[python] support chunk shuffle for append table by steFaiz · Pull Request #8064 · apache/paimon

steFaiz · 2026-06-01T10:34:14Z

Purpose

This PR will close: #8010
I tested a data-evolution table of 100,000,000 records, several structured columns and a blob column.
The result is as below:

Metrics	Value
Plan	49.78s
AVG Per chunk read	1.199s
chunk size	100
AVG chunk Arrow size	41.14 MiB
AVG chunk file num	81
columns	length, image_name, conversations, width, height, image_count, dataset, image_bytes

I directly test reading tables on dfs, it costs a lot to plan i.e. generate 1 million DataSplits and shuffle them. This is because generating 1 million objects in Python is heavy. This will be completed within several hundred of millisecond is Java.

Next step I'll try to add shuffle and buffered shuffle for Pytorch Paimon Dataset.

Tests

See Unit Tests

JingsongLi · 2026-06-01T11:39:33Z

Hi @steFaiz Thanks for the contribution! Maybe just implementing shuffle in Pytorch Paimon Dataset in this PR? And document it.

[python] support chunk shuffle for append table

7f41e24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] support chunk shuffle for append table#8064

[python] support chunk shuffle for append table#8064
steFaiz wants to merge 1 commit into
apache:masterfrom
steFaiz:chunk_shuffle_support

steFaiz commented Jun 1, 2026

Uh oh!

JingsongLi commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steFaiz commented Jun 1, 2026

Purpose

Tests

Uh oh!

JingsongLi commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JingsongLi commented Jun 1, 2026 •

edited

Loading