Skip to content

[python] support chunk shuffle for append table#8064

Open
steFaiz wants to merge 1 commit into
apache:masterfrom
steFaiz:chunk_shuffle_support
Open

[python] support chunk shuffle for append table#8064
steFaiz wants to merge 1 commit into
apache:masterfrom
steFaiz:chunk_shuffle_support

Conversation

@steFaiz
Copy link
Copy Markdown
Contributor

@steFaiz steFaiz commented Jun 1, 2026

Purpose

This PR will close: #8010
I tested a data-evolution table of 100,000,000 records, several structured columns and a blob column.
The result is as below:

Metrics Value
Plan 49.78s
AVG Per chunk read 1.199s
chunk size 100
AVG chunk Arrow size 41.14 MiB
AVG chunk file num 81
columns length, image_name, conversations, width, height, image_count, dataset, image_bytes

I directly test reading tables on dfs, it costs a lot to plan i.e. generate 1 million DataSplits and shuffle them. This is because generating 1 million objects in Python is heavy. This will be completed within several hundred of millisecond is Java.

Next step I'll try to add shuffle and buffered shuffle for Pytorch Paimon Dataset.

Tests

See Unit Tests

@JingsongLi
Copy link
Copy Markdown
Contributor

JingsongLi commented Jun 1, 2026

Hi @steFaiz Thanks for the contribution! Maybe just implementing shuffle in Pytorch Paimon Dataset in this PR? And document it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants