Skip to content

[spark] Harden dynamic overwrite against optimized child plans#8052

Open
kerwin-zk wants to merge 1 commit into
apache:masterfrom
kerwin-zk:spark-dynamic-overwrite-hardening
Open

[spark] Harden dynamic overwrite against optimized child plans#8052
kerwin-zk wants to merge 1 commit into
apache:masterfrom
kerwin-zk:spark-dynamic-overwrite-hardening

Conversation

@kerwin-zk
Copy link
Copy Markdown
Contributor

Purpose

PaimonDynamicPartitionOverwriteCommand exposes its child query to Spark optimizer through V2WriteCommand, but later wraps the same query back into a Dataset in run() before passing it to WriteIntoPaimonTable.This is fragile when the child query has already been optimized by Spark. The optimized plan may contain optimizer/planner-side placeholders, such as DynamicPruningSubquery, which are not ideal to expose again to writer-side Dataset operations.

This PR makes the command-to-writer boundary more robust for the dynamic partition overwrite fallback path. Before passing the query to WriteIntoPaimonTable, it converts the child query into an RDD-backed DataFrame via createNewDataFrame(createDataset(...)). As a result, the writer consumes a clean logical plan instead of directly consuming the possibly optimized child plan.

Tests

CI

@leaves12138
Copy link
Copy Markdown
Contributor

Thanks for the update. I am holding off on approval for now because the current CI run has a failing job and several jobs are still pending. Please fix or rerun the failed checks, then I can take another pass.

@YannByron
Copy link
Copy Markdown
Contributor

+1. @kerwin-zk Thank you for this very-deep issue. Will merge when CI has passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants