[spark] Harden dynamic overwrite against optimized child plans#8052
Open
kerwin-zk wants to merge 1 commit into
Open
[spark] Harden dynamic overwrite against optimized child plans#8052kerwin-zk wants to merge 1 commit into
kerwin-zk wants to merge 1 commit into
Conversation
Contributor
|
Thanks for the update. I am holding off on approval for now because the current CI run has a failing job and several jobs are still pending. Please fix or rerun the failed checks, then I can take another pass. |
Contributor
|
+1. @kerwin-zk Thank you for this very-deep issue. Will merge when CI has passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
PaimonDynamicPartitionOverwriteCommandexposes its child query to Spark optimizer throughV2WriteCommand, but later wraps the same query back into a Dataset inrun()before passing it toWriteIntoPaimonTable.This is fragile when the child query has already been optimized by Spark. The optimized plan may contain optimizer/planner-side placeholders, such asDynamicPruningSubquery, which are not ideal to expose again to writer-side Dataset operations.This PR makes the command-to-writer boundary more robust for the dynamic partition overwrite fallback path. Before passing the query to
WriteIntoPaimonTable, it converts the child query into an RDD-backed DataFrame viacreateNewDataFrame(createDataset(...)). As a result, the writer consumes a clean logical plan instead of directly consuming the possibly optimized child plan.Tests
CI