[spark] Support ON_ERROR = CONTINUE / SKIP_FILE in COPY INTO#8062
Open
JunRuiLee wants to merge 4 commits into
Open
[spark] Support ON_ERROR = CONTINUE / SKIP_FILE in COPY INTO#8062JunRuiLee wants to merge 4 commits into
JunRuiLee wants to merge 4 commits into
Conversation
673a860 to
2ca4f14
Compare
2ca4f14 to
71f172a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This is part of #8005.
COPY INTO previously only supported
ON_ERROR = ABORT_STATEMENT: any parse orcast error aborted the entire command. In production data-loading pipelines a
single malformed row or file would then fail the whole batch, which is often
too strict. This adds two error-tolerant modes:
CONTINUE— skip bad rows and load the rest (row-level tolerance).SKIP_FILE— skip any file that contains an error, all-or-nothing per file.ABORT_STATEMENTremains the default, so existing behavior is unchanged.Changes
ON_ERRORnow acceptsCONTINUEandSKIP_FILEin addition toABORT_STATEMENT.errors_seen(BIGINT) — number of error rows per file.first_error(STRING) — first error message, NULL when the file is clean.statusnow also reportsPARTIALLY_LOADEDandLOAD_FAILED.Load history is recorded so error-tolerant runs stay idempotent under
FORCE = FALSE.CopyIntoTableExecis split into focused helpers(
CopyIntoHelper,CopyIntoCastValidator,CopyIntoDataFrameBuilder,CopyIntoErrorHandler,CopyIntoResultBuilder), shared across CSV/JSON/Parquet.sql-write.md, including the CSV column-count-mismatch caveatunder
CONTINUE.Supported for CSV, JSON, and Parquet.