Antalya 26.3: support external paths in Iceberg tables#1859
Open
zvonand wants to merge 2 commits into
Open
Conversation
Port ClickHouse#90740 to antalya-26.3. Iceberg tables may now reference files (data files, manifests, manifest lists) located outside the table location, including on a different object storage backend. Metadata paths are treated as absolute URIs and resolved at read/delete time via new object-storage helpers (`SchemeAuthorityKey`, `resolveObjectStorageForPath`, `SecondaryStorages`), with the cluster-function protocol bumped to `DBMS_CLUSTER_PROCESSING_PROTOCOL_VERSION_WITH_ICEBERG_ABSOLUTE_PATH`. Adds the `s3_propagate_credentials_to_other_storages` setting to optionally copy base S3 credentials when creating secondary storages. Notes on porting to this branch: - Skipped the `ExpireSnapshotsExecute`, `RemoveOrphanFilesExecute` and `SnapshotFilesTraversal` files: this functionality does not exist in `antalya-26.3`. The `executeCommand` branch using them was dropped and the existing `expireSnapshots` implementation is kept. - Dropped the `S3UriStyle uri_style` `S3::URI` parameter (from an unrelated upstream change not in this branch); only `enable_url_encoding` is added. - Dropped the upstream-only `_path` virtual column `storage_id` field, which is not present in `VirtualsForFileLikeStorage` here. - Folded the metadata-path preference into the existing `getFileIdentifier` helper in the stable task distributor rather than the upstream inline call sites. - Updated `Mutations.cpp` (`expireSnapshots`) callers for the new `getManifestList` / `getManifestFileEntriesHandle` signatures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes `04034_iceberg_spark_style_location` (S3_ERROR 404 reading `warehouse/db/spark_table/metadata/snap-*.avro`). When an Iceberg table's metadata `location` differs from where the files actually live (e.g. a Spark-relocated table whose `location` is `s3a://spark-bucket/warehouse/db/spark_table` while the objects are in the configured base storage), the manifest-list / manifest / data paths in the metadata are spelled with that foreign prefix. `tryResolveObjectStorageForPath` matched such a path against `table_location` and returned the raw URI key on the base storage, so reads hit a non-existent key and failed with a 404. The raw key is only valid for paths whose bucket matches the base storage (handled by the earlier base-bucket branch). For a path that matches `table_location` but not the base bucket, only `IcebergPathResolver::resolve` can map it (strip `table_location`, prepend `table_root`), so defer to it by returning `std::nullopt`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Support Iceberg tables that have data files outside the table location or on a different object storage. Cherry-picked from ClickHouse#90740 (by @zvonand).
CI/CD Options
Exclude tests:
Regression jobs to run: