fix: scope large binary storage and cleanup by execution id#5280
fix: scope large binary storage and cleanup by execution id#5280kunwp1 wants to merge 17 commits into
Conversation
Also update existing call site in RegionExecutionCoordinator to pass None for the new field (required because ScalaPB no_default_values_in_constructor is true).
…he#4123) betterproto returns an empty (falsy) ExecutionIdentity for an unset executionId field rather than None, so the previous `is not None` check never triggered and an unset id would silently produce objects/0/... Use truthiness so unset -> None -> create() raises, matching the JVM invariant. Also moves a stray mid-file `import re` to the top.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5280 +/- ##
============================================
+ Coverage 49.12% 49.38% +0.25%
- Complexity 2378 2417 +39
============================================
Files 1051 1051
Lines 40348 40681 +333
Branches 4279 4330 +51
============================================
+ Hits 19821 20090 +269
- Misses 19368 19400 +32
- Partials 1159 1191 +32
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6a0709e to
94c2804
Compare
…apache#4123) Move the per-execution id out of StorageConfig (which holds only static system configuration sourced from storage.conf) into a dedicated module-level holder in large_binary_manager (set_current_execution_id), mirroring the JVM LargeBinaryManager. The Python init handler sets it via that API.
Add get_current_execution_id() and route create() and the tests through it instead of reading the module-level _current_execution_id directly, keeping the holder's access encapsulated.
|
/request-review @Xiao-zhen-Liu Can you review this PR because you are an engine expert? |
| # Set at executor init and read by create() | ||
| _current_execution_id = None | ||
|
|
||
|
|
||
| def set_current_execution_id(execution_id): | ||
| """Sets the execution id used to scope large binaries created by this worker.""" | ||
| global _current_execution_id | ||
| _current_execution_id = execution_id | ||
|
|
||
|
|
||
| def get_current_execution_id(): | ||
| """Returns the execution id set for this worker, or None if unset.""" | ||
| return _current_execution_id |
There was a problem hiding this comment.
please avoid using global variable to manage state. it is better to create a manager class for this kind of purpose.
| @@ -31,6 +30,20 @@ | |||
| _s3_client = None | |||
There was a problem hiding this comment.
similar to this... can we move those into a class's state?
There was a problem hiding this comment.
Addressed this comment. Can you check if it looks good?
|
@Xiao-zhen-Liu Please review the PR as requested. |
apache#4123) Address review feedback: replace the module-level globals (_s3_client, DEFAULT_BUCKET, _current_execution_id) and free functions with a LargeBinaryManager class holding state as instance attributes, exposed as a single shared per-worker singleton. No more `global` statements; mirrors the JVM `object LargeBinaryManager`. Consumers import the singleton, so call sites are unchanged. Update the stream/type tests to patch the singleton instance.
…pache#4123) The pure create() logic (execution-scoped key + fail-fast when no context is set) was only exercised by the MinIO-backed LargeBinaryManagerSpec. Move those two assertions into LargeBinaryManagerUnitSpec so they run without Docker and count toward coverage; the MinIO spec keeps the isolation test that genuinely needs a live S3 endpoint. deleteByExecution's success and swallow branches were already covered by the unit spec.
What changes were proposed in this PR?
Large binaries were stored in the shared
texera-large-binariesbucket under flat keysobjects/{timestamp}/{uuid}with no execution id, andclearExecutionResources(eid)deleted all of them viaLargeBinaryManager.deleteAllObjects(). Any cleanup for one execution therefore erased every other execution's (and user's) large binaries.This PR namespaces every large binary by its execution id and scopes deletion:
objects/{eid}/{uuid}on both the JVM and Python workers.InitializeExecutorRequest.executionIdproto field, injected by the system at executor init. The user-facinglargebinary()/new LargeBinary()APIs are unchanged.LargeBinaryManager.deleteByExecution(eid)(prefix delete ofobjects/{eid}/). Both JVM and Python engines share the bucket and key shape, so this single JVM-side delete removes binaries created by both.deleteAllObjects()is removed.Pre-existing objects under the old
objects/{timestamp}/...scheme are left untouched.Any related issues, documentation, discussions?
Closes #4123.
How was this PR tested?
Requires running
./bin/python-proto-gen.shImport the following json file to create two workflows, run them, and check if each execution creates 6 objects and one execution doesn't remove the other execution's large binary objects.
Large.Binary.Python (1).json
Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), models Claude Opus 4.7 and Claude Sonnet 4.6