Skip to content

fix: scope large binary storage and cleanup by execution id#5280

Open
kunwp1 wants to merge 17 commits into
apache:mainfrom
kunwp1:fix/large-binary-eid-lifecycle
Open

fix: scope large binary storage and cleanup by execution id#5280
kunwp1 wants to merge 17 commits into
apache:mainfrom
kunwp1:fix/large-binary-eid-lifecycle

Conversation

@kunwp1
Copy link
Copy Markdown
Contributor

@kunwp1 kunwp1 commented May 28, 2026

What changes were proposed in this PR?

Large binaries were stored in the shared texera-large-binaries bucket under flat keys objects/{timestamp}/{uuid} with no execution id, and clearExecutionResources(eid) deleted all of them via LargeBinaryManager.deleteAllObjects(). Any cleanup for one execution therefore erased every other execution's (and user's) large binaries.

This PR namespaces every large binary by its execution id and scopes deletion:

  • Object keys are now objects/{eid}/{uuid} on both the JVM and Python workers.
  • The execution id is carried to workers via a new InitializeExecutorRequest.executionId proto field, injected by the system at executor init. The user-facing largebinary() / new LargeBinary() APIs are unchanged.
  • Cleanup uses the new LargeBinaryManager.deleteByExecution(eid) (prefix delete of objects/{eid}/). Both JVM and Python engines share the bucket and key shape, so this single JVM-side delete removes binaries created by both.
  • The deleteAllObjects() is removed.

Pre-existing objects under the old objects/{timestamp}/... scheme are left untouched.

Any related issues, documentation, discussions?

Closes #4123.

How was this PR tested?

Requires running ./bin/python-proto-gen.sh

Import the following json file to create two workflows, run them, and check if each execution creates 6 objects and one execution doesn't remove the other execution's large binary objects.
Large.Binary.Python (1).json

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), models Claude Opus 4.7 and Claude Sonnet 4.6

kunwp1 added 9 commits May 28, 2026 10:56
Also update existing call site in RegionExecutionCoordinator to pass
None for the new field (required because ScalaPB no_default_values_in_constructor is true).
…he#4123)

betterproto returns an empty (falsy) ExecutionIdentity for an unset
executionId field rather than None, so the previous `is not None` check
never triggered and an unset id would silently produce objects/0/...
Use truthiness so unset -> None -> create() raises, matching the JVM
invariant. Also moves a stray mid-file `import re` to the top.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 28, 2026

Codecov Report

❌ Patch coverage is 87.93103% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.38%. Comparing base (ec12c88) to head (9e78542).
⚠️ Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
...pache/texera/service/util/LargeBinaryManager.scala 60.00% 4 Missing and 2 partials ⚠️
...rg/apache/texera/web/service/WorkflowService.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5280      +/-   ##
============================================
+ Coverage     49.12%   49.38%   +0.25%     
- Complexity     2378     2417      +39     
============================================
  Files          1051     1051              
  Lines         40348    40681     +333     
  Branches       4279     4330      +51     
============================================
+ Hits          19821    20090     +269     
- Misses        19368    19400      +32     
- Partials       1159     1191      +32     
Flag Coverage Δ *Carryforward flag
access-control-service 41.89% <ø> (ø)
agent-service 33.76% <ø> (ø) Carriedforward from 116291d
amber 52.02% <61.11%> (+0.45%) ⬆️
computing-unit-managing-service 1.38% <ø> (+1.38%) ⬆️
config-service 54.68% <ø> (+54.68%) ⬆️
file-service 38.42% <ø> (+0.42%) ⬆️
frontend 40.91% <ø> (-0.17%) ⬇️ Carriedforward from 116291d
pyamber 90.77% <100.00%> (?)
python 90.73% <100.00%> (-0.07%) ⬇️ Carriedforward from 116291d
workflow-compiling-service 58.39% <ø> (+1.57%) ⬆️

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

kunwp1 added 3 commits May 28, 2026 13:29
…apache#4123)

Move the per-execution id out of StorageConfig (which holds only static
system configuration sourced from storage.conf) into a dedicated module-level
holder in large_binary_manager (set_current_execution_id), mirroring the JVM
LargeBinaryManager. The Python init handler sets it via that API.
Add get_current_execution_id() and route create() and the tests through it
instead of reading the module-level _current_execution_id directly, keeping
the holder's access encapsulated.
@kunwp1
Copy link
Copy Markdown
Contributor Author

kunwp1 commented May 28, 2026

/request-review @Xiao-zhen-Liu

Can you review this PR because you are an engine expert?

@github-actions github-actions Bot requested a review from Xiao-zhen-Liu May 28, 2026 22:43
Comment on lines +33 to +45
# Set at executor init and read by create()
_current_execution_id = None


def set_current_execution_id(execution_id):
"""Sets the execution id used to scope large binaries created by this worker."""
global _current_execution_id
_current_execution_id = execution_id


def get_current_execution_id():
"""Returns the execution id set for this worker, or None if unset."""
return _current_execution_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid using global variable to manage state. it is better to create a manager class for this kind of purpose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed!

@@ -31,6 +30,20 @@
_s3_client = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to this... can we move those into a class's state?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this comment. Can you check if it looks good?

@chenlica
Copy link
Copy Markdown
Contributor

chenlica commented Jun 1, 2026

@Xiao-zhen-Liu Please review the PR as requested.

kunwp1 added 3 commits June 1, 2026 12:08
apache#4123)

Address review feedback: replace the module-level globals (_s3_client,
DEFAULT_BUCKET, _current_execution_id) and free functions with a
LargeBinaryManager class holding state as instance attributes, exposed as a
single shared per-worker singleton. No more `global` statements; mirrors the
JVM `object LargeBinaryManager`. Consumers import the singleton, so call sites
are unchanged. Update the stream/type tests to patch the singleton instance.
…pache#4123)

The pure create() logic (execution-scoped key + fail-fast when no context is
set) was only exercised by the MinIO-backed LargeBinaryManagerSpec. Move those
two assertions into LargeBinaryManagerUnitSpec so they run without Docker and
count toward coverage; the MinIO spec keeps the isolation test that genuinely
needs a live S3 endpoint. deleteByExecution's success and swallow branches were
already covered by the unit spec.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Finish the Life Cycle of Large Binaries

4 participants