Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts by cailmdaley · Pull Request #737 · CosmoStat/shapepipe

cailmdaley · 2026-05-30T20:10:23Z

Update: the user-facing docs that were originally part of this PR (README front door, the candide cluster walkthrough, the MPI run docs) have been moved to the docs-rework PR #739 so this PR stays focused on code/infra. It retains the CLAUDE.md build-loop note, which documents the container build workflow these changes introduce.

What

Makes the candide example job scripts actually run ShapePipe through the published container. The container is the supported source of truth (docs/source/container.md), so the example scripts should use it rather than module load intelpython/3 + source activate $HOME/.conda/envs/shapepipe. Verifying the MPI script end-to-end turned up three things in the long-unexercised MPI path that had to be fixed for it to run — a launcher/PMIx mismatch, a code bug, and a stale config — each hidden behind the previous one.

Changes

Job scripts (example/pbs/candide_{smp,mpi}.sh)

Drop conda entirely. The pipeline runs via apptainer exec against the runtime image, pulled once to a SIF whose path is overridable via $SP_IMAGE.
SLURM, not PBS. candide moved to SLURM — qsub/#PBS are gone. The scripts now use #SBATCH / sbatch and mpirun -n $SLURM_NTASKS.
Bind-mount the host clone ($SPDIR) at the same path inside the container so the configs' $SPDIR-relative INPUT_DIR/OUTPUT_DIR resolve identically in- and outside the container.
Fix a stale config path; propagate the real pipeline exit code instead of always exit 0.
Un-rot example/pbs/config_mpi.ini (last touched 2020): its MODULE / section names lacked the _runner suffix the loader now requires, so the run died with No module named 'shapepipe.modules.python_example'. Updated to the current names (matching example/config.ini).

Container (Dockerfile)

Build OpenMPI 5.0.x from source (--with-pmix=internal --with-prrte=internal --disable-dlopen), dropping libopenmpi-dev. The image previously shipped OpenMPI 4.1.4 / PMIx 2, which is wire-incompatible with candide's OpenMPI 5.0.x / PMIx 5 host launcher — so hybrid MPI silently degraded to N independent "rank 0 of 1" processes (wrong science, no error). The stock mpi4py wheel dlopens libmpi.so.40 (ABI-stable across OMPI 4↔5), so uv.lock is untouched.

ShapePipe MPI code (src/shapepipe/pipeline/mpi_run.py, src/shapepipe/run.py)

Thread module_config_sec through run_mpi → submit_mpi_jobs → worker(). By git history this dates to Multiple Module Runs #415 ("Multiple Module Runs"): worker() gained a module_config_sec parameter, but the MPI submit path wasn't updated in step, so it passes 7 args where 8 are required — worker() missing 1 required positional argument: 'module_runner'. On candide this path wasn't reachable until now (the PMIx mismatch meant MPI never wired up), so fixing the launcher is what surfaced it.

Exit-code propagation (src/shapepipe/shapepipe_run.py)

main() called run(args) but didn't return it, so exit(main()) was always exit(None) → 0. That means every caught error in ShapePipe — not just MPI — has been exiting 0, invisible to exit $? and CI. Now return run(args), with a regression test. (This is independent of MPI and worth landing on its own.)

CI (.github/workflows/deploy-image.yml)

Publish images on every branch push, so any PR has a cluster-pullable image to test against before merge.

Verification (on candide)

✅ MPI runs end-to-end — the unmodified candide_mpi.sh against the published runtime image. 2-node / 4-task hybrid run: ranks distribute across two nodes (process 0–3 of 4 on n23+n25, not the pre-fix singletons), all three *_example_runner modules produce output, real exit 0 (the script's exit $?), and shapepipe.log records "A total of 0 errors were recorded."
✅ candide_smp.sh runs the SMP example pipeline end-to-end through the container, 0 errors.
✅ bash -n clean; structural shell-syntax test tier passes.
✅ Caught errors now propagate a non-zero exit code (the main() fix), so a failed run no longer looks like success to exit $?.

Question for review — is MPI a used dependency of ShapePipe at all?

We got MPI working on candide (the three fixes above) — but the bigger question this surfaced, @martinkilbinger, is whether it's worth keeping at all. Its entire footprint in the repo is tiny:

A hard dependency (mpi4py>=4.0 in pyproject.toml) plus the core code (run_mpi in run.py, mpi_run.py).
Two example job scripts — candide_mpi.sh (candide) and cc_mpi.sh (ccin2p3) — and one active example config (config_mpi.ini).
Zero production paths: every canfar/candide submission script runs SMP (N_SMP).

And functionally it adds nothing: SMP and MPI call the identical worker() with identical arguments — the same computation behind two dispatchers (worker_handler.py has no MPI; the MPI path just scatters the independent job-list and gathers results). The workload is embarrassingly parallel, so MPI buys only an ergonomic convenience over "many per-node SMP jobs under SLURM."

So: is MPI actually used anywhere, or can the mode (and the mpi4py dependency) be dropped? If it's used, this PR restores it to working order and there's a ready follow-up to harden it against silent desync; if not, removing it (mpi_run.py, run_mpi, the import_mpi branches, mpi4py, the two *_mpi.sh scripts, config_mpi.ini) is clean and contained. Your call — you'd know whether anyone depends on it.

Out of scope (deliberately untouched)

The production submission scripts — scripts/sh/run_scratch_local.sh, init_run_exclusive_canfar.sh, job_sp_canfar*.bash (canfar + modern candide) — and the ccin2p3 scripts. These target workflows we can't fully verify here; note they are still SMP + conda and not yet containerized (a separate future cleanup).

🤖 Generated with Claude Code

— Claude on behalf of Cail

candide_smp.sh and candide_mpi.sh activated a personal conda env (module load intelpython/3; source activate $HOME/.conda/envs/shapepipe) and called $SPENV/bin/shapepipe_run. Convert them to run the pipeline through the published container image, matching the supported workflow (the container is the source of truth; see docs/source/container.md). - Drop the conda environment entirely. The pipeline runs via `apptainer exec` against the slim runtime image (ghcr.io/cosmostat/shapepipe:develop-runtime), pulled once to a SIF whose path is overridable via $SP_IMAGE. - Bind-mount the host clone ($SPDIR) at the same path inside the container so the example configs' $SPDIR-relative input/output directories resolve identically in- and outside the container. - MPI uses the standard "hybrid" Apptainer pattern: host mpiexec (module load openmpi) launches one container rank per slot, the in-image mpi4py/OpenMPI handle communication. - Fix a stale path: candide_mpi.sh pointed at example/config_mpi.ini, which does not exist; the file is example/pbs/config_mpi.ini. - Propagate the pipeline exit code to the batch system (exit $?) instead of always exiting 0. - Make $SPDIR overridable for testing. Tested on candide (c03): candide_smp.sh runs the SMP example pipeline end-to-end through the container with 0 errors. The MPI hybrid launch needs a real multi-node allocation to verify end-to-end (it hangs on a login node); the image's MPI stack (mpiexec + mpi4py 4.1.1) and the shared container invocation are verified via the SMP run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…t question Worker delivered both cleanups as open PRs against develop (not merged): #736 (rho-stats/PSF-plot removal) and #737 (candide container scripts). Resolves the open question — random_cat is a general LSS random-catalogue generator, not rho-stats, so it was kept. Records that stile was already vestigial and the MPI-verification gap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… as #737) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Hybrid Apptainer MPI was broken on candide: the image shipped Debian bookworm's OpenMPI 4.1.4 (PMIx 2.x) while candide's host launcher is now OpenMPI 5.0.x (PMIx 5.x). A PMIx 2 client cannot handshake with a PMIx 5 server, so every rank degraded to a standalone "rank 0 of 1" -- N singletons instead of one N-rank job (the textbook Apptainer symptom). - Dockerfile: drop libopenmpi-dev/openmpi-bin; build OpenMPI 5.0.8 from source with bundled PMIx 5 / PRRTE (--with-pmix=internal etc.) and --disable-dlopen (static MCA -- fixes an internal-openpmix pdl configure failure and is the right posture for a container). The stock mpi4py wheel dlopens libmpi.so.40, which this build provides, so uv.lock is untouched. - example/pbs/candide_{mpi,smp}.sh: candide migrated PBS -> SLURM (qsub is gone), so convert #PBS -> #SBATCH and launch with `mpirun -n $SLURM_NTASKS apptainer exec ... shapepipe_run`. Load the cluster-default `openmpi` (any 5.0.x is PMIx-compatible). - docs + CLAUDE.md: document the hybrid-MPI run pattern and the build-remotely / pull-locally container workflow. Empirically verified on candide: the 4.1.4 image gives 4x "rank 0 of 1"; an OpenMPI 5.0.8 build wires up correctly. See .felt shapepipe/mpi-hybrid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tag each pushed branch's image with the branch name so any open PR has a pullable image (apptainer pull ...:<branch>-runtime) that can be tested on a real cluster before merge. Same-repo branch pushes always carry a registry-write token, so this is safe; fork PRs still only build+test via the pull_request trigger (they have no token to publish with). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make the public .felt/ team-facing rather than personal collaboration notes: - shapepipe.md root: drop first-person role framing, the 'working agreement with Martin' section, private ~/.claude memory-note pointers, and royal-we voice convention; rewrite as a person-generic gateway (stack division, repo conventions incl. corrected rho-stats/meanshapes boundary, threads). - Delete fabian-coord-bug (body-less personal reminder) and prs-in-flight (personal PR dashboard); rephrase the 3 inbound wikilinks. - Neutralize ngmix-update + docker-uv-revert: strip collaborator names and 'mine'/'we agreed' framing, keep the technical why. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The MPI execution path was broken since #415 ("Multiple Module Runs"): WorkerHandler.worker() gained a `module_config_sec` parameter, but `submit_mpi_jobs` in mpi_run.py was never updated to pass it. So the MPI path called worker() with 7 args where 8 are required, failing every run with: WorkerHandler.worker() missing 1 required positional argument: 'module_runner' This stayed invisible for 16 months because MPI is a legacy execution mode (SMP is the production path), and on candide MPI couldn't even wire up due to a PMIx version mismatch -- which masked the code bug beneath. Fixing the launcher (OpenMPI 5.0.x in the image) exposed it. Thread `module_config_sec` from run_mpi (root rank, broadcast to all ranks) into submit_mpi_jobs and on to worker(), matching the SMP/serial call sites. Verified end-to-end on candide: 2-node / 4-rank hybrid MPI run of the example pipeline, all three modules complete, 0 errors recorded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lers The earlier mpi-hybrid close claimed the full pipeline ran clean under MPI. It did not — that run hit a latent ShapePipe code bug and the sbatch RUN_EXIT=0 was a hardcoded echo. Rewrite the empirical close to the true two-layer story: launcher (PMIx) fixed and verified, which then exposed the module_config_sec bug (#415), now fixed in e599973 and re-verified e2e. Reopen status (fix not yet in the published image, #737 not merged). Add exec-modes-schedulers: a reference fiber mapping smp/mpi (execution modes) and PBS/SLURM (schedulers) — what's production (SMP+SLURM) vs legacy (MPI, PBS) — the context that explains why this bug survived 16 months. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ripts-container # Conflicts: # .felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md

The MPI example config still used the pre-suffix module names (`python_example`, `serial_example`, `execute_example`) and section headers from 2019-2020; the module loader needs the full runner names (`*_runner`), as example/config.ini uses. With the stale names, rank 0 failed with "No module named 'shapepipe.modules.python_example'" and the other ranks deadlocked in the collective until the wall-clock timeout. Third layer of MPI bit-rot beneath the launcher and the module_config_sec fix, same root cause: nobody runs MPI, so its example config rotted too. Verified: the unmodified candide_mpi.sh against the published runtime image now runs the example pipeline end-to-end (4 ranks / 2 nodes, all three modules, 0 errors, exit 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tion Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nknown) Walk back 'nobody runs MPI / invisible for 16 months' across both fibers. What we observed: MPI needed three fixes to run on candide; the code bug dates to #415 by git history; the canfar/candide tooling is SMP-only. What we cannot see: how MPI was actually used, especially on canfar where most processing ran. State the evidence, not the inference about practice. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rtin SMP and MPI call the identical worker() with identical args — same computation, two dispatchers (joblib-on-node vs MPI scatter/gather). worker_handler has no MPI; the workload is embarrassingly parallel. So MPI is an ergonomic convenience, not a computational need. Defer to Martin (in #737) whether MPI earns its keep on candide vs just using SMP; don't retire the documented mode unilaterally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`main()` called `run(args)` but discarded its return value, so `exit(main())` was always `exit(None)` → 0. `run()` returns 1 when it catches an error (`catch_error` + `return 1`), so *every* handled failure has been exiting 0 — invisible to `exit $?` in the job scripts and to any CI/automation. One-word fix: `return run(args)`. Add a regression test that main forwards run's value. Surfaced while end-to-end testing the MPI singleton guard: the guard fired and logged loudly but the job still exited 0 until this fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…etons" When the host MPI launcher and the container's MPI/PMIx stack are incompatible, every process initialises standalone (COMM_WORLD size 1, rank 0). ShapePipe then treats each as master, hands each the full job list, and runs N uncoordinated copies of the pipeline into the same output directory — silently, with exit 0. This is the failure the OpenMPI-5 image fix prevents on candide, but nothing guarded against a future recurrence on another cluster. check_mpi_world() compares the size that actually wired up (COMM_WORLD) against the size the launcher intended (OMPI_COMM_WORLD_SIZE, which is set per process even when the world fails to form) and aborts on a mismatch. Empirically verified on candide: SLURM_NTASKS is NOT reliable for this (reads 1 on remote-node ranks even in a healthy run) — OMPI_COMM_WORLD_SIZE is. Tested both ways on a real allocation: healthy OMPI-5 run passes and completes; OMPI-4 image under the OMPI-5 host launcher fires the check and exits non-zero (together with the exit-code fix). Also catches partial wire-up (N-1 of N). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The 'warning sign' pass: added check_mpi_world preflight (OMPI_COMM_WORLD_SIZE vs COMM_WORLD size; SLURM_NTASKS proven unreliable) and, found while testing it e2e, fixed main() swallowing run()'s exit code (every caught error had exited 0). Both tested on a real allocation. Distinct remaining gap: mid-setup rank-0 failure still deadlocks the other ranks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…sion) Reverts the check_mpi_world preflight added earlier in this branch. It guards a real but narrow failure (the "rank 0 of N singletons" desync → silent wrong results), which is already designed out on candide by the OpenMPI-5 image match, and adds a runtime check to core run.py — scope creep for what is an example-script modernization PR, especially while MPI's future is an open question for Martin (it's a hard mpi4py dependency used only by two example scripts and one config, by zero production paths). Keeps the exit-code propagation fix (33494d7), which is broad and unrelated. The guard's detection recipe (OMPI_COMM_WORLD_SIZE vs COMM_WORLD size; SLURM_NTASKS is unreliable) is preserved in the mpi-hybrid fiber as a ready follow-up if MPI is kept. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Guard pulled from #737 (scope creep on a maybe-retired mode); exit-code fix kept. Recipe preserved in Layer 4. Question to Martin sharpened to 'is MPI a used dependency at all?' with the full footprint: hard mpi4py dep, 2 example scripts, 1 config, 0 production paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The job scripts assumed the runtime SIF already existed; the pull step and candide's home-quota gotcha (point APPTAINER_CACHEDIR at a data partition or `apptainer pull` dies on $HOME) lived only in CLAUDE.md and felt. Add a copy-paste "Quickstart on a cluster (candide)" to the README — the on-ramp a newcomer actually reads — covering clone -> quota-safe pull -> sbatch the example, pointing at example/pbs and docs/source/container.md for depth. Verified end to end on candide (c03, apptainer 1.4.5, SLURM): the exact quickstart command form runs candide_smp.sh against the published :develop-runtime image -> job COMPLETED, ExitCode 0:0, "A total of 0 errors were recorded". Also refresh the stale python-3.9 badge to 3.12 (the shipped interpreter) and drop a stray character from its target URL. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cailmdaley · 2026-05-31T17:27:40Z

@fabianhervaspeters @martinkilbinger — beyond fixing MPI, this PR doubles as a readiness test for the container itself, and as far as I can tell it's ready for other people to pull it, run it, and try to break it.

Verified end to end on candide today (c03, apptainer 1.4.5, SLURM) against the published :develop-runtime image — observed, not inferred:

Bundled example via direct apptainer exec → exit 0, "A total of 0 errors were recorded".
apptainer shell resolves the entry point (/app/.venv/bin/shapepipe_run).
Real SLURM job — candide_smp.sh → COMPLETED, ExitCode 0:0, 0 errors, bind-mount + exit-code propagation working.
Branch CI is green.

The user-facing docs for all this — a one-command Quickstart, the candide cluster walkthrough (quota-safe pull → sbatch example/pbs/candide_smp.sh, including the $HOME-quota gotcha), and the MPI run pattern — now live in the docs-rework PR #739, so this PR stays focused on the code/infra. See the README front door and the candide cluster docs there.

So: please pull it and try to break it. The MPI-vs-keep question above still stands separately.

— Claude on behalf of Cail

…ner.md The README had tunnel-visioned onto a candide-specific SLURM walkthrough — wrong altitude for the project's landing page, where the broader community arrives. Restructure it as a front door: a one-sentence-deeper description of what ShapePipe does, a Quickstart that runs the bundled example straight from the published container in one command (apptainer or docker, no install, no cluster specifics), the image tag scheme, and a Documentation signpost to the published pages. Move the candide cluster walkthrough (quota-safe pull -> sbatch candide_smp.sh, the SPDIR bind-mount, the MPI PMIx note) into a new "Running on a cluster (SLURM)" section in container.md, which the README links to. Drop the test-assertion prose ("logs ... and exits 0") that read like a CI check rather than user docs. Both quickstart commands verified on candide against :develop-runtime (including the no-pre-pull `apptainer exec docker://...` form): the bundled example runs to completion, 0 errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Done-state re-verified fresh: PR #736 (rho-stats removal) and #737 (candide scripts → container) both OPEN against develop, no reviews, not merged. Noted PR #636 (rho-stats feature PR) needs Martin to reconcile against #736's removal. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…open & green Cold survey found zero remaining worker work. PR #736 (rho-stats removal) and #737 (candide scripts via apptainer) are both open, mergeable, and CI-green. The only remaining step is Martin's review, which is outside worker scope. Includes mechanical felt reindex churn. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The README front door, the container.md 'Running on a cluster' section, and the basic_execution.md MPI docs are relocated to #739, which owns the full docs story (cluster docs now live in a dedicated clusters.md, so keeping the walkthrough here too would duplicate it). This PR keeps only the code/infra and the CLAUDE.md build-loop note that the container changes here introduce. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Unify all user-facing docs in this PR (relocated from #737, which is now pure code/infra): - README front door (Quickstart + Documentation signpost). The signpost now has a dedicated 'Running on a cluster' entry pointing at clusters.html, and the container-workflow entry no longer claims to carry the cluster example (that lives in clusters.md). - basic_execution.md MPI section: the hybrid-Apptainer run pattern and the OpenMPI-5 PMIx note, kept alongside the conda-framing fix. - container.md gains a one-line pointer to clusters.md. This removes the container.md/clusters.md duplication at the source rather than reconciling it after merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Audited every narrative docs page against the current code. The install / container / testing / API pages were already fresh; the staleness concentrated in cluster docs and a few content errors. This rework: **Machine-specific cluster tree.** Cluster guidance was scattered and half of it invisible (candide lived only inside container.md on a feature branch; canfar was split across orphaned pages; none of canfar/candide were in the sidebar). Add a single `clusters.md` under a new "Running on a cluster" toctree caption: the shared pattern (container = unit of execution, bind-mount, keep SIFs off a quota-limited $HOME), then per-machine sections for candide (SLURM, the candide_{smp,mpi}.sh scripts, the quota-safe pull, MPI/PMIx) and CANFAR (the current canfar_submit_job / canfar_monitor console scripts), with ccin2p3 stubbed. The deep CANFAR production walkthrough stays in pipeline_canfar.md, linked, and is now in the toctree too. **Delete obsolete pages.** canfar.md (the old curl-VM submission model, superseded by canfar_submit_job), pipeline_v2.0.md (personal paths, a missing script), and work_flow_v2.0.md (an unrealized planning wishlist) — all three orphaned from the toctree. The v2.0 wishlist is preserved in the team's felt store rather than lost. **Fix content errors.** - dependencies.md: rewritten against pyproject.toml. Reframed around the abstract-minimums + uv.lock SSOT (was "pinned per release"); ngmix now points at the aguinot/ngmix@stable_version fork (was esheldon upstream); dropped the phantom CDSclient; added the missing CANFAR/data stack (vos, skaha, canfar, cs_util, astroquery, reproject, h5py, numba). - post_processing.md: dropped the removed rho-statistics step and the dead prepare_tiles_for_final command; added a legacy banner pointing at sp_validation. - random_cat.md: legacy banner; fixed module name random_runner -> random_cat_runner. - pipeline_canfar.md: flagged the matched-star / coverage-mask helpers that moved to sp_validation (merge_psf_cat.py, download_headers, …). - basic_execution.md: replaced the conda-era "activate the environment" framing with the container reality. (MPI sections deferred pending the #737 decision.) - configuration.md (conifg->config, NUMBERING_LIST->NUMBER_LIST), contributing.md (Pleas->Please), module_develop.md (src/shapepipe/modules). Verified with a local sphinx-book-theme build: succeeds; the only new warning the tree introduced (a clusters.md heading anchor) is fixed. Remaining warnings are all pre-existing (the autosummary API page needs the installed package; multiple-toctree notices on every page). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Unify all user-facing docs in this PR (relocated from #737, which is now pure code/infra): - README front door (Quickstart + Documentation signpost). The signpost now has a dedicated 'Running on a cluster' entry pointing at clusters.html, and the container-workflow entry no longer claims to carry the cluster example (that lives in clusters.md). - basic_execution.md MPI section: the hybrid-Apptainer run pattern and the OpenMPI-5 PMIx note, kept alongside the conda-framing fix. - container.md gains a one-line pointer to clusters.md. This removes the container.md/clusters.md duplication at the source rather than reconciling it after merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cailmdaley and others added 8 commits May 31, 2026 12:53

felt: close cleanup-rhostats-jobscripts (D1 stale premise, D2 shipped…

7a87bef

… as #737) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chore: gitignore felt index WAL sidecars (index.db-shm/-wal)

bf9f1e2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/develop' into cleanup/candide-sc…

9d2e523

…ripts-container # Conflicts: # .felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md

cailmdaley changed the title ~~Run candide PBS job scripts through the container instead of conda~~ Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts May 31, 2026

cailmdaley and others added 10 commits May 31, 2026 17:24

felt: mpi-hybrid — record Layer 3 (stale config) + final e2e verifica…

be0c724

…tion Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cailmdaley mentioned this pull request May 31, 2026

Docs: machine-specific cluster tree + freshness pass #739

Open

cailmdaley requested review from fabianhervaspeters and martinkilbinger May 31, 2026 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts#737

Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts#737
cailmdaley wants to merge 21 commits into
developfrom
cleanup/candide-scripts-container

cailmdaley commented May 30, 2026 •

edited

Loading

Uh oh!

cailmdaley commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cailmdaley commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Verification (on candide)

Question for review — is MPI a used dependency of ShapePipe at all?

Out of scope (deliberately untouched)

Uh oh!

cailmdaley commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cailmdaley commented May 30, 2026 •

edited

Loading

cailmdaley commented May 31, 2026 •

edited

Loading