Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts#737
Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts#737cailmdaley wants to merge 21 commits into
Conversation
candide_smp.sh and candide_mpi.sh activated a personal conda env (module load intelpython/3; source activate $HOME/.conda/envs/shapepipe) and called $SPENV/bin/shapepipe_run. Convert them to run the pipeline through the published container image, matching the supported workflow (the container is the source of truth; see docs/source/container.md). - Drop the conda environment entirely. The pipeline runs via `apptainer exec` against the slim runtime image (ghcr.io/cosmostat/shapepipe:develop-runtime), pulled once to a SIF whose path is overridable via $SP_IMAGE. - Bind-mount the host clone ($SPDIR) at the same path inside the container so the example configs' $SPDIR-relative input/output directories resolve identically in- and outside the container. - MPI uses the standard "hybrid" Apptainer pattern: host mpiexec (module load openmpi) launches one container rank per slot, the in-image mpi4py/OpenMPI handle communication. - Fix a stale path: candide_mpi.sh pointed at example/config_mpi.ini, which does not exist; the file is example/pbs/config_mpi.ini. - Propagate the pipeline exit code to the batch system (exit $?) instead of always exiting 0. - Make $SPDIR overridable for testing. Tested on candide (c03): candide_smp.sh runs the SMP example pipeline end-to-end through the container with 0 errors. The MPI hybrid launch needs a real multi-node allocation to verify end-to-end (it hangs on a login node); the image's MPI stack (mpiexec + mpi4py 4.1.1) and the shared container invocation are verified via the SMP run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t question Worker delivered both cleanups as open PRs against develop (not merged): #736 (rho-stats/PSF-plot removal) and #737 (candide container scripts). Resolves the open question — random_cat is a general LSS random-catalogue generator, not rho-stats, so it was kept. Records that stile was already vestigial and the MPI-verification gap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… as #737) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Hybrid Apptainer MPI was broken on candide: the image shipped Debian
bookworm's OpenMPI 4.1.4 (PMIx 2.x) while candide's host launcher is now
OpenMPI 5.0.x (PMIx 5.x). A PMIx 2 client cannot handshake with a PMIx 5
server, so every rank degraded to a standalone "rank 0 of 1" -- N
singletons instead of one N-rank job (the textbook Apptainer symptom).
- Dockerfile: drop libopenmpi-dev/openmpi-bin; build OpenMPI 5.0.8 from
source with bundled PMIx 5 / PRRTE (--with-pmix=internal etc.) and
--disable-dlopen (static MCA -- fixes an internal-openpmix pdl configure
failure and is the right posture for a container). The stock mpi4py
wheel dlopens libmpi.so.40, which this build provides, so uv.lock is
untouched.
- example/pbs/candide_{mpi,smp}.sh: candide migrated PBS -> SLURM (qsub is
gone), so convert #PBS -> #SBATCH and launch with
`mpirun -n $SLURM_NTASKS apptainer exec ... shapepipe_run`. Load the
cluster-default `openmpi` (any 5.0.x is PMIx-compatible).
- docs + CLAUDE.md: document the hybrid-MPI run pattern and the
build-remotely / pull-locally container workflow.
Empirically verified on candide: the 4.1.4 image gives 4x "rank 0 of 1";
an OpenMPI 5.0.8 build wires up correctly. See .felt shapepipe/mpi-hybrid.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tag each pushed branch's image with the branch name so any open PR has a pullable image (apptainer pull ...:<branch>-runtime) that can be tested on a real cluster before merge. Same-repo branch pushes always carry a registry-write token, so this is safe; fork PRs still only build+test via the pull_request trigger (they have no token to publish with). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the public .felt/ team-facing rather than personal collaboration notes: - shapepipe.md root: drop first-person role framing, the 'working agreement with Martin' section, private ~/.claude memory-note pointers, and royal-we voice convention; rewrite as a person-generic gateway (stack division, repo conventions incl. corrected rho-stats/meanshapes boundary, threads). - Delete fabian-coord-bug (body-less personal reminder) and prs-in-flight (personal PR dashboard); rephrase the 3 inbound wikilinks. - Neutralize ngmix-update + docker-uv-revert: strip collaborator names and 'mine'/'we agreed' framing, keep the technical why. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The MPI execution path was broken since #415 ("Multiple Module Runs"): WorkerHandler.worker() gained a `module_config_sec` parameter, but `submit_mpi_jobs` in mpi_run.py was never updated to pass it. So the MPI path called worker() with 7 args where 8 are required, failing every run with: WorkerHandler.worker() missing 1 required positional argument: 'module_runner' This stayed invisible for 16 months because MPI is a legacy execution mode (SMP is the production path), and on candide MPI couldn't even wire up due to a PMIx version mismatch -- which masked the code bug beneath. Fixing the launcher (OpenMPI 5.0.x in the image) exposed it. Thread `module_config_sec` from run_mpi (root rank, broadcast to all ranks) into submit_mpi_jobs and on to worker(), matching the SMP/serial call sites. Verified end-to-end on candide: 2-node / 4-rank hybrid MPI run of the example pipeline, all three modules complete, 0 errors recorded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lers The earlier mpi-hybrid close claimed the full pipeline ran clean under MPI. It did not — that run hit a latent ShapePipe code bug and the sbatch RUN_EXIT=0 was a hardcoded echo. Rewrite the empirical close to the true two-layer story: launcher (PMIx) fixed and verified, which then exposed the module_config_sec bug (#415), now fixed in e599973 and re-verified e2e. Reopen status (fix not yet in the published image, #737 not merged). Add exec-modes-schedulers: a reference fiber mapping smp/mpi (execution modes) and PBS/SLURM (schedulers) — what's production (SMP+SLURM) vs legacy (MPI, PBS) — the context that explains why this bug survived 16 months. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ripts-container # Conflicts: # .felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md
The MPI example config still used the pre-suffix module names (`python_example`, `serial_example`, `execute_example`) and section headers from 2019-2020; the module loader needs the full runner names (`*_runner`), as example/config.ini uses. With the stale names, rank 0 failed with "No module named 'shapepipe.modules.python_example'" and the other ranks deadlocked in the collective until the wall-clock timeout. Third layer of MPI bit-rot beneath the launcher and the module_config_sec fix, same root cause: nobody runs MPI, so its example config rotted too. Verified: the unmodified candide_mpi.sh against the published runtime image now runs the example pipeline end-to-end (4 ranks / 2 nodes, all three modules, 0 errors, exit 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nknown) Walk back 'nobody runs MPI / invisible for 16 months' across both fibers. What we observed: MPI needed three fixes to run on candide; the code bug dates to #415 by git history; the canfar/candide tooling is SMP-only. What we cannot see: how MPI was actually used, especially on canfar where most processing ran. State the evidence, not the inference about practice. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rtin SMP and MPI call the identical worker() with identical args — same computation, two dispatchers (joblib-on-node vs MPI scatter/gather). worker_handler has no MPI; the workload is embarrassingly parallel. So MPI is an ergonomic convenience, not a computational need. Defer to Martin (in #737) whether MPI earns its keep on candide vs just using SMP; don't retire the documented mode unilaterally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`main()` called `run(args)` but discarded its return value, so `exit(main())` was always `exit(None)` → 0. `run()` returns 1 when it catches an error (`catch_error` + `return 1`), so *every* handled failure has been exiting 0 — invisible to `exit $?` in the job scripts and to any CI/automation. One-word fix: `return run(args)`. Add a regression test that main forwards run's value. Surfaced while end-to-end testing the MPI singleton guard: the guard fired and logged loudly but the job still exited 0 until this fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…etons" When the host MPI launcher and the container's MPI/PMIx stack are incompatible, every process initialises standalone (COMM_WORLD size 1, rank 0). ShapePipe then treats each as master, hands each the full job list, and runs N uncoordinated copies of the pipeline into the same output directory — silently, with exit 0. This is the failure the OpenMPI-5 image fix prevents on candide, but nothing guarded against a future recurrence on another cluster. check_mpi_world() compares the size that actually wired up (COMM_WORLD) against the size the launcher intended (OMPI_COMM_WORLD_SIZE, which is set per process even when the world fails to form) and aborts on a mismatch. Empirically verified on candide: SLURM_NTASKS is NOT reliable for this (reads 1 on remote-node ranks even in a healthy run) — OMPI_COMM_WORLD_SIZE is. Tested both ways on a real allocation: healthy OMPI-5 run passes and completes; OMPI-4 image under the OMPI-5 host launcher fires the check and exits non-zero (together with the exit-code fix). Also catches partial wire-up (N-1 of N). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 'warning sign' pass: added check_mpi_world preflight (OMPI_COMM_WORLD_SIZE vs COMM_WORLD size; SLURM_NTASKS proven unreliable) and, found while testing it e2e, fixed main() swallowing run()'s exit code (every caught error had exited 0). Both tested on a real allocation. Distinct remaining gap: mid-setup rank-0 failure still deadlocks the other ranks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sion) Reverts the check_mpi_world preflight added earlier in this branch. It guards a real but narrow failure (the "rank 0 of N singletons" desync → silent wrong results), which is already designed out on candide by the OpenMPI-5 image match, and adds a runtime check to core run.py — scope creep for what is an example-script modernization PR, especially while MPI's future is an open question for Martin (it's a hard mpi4py dependency used only by two example scripts and one config, by zero production paths). Keeps the exit-code propagation fix (33494d7), which is broad and unrelated. The guard's detection recipe (OMPI_COMM_WORLD_SIZE vs COMM_WORLD size; SLURM_NTASKS is unreliable) is preserved in the mpi-hybrid fiber as a ready follow-up if MPI is kept. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Guard pulled from #737 (scope creep on a maybe-retired mode); exit-code fix kept. Recipe preserved in Layer 4. Question to Martin sharpened to 'is MPI a used dependency at all?' with the full footprint: hard mpi4py dep, 2 example scripts, 1 config, 0 production paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The job scripts assumed the runtime SIF already existed; the pull step and candide's home-quota gotcha (point APPTAINER_CACHEDIR at a data partition or `apptainer pull` dies on $HOME) lived only in CLAUDE.md and felt. Add a copy-paste "Quickstart on a cluster (candide)" to the README — the on-ramp a newcomer actually reads — covering clone -> quota-safe pull -> sbatch the example, pointing at example/pbs and docs/source/container.md for depth. Verified end to end on candide (c03, apptainer 1.4.5, SLURM): the exact quickstart command form runs candide_smp.sh against the published :develop-runtime image -> job COMPLETED, ExitCode 0:0, "A total of 0 errors were recorded". Also refresh the stale python-3.9 badge to 3.12 (the shipped interpreter) and drop a stray character from its target URL. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@fabianhervaspeters @martinkilbinger — beyond fixing MPI, this PR doubles as a readiness test for the container itself, and as far as I can tell it's ready for other people to pull it, run it, and try to break it. Verified end to end on candide today (c03, apptainer 1.4.5, SLURM) against the published
The user-facing docs for all this — a one-command Quickstart, the candide cluster walkthrough (quota-safe pull → So: please pull it and try to break it. The MPI-vs-keep question above still stands separately. — Claude on behalf of Cail |
…ner.md
The README had tunnel-visioned onto a candide-specific SLURM walkthrough —
wrong altitude for the project's landing page, where the broader community
arrives. Restructure it as a front door: a one-sentence-deeper description of
what ShapePipe does, a Quickstart that runs the bundled example straight from
the published container in one command (apptainer or docker, no install, no
cluster specifics), the image tag scheme, and a Documentation signpost to the
published pages.
Move the candide cluster walkthrough (quota-safe pull -> sbatch candide_smp.sh,
the SPDIR bind-mount, the MPI PMIx note) into a new "Running on a cluster
(SLURM)" section in container.md, which the README links to. Drop the
test-assertion prose ("logs ... and exits 0") that read like a CI check rather
than user docs.
Both quickstart commands verified on candide against :develop-runtime
(including the no-pre-pull `apptainer exec docker://...` form): the bundled
example runs to completion, 0 errors.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…open & green Cold survey found zero remaining worker work. PR #736 (rho-stats removal) and #737 (candide scripts via apptainer) are both open, mergeable, and CI-green. The only remaining step is Martin's review, which is outside worker scope. Includes mechanical felt reindex churn. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The README front door, the container.md 'Running on a cluster' section, and the basic_execution.md MPI docs are relocated to #739, which owns the full docs story (cluster docs now live in a dedicated clusters.md, so keeping the walkthrough here too would duplicate it). This PR keeps only the code/infra and the CLAUDE.md build-loop note that the container changes here introduce. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Unify all user-facing docs in this PR (relocated from #737, which is now pure code/infra): - README front door (Quickstart + Documentation signpost). The signpost now has a dedicated 'Running on a cluster' entry pointing at clusters.html, and the container-workflow entry no longer claims to carry the cluster example (that lives in clusters.md). - basic_execution.md MPI section: the hybrid-Apptainer run pattern and the OpenMPI-5 PMIx note, kept alongside the conda-framing fix. - container.md gains a one-line pointer to clusters.md. This removes the container.md/clusters.md duplication at the source rather than reconciling it after merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audited every narrative docs page against the current code. The install /
container / testing / API pages were already fresh; the staleness concentrated
in cluster docs and a few content errors. This rework:
**Machine-specific cluster tree.** Cluster guidance was scattered and half
of it invisible (candide lived only inside container.md on a feature branch;
canfar was split across orphaned pages; none of canfar/candide were in the
sidebar). Add a single `clusters.md` under a new "Running on a cluster" toctree
caption: the shared pattern (container = unit of execution, bind-mount, keep
SIFs off a quota-limited $HOME), then per-machine sections for candide (SLURM,
the candide_{smp,mpi}.sh scripts, the quota-safe pull, MPI/PMIx) and CANFAR
(the current canfar_submit_job / canfar_monitor console scripts), with ccin2p3
stubbed. The deep CANFAR production walkthrough stays in pipeline_canfar.md,
linked, and is now in the toctree too.
**Delete obsolete pages.** canfar.md (the old curl-VM submission model,
superseded by canfar_submit_job), pipeline_v2.0.md (personal paths, a missing
script), and work_flow_v2.0.md (an unrealized planning wishlist) — all three
orphaned from the toctree. The v2.0 wishlist is preserved in the team's felt
store rather than lost.
**Fix content errors.**
- dependencies.md: rewritten against pyproject.toml. Reframed around the
abstract-minimums + uv.lock SSOT (was "pinned per release"); ngmix now points
at the aguinot/ngmix@stable_version fork (was esheldon upstream); dropped the
phantom CDSclient; added the missing CANFAR/data stack (vos, skaha, canfar,
cs_util, astroquery, reproject, h5py, numba).
- post_processing.md: dropped the removed rho-statistics step and the dead
prepare_tiles_for_final command; added a legacy banner pointing at sp_validation.
- random_cat.md: legacy banner; fixed module name random_runner -> random_cat_runner.
- pipeline_canfar.md: flagged the matched-star / coverage-mask helpers that
moved to sp_validation (merge_psf_cat.py, download_headers, …).
- basic_execution.md: replaced the conda-era "activate the environment" framing
with the container reality. (MPI sections deferred pending the #737 decision.)
- configuration.md (conifg->config, NUMBERING_LIST->NUMBER_LIST),
contributing.md (Pleas->Please), module_develop.md (src/shapepipe/modules).
Verified with a local sphinx-book-theme build: succeeds; the only new warning
the tree introduced (a clusters.md heading anchor) is fixed. Remaining warnings
are all pre-existing (the autosummary API page needs the installed package;
multiple-toctree notices on every page).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Unify all user-facing docs in this PR (relocated from #737, which is now pure code/infra): - README front door (Quickstart + Documentation signpost). The signpost now has a dedicated 'Running on a cluster' entry pointing at clusters.html, and the container-workflow entry no longer claims to carry the cluster example (that lives in clusters.md). - basic_execution.md MPI section: the hybrid-Apptainer run pattern and the OpenMPI-5 PMIx note, kept alongside the conda-framing fix. - container.md gains a one-line pointer to clusters.md. This removes the container.md/clusters.md duplication at the source rather than reconciling it after merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
Makes the candide example job scripts actually run ShapePipe through the published container. The container is the supported source of truth (
docs/source/container.md), so the example scripts should use it rather thanmodule load intelpython/3+source activate $HOME/.conda/envs/shapepipe. Verifying the MPI script end-to-end turned up three things in the long-unexercised MPI path that had to be fixed for it to run — a launcher/PMIx mismatch, a code bug, and a stale config — each hidden behind the previous one.Changes
Job scripts (
example/pbs/candide_{smp,mpi}.sh)apptainer execagainst the runtime image, pulled once to a SIF whose path is overridable via$SP_IMAGE.qsub/#PBSare gone. The scripts now use#SBATCH/sbatchandmpirun -n $SLURM_NTASKS.$SPDIR) at the same path inside the container so the configs'$SPDIR-relativeINPUT_DIR/OUTPUT_DIRresolve identically in- and outside the container.exit 0.example/pbs/config_mpi.ini(last touched 2020): itsMODULE/ section names lacked the_runnersuffix the loader now requires, so the run died withNo module named 'shapepipe.modules.python_example'. Updated to the current names (matchingexample/config.ini).Container (
Dockerfile)--with-pmix=internal --with-prrte=internal --disable-dlopen), droppinglibopenmpi-dev. The image previously shipped OpenMPI 4.1.4 / PMIx 2, which is wire-incompatible with candide's OpenMPI 5.0.x / PMIx 5 host launcher — so hybrid MPI silently degraded to N independent "rank 0 of 1" processes (wrong science, no error). The stockmpi4pywheel dlopenslibmpi.so.40(ABI-stable across OMPI 4↔5), souv.lockis untouched.ShapePipe MPI code (
src/shapepipe/pipeline/mpi_run.py,src/shapepipe/run.py)module_config_secthroughrun_mpi→submit_mpi_jobs→worker(). By git history this dates to Multiple Module Runs #415 ("Multiple Module Runs"):worker()gained amodule_config_secparameter, but the MPI submit path wasn't updated in step, so it passes 7 args where 8 are required —worker() missing 1 required positional argument: 'module_runner'. On candide this path wasn't reachable until now (the PMIx mismatch meant MPI never wired up), so fixing the launcher is what surfaced it.Exit-code propagation (
src/shapepipe/shapepipe_run.py)main()calledrun(args)but didn't return it, soexit(main())was alwaysexit(None)→ 0. That means every caught error in ShapePipe — not just MPI — has been exiting 0, invisible toexit $?and CI. Nowreturn run(args), with a regression test. (This is independent of MPI and worth landing on its own.)CI (
.github/workflows/deploy-image.yml)Verification (on candide)
candide_mpi.shagainst the published runtime image. 2-node / 4-task hybrid run: ranks distribute across two nodes (process 0–3 of 4on n23+n25, not the pre-fix singletons), all three*_example_runnermodules produce output, real exit 0 (the script'sexit $?), andshapepipe.logrecords "A total of 0 errors were recorded."candide_smp.shruns the SMP example pipeline end-to-end through the container, 0 errors.bash -nclean; structural shell-syntax test tier passes.main()fix), so a failed run no longer looks like success toexit $?.Question for review — is MPI a used dependency of ShapePipe at all?
We got MPI working on candide (the three fixes above) — but the bigger question this surfaced, @martinkilbinger, is whether it's worth keeping at all. Its entire footprint in the repo is tiny:
mpi4py>=4.0inpyproject.toml) plus the core code (run_mpiinrun.py,mpi_run.py).candide_mpi.sh(candide) andcc_mpi.sh(ccin2p3) — and one active example config (config_mpi.ini).N_SMP).And functionally it adds nothing: SMP and MPI call the identical
worker()with identical arguments — the same computation behind two dispatchers (worker_handler.pyhas no MPI; the MPI path justscatters the independent job-list andgathers results). The workload is embarrassingly parallel, so MPI buys only an ergonomic convenience over "many per-node SMP jobs under SLURM."So: is MPI actually used anywhere, or can the mode (and the
mpi4pydependency) be dropped? If it's used, this PR restores it to working order and there's a ready follow-up to harden it against silent desync; if not, removing it (mpi_run.py,run_mpi, theimport_mpibranches,mpi4py, the two*_mpi.shscripts,config_mpi.ini) is clean and contained. Your call — you'd know whether anyone depends on it.Out of scope (deliberately untouched)
The production submission scripts —
scripts/sh/run_scratch_local.sh,init_run_exclusive_canfar.sh,job_sp_canfar*.bash(canfar + modern candide) — and the ccin2p3 scripts. These target workflows we can't fully verify here; note they are still SMP + conda and not yet containerized (a separate future cleanup).🤖 Generated with Claude Code
— Claude on behalf of Cail