feat(phase-19): track G eval harness lessons 70-75#210
Conversation
📝 WalkthroughWalkthroughThis PR adds six new capstone lessons (70–75) that implement an end-to-end LLM evaluation framework, progressing from task specification validation through classical metrics, code execution scoring, calibration analysis, leaderboard aggregation, and final unified runner orchestration. Catalog metadata is updated to reflect these additions and expanded TypeScript assets for existing capstones. ChangesCapstone Evaluation Framework
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (4)
phases/19-capstone-projects/71-classical-metrics/code/main.py (1)
168-195: 💤 Low valueConsider using
strict=Truein zip for explicitness.The explicit length check at lines 169-170 makes the
zip()at line 177 safe. However, addingstrict=Truewould make the safety contract more explicit and defend against future refactoring that might remove the check.♻️ Optional improvement
- for cand_text, ref_text in zip(predictions, references): + for cand_text, ref_text in zip(predictions, references, strict=True):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/71-classical-metrics/code/main.py` around lines 168 - 195, In corpus_bleu, make the pairing of predictions and references explicit by passing strict=True to the zip call inside the loop (the zip in function corpus_bleu that currently reads zip(predictions, references)); this enforces the existing length-check contract at runtime and prevents silent truncation if future refactors remove the prior len check—keep the existing length validation and just add strict=True to the zip invocation.phases/19-capstone-projects/73-perplexity-calibration/code/tests/test_calibration.py (1)
111-137: ⚡ Quick winAdd invalid-
binstests for decomposition and reliability APIs.Please add
bins=0/negative assertions forbrier_decompositionandreliability_diagramso their validation contract is pinned like ECE’s.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/73-perplexity-calibration/code/tests/test_calibration.py` around lines 111 - 137, Add tests asserting that passing non-positive bin counts raises a validation error: call brier_decomposition(conf, corr, bins=0) and bins=-1 (use synthetic or small arrays) and assert it raises ValueError (or the project's chosen validation exception), and similarly call reliability_diagram(conf, corr, bins=0) and bins=-1 and assert the same; reference the functions brier_decomposition and reliability_diagram so test names like test_invalid_bins_raise_for_brier_decomposition and test_invalid_bins_raise_for_reliability_diagram clearly cover bins<=0 cases.phases/19-capstone-projects/72-code-exec-metric/code/tests/test_exec.py (1)
137-144: ⚡ Quick winAdd regression tests for
k > n/n=0pass@k behavior.Current suite doesn’t pin the
k > nedge cases (e.g., no-pass vs at-least-one-pass), so the branch-order bug inpass_at_kslipped through.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/72-code-exec-metric/code/tests/test_exec.py` around lines 137 - 144, Add explicit regression tests for pass_at_k to pin the edge-case behavior: add one test (e.g., test_k_greater_than_n_raises) that calls pass_at_k with k > n (e.g., pass_at_k(5, 3, 1) or pass_at_k(5, 6, 1) depending on intended semantics) and asserts it raises ValueError, and add another test (e.g., test_n_zero_returns_zero) that calls pass_at_k with n == 0 (e.g., pass_at_k(1, 0, 0)) and asserts it returns 0; reference the pass_at_k function to locate where to add these tests and use clear test names to lock in the expected branch behavior.phases/19-capstone-projects/75-end-to-end-eval-runner/code/tests/test_runner.py (1)
89-89: ⚡ Quick winUse a throwaway variable for the unused unpacked value.
Line 89 assigns
resultsbut never reads it.Suggested fix
- results, buf = run_eval([RuleBasedAdapter()], tasks, parallel=False) + _, buf = run_eval([RuleBasedAdapter()], tasks, parallel=False)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/tests/test_runner.py` at line 89, The test unpacks two values from run_eval([...]) into results and buf but never uses results; change the unpack to use a throwaway variable (e.g., "_" or "_results") instead of results to make intent clear and avoid unused-variable warnings. Update the unpack expression where run_eval is called in the test (the line invoking run_eval with RuleBasedAdapter) to replace the first target with the throwaway name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@phases/19-capstone-projects/72-code-exec-metric/code/main.py`:
- Around line 155-168: The code currently uses subprocess.run(...,
capture_output=True) which buffers the entire child stdout before we check
OUTPUT_CAP_BYTES; replace this with spawning the process via subprocess.Popen
(or subprocess.run with stdout=subprocess.PIPE but reading the pipe
incrementally) and stream-read proc.stdout up to OUTPUT_CAP_BYTES so you never
accumulate more than the cap in memory; if the cap is exceeded, terminate the
child and return ExecResult(..., detail="output overflow") as before. Refer to
the subprocess.run/capture_output usage and the OUTPUT_CAP_BYTES/ExecResult
symbols to locate the change.
- Around line 204-207: The early-return order in pass_at_k is wrong: check k > n
before checking n - c < k so cases like pass_at_k(5, 0, 10) don't short-circuit
to 1.0; update the branch order in pass_at_k (or adjust the first condition to
include k <= n) so that the guard if k > n: return float(c > 0) executes before
evaluating if n - c < k: return 1.0, preserving correct probability outputs.
In `@phases/19-capstone-projects/72-code-exec-metric/docs/en.md`:
- Around line 78-80: The fenced block containing the formula for pass_at_k(n, c,
k) is unlabeled and triggers MD040; edit the fenced block that holds
"pass_at_k(n, c, k) = 1 - C(n - c, k) / C(n, k)" and add a language tag such as
```math or ```latex (e.g., change the opening fence from ``` to ```math) so the
block is properly labeled for markdown linters while leaving the formula text
unchanged.
- Around line 46-47: Update the docs to match the implementation: reconcile the
denylist entries and the exit/overflow behavior described with what's
implemented in main.py by either (A) aligning the documentation to list only the
actual denied imports present in the denylist variable in main.py (remove items
like pathlib, multiprocessing, threading.Thread.start if they are not present)
or (B) update main.py to include the documented entries if they are intended,
and make sure import-rewrite logic references the exact denylist symbol; also
update the outcomes section to reflect that output overflow currently returns
exit_code="error" with overflow details rather than a separate output_overflow
outcome (or change implementation to emit a distinct output_overflow outcome) so
the docs and the behavior of exit_code/overflow are consistent.
In `@phases/19-capstone-projects/72-code-exec-metric/quiz.json`:
- Around line 67-76: The quiz currently implies a separate outcome named
output_overflow, but the runtime normalizes large-output incidents into
exit_code="error" with a detail message indicating output overflow; update the
question, options, and explanation to state that output_overflow is represented
in the runtime as exit_code="error" with a detail string (e.g.,
"output_overflow") rather than a standalone exit state, and adjust the correct
option and explanation text to match this schema (referencing output_overflow,
exit_code, and the detail text).
In `@phases/19-capstone-projects/73-perplexity-calibration/code/main.py`:
- Around line 111-147: Both brier_decomposition and reliability_diagram are
missing validation for the bins parameter; add the same check used by
expected_calibration_error so bins must be an integer > 0 and raise a ValueError
otherwise. In both functions (before calling _bin_indices or using bins)
validate that isinstance(bins, int) and bins > 0, raising a clear ValueError if
not, so downstream calls to _bin_indices and array allocations cannot silently
produce invalid results.
In `@phases/19-capstone-projects/73-perplexity-calibration/docs/en.md`:
- Around line 87-89: Update the docs text under the "Empty input" / "perplexity
zero-token case" paragraph to reflect actual behavior: the implementation
returns NaN for zero-token perplexity and currently does not emit a warning;
change the wording that currently claims "returns 0.0 (or NaN, but we choose 0.0
with a flag) and a warning" to state it returns NaN and no warning (or
explicitly document the flag/behavior if you choose to change code instead).
Ensure the doc string references the "perplexity zero-token case" so readers can
map it to the tests and implementation.
In `@phases/19-capstone-projects/74-leaderboard-aggregation/code/main.py`:
- Around line 197-207: pairwise_diffs currently skips the [0,1] score validation
that aggregate enforces; add the same input validation at the start of
pairwise_diffs (before building common_task_scores or computing diffs) by
invoking the same validation logic used by aggregate (i.e., the helper or checks
that assert each run.score is between 0 and 1), so invalid scores raise an error
instead of producing misleading significance results; update pairwise_diffs (and
references to runs/common_task_scores) to reuse that helper rather than
duplicating logic.
- Around line 75-79: In _validate_runs, besides checking score range for each
EvalRun in runs, detect and reject duplicate (model_id, task_id) pairs: iterate
through runs, maintain a seen set of tuples (r.model_id, r.task_id), and if a
tuple is already present raise a ValueError indicating the duplicate; keep the
existing score bounds check for function _validate_runs and type EvalRun intact.
In `@phases/19-capstone-projects/74-leaderboard-aggregation/docs/en.md`:
- Around line 98-99: Docs claim default bootstrap iterations B=1000 but the
public aggregator entry points use parameter b with default=500, causing
inconsistency; pick one source of truth and align them: either update the docs
text in docs/en.md to state the default B is 500, or change the function
signatures that expose the bootstrap parameter (the public aggregator entry
points that accept argument b and currently default to 500) to default to 1000;
ensure the alpha default mentioned in docs remains consistent with the code's
alpha/default_alpha variable as well.
In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py`:
- Around line 145-146: The code builds calibration_buf = {a.model_id: [] for a
in adapters} which silently merges adapters with duplicate model_id values;
before using calibration_buf (and the similar logic around lines that initialize
report rows), validate adapters for unique model_id values and raise or
fail-fast on duplicates. Add a pre-check that collects model_id from adapters
(e.g., via a set or counting) and if any id appears more than once, raise a
ValueError or log an explicit error mentioning the duplicate id(s) and stop
processing; update any initialization that assumes unique keys (calibration_buf
and report row creation) to only run after this uniqueness check. Ensure the
error references the adapter list and model_id to help locate offending entries.
- Line 376: The print call currently uses an f-string with no interpolation
(print(f"ERROR: rule_based should not be at the bottom")); remove the
unnecessary f-prefix so it is a normal string literal (e.g., print("ERROR:
rule_based should not be at the bottom")) to resolve Ruff F541; locate the print
statement in main.py (the line that prints "ERROR: rule_based should not be at
the bottom") and update it accordingly.
In `@phases/19-capstone-projects/75-end-to-end-eval-runner/docs/en.md`:
- Around line 65-66: The docs describe APIs that don't match the code: update
all references of run_eval(adapter, tasks, parallel=False) to reflect the actual
signature run_eval(adapters, tasks, parallel=False) (note plural), remove any
mention of a non-existent correct_threshold parameter, and replace references to
the helper imports with the actual helper name _load_sibling; apply these fixes
consistently across the document (including the other occurrences noted around
lines 78-79 and 121) so the documentation matches the real symbols run_eval,
adapters, correct_threshold (remove), imports -> _load_sibling.
In `@phases/19-capstone-projects/75-end-to-end-eval-runner/quiz.json`:
- Around line 31-40: The quiz explanation references a non-existent "threshold
override" for the runner's calibration buffer logic; either remove that claim
from the explanation or add an explicit optional override parameter to the
runner API and use it when deciding the calibration flag. Concretely: if you
prefer API change, add an optional parameter (e.g., thresholdOverride) to the
runner method that computes the calibration buffer/flag (the function currently
responsible for "calibration buffer" / flag determination), have the decision
logic use thresholdOverride when provided and otherwise default to
exact_match=1.0 and graded metrics=0.5; if you prefer documentation change,
update the explanation text to omit the "runner accepts an override" sentence so
it matches the current runner behavior.
---
Nitpick comments:
In `@phases/19-capstone-projects/71-classical-metrics/code/main.py`:
- Around line 168-195: In corpus_bleu, make the pairing of predictions and
references explicit by passing strict=True to the zip call inside the loop (the
zip in function corpus_bleu that currently reads zip(predictions, references));
this enforces the existing length-check contract at runtime and prevents silent
truncation if future refactors remove the prior len check—keep the existing
length validation and just add strict=True to the zip invocation.
In `@phases/19-capstone-projects/72-code-exec-metric/code/tests/test_exec.py`:
- Around line 137-144: Add explicit regression tests for pass_at_k to pin the
edge-case behavior: add one test (e.g., test_k_greater_than_n_raises) that calls
pass_at_k with k > n (e.g., pass_at_k(5, 3, 1) or pass_at_k(5, 6, 1) depending
on intended semantics) and asserts it raises ValueError, and add another test
(e.g., test_n_zero_returns_zero) that calls pass_at_k with n == 0 (e.g.,
pass_at_k(1, 0, 0)) and asserts it returns 0; reference the pass_at_k function
to locate where to add these tests and use clear test names to lock in the
expected branch behavior.
In
`@phases/19-capstone-projects/73-perplexity-calibration/code/tests/test_calibration.py`:
- Around line 111-137: Add tests asserting that passing non-positive bin counts
raises a validation error: call brier_decomposition(conf, corr, bins=0) and
bins=-1 (use synthetic or small arrays) and assert it raises ValueError (or the
project's chosen validation exception), and similarly call
reliability_diagram(conf, corr, bins=0) and bins=-1 and assert the same;
reference the functions brier_decomposition and reliability_diagram so test
names like test_invalid_bins_raise_for_brier_decomposition and
test_invalid_bins_raise_for_reliability_diagram clearly cover bins<=0 cases.
In
`@phases/19-capstone-projects/75-end-to-end-eval-runner/code/tests/test_runner.py`:
- Line 89: The test unpacks two values from run_eval([...]) into results and buf
but never uses results; change the unpack to use a throwaway variable (e.g., "_"
or "_results") instead of results to make intent clear and avoid unused-variable
warnings. Update the unpack expression where run_eval is called in the test (the
line invoking run_eval with RuleBasedAdapter) to replace the first target with
the throwaway name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 944fdc59-f42c-47e4-8fef-c77be693d956
📒 Files selected for processing (25)
catalog.jsonphases/19-capstone-projects/70-task-spec-format/code/main.pyphases/19-capstone-projects/70-task-spec-format/code/tests/test_spec.pyphases/19-capstone-projects/70-task-spec-format/docs/en.mdphases/19-capstone-projects/70-task-spec-format/quiz.jsonphases/19-capstone-projects/71-classical-metrics/code/main.pyphases/19-capstone-projects/71-classical-metrics/code/tests/test_metrics.pyphases/19-capstone-projects/71-classical-metrics/docs/en.mdphases/19-capstone-projects/71-classical-metrics/quiz.jsonphases/19-capstone-projects/72-code-exec-metric/code/main.pyphases/19-capstone-projects/72-code-exec-metric/code/tests/test_exec.pyphases/19-capstone-projects/72-code-exec-metric/docs/en.mdphases/19-capstone-projects/72-code-exec-metric/quiz.jsonphases/19-capstone-projects/73-perplexity-calibration/code/main.pyphases/19-capstone-projects/73-perplexity-calibration/code/tests/test_calibration.pyphases/19-capstone-projects/73-perplexity-calibration/docs/en.mdphases/19-capstone-projects/73-perplexity-calibration/quiz.jsonphases/19-capstone-projects/74-leaderboard-aggregation/code/main.pyphases/19-capstone-projects/74-leaderboard-aggregation/code/tests/test_leaderboard.pyphases/19-capstone-projects/74-leaderboard-aggregation/docs/en.mdphases/19-capstone-projects/74-leaderboard-aggregation/quiz.jsonphases/19-capstone-projects/75-end-to-end-eval-runner/code/main.pyphases/19-capstone-projects/75-end-to-end-eval-runner/code/tests/test_runner.pyphases/19-capstone-projects/75-end-to-end-eval-runner/docs/en.mdphases/19-capstone-projects/75-end-to-end-eval-runner/quiz.json
| ``` | ||
| pass_at_k(n, c, k) = 1 - C(n - c, k) / C(n, k) | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to the fenced formula block.
Line 78 uses an unlabeled fenced code block, which trips markdown lint (MD040).
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 78-78: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@phases/19-capstone-projects/72-code-exec-metric/docs/en.md` around lines 78 -
80, The fenced block containing the formula for pass_at_k(n, c, k) is unlabeled
and triggers MD040; edit the fenced block that holds "pass_at_k(n, c, k) = 1 -
C(n - c, k) / C(n, k)" and add a language tag such as ```math or ```latex (e.g.,
change the opening fence from ``` to ```math) so the block is properly labeled
for markdown linters while leaving the formula text unchanged.
| print(f"ERROR: expected rule_based at top, got {top}") | ||
| return 2 | ||
| if bot == "rule_based": | ||
| print(f"ERROR: rule_based should not be at the bottom") |
There was a problem hiding this comment.
Remove the empty f-string prefix.
Line 376 uses an f-string with no interpolation, which triggers Ruff F541.
Suggested fix
- print(f"ERROR: rule_based should not be at the bottom")
+ print("ERROR: rule_based should not be at the bottom")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f"ERROR: rule_based should not be at the bottom") | |
| print("ERROR: rule_based should not be at the bottom") |
🧰 Tools
🪛 Ruff (0.15.14)
[error] 376-376: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py` at line
376, The print call currently uses an f-string with no interpolation
(print(f"ERROR: rule_based should not be at the bottom")); remove the
unnecessary f-prefix so it is a normal string literal (e.g., print("ERROR:
rule_based should not be at the bottom")) to resolve Ruff F541; locate the print
statement in main.py (the line that prints "ERROR: rule_based should not be at
the bottom") and update it accordingly.
…arify bootstrap defaults
There was a problem hiding this comment.
♻️ Duplicate comments (1)
phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py (1)
379-379:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRemove the unnecessary f-string prefix.
This f-string has no placeholders. The fix from the previous review was not applied.
- print(f"ERROR: rule_based should not be at the bottom") + print("ERROR: rule_based should not be at the bottom")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py` at line 379, The print call using an f-string has no placeholders; locate the print statement containing the literal "ERROR: rule_based should not be at the bottom" and remove the unnecessary f-string prefix so it is a normal string literal (i.e., change the print invocation to use a plain string instead of an f-string).
🧹 Nitpick comments (1)
phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py (1)
208-220: 💤 Low valueRemove unused parameters from
_perplexity_blocks.Parameters
bufandresultsare never used in the function body. Onlyadaptersandadapter_token_statsare actually needed.-def _perplexity_blocks(adapters: Sequence[ModelAdapter], buf: dict[str, list[tuple[float, float, int]]], - results: Sequence[TaskResult], adapter_token_stats: dict[str, list[tuple[float, int]]]) -> dict[str, dict]: +def _perplexity_blocks(adapters: Sequence[ModelAdapter], + adapter_token_stats: dict[str, list[tuple[float, int]]]) -> dict[str, dict]:Also update the call site at line 235:
- perplexity = _perplexity_blocks(adapters, calibration_buf, results, adapter_token_stats) + perplexity = _perplexity_blocks(adapters, adapter_token_stats)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py` around lines 208 - 220, The function _perplexity_blocks currently accepts unused parameters buf and results; remove these parameters from the signature so it becomes _perplexity_blocks(adapters: Sequence[ModelAdapter], adapter_token_stats: dict[str, list[tuple[float, int]]]) and update any call sites that invoke _perplexity_blocks to stop passing buf and results (pass only adapters and adapter_token_stats), keeping the internal logic (stats extraction, PerplexityResult construction) unchanged; verify types and run tests to ensure call sites (the spot that previously passed three args) are updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py`:
- Line 379: The print call using an f-string has no placeholders; locate the
print statement containing the literal "ERROR: rule_based should not be at the
bottom" and remove the unnecessary f-string prefix so it is a normal string
literal (i.e., change the print invocation to use a plain string instead of an
f-string).
---
Nitpick comments:
In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py`:
- Around line 208-220: The function _perplexity_blocks currently accepts unused
parameters buf and results; remove these parameters from the signature so it
becomes _perplexity_blocks(adapters: Sequence[ModelAdapter],
adapter_token_stats: dict[str, list[tuple[float, int]]]) and update any call
sites that invoke _perplexity_blocks to stop passing buf and results (pass only
adapters and adapter_token_stats), keeping the internal logic (stats extraction,
PerplexityResult construction) unchanged; verify types and run tests to ensure
call sites (the spot that previously passed three args) are updated accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6f9ddd21-9fb2-4509-b602-a6b77dd879b0
📒 Files selected for processing (10)
phases/19-capstone-projects/72-code-exec-metric/code/main.pyphases/19-capstone-projects/72-code-exec-metric/docs/en.mdphases/19-capstone-projects/72-code-exec-metric/quiz.jsonphases/19-capstone-projects/73-perplexity-calibration/code/main.pyphases/19-capstone-projects/73-perplexity-calibration/docs/en.mdphases/19-capstone-projects/74-leaderboard-aggregation/code/main.pyphases/19-capstone-projects/74-leaderboard-aggregation/docs/en.mdphases/19-capstone-projects/75-end-to-end-eval-runner/code/main.pyphases/19-capstone-projects/75-end-to-end-eval-runner/docs/en.mdphases/19-capstone-projects/75-end-to-end-eval-runner/quiz.json
✅ Files skipped from review due to trivial changes (6)
- phases/19-capstone-projects/75-end-to-end-eval-runner/quiz.json
- phases/19-capstone-projects/74-leaderboard-aggregation/docs/en.md
- phases/19-capstone-projects/72-code-exec-metric/docs/en.md
- phases/19-capstone-projects/72-code-exec-metric/quiz.json
- phases/19-capstone-projects/73-perplexity-calibration/docs/en.md
- phases/19-capstone-projects/75-end-to-end-eval-runner/docs/en.md
# Conflicts: # catalog.json
Summary
Track G of the Phase 19 capstone series: a self-contained LLM eval harness built from scratch over six atomic lessons.
ModelAdapterinterface, parallel task execution viaThreadPoolExecutor, three mock adapters (rule-based, noisy, biased), self-terminating demo over the lesson-70 fixture.Stdlib +
numpyonly. Noevaluate,lm-eval,scipy, ornltk. BLEU / ROUGE / F1 are implemented from scratch.Test plan
python3 code/main.pyexits 0 in all six lessonspython3 -m unittest discover -s code/testspasses in all six lessons (139 tests total)site/, rootREADME.md, orcatalog.json