feat(phase-19): track G eval harness lessons 70-75 by rohitg00 · Pull Request #210 · rohitg00/ai-engineering-from-scratch

rohitg00 · 2026-05-26T18:41:28Z

Summary

Track G of the Phase 19 capstone series: a self-contained LLM eval harness built from scratch over six atomic lessons.

70 task-spec-format: JSONL task schema, closed metric vocabulary, validator, post-process rules, 10 fixture tasks across 5 categories.
71 classical-metrics: exact-match, F1, BLEU-4 (modified n-gram precision + brevity penalty + add-one smoothing), ROUGE-L (LCS), accuracy. One frozen tokenizer.
72 code-exec-metric: extract fenced code, run in isolated subprocess with wall-clock timeout / output cap / import denylist, five-way exit code vocabulary, pass-at-k math.
73 perplexity-calibration: corpus perplexity, ECE (binned gap), Brier score with reliability/resolution/uncertainty decomposition, reliability diagram data.
74 leaderboard-aggregation: per-model mean + win-rate + per-category breakdown, percentile bootstrap CI on means and pairwise diffs, JSON + markdown rendering.
75 end-to-end-eval-runner: composes 70-74 with a minimal ModelAdapter interface, parallel task execution via ThreadPoolExecutor, three mock adapters (rule-based, noisy, biased), self-terminating demo over the lesson-70 fixture.

Stdlib + numpy only. No evaluate, lm-eval, scipy, or nltk. BLEU / ROUGE / F1 are implemented from scratch.

Test plan

python3 code/main.py exits 0 in all six lessons
python3 -m unittest discover -s code/tests passes in all six lessons (139 tests total)
Lesson 75 demo composes 70-74 end-to-end in under one second, rule-based adapter ranks Expand Phase 1 (22 lessons) and Phase 2 (18 lessons) with deeper content #1
No edits to site/, root README.md, or catalog.json
Atomic per-lesson commits, conventional commit messages
Docs use mermaid diagrams, all fences language-tagged, no em-dashes

coderabbitai · 2026-05-26T18:41:43Z

📝 Walkthrough

Walkthrough

This PR adds six new capstone lessons (70–75) that implement an end-to-end LLM evaluation framework, progressing from task specification validation through classical metrics, code execution scoring, calibration analysis, leaderboard aggregation, and final unified runner orchestration. Catalog metadata is updated to reflect these additions and expanded TypeScript assets for existing capstones.

Changes

Capstone Evaluation Framework

Layer / File(s)	Summary
Catalog metadata `catalog.json`	Increased aggregate lesson count from 435 to 441 and code_files from 487 to 519. Phase 19 lesson_count raised from 17 to 23. Added entries for lessons 70–75, each with has_docs/has_code/has_quiz and code_files (main.py, test_*.py). Extended code_files for existing capstones 09–11 with TypeScript sources and tests.
Task Spec Format (Lesson 70) `phases/19-capstone-projects/70-task-spec-format/code/main.py`, `code/tests/test_spec.py`, `docs/en.md`, `quiz.json`	Defines frozen JSONL task schema with required fields (`task_id`, `category`, `prompt`, `targets`, `metric_name`, `post_process`) and optional fields (`few_shot_examples`, `metadata`). Implements validation returning line-numbered errors, detects duplicate `task_id`s, provides prompt rendering with few-shot support, and deterministic post-processing (`extract_letter`, `extract_code_block`, `strip_whitespace`, etc.). Tests cover validation paths, file I/O, rendering, and post-processing rules.
Classical Metrics (Lesson 71) `phases/19-capstone-projects/71-classical-metrics/code/main.py`, `code/tests/test_metrics.py`, `docs/en.md`, `quiz.json`	Implements exact-match, token-level F1 (multiset overlap), BLEU-4 (modified n-gram precision, brevity penalty, add-one smoothing), and ROUGE-L (LCS-based). Provides tokenizer contract, a `score(metric_name, pred, targets)` dispatcher with multi-target max selection, corpus-level BLEU, and tests validating numeric correctness and edge cases.
Code Execution Metric (Lesson 72) `phases/19-capstone-projects/72-code-exec-metric/code/main.py`, `code/tests/test_exec.py`, `docs/en.md`, `quiz.json`	Extracts Python code from fenced blocks, executes it in an isolated subprocess with import denylist and `os.system` blocked, evaluates assertions as pass/fail, normalizes exit codes (`pass`, `assertion_fail`, `syntax_error`, `timeout`, `error`), computes pass@k probability and estimators. Tests validate extraction, execution outcomes, timeouts, denial cases, and pass@k math.
Calibration & Perplexity (Lesson 73) `phases/19-capstone-projects/73-perplexity-calibration/code/main.py`, `code/tests/test_calibration.py`, `docs/en.md`, `quiz.json`	Computes token-level perplexity from adapter-provided negative log-likelihoods, Expected Calibration Error via binning, Brier score with reliability/resolution/uncertainty decomposition, and reliability-diagram aggregates. Adds `CalibrationReport` and `PerplexityResult` dataclasses, synthetic generators for calibrated/over/underconfident data, tests, docs, and quiz.
Leaderboard Aggregation (Lesson 74) `phases/19-capstone-projects/74-leaderboard-aggregation/code/main.py`, `code/tests/test_leaderboard.py`, `docs/en.md`, `quiz.json`	Aggregates per-model scores into leaderboard rows with bootstrap mean confidence intervals, computes win-rates with tie handling, produces paired bootstrap pairwise differences with significance flags, and renders Markdown/JSON. Includes tests and documentation.
End-to-End Eval Runner (Lesson 75) `phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py`, `code/tests/test_runner.py`, `docs/en.md`, `quiz.json`	Orchestrates the full evaluation pipeline: dynamic sibling-module loading, `ModelAdapter` interface and concrete adapters (RuleBased/Noisy/Biased), per-(adapter,task) scoring (prompt render → generate → post_process → code_exec/metric), buffering per-model calibration and token stats, aggregates into `EvalReport` (leaderboard + pairwise + calibration + perplexity), and includes demo, tests, docs, and quiz.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat(phase-19): track G eval harness lessons 70-75' clearly describes the main change: adding six new capstone lessons (70-75) that implement an LLM evaluation harness. It is concise, specific, and directly related to the changeset.
Description check	✅ Passed	The PR description provides comprehensive detail about all six lessons (70-75), their purpose, implementation approach, testing verification, and constraints. It directly relates to the changeset and explains the intent clearly.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/phase-19-track-g

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (4)

phases/19-capstone-projects/71-classical-metrics/code/main.py (1)

168-195: 💤 Low value

Consider using strict=True in zip for explicitness.

The explicit length check at lines 169-170 makes the zip() at line 177 safe. However, adding strict=True would make the safety contract more explicit and defend against future refactoring that might remove the check.

♻️ Optional improvement

-    for cand_text, ref_text in zip(predictions, references):
+    for cand_text, ref_text in zip(predictions, references, strict=True):

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@phases/19-capstone-projects/71-classical-metrics/code/main.py` around lines
168 - 195, In corpus_bleu, make the pairing of predictions and references
explicit by passing strict=True to the zip call inside the loop (the zip in
function corpus_bleu that currently reads zip(predictions, references)); this
enforces the existing length-check contract at runtime and prevents silent
truncation if future refactors remove the prior len check—keep the existing
length validation and just add strict=True to the zip invocation.

phases/19-capstone-projects/73-perplexity-calibration/code/tests/test_calibration.py (1)

111-137: ⚡ Quick win

Add invalid-bins tests for decomposition and reliability APIs.

Please add bins=0/negative assertions for brier_decomposition and reliability_diagram so their validation contract is pinned like ECE’s.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@phases/19-capstone-projects/73-perplexity-calibration/code/tests/test_calibration.py`
around lines 111 - 137, Add tests asserting that passing non-positive bin counts
raises a validation error: call brier_decomposition(conf, corr, bins=0) and
bins=-1 (use synthetic or small arrays) and assert it raises ValueError (or the
project's chosen validation exception), and similarly call
reliability_diagram(conf, corr, bins=0) and bins=-1 and assert the same;
reference the functions brier_decomposition and reliability_diagram so test
names like test_invalid_bins_raise_for_brier_decomposition and
test_invalid_bins_raise_for_reliability_diagram clearly cover bins<=0 cases.

phases/19-capstone-projects/72-code-exec-metric/code/tests/test_exec.py (1)

137-144: ⚡ Quick win

Add regression tests for k > n/n=0 pass@k behavior.

Current suite doesn’t pin the k > n edge cases (e.g., no-pass vs at-least-one-pass), so the branch-order bug in pass_at_k slipped through.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@phases/19-capstone-projects/72-code-exec-metric/code/tests/test_exec.py`
around lines 137 - 144, Add explicit regression tests for pass_at_k to pin the
edge-case behavior: add one test (e.g., test_k_greater_than_n_raises) that calls
pass_at_k with k > n (e.g., pass_at_k(5, 3, 1) or pass_at_k(5, 6, 1) depending
on intended semantics) and asserts it raises ValueError, and add another test
(e.g., test_n_zero_returns_zero) that calls pass_at_k with n == 0 (e.g.,
pass_at_k(1, 0, 0)) and asserts it returns 0; reference the pass_at_k function
to locate where to add these tests and use clear test names to lock in the
expected branch behavior.

phases/19-capstone-projects/75-end-to-end-eval-runner/code/tests/test_runner.py (1)

89-89: ⚡ Quick win

Use a throwaway variable for the unused unpacked value.

Line 89 assigns results but never reads it.

Suggested fix

-        results, buf = run_eval([RuleBasedAdapter()], tasks, parallel=False)
+        _, buf = run_eval([RuleBasedAdapter()], tasks, parallel=False)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@phases/19-capstone-projects/75-end-to-end-eval-runner/code/tests/test_runner.py`
at line 89, The test unpacks two values from run_eval([...]) into results and
buf but never uses results; change the unpack to use a throwaway variable (e.g.,
"_" or "_results") instead of results to make intent clear and avoid
unused-variable warnings. Update the unpack expression where run_eval is called
in the test (the line invoking run_eval with RuleBasedAdapter) to replace the
first target with the throwaway name.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@phases/19-capstone-projects/72-code-exec-metric/code/main.py`:
- Around line 155-168: The code currently uses subprocess.run(...,
capture_output=True) which buffers the entire child stdout before we check
OUTPUT_CAP_BYTES; replace this with spawning the process via subprocess.Popen
(or subprocess.run with stdout=subprocess.PIPE but reading the pipe
incrementally) and stream-read proc.stdout up to OUTPUT_CAP_BYTES so you never
accumulate more than the cap in memory; if the cap is exceeded, terminate the
child and return ExecResult(..., detail="output overflow") as before. Refer to
the subprocess.run/capture_output usage and the OUTPUT_CAP_BYTES/ExecResult
symbols to locate the change.
- Around line 204-207: The early-return order in pass_at_k is wrong: check k > n
before checking n - c < k so cases like pass_at_k(5, 0, 10) don't short-circuit
to 1.0; update the branch order in pass_at_k (or adjust the first condition to
include k <= n) so that the guard if k > n: return float(c > 0) executes before
evaluating if n - c < k: return 1.0, preserving correct probability outputs.

In `@phases/19-capstone-projects/72-code-exec-metric/docs/en.md`:
- Around line 78-80: The fenced block containing the formula for pass_at_k(n, c,
k) is unlabeled and triggers MD040; edit the fenced block that holds
"pass_at_k(n, c, k) = 1 - C(n - c, k) / C(n, k)" and add a language tag such as
```math or ```latex (e.g., change the opening fence from ``` to ```math) so the
block is properly labeled for markdown linters while leaving the formula text
unchanged.
- Around line 46-47: Update the docs to match the implementation: reconcile the
denylist entries and the exit/overflow behavior described with what's
implemented in main.py by either (A) aligning the documentation to list only the
actual denied imports present in the denylist variable in main.py (remove items
like pathlib, multiprocessing, threading.Thread.start if they are not present)
or (B) update main.py to include the documented entries if they are intended,
and make sure import-rewrite logic references the exact denylist symbol; also
update the outcomes section to reflect that output overflow currently returns
exit_code="error" with overflow details rather than a separate output_overflow
outcome (or change implementation to emit a distinct output_overflow outcome) so
the docs and the behavior of exit_code/overflow are consistent.

In `@phases/19-capstone-projects/72-code-exec-metric/quiz.json`:
- Around line 67-76: The quiz currently implies a separate outcome named
output_overflow, but the runtime normalizes large-output incidents into
exit_code="error" with a detail message indicating output overflow; update the
question, options, and explanation to state that output_overflow is represented
in the runtime as exit_code="error" with a detail string (e.g.,
"output_overflow") rather than a standalone exit state, and adjust the correct
option and explanation text to match this schema (referencing output_overflow,
exit_code, and the detail text).

In `@phases/19-capstone-projects/73-perplexity-calibration/code/main.py`:
- Around line 111-147: Both brier_decomposition and reliability_diagram are
missing validation for the bins parameter; add the same check used by
expected_calibration_error so bins must be an integer > 0 and raise a ValueError
otherwise. In both functions (before calling _bin_indices or using bins)
validate that isinstance(bins, int) and bins > 0, raising a clear ValueError if
not, so downstream calls to _bin_indices and array allocations cannot silently
produce invalid results.

In `@phases/19-capstone-projects/73-perplexity-calibration/docs/en.md`:
- Around line 87-89: Update the docs text under the "Empty input" / "perplexity
zero-token case" paragraph to reflect actual behavior: the implementation
returns NaN for zero-token perplexity and currently does not emit a warning;
change the wording that currently claims "returns 0.0 (or NaN, but we choose 0.0
with a flag) and a warning" to state it returns NaN and no warning (or
explicitly document the flag/behavior if you choose to change code instead).
Ensure the doc string references the "perplexity zero-token case" so readers can
map it to the tests and implementation.

In `@phases/19-capstone-projects/74-leaderboard-aggregation/code/main.py`:
- Around line 197-207: pairwise_diffs currently skips the [0,1] score validation
that aggregate enforces; add the same input validation at the start of
pairwise_diffs (before building common_task_scores or computing diffs) by
invoking the same validation logic used by aggregate (i.e., the helper or checks
that assert each run.score is between 0 and 1), so invalid scores raise an error
instead of producing misleading significance results; update pairwise_diffs (and
references to runs/common_task_scores) to reuse that helper rather than
duplicating logic.
- Around line 75-79: In _validate_runs, besides checking score range for each
EvalRun in runs, detect and reject duplicate (model_id, task_id) pairs: iterate
through runs, maintain a seen set of tuples (r.model_id, r.task_id), and if a
tuple is already present raise a ValueError indicating the duplicate; keep the
existing score bounds check for function _validate_runs and type EvalRun intact.

In `@phases/19-capstone-projects/74-leaderboard-aggregation/docs/en.md`:
- Around line 98-99: Docs claim default bootstrap iterations B=1000 but the
public aggregator entry points use parameter b with default=500, causing
inconsistency; pick one source of truth and align them: either update the docs
text in docs/en.md to state the default B is 500, or change the function
signatures that expose the bootstrap parameter (the public aggregator entry
points that accept argument b and currently default to 500) to default to 1000;
ensure the alpha default mentioned in docs remains consistent with the code's
alpha/default_alpha variable as well.

In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py`:
- Around line 145-146: The code builds calibration_buf = {a.model_id: [] for a
in adapters} which silently merges adapters with duplicate model_id values;
before using calibration_buf (and the similar logic around lines that initialize
report rows), validate adapters for unique model_id values and raise or
fail-fast on duplicates. Add a pre-check that collects model_id from adapters
(e.g., via a set or counting) and if any id appears more than once, raise a
ValueError or log an explicit error mentioning the duplicate id(s) and stop
processing; update any initialization that assumes unique keys (calibration_buf
and report row creation) to only run after this uniqueness check. Ensure the
error references the adapter list and model_id to help locate offending entries.
- Line 376: The print call currently uses an f-string with no interpolation
(print(f"ERROR: rule_based should not be at the bottom")); remove the
unnecessary f-prefix so it is a normal string literal (e.g., print("ERROR:
rule_based should not be at the bottom")) to resolve Ruff F541; locate the print
statement in main.py (the line that prints "ERROR: rule_based should not be at
the bottom") and update it accordingly.

In `@phases/19-capstone-projects/75-end-to-end-eval-runner/docs/en.md`:
- Around line 65-66: The docs describe APIs that don't match the code: update
all references of run_eval(adapter, tasks, parallel=False) to reflect the actual
signature run_eval(adapters, tasks, parallel=False) (note plural), remove any
mention of a non-existent correct_threshold parameter, and replace references to
the helper imports with the actual helper name _load_sibling; apply these fixes
consistently across the document (including the other occurrences noted around
lines 78-79 and 121) so the documentation matches the real symbols run_eval,
adapters, correct_threshold (remove), imports -> _load_sibling.

In `@phases/19-capstone-projects/75-end-to-end-eval-runner/quiz.json`:
- Around line 31-40: The quiz explanation references a non-existent "threshold
override" for the runner's calibration buffer logic; either remove that claim
from the explanation or add an explicit optional override parameter to the
runner API and use it when deciding the calibration flag. Concretely: if you
prefer API change, add an optional parameter (e.g., thresholdOverride) to the
runner method that computes the calibration buffer/flag (the function currently
responsible for "calibration buffer" / flag determination), have the decision
logic use thresholdOverride when provided and otherwise default to
exact_match=1.0 and graded metrics=0.5; if you prefer documentation change,
update the explanation text to omit the "runner accepts an override" sentence so
it matches the current runner behavior.

---

Nitpick comments:
In `@phases/19-capstone-projects/71-classical-metrics/code/main.py`:
- Around line 168-195: In corpus_bleu, make the pairing of predictions and
references explicit by passing strict=True to the zip call inside the loop (the
zip in function corpus_bleu that currently reads zip(predictions, references));
this enforces the existing length-check contract at runtime and prevents silent
truncation if future refactors remove the prior len check—keep the existing
length validation and just add strict=True to the zip invocation.

In `@phases/19-capstone-projects/72-code-exec-metric/code/tests/test_exec.py`:
- Around line 137-144: Add explicit regression tests for pass_at_k to pin the
edge-case behavior: add one test (e.g., test_k_greater_than_n_raises) that calls
pass_at_k with k > n (e.g., pass_at_k(5, 3, 1) or pass_at_k(5, 6, 1) depending
on intended semantics) and asserts it raises ValueError, and add another test
(e.g., test_n_zero_returns_zero) that calls pass_at_k with n == 0 (e.g.,
pass_at_k(1, 0, 0)) and asserts it returns 0; reference the pass_at_k function
to locate where to add these tests and use clear test names to lock in the
expected branch behavior.

In
`@phases/19-capstone-projects/73-perplexity-calibration/code/tests/test_calibration.py`:
- Around line 111-137: Add tests asserting that passing non-positive bin counts
raises a validation error: call brier_decomposition(conf, corr, bins=0) and
bins=-1 (use synthetic or small arrays) and assert it raises ValueError (or the
project's chosen validation exception), and similarly call
reliability_diagram(conf, corr, bins=0) and bins=-1 and assert the same;
reference the functions brier_decomposition and reliability_diagram so test
names like test_invalid_bins_raise_for_brier_decomposition and
test_invalid_bins_raise_for_reliability_diagram clearly cover bins<=0 cases.

In
`@phases/19-capstone-projects/75-end-to-end-eval-runner/code/tests/test_runner.py`:
- Line 89: The test unpacks two values from run_eval([...]) into results and buf
but never uses results; change the unpack to use a throwaway variable (e.g., "_"
or "_results") instead of results to make intent clear and avoid unused-variable
warnings. Update the unpack expression where run_eval is called in the test (the
line invoking run_eval with RuleBasedAdapter) to replace the first target with
the throwaway name.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 944fdc59-f42c-47e4-8fef-c77be693d956

📥 Commits

Reviewing files that changed from the base of the PR and between c1374e1 and 01de8b7.

📒 Files selected for processing (25)

catalog.json
phases/19-capstone-projects/70-task-spec-format/code/main.py
phases/19-capstone-projects/70-task-spec-format/code/tests/test_spec.py
phases/19-capstone-projects/70-task-spec-format/docs/en.md
phases/19-capstone-projects/70-task-spec-format/quiz.json
phases/19-capstone-projects/71-classical-metrics/code/main.py
phases/19-capstone-projects/71-classical-metrics/code/tests/test_metrics.py
phases/19-capstone-projects/71-classical-metrics/docs/en.md
phases/19-capstone-projects/71-classical-metrics/quiz.json
phases/19-capstone-projects/72-code-exec-metric/code/main.py
phases/19-capstone-projects/72-code-exec-metric/code/tests/test_exec.py
phases/19-capstone-projects/72-code-exec-metric/docs/en.md
phases/19-capstone-projects/72-code-exec-metric/quiz.json
phases/19-capstone-projects/73-perplexity-calibration/code/main.py
phases/19-capstone-projects/73-perplexity-calibration/code/tests/test_calibration.py
phases/19-capstone-projects/73-perplexity-calibration/docs/en.md
phases/19-capstone-projects/73-perplexity-calibration/quiz.json
phases/19-capstone-projects/74-leaderboard-aggregation/code/main.py
phases/19-capstone-projects/74-leaderboard-aggregation/code/tests/test_leaderboard.py
phases/19-capstone-projects/74-leaderboard-aggregation/docs/en.md
phases/19-capstone-projects/74-leaderboard-aggregation/quiz.json
phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py
phases/19-capstone-projects/75-end-to-end-eval-runner/code/tests/test_runner.py
phases/19-capstone-projects/75-end-to-end-eval-runner/docs/en.md
phases/19-capstone-projects/75-end-to-end-eval-runner/quiz.json

coderabbitai · 2026-05-26T18:56:42Z

+```
+pass_at_k(n, c, k) = 1 - C(n - c, k) / C(n, k)
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to the fenced formula block.

Line 78 uses an unlabeled fenced code block, which trips markdown lint (MD040).

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 78-78: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/72-code-exec-metric/docs/en.md` around lines 78 - 80, The fenced block containing the formula for pass_at_k(n, c, k) is unlabeled and triggers MD040; edit the fenced block that holds "pass_at_k(n, c, k) = 1 - C(n - c, k) / C(n, k)" and add a language tag such as ```math or ```latex (e.g., change the opening fence from ``` to ```math) so the block is properly labeled for markdown linters while leaving the formula text unchanged.

coderabbitai · 2026-05-26T18:56:42Z

+        print(f"ERROR: expected rule_based at top, got {top}")
+        return 2
+    if bot == "rule_based":
+        print(f"ERROR: rule_based should not be at the bottom")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the empty f-string prefix.

Line 376 uses an f-string with no interpolation, which triggers Ruff F541.

Suggested fix

- print(f"ERROR: rule_based should not be at the bottom") + print("ERROR: rule_based should not be at the bottom")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f"ERROR: rule_based should not be at the bottom")

print("ERROR: rule_based should not be at the bottom")

🧰 Tools

🪛 Ruff (0.15.14)

[error] 376-376: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py` at line 376, The print call currently uses an f-string with no interpolation (print(f"ERROR: rule_based should not be at the bottom")); remove the unnecessary f-prefix so it is a normal string literal (e.g., print("ERROR: rule_based should not be at the bottom")) to resolve Ruff F541; locate the print statement in main.py (the line that prints "ERROR: rule_based should not be at the bottom") and update it accordingly.

…ut docs

…arify bootstrap defaults

…real API

coderabbitai

♻️ Duplicate comments (1)

phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py (1)

379-379: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the unnecessary f-string prefix.

This f-string has no placeholders. The fix from the previous review was not applied.

-        print(f"ERROR: rule_based should not be at the bottom")
+        print("ERROR: rule_based should not be at the bottom")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py` at line
379, The print call using an f-string has no placeholders; locate the print
statement containing the literal "ERROR: rule_based should not be at the bottom"
and remove the unnecessary f-string prefix so it is a normal string literal
(i.e., change the print invocation to use a plain string instead of an
f-string).

🧹 Nitpick comments (1)

phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py (1)

208-220: 💤 Low value

Remove unused parameters from _perplexity_blocks.

Parameters buf and results are never used in the function body. Only adapters and adapter_token_stats are actually needed.

-def _perplexity_blocks(adapters: Sequence[ModelAdapter], buf: dict[str, list[tuple[float, float, int]]],
-                       results: Sequence[TaskResult], adapter_token_stats: dict[str, list[tuple[float, int]]]) -> dict[str, dict]:
+def _perplexity_blocks(adapters: Sequence[ModelAdapter],
+                       adapter_token_stats: dict[str, list[tuple[float, int]]]) -> dict[str, dict]:

Also update the call site at line 235:

-    perplexity = _perplexity_blocks(adapters, calibration_buf, results, adapter_token_stats)
+    perplexity = _perplexity_blocks(adapters, adapter_token_stats)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py` around
lines 208 - 220, The function _perplexity_blocks currently accepts unused
parameters buf and results; remove these parameters from the signature so it
becomes _perplexity_blocks(adapters: Sequence[ModelAdapter],
adapter_token_stats: dict[str, list[tuple[float, int]]]) and update any call
sites that invoke _perplexity_blocks to stop passing buf and results (pass only
adapters and adapter_token_stats), keeping the internal logic (stats extraction,
PerplexityResult construction) unchanged; verify types and run tests to ensure
call sites (the spot that previously passed three args) are updated accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py`:
- Line 379: The print call using an f-string has no placeholders; locate the
print statement containing the literal "ERROR: rule_based should not be at the
bottom" and remove the unnecessary f-string prefix so it is a normal string
literal (i.e., change the print invocation to use a plain string instead of an
f-string).

---

Nitpick comments:
In `@phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py`:
- Around line 208-220: The function _perplexity_blocks currently accepts unused
parameters buf and results; remove these parameters from the signature so it
becomes _perplexity_blocks(adapters: Sequence[ModelAdapter],
adapter_token_stats: dict[str, list[tuple[float, int]]]) and update any call
sites that invoke _perplexity_blocks to stop passing buf and results (pass only
adapters and adapter_token_stats), keeping the internal logic (stats extraction,
PerplexityResult construction) unchanged; verify types and run tests to ensure
call sites (the spot that previously passed three args) are updated accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6f9ddd21-9fb2-4509-b602-a6b77dd879b0

📥 Commits

Reviewing files that changed from the base of the PR and between 01de8b7 and 89b4a9b.

📒 Files selected for processing (10)

phases/19-capstone-projects/72-code-exec-metric/code/main.py
phases/19-capstone-projects/72-code-exec-metric/docs/en.md
phases/19-capstone-projects/72-code-exec-metric/quiz.json
phases/19-capstone-projects/73-perplexity-calibration/code/main.py
phases/19-capstone-projects/73-perplexity-calibration/docs/en.md
phases/19-capstone-projects/74-leaderboard-aggregation/code/main.py
phases/19-capstone-projects/74-leaderboard-aggregation/docs/en.md
phases/19-capstone-projects/75-end-to-end-eval-runner/code/main.py
phases/19-capstone-projects/75-end-to-end-eval-runner/docs/en.md
phases/19-capstone-projects/75-end-to-end-eval-runner/quiz.json

✅ Files skipped from review due to trivial changes (6)

phases/19-capstone-projects/75-end-to-end-eval-runner/quiz.json
phases/19-capstone-projects/74-leaderboard-aggregation/docs/en.md
phases/19-capstone-projects/72-code-exec-metric/docs/en.md
phases/19-capstone-projects/72-code-exec-metric/quiz.json
phases/19-capstone-projects/73-perplexity-calibration/docs/en.md
phases/19-capstone-projects/75-end-to-end-eval-runner/docs/en.md

# Conflicts: # catalog.json

rohitg00 and others added 7 commits May 26, 2026 19:24

feat(phase-19/70): add task-spec-format deep capstone

01ed223

feat(phase-19/71): add classical-metrics deep capstone

ad4bcef

feat(phase-19/72): add code-exec-metric deep capstone

1a9c66b

feat(phase-19/73): add perplexity-calibration deep capstone

25d2604

feat(phase-19/74): add leaderboard-aggregation deep capstone

5403bec

feat(phase-19/75): add end-to-end-eval-runner deep capstone

138c78b

chore(catalog): auto-regen

01de8b7

vercel Bot deployed to Preview May 26, 2026 18:41 View deployment

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

rohitg00 added 4 commits May 26, 2026 21:00

fix(phase-19/72): stream stdout cap, swap pass_at_k branches, align docs

4b823a1

fix(phase-19/73): validate bins in brier/reliability, align empty-inp…

83491eb

…ut docs

fix(phase-19/74): reject duplicate runs, validate pairwise inputs, cl…

8d764f0

…arify bootstrap defaults

fix(phase-19/75): enforce unique model ids, align docs and quiz with …

89b4a9b

…real API

vercel Bot deployed to Preview May 26, 2026 20:02 View deployment

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into feat/phase-19-track-g

ab3293a

# Conflicts: # catalog.json

vercel Bot deployed to Preview May 27, 2026 09:13 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(phase-19): track G eval harness lessons 70-75#210

feat(phase-19): track G eval harness lessons 70-75#210
rohitg00 wants to merge 12 commits into
mainfrom
feat/phase-19-track-g

rohitg00 commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	print(f"ERROR: rule_based should not be at the bottom")
	print("ERROR: rule_based should not be at the bottom")

Uh oh!

Conversation

rohitg00 commented May 26, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 26, 2026 •

edited

Loading