feat(phase-19): track I safety red-team lessons 82-87 by rohitg00 · Pull Request #212 · rohitg00/ai-engineering-from-scratch

rohitg00 · 2026-05-26T18:45:23Z

Summary

Six capstone sub-lessons under phases/19-capstone-projects/ that compose into a runnable safety + red-team harness.

82 jailbreak-taxonomy - six-category attack taxonomy partitioned by trust boundary abused, 50 hand-authored fixtures with severity 1-5, trigram matcher, validator, stable taxonomy.json artifact
83 prompt-injection-detector - normalize (zero-width, homoglyph, base64, hex, leet, rot13) then substring then regex rules, per-category precision/recall/F1 against the 50-fixture corpus plus a 25-prompt benign baseline
84 refusal-evaluation - under-refusal, over-refusal, accuracy, ECE calibration, per-category breakdown; three mock LLM policies (strict, leaky, over-cautious) exercise opposite failure modes
85 content-classifier-integration - toxicity (negation-window check), PII (Luhn-validated card detection), instruction-leakage (trigram cosine vs known system prompt) behind a severity router with block/redact/warn/log
86 constitutional-rules-engine - YAML constitution with all_of/any_of/not_ predicates, six-rule starter constitution, declarative fixer (append/prepend/replace), structured diff between draft and revised, self-contained YAML subset parser
87 end-to-end-safety-gate - composes 82-86 into pre-gen + during-gen (streaming token-filter) + post-gen with deterministic aggregation table; runs all 50 attack fixtures plus 10 benign prompts end-to-end and emits a per-request trace

Every lesson ships docs (mermaid diagram, 900-1100 words), runnable main.py, tests.py (12-18 unittests each), quiz.json (6 questions), and a skill-*.md output. Total: 92 unit tests across the six lessons, all green; six demos all exit 0 and write JSON artifacts under each lesson's outputs/.

Implementation uses only numpy (lessons 82-84) plus an optional pyyaml fallback (lesson 86 ships its own YAML subset parser so the lesson runs on a stock Python install). No real LLM calls; mock LLMs throughout. No external red-team / safety repo names or paper citations in any file.

Test plan

python3 -m unittest tests passes in each of the six lesson code/ directories (15 + 14 + 15 + 18 + 18 + 12 = 92 tests)
python3 main.py exits 0 in each lesson and writes its artifact
Lesson 87 demo composes all five prior lessons via importlib.spec file-loading without name collisions
No site/, root README.md, or catalog.json touched
Mermaid diagrams render; all code fences language-tagged

Six-category taxonomy (role-play, instruction-override, context-smuggling, multi-turn-ramp, encoding-trick, prefix-injection) partitions attacks by trust boundary abused. Fixtures hand-authored, severity 1-5. Trigram cosine matcher assigns category to candidate prompts. Validator enforces minimum per-category count, severity range, unique ids, non-empty prompts. Includes 50 fixtures, taxonomy.json artifact for downstream lessons, 15 unittest cases, quiz with 6 questions, skill output.

Layered detector pipeline: normalize (zero-width, homoglyph, base64, hex, leet, rot13) then substring rules then regex rules. Each rule carries a category and a base score; aggregator returns the highest scoring category with its confidence. Runner reads the lesson 82 taxonomy artifact, evaluates against the 50-fixture corpus plus a 25-prompt benign baseline, and writes per-category precision/recall/F1 to detector_report.json. Includes 14 unittest cases, quiz with 6 questions, skill output, rules and benign corpus as data files for easy extension.

Two-sided refusal metrics: under-refusal (answered unsafe), over-refusal (refused safe), accuracy, ECE calibration, per-category under-refusal join against the lesson 82 taxonomy. Three mock LLM policies (strict, leaky, over-cautious) demonstrate the framework detects opposite failure modes. Labeled corpus: 25 unsafe prompts tagged with taxonomy ids, 30 safe prompts non-overlapping with the lesson 83 benign set. Includes 15 unittest cases, quiz with 6 questions, skill output, ECE binning implementation, refusal phrase classifier.

Three classifiers behind one severity router. Toxicity (harassment terms with negation-window check), PII (email, phone, SSN, Luhn-validated card, IPv4), instruction-leakage (trigram cosine vs a known system prompt). Router takes max severity across classifiers and applies block, redact, warn, or log. Each classifier carries its own redactor; redact-severity outputs flow through all matching redactors before shipping. Includes 18 unittest cases, quiz with 6 questions, skill output, demo over six fixtures exercising all four severity buckets.

YAML constitution defines rules with name, severity, applies_when, must, explanation, fix. Predicates compose via all_of/any_of/not_. Engine emits per-rule status (pass, violation, not_applicable) with matched span. Fixer applies declarative append/prepend/replace operations per rule. diff function produces structured change list between draft and revised. Self-contained yaml_subset parser so the lesson runs without PyYAML, with graceful fallback to PyYAML when present. Includes 18 unittest cases, quiz with 6 questions, skill output, six-rule constitution covering refusal redirects, code closing, PII in examples, citations, internal library leaks, and length bounds.

Three-checkpoint composition: pre-gen detector on the prompt, during-gen streaming filter that buffers chunks and terminates early on harmful continuations, post-gen classifier router and rules engine on the completed output. Deterministic aggregation table picks the final action (block, redact, warn, allow) from the maximum severity across signals. Each request emits a structured RequestTrace with checkpoint verdicts and latency. Demo runs all 50 lesson 82 fixtures plus 10 benign prompts end-to-end, prints per-action and per-category outcomes, and writes gate_trace.json. Includes 12 unittest cases, quiz with 6 questions, skill output, and direct file-spec imports so sibling lessons compose without packaging.

…fact Title em-dashes stay (matches existing capstone convention). One body em-dash in lesson 82 prose replaced with a colon. Lesson 87 gate trace artifact regenerated.

coderabbitai · 2026-05-26T18:45:36Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5dc8368d-8685-4917-845c-0411d82977a8

📥 Commits

Reviewing files that changed from the base of the PR and between 91e8782 and c2d5b44.

📒 Files selected for processing (18)

phases/19-capstone-projects/83-prompt-injection-detector/code/main.py
phases/19-capstone-projects/83-prompt-injection-detector/code/tests.py
phases/19-capstone-projects/84-refusal-evaluation/code/prompts.py
phases/19-capstone-projects/84-refusal-evaluation/docs/en.md
phases/19-capstone-projects/85-content-classifier-integration/code/classifiers.py
phases/19-capstone-projects/85-content-classifier-integration/docs/en.md
phases/19-capstone-projects/85-content-classifier-integration/outputs/classifier_report.json
phases/19-capstone-projects/85-content-classifier-integration/outputs/skill-content-classifier-integration.md
phases/19-capstone-projects/86-constitutional-rules-engine/code/rules.yml
phases/19-capstone-projects/86-constitutional-rules-engine/code/yaml_subset.py
phases/19-capstone-projects/86-constitutional-rules-engine/outputs/rules_report.json
phases/19-capstone-projects/86-constitutional-rules-engine/quiz.json
phases/19-capstone-projects/87-end-to-end-safety-gate/code/main.py
phases/19-capstone-projects/87-end-to-end-safety-gate/code/mock_llm_stream.py
phases/19-capstone-projects/87-end-to-end-safety-gate/code/safety_gate.py
phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md
phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/gate_trace.json
phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/skill-end-to-end-safety-gate.md

✅ Files skipped from review due to trivial changes (5)

phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md
phases/19-capstone-projects/85-content-classifier-integration/outputs/skill-content-classifier-integration.md
phases/19-capstone-projects/86-constitutional-rules-engine/quiz.json
phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/gate_trace.json
phases/19-capstone-projects/84-refusal-evaluation/docs/en.md

📝 Walkthrough

Walkthrough

This PR introduces six interconnected capstone lessons (82–87) forming a comprehensive LLM safety framework. Lesson 82 establishes a jailbreak taxonomy with 50 prompts across six trust-boundary categories. Lesson 83 builds a detector using layered normalization and rule matching. Lesson 84 evaluates refusal behavior via mock policies and metrics. Lesson 85 implements three independent classifiers (toxicity, PII, instruction leakage) feeding a severity-based router. Lesson 86 provides a declarative rules engine for output constraints. Lesson 87 orchestrates all prior stages (pre/during/post-generation) into a unified safety gate. The catalog is updated to reflect all six new lessons.

Changes

Capstone Lessons 82–87: Complete Safety Framework

Layer / File(s)	Summary
Catalog updates and Lesson 82: Jailbreak Taxonomy `catalog.json`, `phases/19-capstone-projects/82-jailbreak-taxonomy/code/`, `phases/19-capstone-projects/82-jailbreak-taxonomy/docs/en.md`, `phases/19-capstone-projects/82-jailbreak-taxonomy/outputs/`, `phases/19-capstone-projects/82-jailbreak-taxonomy/quiz.json`	Catalog incremented from 487 to 528 code files and lesson count from 17 to 23. Lesson 82 defines a six-category jailbreak taxonomy (role-play, instruction-override, context-smuggling, multi-turn-ramp, encoding-trick, prefix-injection) with 50 fixtures, trigram-based matching, validation invariants, comprehensive tests, and JSON artifact serialization.
Lesson 83: Prompt Injection Detector `phases/19-capstone-projects/83-prompt-injection-detector/code/`, `phases/19-capstone-projects/83-prompt-injection-detector/docs/en.md`, `phases/19-capstone-projects/83-prompt-injection-detector/outputs/`, `phases/19-capstone-projects/83-prompt-injection-detector/quiz.json`	Normalization pipeline (zero-width, homoglyph, base64/hex, leet, ROT13) feeding substring and regex rules. Loads lesson 82 taxonomy, benign prompts, and computes per-category confusion matrices with precision/recall/F1 metrics and overall accuracy. Outputs detector_report.json.
Lesson 84: Refusal Evaluation `phases/19-capstone-projects/84-refusal-evaluation/code/`, `phases/19-capstone-projects/84-refusal-evaluation/docs/en.md`, `phases/19-capstone-projects/84-refusal-evaluation/outputs/`, `phases/19-capstone-projects/84-refusal-evaluation/quiz.json`	Labeled safe/unsafe prompt corpus (25 each) with three mock LLM policies (strict, leaky, over-cautious) and regex refusal classification. Computes accuracy, under-refusal rate, over-refusal rate, ECE calibration, and per-category under-refusal breakdown. Outputs refusal_eval_report.json.
Lesson 85: Content Classifier Integration `phases/19-capstone-projects/85-content-classifier-integration/code/`, `phases/19-capstone-projects/85-content-classifier-integration/docs/en.md`, `phases/19-capstone-projects/85-content-classifier-integration/outputs/`, `phases/19-capstone-projects/85-content-classifier-integration/quiz.json`	Three classifiers (toxicity via negation-aware matching, PII via email/phone/SSN/card/IP with Luhn validation, instruction leakage via trigram similarity). Router aggregates by max severity into four actions (block, redact, warn, log) with per-action output transformations. Outputs classifier_report.json.
Lesson 86: Constitutional Rules Engine `phases/19-capstone-projects/86-constitutional-rules-engine/code/`, `phases/19-capstone-projects/86-constitutional-rules-engine/docs/en.md`, `phases/19-capstone-projects/86-constitutional-rules-engine/outputs/`, `phases/19-capstone-projects/86-constitutional-rules-engine/quiz.json`	YAML-based rules with recursive predicates (all_of, any_of, not_), constraints (regex, word count), and violations. Fixer applies conditional repairs (append/prepend-if-missing, regex replace). Diff computes line-level changes via SequenceMatcher. Outputs rules_report.json.
Lesson 87: End-to-End Safety Gate `phases/19-capstone-projects/87-end-to-end-safety-gate/code/`, `phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md`, `phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/`, `phases/19-capstone-projects/87-end-to-end-safety-gate/quiz.json`	Three-stage orchestration: pre-generation detector verdict, during-generation streaming with early termination on unsafe patterns, post-generation classifier and rules verdicts. Deterministic severity aggregation selects final action (block/redact/warn/allow). RequestTrace records all verdicts, latency, and audit per request. Outputs gate_trace.json.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/phase-19-track-i

coderabbitai

Actionable comments posted: 19

🧹 Nitpick comments (1)

phases/19-capstone-projects/87-end-to-end-safety-gate/code/tests.py (1)

74-77: ⚡ Quick win

test_redact_when_classifier_redacts currently asserts nothing meaningful.

Line 76 allows every possible action, so this test can’t fail on regressions. Assert the intended contract (at least not allow, or specifically redact for this input).

Suggested fix

     def test_redact_when_classifier_redacts(self) -> None:
         trace = self.gate.handle("Please email me at lee@example.com about my account.")
-        self.assertIn(trace.final_action, {"redact", "block", "warn", "allow"})
+        self.assertIn(trace.final_action, {"redact", "block"})

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@phases/19-capstone-projects/87-end-to-end-safety-gate/code/tests.py` around
lines 74 - 77, The test test_redact_when_classifier_redacts is currently vacuous
because it allows every outcome; instead assert the intended contract by
checking trace.final_action against the expected behavior from self.gate.handle
for an email-containing input: replace the broad
self.assertIn(trace.final_action, {"redact","block","warn","allow"}) with a
stricter assertion (e.g., self.assertNotEqual(trace.final_action, "allow") or
self.assertEqual(trace.final_action, "redact")) so the test fails on
regressions; update the assertion in test_redact_when_classifier_redacts to use
trace.final_action and the chosen expected value.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@phases/19-capstone-projects/83-prompt-injection-detector/code/main.py`:
- Around line 29-30: Replace the raw invisible/confusable Unicode in the
normalization constants with explicit \uXXXX escapes so the behavior is
unchanged but the characters are auditable: update the ZERO_WIDTH regex
(ZERO_WIDTH) to use escapes for U+200B, U+200C, U+200D, U+2060 and the range
U+202A–U+202E (e.g. \u200B\u200C\u200D\u2060\u202A-\u202E) and change the
HOMOGLYPHS str.maketrans mapping (HOMOGLYPHS) to use \u04XX/\u04XX-style escapes
for the Cyrillic capitals (А U+0410, В U+0412, С U+0421, Е U+0415, Н U+041D, К
U+041A, М U+041C, О U+041E, Р U+0420, Т U+0422, Х U+0425) mapping to their ASCII
equivalents; keep the same keys/values and semantics but substitute escaped code
points instead of embedding raw characters.
- Around line 83-84: In _try_decode_base64 and _try_decode_hex replace the broad
"except Exception: pass" with targeted exception handling: catch and handle only
the decoding-related exceptions (e.g., for base64 use base64.b64decode and catch
binascii.Error, ValueError and TypeError; for hex use binascii.unhexlify and
catch binascii.Error and ValueError/TypeError), ensure you import binascii if
needed, and return None on those specific failures so other unexpected
exceptions still bubble up.

In `@phases/19-capstone-projects/83-prompt-injection-detector/code/tests.py`:
- Around line 13-15: The test contains raw zero-width characters in the string
literal and assertion which are hard to review; update the test in tests.py to
replace the embedded zero-width characters with explicit Unicode escape
sequences (e.g., use \u200b, \u200c, or \u200d as appropriate) in the variable
assigned to text and in the assertNotIn call that checks normalize(text);
reference the normalize function and the test variable (text) so reviewers can
locate and verify the change.

In `@phases/19-capstone-projects/84-refusal-evaluation/code/prompts.py`:
- Around line 65-96: SAFE_PROMPTS currently has 30 entries (s-01..s-30) but the
docs claim 25; update the documentation to match the implementation by changing
any references that say "25 safe prompts" to "30 safe prompts" (including the
module docstring and en.md). Locate the SAFE_PROMPTS definition to confirm the
count and then update the docstring text and the en.md file so they accurately
state 30 safe prompts and, if present, adjust any related summary or index
numbers to reflect the full list.
- Around line 1-7: The module docstring incorrectly states "25 safe prompts"
while the SAFE_PROMPTS constant actually contains 30 entries (IDs s-01..s-30);
update the top-level docstring in prompts.py to say "30 safe prompts" and
likewise update the corresponding sentence in en.md (around the existing
mention) to reflect 30 safe prompts so both documentation strings match the
SAFE_PROMPTS list.

In `@phases/19-capstone-projects/84-refusal-evaluation/docs/en.md`:
- Around line 42-46: Update the documentation count to match the actual
SAFE_PROMPTS length: change the phrase that reads "plus 25 safe prompts" to
"plus 30 safe prompts" (the code defines SAFE_PROMPTS with entries s-01 through
s-30 in code/prompts.py), and ensure the surrounding sentence in docs/en.md
referencing the labeled corpus reflects this corrected count.

In
`@phases/19-capstone-projects/85-content-classifier-integration/code/classifiers.py`:
- Line 90: The _CARD_RE pattern allows a trailing separator because the repeated
group includes an optional separator; change it so the final character must be a
digit: replace _CARD_RE = re.compile(r"\b(?:\d[ -]?){13,19}\b") with a pattern
that enforces the last character is a digit, e.g. _CARD_RE =
re.compile(r"\b(?:\d[ -]?){12,18}\d\b"), keeping the raw string and re.compile
usage so matches always end on a digit.
- Around line 112-123: The code is recording raw PII matches into the findings
list (uses m.group(0) for _EMAIL_RE, _PHONE_RE, _SSN_RE, _CARD_RE, _IPV4_RE),
which risks leaking sensitive data; update the loops that append to findings to
store only the type labels (e.g., "email", "phone", "ssn", "card", "ip") or a
strictly masked form instead of the full match, while preserving the existing
validation (e.g., keep using _luhn(digits) for card validation) and the same
loop locations (the blocks that iterate _EMAIL_RE, _PHONE_RE, _SSN_RE, _CARD_RE,
_IPV4_RE and append to findings).

In `@phases/19-capstone-projects/85-content-classifier-integration/docs/en.md`:
- Line 43: Update the documentation to match the implemented API: replace
references to Action.redacted_output with Action.output, and update descriptions
of the router function from decide(verdicts) to decide(text, verdicts) so the
docs reflect the actual function signature and returned Action shape (verb,
output, metadata); ensure any examples, explanations, and the mention in lesson
51 use these exact names to avoid integration confusion.

In
`@phases/19-capstone-projects/85-content-classifier-integration/outputs/skill-content-classifier-integration.md`:
- Around line 16-22: The fenced code block showing the data shape for
"ClassifierVerdict" lacks a language tag and triggers markdownlint MD040; update
the opening fence from ``` to a tagged fence such as ```text (or another
appropriate tag like ```yaml) so the block becomes ```text and retains the same
contents for "ClassifierVerdict", preserving indentation and lines for name,
severity, score, and findings.

In `@phases/19-capstone-projects/86-constitutional-rules-engine/code/rules.yml`:
- Around line 33-39: The rule currently detects both email and phone patterns
via the two not_contains_regex entries but the fixer only replaces emails;
update the fix block used by the no-pii-in-examples rule so it also rewrites
phone numbers. Specifically, add a second replace_regex (or expand the existing
pattern) to include the phone regex '\b(\+?\d{1,3}[ .-]?)?(\(?\d{3}\)?[
.-]?)\d{3}[ .-]?\d{4}\b' so that the fixer replaces detected phone numbers
(similar to how the existing replace_regex with pattern
'\b[\w.+-]+@[\w-]+\.[\w.-]+\b' replaces emails with '[example-user]').

In
`@phases/19-capstone-projects/86-constitutional-rules-engine/code/yaml_subset.py`:
- Around line 49-73: The coercion function _coerce currently treats the literal
"{}" as a plain string, breaking inline empty mappings like the `applies_when:
{}` used in rules.yml; update _coerce to detect empty inline mappings and return
an actual empty dict (e.g., if s startswith "{" and endswith "}" and the inner
content is whitespace/empty, return {}), leaving all other coercions unchanged.

In `@phases/19-capstone-projects/86-constitutional-rules-engine/quiz.json`:
- Around line 19-27: The quiz's correct answer index is wrong: update the
"correct" value in the quiz entry for the question "What three fields must every
rule have at minimum?" from 2 to 4 so the correct option becomes "predicate,
severity, owner" (which includes the required severity field) by editing the
"correct" key in quiz.json.

In `@phases/19-capstone-projects/87-end-to-end-safety-gate/code/main.py`:
- Around line 77-84: The benign loop over BENIGN_PROMPTS currently never updates
the global terminations counter; when gate.handle(prompt) returns a trace that
indicates an early termination (use trace.final_action or trace.terminated
flag), increment the same global terminations metric used elsewhere (named
terminations) so benign requests are included; apply the same change to the
other benign-processing block referenced by the reviewer (the logic around
traces.append and per_category_outcome updates) to mirror how terminations is
incremented for non-benign requests.

In
`@phases/19-capstone-projects/87-end-to-end-safety-gate/code/mock_llm_stream.py`:
- Around line 60-70: The stream function uses chunk_tokens as the range step
without validation, so pass a check at the top of stream(prompt: str,
chunk_tokens: int = 4) to ensure chunk_tokens is an int > 0 (e.g., raise
ValueError with a clear message if chunk_tokens <= 0 or not an int) before
calling range(..., chunk_tokens) and then proceed to chunk the tokens; reference
the stream function and the chunk_tokens parameter when adding the validation.

In `@phases/19-capstone-projects/87-end-to-end-safety-gate/code/safety_gate.py`:
- Around line 170-172: The code currently indexes SEVERITY_RANK directly with
post.classifier_severity and post.rules_max_severity causing KeyError on
unexpected tokens; change both places to safely lookup with a fallback (e.g.,
use SEVERITY_RANK.get(post.classifier_severity, <safe-default>) and
SEVERITY_RANK.get(post.rules_max_severity, <safe-default>)) so
signals.append(...) always receives a numeric severity; update the occurrences
that build ("post.classifier", ...) and ("post.rules", ...) to use .get and pick
a sensible default (like 0 or the lowest severity) to degrade gracefully.
- Around line 187-194: After calling classifier_router.run(raw_output) and
getting classifier_action.output into redacted, ensure you don't return an
empty/falsey string: if redacted is empty after classifier_router.run and after
optional rules_fixer.apply, replace it with a safe fallback (for example call a
helper like self.safe_fallback(raw_output) or return a generic safe message)
before returning; update the logic around classifier_action.output,
post.rules_violations, rules_engine.evaluate(...).violations(), and
rules_fixer.apply(...) to perform this empty-check and fallback substitution so
the redact branch never returns a blank body.

In `@phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md`:
- Around line 42-43: The aggregation table row that currently reads "detector
confidence 0.5-0.85, no other signal | allow with note" conflicts with the
implemented action which emits final_action="warn"; update the table text for
the "detector confidence 0.5-0.85, no other signal" case to read "warn" (or
otherwise match the exact implemented token final_action="warn") so wording is
consistent with the implementation.

In
`@phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/skill-end-to-end-safety-gate.md`:
- Around line 33-43: The fenced code block showing the RequestTrace schema lacks
a language hint, which triggers MD040 linting; update the markdown fenced block
that contains "RequestTrace" (the block starting with ``` and the schema lines
including request_id, prompt, pre_gen, during_gen, post_gen, final_action,
final_output, latency_ms) to include a language identifier such as "text" (i.e.,
```text) so the block is properly annotated for the linter.

---

Nitpick comments:
In `@phases/19-capstone-projects/87-end-to-end-safety-gate/code/tests.py`:
- Around line 74-77: The test test_redact_when_classifier_redacts is currently
vacuous because it allows every outcome; instead assert the intended contract by
checking trace.final_action against the expected behavior from self.gate.handle
for an email-containing input: replace the broad
self.assertIn(trace.final_action, {"redact","block","warn","allow"}) with a
stricter assertion (e.g., self.assertNotEqual(trace.final_action, "allow") or
self.assertEqual(trace.final_action, "redact")) so the test fails on
regressions; update the assertion in test_redact_when_classifier_redacts to use
trace.final_action and the chosen expected value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a7bfd78b-0ecc-4edd-8c20-738c790182c2

📥 Commits

Reviewing files that changed from the base of the PR and between c1374e1 and 91e8782.

📒 Files selected for processing (47)

catalog.json
phases/19-capstone-projects/82-jailbreak-taxonomy/code/fixtures.py
phases/19-capstone-projects/82-jailbreak-taxonomy/code/main.py
phases/19-capstone-projects/82-jailbreak-taxonomy/code/tests.py
phases/19-capstone-projects/82-jailbreak-taxonomy/docs/en.md
phases/19-capstone-projects/82-jailbreak-taxonomy/outputs/skill-jailbreak-taxonomy.md
phases/19-capstone-projects/82-jailbreak-taxonomy/outputs/taxonomy.json
phases/19-capstone-projects/82-jailbreak-taxonomy/quiz.json
phases/19-capstone-projects/83-prompt-injection-detector/code/benign.py
phases/19-capstone-projects/83-prompt-injection-detector/code/main.py
phases/19-capstone-projects/83-prompt-injection-detector/code/rules.py
phases/19-capstone-projects/83-prompt-injection-detector/code/tests.py
phases/19-capstone-projects/83-prompt-injection-detector/docs/en.md
phases/19-capstone-projects/83-prompt-injection-detector/outputs/detector_report.json
phases/19-capstone-projects/83-prompt-injection-detector/outputs/skill-prompt-injection-detector.md
phases/19-capstone-projects/83-prompt-injection-detector/quiz.json
phases/19-capstone-projects/84-refusal-evaluation/code/main.py
phases/19-capstone-projects/84-refusal-evaluation/code/mock_llm.py
phases/19-capstone-projects/84-refusal-evaluation/code/prompts.py
phases/19-capstone-projects/84-refusal-evaluation/code/tests.py
phases/19-capstone-projects/84-refusal-evaluation/docs/en.md
phases/19-capstone-projects/84-refusal-evaluation/outputs/refusal_eval_report.json
phases/19-capstone-projects/84-refusal-evaluation/outputs/skill-refusal-evaluation.md
phases/19-capstone-projects/84-refusal-evaluation/quiz.json
phases/19-capstone-projects/85-content-classifier-integration/code/classifiers.py
phases/19-capstone-projects/85-content-classifier-integration/code/main.py
phases/19-capstone-projects/85-content-classifier-integration/code/tests.py
phases/19-capstone-projects/85-content-classifier-integration/docs/en.md
phases/19-capstone-projects/85-content-classifier-integration/outputs/classifier_report.json
phases/19-capstone-projects/85-content-classifier-integration/outputs/skill-content-classifier-integration.md
phases/19-capstone-projects/85-content-classifier-integration/quiz.json
phases/19-capstone-projects/86-constitutional-rules-engine/code/main.py
phases/19-capstone-projects/86-constitutional-rules-engine/code/rules.yml
phases/19-capstone-projects/86-constitutional-rules-engine/code/tests.py
phases/19-capstone-projects/86-constitutional-rules-engine/code/yaml_subset.py
phases/19-capstone-projects/86-constitutional-rules-engine/docs/en.md
phases/19-capstone-projects/86-constitutional-rules-engine/outputs/rules_report.json
phases/19-capstone-projects/86-constitutional-rules-engine/outputs/skill-constitutional-rules-engine.md
phases/19-capstone-projects/86-constitutional-rules-engine/quiz.json
phases/19-capstone-projects/87-end-to-end-safety-gate/code/main.py
phases/19-capstone-projects/87-end-to-end-safety-gate/code/mock_llm_stream.py
phases/19-capstone-projects/87-end-to-end-safety-gate/code/safety_gate.py
phases/19-capstone-projects/87-end-to-end-safety-gate/code/tests.py
phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md
phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/gate_trace.json
phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/skill-end-to-end-safety-gate.md
phases/19-capstone-projects/87-end-to-end-safety-gate/quiz.json

coderabbitai · 2026-05-26T18:58:09Z

+        except Exception:
+            pass


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify broad exception swallowing locations in this file. python - <<'PY' from pathlib import Path path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py") for i, line in enumerate(path.read_text(encoding="utf-8").splitlines(), 1): if "except Exception" in line: print(f"{path}:{i}:{line.strip()}") PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 262

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' from pathlib import Path path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py") lines = path.read_text(encoding="utf-8").splitlines() for start in (75, 90): end = start + 20 print(f"\n--- {path}:{start}-{end} ---") for i in range(start-1, min(end, len(lines))): print(f"{i+1:4d}: {lines[i]}") PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 1957

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' from pathlib import Path path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py") lines = path.read_text(encoding="utf-8").splitlines() for start in (75, 90): end = start + 20 print(f"\n--- {path}:{start}-{end} ---") for i in range(start-1, min(end, len(lines))): print(f"{i+1:4d}: {lines[i]}") PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 1957

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' from pathlib import Path path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py") lines = path.read_text(encoding="utf-8").splitlines() for i in range(1, 60): if i-1 < len(lines): print(f"{i:4d}: {lines[i-1]}") PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 2022

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' from pathlib import Path path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py") lines = path.read_text(encoding="utf-8").splitlines() for i in range(1, 80): if i-1 < len(lines): print(f"{i:4d}: {lines[i-1]}") PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 2731

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' from pathlib import Path path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py") lines = path.read_text(encoding="utf-8").splitlines() for i in range(1, 80): if i-1 < len(lines): print(f"{i:4d}: {lines[i-1]}") PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 2731

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' from pathlib import Path path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py") lines = path.read_text(encoding="utf-8").splitlines() for i in range(1, 60): if i-1 < len(lines): print(f"{i:4d}: {lines[i-1]}") PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 2022

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "binascii" phases/19-capstone-projects/83-prompt-injection-detector/code/main.py || true

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 62

Narrow exception handling in base64/hex decoding instead of swallowing all exceptions.

_try_decode_base64 and _try_decode_hex both use except Exception: pass (lines 83-84 and 98-99), which can hide unexpected failures and silently alter normalization behavior. Narrow the exceptions caught.

🔧 Proposed fix

+import binascii @@ - except Exception: - pass + except (binascii.Error, ValueError): + continue @@ - except Exception: - pass + except ValueError: + continue

🧰 Tools

🪛 Ruff (0.15.14)

[error] 83-84: try-except-pass detected, consider logging the exception

(S110)

[warning] 83-83: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/19-capstone-projects/83-prompt-injection-detector/code/main.py` around lines 83 - 84, In _try_decode_base64 and _try_decode_hex replace the broad "except Exception: pass" with targeted exception handling: catch and handle only the decoding-related exceptions (e.g., for base64 use base64.b64decode and catch binascii.Error, ValueError and TypeError; for hex use binascii.unhexlify and catch binascii.Error and ValueError/TypeError), ensure you import binascii if needed, and return None on those specific failures so other unexpected exceptions still bubble up.

…ag skill md

…i fixer, quiz key includes severity

… on empty redact, count benign terminations, doc action token

# Conflicts: # catalog.json

rohitg00 and others added 8 commits May 26, 2026 19:24

style(phase-19): drop em-dash from body text, refresh gate trace arti…

0ad2e83

…fact Title em-dashes stay (matches existing capstone convention). One body em-dash in lesson 82 prose replaced with a colon. Lesson 87 gate trace artifact regenerated.

chore(catalog): auto-regen

91e8782

vercel Bot deployed to Preview May 26, 2026 18:45 View deployment

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

rohitg00 added 5 commits May 26, 2026 21:07

fix(phase-19/83): use uXXXX escapes for invisible unicode in normalizer

136f781

docs(phase-19/84): correct safe prompt count from 25 to 30

a963424

fix(phase-19/85): tighten card regex, sync docs API names, language t…

705e3a4

…ag skill md

fix(phase-19/86): handle inline empty map in yaml fallback, extend pi…

7cf1a2c

…i fixer, quiz key includes severity

fix(phase-19/87): guard chunk_tokens, harden severity lookup, refusal…

c2d5b44

… on empty redact, count benign terminations, doc action token

vercel Bot deployed to Preview May 26, 2026 20:07 View deployment

Merge remote-tracking branch 'origin/main' into feat/phase-19-track-i

8b62685

# Conflicts: # catalog.json

vercel Bot deployed to Preview May 27, 2026 09:13 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(phase-19): track I safety red-team lessons 82-87#212

feat(phase-19): track I safety red-team lessons 82-87#212
rohitg00 wants to merge 14 commits into
mainfrom
feat/phase-19-track-i

rohitg00 commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rohitg00 commented May 26, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 26, 2026 •

edited

Loading