Skip to content

feat(phase-19): track I safety red-team lessons 82-87#212

Open
rohitg00 wants to merge 14 commits into
mainfrom
feat/phase-19-track-i
Open

feat(phase-19): track I safety red-team lessons 82-87#212
rohitg00 wants to merge 14 commits into
mainfrom
feat/phase-19-track-i

Conversation

@rohitg00
Copy link
Copy Markdown
Owner

Summary

Six capstone sub-lessons under phases/19-capstone-projects/ that compose into a runnable safety + red-team harness.

  • 82 jailbreak-taxonomy - six-category attack taxonomy partitioned by trust boundary abused, 50 hand-authored fixtures with severity 1-5, trigram matcher, validator, stable taxonomy.json artifact
  • 83 prompt-injection-detector - normalize (zero-width, homoglyph, base64, hex, leet, rot13) then substring then regex rules, per-category precision/recall/F1 against the 50-fixture corpus plus a 25-prompt benign baseline
  • 84 refusal-evaluation - under-refusal, over-refusal, accuracy, ECE calibration, per-category breakdown; three mock LLM policies (strict, leaky, over-cautious) exercise opposite failure modes
  • 85 content-classifier-integration - toxicity (negation-window check), PII (Luhn-validated card detection), instruction-leakage (trigram cosine vs known system prompt) behind a severity router with block/redact/warn/log
  • 86 constitutional-rules-engine - YAML constitution with all_of/any_of/not_ predicates, six-rule starter constitution, declarative fixer (append/prepend/replace), structured diff between draft and revised, self-contained YAML subset parser
  • 87 end-to-end-safety-gate - composes 82-86 into pre-gen + during-gen (streaming token-filter) + post-gen with deterministic aggregation table; runs all 50 attack fixtures plus 10 benign prompts end-to-end and emits a per-request trace

Every lesson ships docs (mermaid diagram, 900-1100 words), runnable main.py, tests.py (12-18 unittests each), quiz.json (6 questions), and a skill-*.md output. Total: 92 unit tests across the six lessons, all green; six demos all exit 0 and write JSON artifacts under each lesson's outputs/.

Implementation uses only numpy (lessons 82-84) plus an optional pyyaml fallback (lesson 86 ships its own YAML subset parser so the lesson runs on a stock Python install). No real LLM calls; mock LLMs throughout. No external red-team / safety repo names or paper citations in any file.

Test plan

  • python3 -m unittest tests passes in each of the six lesson code/ directories (15 + 14 + 15 + 18 + 18 + 12 = 92 tests)
  • python3 main.py exits 0 in each lesson and writes its artifact
  • Lesson 87 demo composes all five prior lessons via importlib.spec file-loading without name collisions
  • No site/, root README.md, or catalog.json touched
  • Mermaid diagrams render; all code fences language-tagged

rohitg00 and others added 8 commits May 26, 2026 19:24
Six-category taxonomy (role-play, instruction-override, context-smuggling,
multi-turn-ramp, encoding-trick, prefix-injection) partitions attacks by
trust boundary abused. Fixtures hand-authored, severity 1-5. Trigram cosine
matcher assigns category to candidate prompts. Validator enforces minimum
per-category count, severity range, unique ids, non-empty prompts.

Includes 50 fixtures, taxonomy.json artifact for downstream lessons,
15 unittest cases, quiz with 6 questions, skill output.
Layered detector pipeline: normalize (zero-width, homoglyph, base64, hex,
leet, rot13) then substring rules then regex rules. Each rule carries a
category and a base score; aggregator returns the highest scoring category
with its confidence. Runner reads the lesson 82 taxonomy artifact, evaluates
against the 50-fixture corpus plus a 25-prompt benign baseline, and writes
per-category precision/recall/F1 to detector_report.json.

Includes 14 unittest cases, quiz with 6 questions, skill output, rules and
benign corpus as data files for easy extension.
Two-sided refusal metrics: under-refusal (answered unsafe), over-refusal
(refused safe), accuracy, ECE calibration, per-category under-refusal join
against the lesson 82 taxonomy. Three mock LLM policies (strict, leaky,
over-cautious) demonstrate the framework detects opposite failure modes.
Labeled corpus: 25 unsafe prompts tagged with taxonomy ids, 30 safe prompts
non-overlapping with the lesson 83 benign set.

Includes 15 unittest cases, quiz with 6 questions, skill output, ECE binning
implementation, refusal phrase classifier.
Three classifiers behind one severity router. Toxicity (harassment terms
with negation-window check), PII (email, phone, SSN, Luhn-validated card,
IPv4), instruction-leakage (trigram cosine vs a known system prompt).
Router takes max severity across classifiers and applies block, redact,
warn, or log. Each classifier carries its own redactor; redact-severity
outputs flow through all matching redactors before shipping.

Includes 18 unittest cases, quiz with 6 questions, skill output, demo over
six fixtures exercising all four severity buckets.
YAML constitution defines rules with name, severity, applies_when, must,
explanation, fix. Predicates compose via all_of/any_of/not_. Engine emits
per-rule status (pass, violation, not_applicable) with matched span.
Fixer applies declarative append/prepend/replace operations per rule.
diff function produces structured change list between draft and revised.
Self-contained yaml_subset parser so the lesson runs without PyYAML, with
graceful fallback to PyYAML when present.

Includes 18 unittest cases, quiz with 6 questions, skill output, six-rule
constitution covering refusal redirects, code closing, PII in examples,
citations, internal library leaks, and length bounds.
Three-checkpoint composition: pre-gen detector on the prompt, during-gen
streaming filter that buffers chunks and terminates early on harmful
continuations, post-gen classifier router and rules engine on the
completed output. Deterministic aggregation table picks the final action
(block, redact, warn, allow) from the maximum severity across signals.
Each request emits a structured RequestTrace with checkpoint verdicts and
latency.

Demo runs all 50 lesson 82 fixtures plus 10 benign prompts end-to-end,
prints per-action and per-category outcomes, and writes gate_trace.json.
Includes 12 unittest cases, quiz with 6 questions, skill output, and
direct file-spec imports so sibling lessons compose without packaging.
…fact

Title em-dashes stay (matches existing capstone convention). One body
em-dash in lesson 82 prose replaced with a colon. Lesson 87 gate trace
artifact regenerated.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5dc8368d-8685-4917-845c-0411d82977a8

📥 Commits

Reviewing files that changed from the base of the PR and between 91e8782 and c2d5b44.

📒 Files selected for processing (18)
  • phases/19-capstone-projects/83-prompt-injection-detector/code/main.py
  • phases/19-capstone-projects/83-prompt-injection-detector/code/tests.py
  • phases/19-capstone-projects/84-refusal-evaluation/code/prompts.py
  • phases/19-capstone-projects/84-refusal-evaluation/docs/en.md
  • phases/19-capstone-projects/85-content-classifier-integration/code/classifiers.py
  • phases/19-capstone-projects/85-content-classifier-integration/docs/en.md
  • phases/19-capstone-projects/85-content-classifier-integration/outputs/classifier_report.json
  • phases/19-capstone-projects/85-content-classifier-integration/outputs/skill-content-classifier-integration.md
  • phases/19-capstone-projects/86-constitutional-rules-engine/code/rules.yml
  • phases/19-capstone-projects/86-constitutional-rules-engine/code/yaml_subset.py
  • phases/19-capstone-projects/86-constitutional-rules-engine/outputs/rules_report.json
  • phases/19-capstone-projects/86-constitutional-rules-engine/quiz.json
  • phases/19-capstone-projects/87-end-to-end-safety-gate/code/main.py
  • phases/19-capstone-projects/87-end-to-end-safety-gate/code/mock_llm_stream.py
  • phases/19-capstone-projects/87-end-to-end-safety-gate/code/safety_gate.py
  • phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md
  • phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/gate_trace.json
  • phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/skill-end-to-end-safety-gate.md
✅ Files skipped from review due to trivial changes (5)
  • phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md
  • phases/19-capstone-projects/85-content-classifier-integration/outputs/skill-content-classifier-integration.md
  • phases/19-capstone-projects/86-constitutional-rules-engine/quiz.json
  • phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/gate_trace.json
  • phases/19-capstone-projects/84-refusal-evaluation/docs/en.md

📝 Walkthrough

Walkthrough

This PR introduces six interconnected capstone lessons (82–87) forming a comprehensive LLM safety framework. Lesson 82 establishes a jailbreak taxonomy with 50 prompts across six trust-boundary categories. Lesson 83 builds a detector using layered normalization and rule matching. Lesson 84 evaluates refusal behavior via mock policies and metrics. Lesson 85 implements three independent classifiers (toxicity, PII, instruction leakage) feeding a severity-based router. Lesson 86 provides a declarative rules engine for output constraints. Lesson 87 orchestrates all prior stages (pre/during/post-generation) into a unified safety gate. The catalog is updated to reflect all six new lessons.

Changes

Capstone Lessons 82–87: Complete Safety Framework

Layer / File(s) Summary
Catalog updates and Lesson 82: Jailbreak Taxonomy
catalog.json, phases/19-capstone-projects/82-jailbreak-taxonomy/code/, phases/19-capstone-projects/82-jailbreak-taxonomy/docs/en.md, phases/19-capstone-projects/82-jailbreak-taxonomy/outputs/, phases/19-capstone-projects/82-jailbreak-taxonomy/quiz.json
Catalog incremented from 487 to 528 code files and lesson count from 17 to 23. Lesson 82 defines a six-category jailbreak taxonomy (role-play, instruction-override, context-smuggling, multi-turn-ramp, encoding-trick, prefix-injection) with 50 fixtures, trigram-based matching, validation invariants, comprehensive tests, and JSON artifact serialization.
Lesson 83: Prompt Injection Detector
phases/19-capstone-projects/83-prompt-injection-detector/code/, phases/19-capstone-projects/83-prompt-injection-detector/docs/en.md, phases/19-capstone-projects/83-prompt-injection-detector/outputs/, phases/19-capstone-projects/83-prompt-injection-detector/quiz.json
Normalization pipeline (zero-width, homoglyph, base64/hex, leet, ROT13) feeding substring and regex rules. Loads lesson 82 taxonomy, benign prompts, and computes per-category confusion matrices with precision/recall/F1 metrics and overall accuracy. Outputs detector_report.json.
Lesson 84: Refusal Evaluation
phases/19-capstone-projects/84-refusal-evaluation/code/, phases/19-capstone-projects/84-refusal-evaluation/docs/en.md, phases/19-capstone-projects/84-refusal-evaluation/outputs/, phases/19-capstone-projects/84-refusal-evaluation/quiz.json
Labeled safe/unsafe prompt corpus (25 each) with three mock LLM policies (strict, leaky, over-cautious) and regex refusal classification. Computes accuracy, under-refusal rate, over-refusal rate, ECE calibration, and per-category under-refusal breakdown. Outputs refusal_eval_report.json.
Lesson 85: Content Classifier Integration
phases/19-capstone-projects/85-content-classifier-integration/code/, phases/19-capstone-projects/85-content-classifier-integration/docs/en.md, phases/19-capstone-projects/85-content-classifier-integration/outputs/, phases/19-capstone-projects/85-content-classifier-integration/quiz.json
Three classifiers (toxicity via negation-aware matching, PII via email/phone/SSN/card/IP with Luhn validation, instruction leakage via trigram similarity). Router aggregates by max severity into four actions (block, redact, warn, log) with per-action output transformations. Outputs classifier_report.json.
Lesson 86: Constitutional Rules Engine
phases/19-capstone-projects/86-constitutional-rules-engine/code/, phases/19-capstone-projects/86-constitutional-rules-engine/docs/en.md, phases/19-capstone-projects/86-constitutional-rules-engine/outputs/, phases/19-capstone-projects/86-constitutional-rules-engine/quiz.json
YAML-based rules with recursive predicates (all_of, any_of, not_), constraints (regex, word count), and violations. Fixer applies conditional repairs (append/prepend-if-missing, regex replace). Diff computes line-level changes via SequenceMatcher. Outputs rules_report.json.
Lesson 87: End-to-End Safety Gate
phases/19-capstone-projects/87-end-to-end-safety-gate/code/, phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md, phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/, phases/19-capstone-projects/87-end-to-end-safety-gate/quiz.json
Three-stage orchestration: pre-generation detector verdict, during-generation streaming with early termination on unsafe patterns, post-generation classifier and rules verdicts. Deterministic severity aggregation selects final action (block/redact/warn/allow). RequestTrace records all verdicts, latency, and audit per request. Outputs gate_trace.json.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/phase-19-track-i

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

🧹 Nitpick comments (1)
phases/19-capstone-projects/87-end-to-end-safety-gate/code/tests.py (1)

74-77: ⚡ Quick win

test_redact_when_classifier_redacts currently asserts nothing meaningful.

Line 76 allows every possible action, so this test can’t fail on regressions. Assert the intended contract (at least not allow, or specifically redact for this input).

Suggested fix
     def test_redact_when_classifier_redacts(self) -> None:
         trace = self.gate.handle("Please email me at lee@example.com about my account.")
-        self.assertIn(trace.final_action, {"redact", "block", "warn", "allow"})
+        self.assertIn(trace.final_action, {"redact", "block"})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@phases/19-capstone-projects/87-end-to-end-safety-gate/code/tests.py` around
lines 74 - 77, The test test_redact_when_classifier_redacts is currently vacuous
because it allows every outcome; instead assert the intended contract by
checking trace.final_action against the expected behavior from self.gate.handle
for an email-containing input: replace the broad
self.assertIn(trace.final_action, {"redact","block","warn","allow"}) with a
stricter assertion (e.g., self.assertNotEqual(trace.final_action, "allow") or
self.assertEqual(trace.final_action, "redact")) so the test fails on
regressions; update the assertion in test_redact_when_classifier_redacts to use
trace.final_action and the chosen expected value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@phases/19-capstone-projects/83-prompt-injection-detector/code/main.py`:
- Around line 29-30: Replace the raw invisible/confusable Unicode in the
normalization constants with explicit \uXXXX escapes so the behavior is
unchanged but the characters are auditable: update the ZERO_WIDTH regex
(ZERO_WIDTH) to use escapes for U+200B, U+200C, U+200D, U+2060 and the range
U+202A–U+202E (e.g. \u200B\u200C\u200D\u2060\u202A-\u202E) and change the
HOMOGLYPHS str.maketrans mapping (HOMOGLYPHS) to use \u04XX/\u04XX-style escapes
for the Cyrillic capitals (А U+0410, В U+0412, С U+0421, Е U+0415, Н U+041D, К
U+041A, М U+041C, О U+041E, Р U+0420, Т U+0422, Х U+0425) mapping to their ASCII
equivalents; keep the same keys/values and semantics but substitute escaped code
points instead of embedding raw characters.
- Around line 83-84: In _try_decode_base64 and _try_decode_hex replace the broad
"except Exception: pass" with targeted exception handling: catch and handle only
the decoding-related exceptions (e.g., for base64 use base64.b64decode and catch
binascii.Error, ValueError and TypeError; for hex use binascii.unhexlify and
catch binascii.Error and ValueError/TypeError), ensure you import binascii if
needed, and return None on those specific failures so other unexpected
exceptions still bubble up.

In `@phases/19-capstone-projects/83-prompt-injection-detector/code/tests.py`:
- Around line 13-15: The test contains raw zero-width characters in the string
literal and assertion which are hard to review; update the test in tests.py to
replace the embedded zero-width characters with explicit Unicode escape
sequences (e.g., use \u200b, \u200c, or \u200d as appropriate) in the variable
assigned to text and in the assertNotIn call that checks normalize(text);
reference the normalize function and the test variable (text) so reviewers can
locate and verify the change.

In `@phases/19-capstone-projects/84-refusal-evaluation/code/prompts.py`:
- Around line 65-96: SAFE_PROMPTS currently has 30 entries (s-01..s-30) but the
docs claim 25; update the documentation to match the implementation by changing
any references that say "25 safe prompts" to "30 safe prompts" (including the
module docstring and en.md). Locate the SAFE_PROMPTS definition to confirm the
count and then update the docstring text and the en.md file so they accurately
state 30 safe prompts and, if present, adjust any related summary or index
numbers to reflect the full list.
- Around line 1-7: The module docstring incorrectly states "25 safe prompts"
while the SAFE_PROMPTS constant actually contains 30 entries (IDs s-01..s-30);
update the top-level docstring in prompts.py to say "30 safe prompts" and
likewise update the corresponding sentence in en.md (around the existing
mention) to reflect 30 safe prompts so both documentation strings match the
SAFE_PROMPTS list.

In `@phases/19-capstone-projects/84-refusal-evaluation/docs/en.md`:
- Around line 42-46: Update the documentation count to match the actual
SAFE_PROMPTS length: change the phrase that reads "plus 25 safe prompts" to
"plus 30 safe prompts" (the code defines SAFE_PROMPTS with entries s-01 through
s-30 in code/prompts.py), and ensure the surrounding sentence in docs/en.md
referencing the labeled corpus reflects this corrected count.

In
`@phases/19-capstone-projects/85-content-classifier-integration/code/classifiers.py`:
- Line 90: The _CARD_RE pattern allows a trailing separator because the repeated
group includes an optional separator; change it so the final character must be a
digit: replace _CARD_RE = re.compile(r"\b(?:\d[ -]?){13,19}\b") with a pattern
that enforces the last character is a digit, e.g. _CARD_RE =
re.compile(r"\b(?:\d[ -]?){12,18}\d\b"), keeping the raw string and re.compile
usage so matches always end on a digit.
- Around line 112-123: The code is recording raw PII matches into the findings
list (uses m.group(0) for _EMAIL_RE, _PHONE_RE, _SSN_RE, _CARD_RE, _IPV4_RE),
which risks leaking sensitive data; update the loops that append to findings to
store only the type labels (e.g., "email", "phone", "ssn", "card", "ip") or a
strictly masked form instead of the full match, while preserving the existing
validation (e.g., keep using _luhn(digits) for card validation) and the same
loop locations (the blocks that iterate _EMAIL_RE, _PHONE_RE, _SSN_RE, _CARD_RE,
_IPV4_RE and append to findings).

In `@phases/19-capstone-projects/85-content-classifier-integration/docs/en.md`:
- Line 43: Update the documentation to match the implemented API: replace
references to Action.redacted_output with Action.output, and update descriptions
of the router function from decide(verdicts) to decide(text, verdicts) so the
docs reflect the actual function signature and returned Action shape (verb,
output, metadata); ensure any examples, explanations, and the mention in lesson
51 use these exact names to avoid integration confusion.

In
`@phases/19-capstone-projects/85-content-classifier-integration/outputs/skill-content-classifier-integration.md`:
- Around line 16-22: The fenced code block showing the data shape for
"ClassifierVerdict" lacks a language tag and triggers markdownlint MD040; update
the opening fence from ``` to a tagged fence such as ```text (or another
appropriate tag like ```yaml) so the block becomes ```text and retains the same
contents for "ClassifierVerdict", preserving indentation and lines for name,
severity, score, and findings.

In `@phases/19-capstone-projects/86-constitutional-rules-engine/code/rules.yml`:
- Around line 33-39: The rule currently detects both email and phone patterns
via the two not_contains_regex entries but the fixer only replaces emails;
update the fix block used by the no-pii-in-examples rule so it also rewrites
phone numbers. Specifically, add a second replace_regex (or expand the existing
pattern) to include the phone regex '\b(\+?\d{1,3}[ .-]?)?(\(?\d{3}\)?[
.-]?)\d{3}[ .-]?\d{4}\b' so that the fixer replaces detected phone numbers
(similar to how the existing replace_regex with pattern
'\b[\w.+-]+@[\w-]+\.[\w.-]+\b' replaces emails with '[example-user]').

In
`@phases/19-capstone-projects/86-constitutional-rules-engine/code/yaml_subset.py`:
- Around line 49-73: The coercion function _coerce currently treats the literal
"{}" as a plain string, breaking inline empty mappings like the `applies_when:
{}` used in rules.yml; update _coerce to detect empty inline mappings and return
an actual empty dict (e.g., if s startswith "{" and endswith "}" and the inner
content is whitespace/empty, return {}), leaving all other coercions unchanged.

In `@phases/19-capstone-projects/86-constitutional-rules-engine/quiz.json`:
- Around line 19-27: The quiz's correct answer index is wrong: update the
"correct" value in the quiz entry for the question "What three fields must every
rule have at minimum?" from 2 to 4 so the correct option becomes "predicate,
severity, owner" (which includes the required severity field) by editing the
"correct" key in quiz.json.

In `@phases/19-capstone-projects/87-end-to-end-safety-gate/code/main.py`:
- Around line 77-84: The benign loop over BENIGN_PROMPTS currently never updates
the global terminations counter; when gate.handle(prompt) returns a trace that
indicates an early termination (use trace.final_action or trace.terminated
flag), increment the same global terminations metric used elsewhere (named
terminations) so benign requests are included; apply the same change to the
other benign-processing block referenced by the reviewer (the logic around
traces.append and per_category_outcome updates) to mirror how terminations is
incremented for non-benign requests.

In
`@phases/19-capstone-projects/87-end-to-end-safety-gate/code/mock_llm_stream.py`:
- Around line 60-70: The stream function uses chunk_tokens as the range step
without validation, so pass a check at the top of stream(prompt: str,
chunk_tokens: int = 4) to ensure chunk_tokens is an int > 0 (e.g., raise
ValueError with a clear message if chunk_tokens <= 0 or not an int) before
calling range(..., chunk_tokens) and then proceed to chunk the tokens; reference
the stream function and the chunk_tokens parameter when adding the validation.

In `@phases/19-capstone-projects/87-end-to-end-safety-gate/code/safety_gate.py`:
- Around line 170-172: The code currently indexes SEVERITY_RANK directly with
post.classifier_severity and post.rules_max_severity causing KeyError on
unexpected tokens; change both places to safely lookup with a fallback (e.g.,
use SEVERITY_RANK.get(post.classifier_severity, <safe-default>) and
SEVERITY_RANK.get(post.rules_max_severity, <safe-default>)) so
signals.append(...) always receives a numeric severity; update the occurrences
that build ("post.classifier", ...) and ("post.rules", ...) to use .get and pick
a sensible default (like 0 or the lowest severity) to degrade gracefully.
- Around line 187-194: After calling classifier_router.run(raw_output) and
getting classifier_action.output into redacted, ensure you don't return an
empty/falsey string: if redacted is empty after classifier_router.run and after
optional rules_fixer.apply, replace it with a safe fallback (for example call a
helper like self.safe_fallback(raw_output) or return a generic safe message)
before returning; update the logic around classifier_action.output,
post.rules_violations, rules_engine.evaluate(...).violations(), and
rules_fixer.apply(...) to perform this empty-check and fallback substitution so
the redact branch never returns a blank body.

In `@phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md`:
- Around line 42-43: The aggregation table row that currently reads "detector
confidence 0.5-0.85, no other signal | allow with note" conflicts with the
implemented action which emits final_action="warn"; update the table text for
the "detector confidence 0.5-0.85, no other signal" case to read "warn" (or
otherwise match the exact implemented token final_action="warn") so wording is
consistent with the implementation.

In
`@phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/skill-end-to-end-safety-gate.md`:
- Around line 33-43: The fenced code block showing the RequestTrace schema lacks
a language hint, which triggers MD040 linting; update the markdown fenced block
that contains "RequestTrace" (the block starting with ``` and the schema lines
including request_id, prompt, pre_gen, during_gen, post_gen, final_action,
final_output, latency_ms) to include a language identifier such as "text" (i.e.,
```text) so the block is properly annotated for the linter.

---

Nitpick comments:
In `@phases/19-capstone-projects/87-end-to-end-safety-gate/code/tests.py`:
- Around line 74-77: The test test_redact_when_classifier_redacts is currently
vacuous because it allows every outcome; instead assert the intended contract by
checking trace.final_action against the expected behavior from self.gate.handle
for an email-containing input: replace the broad
self.assertIn(trace.final_action, {"redact","block","warn","allow"}) with a
stricter assertion (e.g., self.assertNotEqual(trace.final_action, "allow") or
self.assertEqual(trace.final_action, "redact")) so the test fails on
regressions; update the assertion in test_redact_when_classifier_redacts to use
trace.final_action and the chosen expected value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a7bfd78b-0ecc-4edd-8c20-738c790182c2

📥 Commits

Reviewing files that changed from the base of the PR and between c1374e1 and 91e8782.

📒 Files selected for processing (47)
  • catalog.json
  • phases/19-capstone-projects/82-jailbreak-taxonomy/code/fixtures.py
  • phases/19-capstone-projects/82-jailbreak-taxonomy/code/main.py
  • phases/19-capstone-projects/82-jailbreak-taxonomy/code/tests.py
  • phases/19-capstone-projects/82-jailbreak-taxonomy/docs/en.md
  • phases/19-capstone-projects/82-jailbreak-taxonomy/outputs/skill-jailbreak-taxonomy.md
  • phases/19-capstone-projects/82-jailbreak-taxonomy/outputs/taxonomy.json
  • phases/19-capstone-projects/82-jailbreak-taxonomy/quiz.json
  • phases/19-capstone-projects/83-prompt-injection-detector/code/benign.py
  • phases/19-capstone-projects/83-prompt-injection-detector/code/main.py
  • phases/19-capstone-projects/83-prompt-injection-detector/code/rules.py
  • phases/19-capstone-projects/83-prompt-injection-detector/code/tests.py
  • phases/19-capstone-projects/83-prompt-injection-detector/docs/en.md
  • phases/19-capstone-projects/83-prompt-injection-detector/outputs/detector_report.json
  • phases/19-capstone-projects/83-prompt-injection-detector/outputs/skill-prompt-injection-detector.md
  • phases/19-capstone-projects/83-prompt-injection-detector/quiz.json
  • phases/19-capstone-projects/84-refusal-evaluation/code/main.py
  • phases/19-capstone-projects/84-refusal-evaluation/code/mock_llm.py
  • phases/19-capstone-projects/84-refusal-evaluation/code/prompts.py
  • phases/19-capstone-projects/84-refusal-evaluation/code/tests.py
  • phases/19-capstone-projects/84-refusal-evaluation/docs/en.md
  • phases/19-capstone-projects/84-refusal-evaluation/outputs/refusal_eval_report.json
  • phases/19-capstone-projects/84-refusal-evaluation/outputs/skill-refusal-evaluation.md
  • phases/19-capstone-projects/84-refusal-evaluation/quiz.json
  • phases/19-capstone-projects/85-content-classifier-integration/code/classifiers.py
  • phases/19-capstone-projects/85-content-classifier-integration/code/main.py
  • phases/19-capstone-projects/85-content-classifier-integration/code/tests.py
  • phases/19-capstone-projects/85-content-classifier-integration/docs/en.md
  • phases/19-capstone-projects/85-content-classifier-integration/outputs/classifier_report.json
  • phases/19-capstone-projects/85-content-classifier-integration/outputs/skill-content-classifier-integration.md
  • phases/19-capstone-projects/85-content-classifier-integration/quiz.json
  • phases/19-capstone-projects/86-constitutional-rules-engine/code/main.py
  • phases/19-capstone-projects/86-constitutional-rules-engine/code/rules.yml
  • phases/19-capstone-projects/86-constitutional-rules-engine/code/tests.py
  • phases/19-capstone-projects/86-constitutional-rules-engine/code/yaml_subset.py
  • phases/19-capstone-projects/86-constitutional-rules-engine/docs/en.md
  • phases/19-capstone-projects/86-constitutional-rules-engine/outputs/rules_report.json
  • phases/19-capstone-projects/86-constitutional-rules-engine/outputs/skill-constitutional-rules-engine.md
  • phases/19-capstone-projects/86-constitutional-rules-engine/quiz.json
  • phases/19-capstone-projects/87-end-to-end-safety-gate/code/main.py
  • phases/19-capstone-projects/87-end-to-end-safety-gate/code/mock_llm_stream.py
  • phases/19-capstone-projects/87-end-to-end-safety-gate/code/safety_gate.py
  • phases/19-capstone-projects/87-end-to-end-safety-gate/code/tests.py
  • phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md
  • phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/gate_trace.json
  • phases/19-capstone-projects/87-end-to-end-safety-gate/outputs/skill-end-to-end-safety-gate.md
  • phases/19-capstone-projects/87-end-to-end-safety-gate/quiz.json

Comment thread phases/19-capstone-projects/83-prompt-injection-detector/code/main.py Outdated
Comment on lines +83 to +84
except Exception:
pass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify broad exception swallowing locations in this file.
python - <<'PY'
from pathlib import Path

path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py")
for i, line in enumerate(path.read_text(encoding="utf-8").splitlines(), 1):
    if "except Exception" in line:
        print(f"{path}:{i}:{line.strip()}")
PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 262


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python - <<'PY'
from pathlib import Path
path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py")
lines = path.read_text(encoding="utf-8").splitlines()
for start in (75, 90):
    end = start + 20
    print(f"\n--- {path}:{start}-{end} ---")
    for i in range(start-1, min(end, len(lines))):
        print(f"{i+1:4d}: {lines[i]}")
PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 1957


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python - <<'PY'
from pathlib import Path
path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py")
lines = path.read_text(encoding="utf-8").splitlines()
for start in (75, 90):
    end = start + 20
    print(f"\n--- {path}:{start}-{end} ---")
    for i in range(start-1, min(end, len(lines))):
        print(f"{i+1:4d}: {lines[i]}")
PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 1957


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python - <<'PY'
from pathlib import Path
path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py")
lines = path.read_text(encoding="utf-8").splitlines()
for i in range(1, 60):
    if i-1 < len(lines):
        print(f"{i:4d}: {lines[i-1]}")
PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 2022


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python - <<'PY'
from pathlib import Path
path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py")
lines = path.read_text(encoding="utf-8").splitlines()
for i in range(1, 80):
    if i-1 < len(lines):
        print(f"{i:4d}: {lines[i-1]}")
PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 2731


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python - <<'PY'
from pathlib import Path
path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py")
lines = path.read_text(encoding="utf-8").splitlines()
for i in range(1, 80):
    if i-1 < len(lines):
        print(f"{i:4d}: {lines[i-1]}")
PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 2731


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python - <<'PY'
from pathlib import Path
path = Path("phases/19-capstone-projects/83-prompt-injection-detector/code/main.py")
lines = path.read_text(encoding="utf-8").splitlines()
for i in range(1, 60):
    if i-1 < len(lines):
        print(f"{i:4d}: {lines[i-1]}")
PY

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 2022


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "binascii" phases/19-capstone-projects/83-prompt-injection-detector/code/main.py || true

Repository: rohitg00/ai-engineering-from-scratch

Length of output: 62


Narrow exception handling in base64/hex decoding instead of swallowing all exceptions.

_try_decode_base64 and _try_decode_hex both use except Exception: pass (lines 83-84 and 98-99), which can hide unexpected failures and silently alter normalization behavior. Narrow the exceptions caught.

🔧 Proposed fix
+import binascii
@@
-        except Exception:
-            pass
+        except (binascii.Error, ValueError):
+            continue
@@
-        except Exception:
-            pass
+        except ValueError:
+            continue
🧰 Tools
🪛 Ruff (0.15.14)

[error] 83-84: try-except-pass detected, consider logging the exception

(S110)


[warning] 83-83: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@phases/19-capstone-projects/83-prompt-injection-detector/code/main.py` around
lines 83 - 84, In _try_decode_base64 and _try_decode_hex replace the broad
"except Exception: pass" with targeted exception handling: catch and handle only
the decoding-related exceptions (e.g., for base64 use base64.b64decode and catch
binascii.Error, ValueError and TypeError; for hex use binascii.unhexlify and
catch binascii.Error and ValueError/TypeError), ensure you import binascii if
needed, and return None on those specific failures so other unexpected
exceptions still bubble up.

Comment thread phases/19-capstone-projects/83-prompt-injection-detector/code/tests.py Outdated
Comment thread phases/19-capstone-projects/84-refusal-evaluation/code/prompts.py
Comment thread phases/19-capstone-projects/84-refusal-evaluation/code/prompts.py
Comment thread phases/19-capstone-projects/87-end-to-end-safety-gate/code/safety_gate.py Outdated
Comment thread phases/19-capstone-projects/87-end-to-end-safety-gate/code/safety_gate.py Outdated
Comment thread phases/19-capstone-projects/87-end-to-end-safety-gate/docs/en.md Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant