fix(sanitize): preserve angle brackets inside code blocks and inline code#2408
fix(sanitize): preserve angle brackets inside code blocks and inline code#2408blackwell-systems wants to merge 2 commits into
Conversation
3722fe4 to
680c63b
Compare
jluocsa
left a comment
There was a problem hiding this comment.
Thanks for picking up #2202 — generic-type angle brackets disappearing from code blocks is a long-standing irritation. The CommonMark-aware fence/inline detection is much better than a regex-based approximation.
Before merge, I think there's a security-sensitive issue worth addressing:
Sentinel collision can be used to bypass the HTML sanitizer.
The current sentinels are:
const (
ltSentinel = "\x00LT\x00"
gtSentinel = "\x00GT\x00"
)Sanitize runs in this order:
cleaned := FilterCodeFenceMetadata(FilterInvisibleCharacters(input))
protected := protectCodeAngleBrackets(cleaned)
sanitized := FilterHTMLTags(protected)
return restoreCodeAngleBrackets(sanitized)FilterInvisibleCharacters strips a specific list of Unicode codepoints (see shouldRemoveRune in pkg/sanitize/sanitize.go — zero-width spaces, BiDi controls, tag characters, etc.) but does not strip \x00 (NUL is not in the list).
That means an attacker who controls input — e.g. an issue title, PR body, search-result snippet, etc. — can submit:
\x00LT\x00script\x00GT\x00alert(1)\x00LT\x00/script\x00GT\x00
…outside any code block. The flow is:
FilterInvisibleCharacters→ passes through (NUL not stripped)FilterCodeFenceMetadata→ passes through (no fence)protectCodeAngleBrackets→ passes through (no backticks, the sentinel literal is treated as ordinary text)FilterHTMLTags(bluemonday) → passes through (\x00LT\x00script...is not HTML)restoreCodeAngleBrackets→ substitutes back into<script>alert(1)</script>
The output now contains raw HTML/JS that should have been stripped, defeating the purpose of FilterHTMLTags.
A few mitigation options, in increasing order of robustness:
- Add
\x00toshouldRemoveRune. Cheapest fix; consistent with the existing "strip control / invisible chars" policy. Adding it beforeprotectCodeAngleBracketsguarantees no pre-existing sentinels. - Pre-scan for the literal sentinel and refuse / escape it. A bit more code, no change to filter semantics.
- Use a per-call randomized sentinel (generate two crypto-random byte sequences once per
Sanitizeinvocation). Defeats collision by construction; only downside is a small allocation per call.
Option 1 seems most consistent with what FilterInvisibleCharacters is already doing, and the smallest possible change. A unit test mirroring TestSanitizePreservesAngleBracketsInCodeBlocks with input containing the literal sentinel would lock the behavior in.
Minor: closing-fence detection. protectCodeAngleBrackets treats any run of >= fenceLen backticks as a closing fence (anywhere on the line). CommonMark requires the closing fence to be on its own line and not have an info string. The current implementation is more permissive, which in practice means a stray ``` inside what was meant to be the fenced content will end protection early. This is a soft-fail (some < get stripped that shouldn't be) rather than a security issue, so probably acceptable, but worth a comment near the loop documenting the deviation.
Tests are nicely focused, the inline-vs-fenced split is handled, and [T comparable]-style generics outside code (where they really might be HTML-ish) are still cleaned. Once the sentinel exposure is closed, this is a great fix.
…code bluemonday's StrictPolicy treats angle brackets inside markdown code blocks and inline code spans as HTML tags and strips them. This causes content like `mut_raw_ptr<int>` to become `mut_raw_ptr` when read through MCP issue/PR endpoints. The fix protects angle brackets inside fenced code blocks (```) and inline code spans (`) with sentinels before HTML sanitization, then restores them after. Angle brackets outside code are still sanitized normally, preserving XSS protection. Fixes github#2202 Signed-off-by: Dayna Blackwell <dayna@blackwell-systems.com>
Security fix: Add NUL byte (0x00) to shouldRemoveRune so that FilterInvisibleCharacters strips NUL bytes before protectCodeAngleBrackets runs. Without this, an attacker can inject literal sentinel strings (\x00LT\x00script\x00GT\x00) that bypass FilterHTMLTags and get restored to <script> by restoreCodeAngleBrackets. Also add a comment documenting the CommonMark closing-fence deviation: the implementation treats any run of >= fenceLen backticks as a closing fence even mid-line, which is more permissive than CommonMark (requires own line, no info string). This is a soft-fail (some angle brackets may be unprotected) rather than a security issue. Tests added: - sentinel collision: verifies NUL-byte injection does not produce <script> - NUL bytes in code blocks: verifies code content is preserved after stripping - NUL byte in shouldRemoveRune: verifies 0x00 is in the removal set
f8f50a6 to
3c28543
Compare
SamMorrowDrums
left a comment
There was a problem hiding this comment.
Thanks for this @blackwell-systems — the approach is sound and the security trade-off is defensible. bluemonday continues to handle prose, and the sentinel design has the right ordering: FilterInvisibleCharacters strips NUL bytes before protectCodeAngleBrackets runs, so an attacker can't forge sentinels in the input.
Cc'ing @joannakl as a fellow maintainer — would appreciate a second pair of eyes since this touches the sanitizer.
I've left two inline suggestions to harden the contract (ordering invariant + sentinel doc). One bigger consideration:
Have you considered a CommonMark parser? github.com/yuin/goldmark is already a transitive dep in go.sum. I prototyped a drop-in replacement for protectCodeAngleBrackets against this branch:
func protectCodeAngleBrackets(input string) string {
src := []byte(input)
doc := goldmark.DefaultParser().Parse(text.NewReader(src))
type span struct{ start, stop int }
var spans []span
_ = ast.Walk(doc, func(n ast.Node, entering bool) (ast.WalkStatus, error) {
if !entering {
return ast.WalkContinue, nil
}
switch n.Kind() {
case ast.KindFencedCodeBlock, ast.KindCodeBlock:
lines := n.Lines()
for i := 0; i < lines.Len(); i++ {
seg := lines.At(i)
spans = append(spans, span{seg.Start, seg.Stop})
}
case ast.KindCodeSpan:
for c := n.FirstChild(); c != nil; c = c.NextSibling() {
if t, ok := c.(*ast.Text); ok {
seg := t.Segment
spans = append(spans, span{seg.Start, seg.Stop})
}
}
}
return ast.WalkContinue, nil
})
if len(spans) == 0 {
return input
}
var out bytes.Buffer
out.Grow(len(src))
cursor := 0
for _, s := range spans {
out.Write(src[cursor:s.start])
for _, b := range src[s.start:s.stop] {
switch b {
case '<':
out.WriteString(ltSentinel)
case '>':
out.WriteString(gtSentinel)
default:
out.WriteByte(b)
}
}
cursor = s.stop
}
out.Write(src[cursor:])
return out.String()
}About 40 lines instead of 150, and it picks up indented (4-space) code blocks for free, which the current scanner misses:
Text
let x: Vec<u8> = vec![];
I ran it against the same edge cases (4-backtick fences, nested backticks, unclosed fences, sentinel-injection probes, <script> inside fences, etc.) and behavior matches or improves on the hand-rolled version. NUL-strip ordering is preserved, so the sentinel-collision defense is unchanged.
Would also let us drop ~150 lines and rely on a CommonMark-compliant parser for future edge cases (autolinks adjacent to fences, etc.).
Happy with this either way — if you'd rather land the current approach and follow up, that's totally fine. Up to you and @joannakl. 🙏
A non-blocking nit: the package doc would be a good place to write down the threat model (input source: untrusted GitHub content; goal: strip HTML from prose without corrupting code) so the next reader has the context up front.
| func Sanitize(input string) string { | ||
| return FilterHTMLTags(FilterCodeFenceMetadata(FilterInvisibleCharacters(input))) | ||
| cleaned := FilterCodeFenceMetadata(FilterInvisibleCharacters(input)) | ||
| // Protect angle brackets inside code blocks and inline code spans | ||
| // from being stripped by the HTML sanitizer. bluemonday treats <int>, | ||
| // <T>, etc. as unknown HTML tags and removes them. | ||
| // See https://github.com/github/github-mcp-server/issues/2202 | ||
| protected := protectCodeAngleBrackets(cleaned) | ||
| sanitized := FilterHTMLTags(protected) | ||
| return restoreCodeAngleBrackets(sanitized) |
There was a problem hiding this comment.
Optional: pull out the ordering as a named invariant so future readers don't accidentally reorder these calls. The security argument relies on FilterInvisibleCharacters running first (so NUL is stripped before sentinels are introduced) and on restoreCodeAngleBrackets running last (after FilterHTMLTags).
| func Sanitize(input string) string { | |
| return FilterHTMLTags(FilterCodeFenceMetadata(FilterInvisibleCharacters(input))) | |
| cleaned := FilterCodeFenceMetadata(FilterInvisibleCharacters(input)) | |
| // Protect angle brackets inside code blocks and inline code spans | |
| // from being stripped by the HTML sanitizer. bluemonday treats <int>, | |
| // <T>, etc. as unknown HTML tags and removes them. | |
| // See https://github.com/github/github-mcp-server/issues/2202 | |
| protected := protectCodeAngleBrackets(cleaned) | |
| sanitized := FilterHTMLTags(protected) | |
| return restoreCodeAngleBrackets(sanitized) | |
| func Sanitize(input string) string { | |
| // Ordering is security-load-bearing: | |
| // 1. FilterInvisibleCharacters strips NUL bytes BEFORE sentinels are | |
| // introduced, preventing an attacker from forging \x00LT\x00 in input. | |
| // 2. protectCodeAngleBrackets replaces < and > inside code with sentinels | |
| // so bluemonday does not strip them as HTML. | |
| // 3. FilterHTMLTags (bluemonday) sanitizes prose; sentinels pass through. | |
| // 4. restoreCodeAngleBrackets converts sentinels back to < and >. | |
| // See https://github.com/github/github-mcp-server/issues/2202 | |
| cleaned := FilterCodeFenceMetadata(FilterInvisibleCharacters(input)) | |
| protected := protectCodeAngleBrackets(cleaned) | |
| sanitized := FilterHTMLTags(protected) | |
| return restoreCodeAngleBrackets(sanitized) | |
| } |
| // Sentinels used to protect angle brackets inside code from HTML sanitization. | ||
| // These are chosen to be unlikely to appear in real content. | ||
| const ( | ||
| ltSentinel = "\x00LT\x00" | ||
| gtSentinel = "\x00GT\x00" | ||
| ) |
There was a problem hiding this comment.
Worth noting in the doc comment that the NUL-stripping in FilterInvisibleCharacters is what makes these sentinels collision-free — it's a non-obvious dependency.
| // Sentinels used to protect angle brackets inside code from HTML sanitization. | |
| // These are chosen to be unlikely to appear in real content. | |
| const ( | |
| ltSentinel = "\x00LT\x00" | |
| gtSentinel = "\x00GT\x00" | |
| ) | |
| // Sentinels used to protect angle brackets inside code from HTML sanitization. | |
| // They embed NUL bytes, which are stripped from all input by | |
| // FilterInvisibleCharacters before this function runs — that strip is what | |
| // makes the sentinels collision-free against attacker-controlled input. | |
| const ( | |
| ltSentinel = "\x00LT\x00" | |
| gtSentinel = "\x00GT\x00" | |
| ) |
Description
bluemonday.StrictPolicy()treats angle brackets inside markdown code blocks and inline code spans as HTML tags and strips them. This causes content likemut_raw_ptr<int>in issue/PR bodies to becomemut_raw_ptrwhen read through MCP endpoints.Before
Agent sees:
let ptr: mut_raw_ptr = raw_new int;After
Agent sees:
let ptr: mut_raw_ptr<int> = raw_new int;Root cause
FilterHTMLTagscallsbluemonday.Sanitize()on the entire markdown body without distinguishing code from prose. Bluemonday treats<int>,<T>,<String>, etc. as unrecognized HTML tags and removes them.Fix
Before HTML sanitization, replace
<and>inside fenced code blocks (```) and inline code spans (`) with null-byte sentinels that bluemonday will not interpret as HTML. After sanitization, restore the sentinels to angle brackets.This preserves XSS protection for angle brackets in prose (e.g.
<script>is still stripped) while keeping angle brackets inside code intact.Testing
Added 6 test cases covering:
Vec<String>)Fixes #2202