Skip to content

feat(githubapp): pair PEM keys with App IDs across chunks#4980

Draft
lukem-ts wants to merge 1 commit into
mainfrom
githubapp-cross-chunk-pairing-experimental
Draft

feat(githubapp): pair PEM keys with App IDs across chunks#4980
lukem-ts wants to merge 1 commit into
mainfrom
githubapp-cross-chunk-pairing-experimental

Conversation

@lukem-ts
Copy link
Copy Markdown

@lukem-ts lukem-ts commented May 25, 2026

Description:

  • Reworks pkg/detectors/githubapp so a PEM private key in one chunk and an App ID in another chunk (same source) can be paired and verified, instead of requiring both halves to live in the same chunk.
  • Pairing state is held in a sync.Map[SourceID]*sourceState, where each sourceState holds two hashicorp/golang-lru/v2 caches (PEMs, App IDs) capped at 256 entries each. Entries TTL out after 30m; a best-effort reaper runs at most every 5m.
  • Regexes widened: keyPat now matches any BEGIN/END … PRIVATE KEY block (not just RSA); appPat matches github_app_id, gh-app-id, app id, etc. with 4–9 digit IDs.
  • PEMs are validated via a shape check and de-duplicated by SHA-256 fingerprint before being cached or paired.
  • Implements MaxSecretSizeProvider (4096), MultiPartCredentialProvider (4096), and CustomFalsePositiveChecker.
  • Keywords() now returns {"github", "private key"}.
  • No new dependencies — hashicorp/golang-lru/v2 is already in go.mod.

Checklist:

  • [ x] Tests passing (make test-community)?
  • [ x] Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Changes core secret-matching and adds per-source in-memory pairing plus live GitHub API verification; broader regexes may shift false-positive/negative behavior until exercised in production scans.

Overview
Reworks the GitHub App detector so a 2048-bit RSA PKCS#1 PEM and an App ID no longer have to appear in the same scan chunk. When chunk_source_id is present, each source keeps bounded LRU caches of “half” credentials and pairs new chunks with prior halves (with companion_location and pairing metadata); without a source ID, only in-chunk pairs are emitted.

Detection is tightened and broadened at once: PEM blocks match generic BEGIN/END … PRIVATE KEY text but are accepted only after a GitHub-app-shaped key check (no encrypted PEM headers, fixed RSA parameters); App IDs match more config-style labels (github_app_id, gh-app-id, etc.) with 4–9 digits. Results now use RawV2, structured SecretParts, and optional verification that calls GitHub’s /app API and records app/owner/permission fields on success.

The scanner becomes stateful (sync.Map + TTL reaping), advertises 4096-byte max secret/credential span, adds private key as a keyword, and unit/integration tests are updated for YAML-style fixtures and cross-chunk behavior.

Reviewed by Cursor Bugbot for commit 73b9ad6. Bugbot is set up for automated code reviews on this repo. Configure here.

…n seperate chunks of the same source can be paired and verified together
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 73b9ad6. Configure here.

func (s Scanner) Keywords() []string {
return []string{"github"}
func (s *Scanner) Keywords() []string {
return []string{"github", "private key"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keywords don't cover all appPat regex alternatives

Medium Severity

The appPat regex matches three alternatives: github[-_ ]?app[-_ ]?id, gh[-_ ]?app[-_ ]?id, and app[-_ ]?id. However, Keywords() only returns ["github", "private key"]. The Aho-Corasick pre-filter requires at least one keyword to appear in a chunk before it reaches the detector. Chunks containing only gh_app_id or app_id (without "github" or "private key" elsewhere) will never pass the pre-filter, so the detector is never invoked on them. This breaks cross-chunk pairing for those patterns since the app ID half is silently dropped.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 73b9ad6. Configure here.

@github-actions
Copy link
Copy Markdown

Corpora Test Results

Scans a corpus of real-world public code against only the detectors changed in this PR, then compares unique match counts between the PR build and the main baseline to catch regex regressions. Verification is disabled — each detector's regex is measured independently.

0 new · 1 clean  |  Scoped to: githubapp

Status Detector Unique matches (main) Unique matches (PR) New Removed
githubapp 0 0 0 0
  • 🔴 regression: >5 new, >20% increase over main, or any removed
  • ⚠️ warning: 1–5 new and ≤20% increase over main
  • ✅ clean
  • 🆕 new detector (no baseline)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants