Twin Router Bench

Twin Router Bench is a single benchmark suite for per-step LLM routing: a router chooses which pooled model_id to use on every agent step, under locked pricing and cache rules. The suite ships in one Python distribution (twinrouterbench) and one source tree (TwinRouterBench/).

It contains two tracks inside the same product—same protocol and locked tables—not two separate benchmarks:

Track	Role	CLI entry
Static	Fast validation on a fixed supervision bank (tier labels + nominal cost metrics).	`twinrouterbench static …`
Dynamic	End-to-end evaluation on SWE-bench Verified with real tool use—mini-swe-agent scaffold or editor scaffold.	`twinrouterbench dynamic …` / `twinrouterbench swe …`

Where to run commands (important)

Many examples use paths relative to the checkout parent directory (the parent of TwinRouterBench/), for example:

semantic-router/... — checkout of the semantic-router repo (KNN weights and models.py).
TwinRouterBench/data/static/... — static track: question_bank.jsonl, manifest.json.
TwinRouterBench/data/dynamic/... — dynamic track: locked pool, pricing, TTL, tier map, SR-KNN mapping.

Recommended: cd to the checkout parent directory, pip install -e "./TwinRouterBench[dynamic]", then run twinrouterbench … from there so those relative paths resolve.

The dynamic CLI also loads TwinRouterBench/.env via miniswerouter.cli (see Configuration). That file is resolved from the TwinRouterBench package directory, not from your shell cwd, so keeping .env next to pyproject.toml under TwinRouterBench/ is the supported layout.

Install

Static track only (lightweight dependencies):

pip install -e ./TwinRouterBench
# or from PyPI once published:
# pip install twinrouterbench

Full suite (adds Docker, SWE-bench harness, mini-swe-agent, LiteLLM, etc.):

pip install -e "./TwinRouterBench[dynamic]"

If you run twinrouterbench dynamic or twinrouterbench swe without [dynamic], the CLI exits with an explicit message to install the extra.

Configuration

`TwinRouterBench/.env`

Copy TwinRouterBench/.env.example to TwinRouterBench/.env.
Fill in real credentials (never commit .env).

Load order (dynamic / mini CLI):

_load_mini_dotenv — reads TwinRouterBench/.env and sets any key that is missing or empty in the process environment (already-set variables win).
_apply_gateway_aliases — if CommonStack variables are set, they override chat-related OpenRouter/SWERouter URL and API key (see table below).

Variable	Purpose
`OPENROUTER_BASE_URL`	OpenAI-compatible root URL (must end with `/v1`; clients append `/chat/completions`). Example: `https://openrouter.ai/api/v1`.
`OPENROUTER_API_KEY`	Bearer token for the gateway above.
`OPENROUTER_API_KEY_EXP`	Optional alternate key; if `OPENROUTER_API_KEY` is empty after dotenv, it is copied from this.
`SWEROUTER_BASE_URL` / `SWEROUTER_API_KEY`	Optional explicit names; `miniswerouterbench run` defaults fall back to `OPENROUTER_*` when unset.
`COMMONSTACK_API_BASE`	If set, replaces `OPENROUTER_BASE_URL` and `SWEROUTER_BASE_URL` after bootstrap.
`COMMONSTACK_API_KEY`	If set, replaces `OPENROUTER_API_KEY`, `OPENROUTER_API_KEY_EXP`, and `SWEROUTER_API_KEY` after bootstrap.

Common pitfall: OPENROUTER_BASE_URL pointing at CommonStack while OPENROUTER_API_KEY is still an OpenRouter sk-or-v1-… key (or the reverse) yields 401 or “invalid access key”. Use a consistent pair: either full CommonStack base + CommonStack access key, or OpenRouter base + OpenRouter key.

Shell snippets under TwinRouterBench/scripts/examples/env.inc.sh mirror the Python alias logic for bash-driven runs.

Prerequisites (dynamic track)

Docker running locally; enough disk for SWE-bench images.
Network for image pulls and the chat gateway.
API keys as above. The dynamic CLIs load TwinRouterBench/.env before parsing arguments.

Static metrics subcommand does not need Docker or network if you only aggregate local JSON.

semantic-router (SR KNN only)

The SemanticRouterKNNRouter loads:

knn_model.json (feature dim 1024 + 14 category one-hot),
semantic-router repo root (for ml_model_selection/models.py / KNNModel.load),
sentence-transformers embedder (default Qwen/Qwen3-Embedding-0.6B; you should downloads weights first).

Ensure a checkout exists at semantic-router/ relative to the checkout parent directory (or pass absolute --router-arg paths).

Command-line interface

Primary entrypoint:

twinrouterbench static <subcommand> [args]   # static track
twinrouterbench dynamic <subcommand> [args]  # mini-swe-agent harness
twinrouterbench swe <subcommand> [args]     # editor-scaffold harness

Compatibility console scripts (same code after the same install):

Script	Module
`CommonRouterBench`	`main.cli`
`miniswerouterbench`	`miniswerouter.cli`
`swerouterbench`	`swerouter.cli`

For debugging without installing entrypoints:

export PYTHONPATH=/abs/path/to/TwinRouterBench
python -m miniswerouter.cli run …

Static track

`twinrouterbench static metrics`

Aggregates Section 11–style metrics from a JSON file containing an array of CaseMetrics objects (see main.metrics.CaseMetrics and case_metrics_from_dict).

twinrouterbench static metrics --cases /path/to/cases.json

Each element must at least include case_id, task_passed, and either nominal cost fields or baseline_steps / optimal_steps / test_steps lists with completion_tokens (and optional tier / model). Example skeleton:

[
  {
    "case_id": "example-1",
    "task_passed": true,
    "baseline_cost_nominal": 10.0,
    "optimal_cost_nominal": 4.0,
    "test_cost_nominal": 5.0
  }
]

Question bank: shipped under data/static/ (question_bank.jsonl, manifest.json). The main package exposes DATA_DIR / STATIC_DATA_DIR / QUESTION_BANK_PATH pointing at that directory (see main.dataset). setuptools package-data includes those files for wheel installs. Tier-only eval APIs live under main.eval.

Static track — metric field names (outputs)

1) twinrouterbench static metrics --cases … prints one JSON object from main.metrics.aggregate_routerbench_metrics (Section 11–style task-level savings on nominal tier rates):

Field	Meaning
`valid_cases`	Number of cases in the input array.
`passed_cases`	Cases with `task_passed == true`.
`pass_rate`	`passed_cases / valid_cases`.
`cost_score_cases_used`	Passed cases with `save_gt > 0` included in the cost ratio.
`sum_save_gt_usd` / `sum_save_test_usd`	Sums of per-case savings vs baseline / vs test (USD, passed + positive-save_gt subset).
`cost_savings_score`	`100 * sum_save_test_usd / sum_save_gt_usd` on that subset (`NaN` if denominator is 0).
`money_saved_test`	Per-case `baseline_nominal - test_nominal` stats over passed cases (`mean_per_case_over_passed`, `total_over_passed`).
`pricing` / `cost_score_rule`	Fixed tier rates used and a short rule string (documentation only).

2) Tier-supervision eval summary (main.eval.build_eval_summary / run_question_bank_eval): one top-level JSON with routing rows and several metric blocks. Headline tier metrics most papers care about live under scores_v2 (main.eval.compute_v2_scores):

Field (`scores_v2`)	Meaning
`case_pass_rate_percent`	Row fraction with `pred_tier_id >= gold_tier_id` (errors count as fail).
`case_exact_match_percent`	Row fraction with `pred_tier_id == gold_tier_id`.
`trajectory_pass_rate_percent`	Case-weighted share of rows whose whole trajectory passes (every step `pred >= gold`, no error).
`cost_savings_score_percent`	Macro-averaged trajectory-level cost savings vs always-high baseline (see `scores_v2.note` in the JSON).
`combined_score_percent`	Mean of the four percentages above (`NaN` if any component is `NaN`).
`total_rows` / `error_rows`	Row counts; `case_pass_count` / `case_exact_count` are integer numerators.

Other useful keys on the same summary object:

Field	Meaning
`tier_match_accuracy` / `accuracy_excluding_errors`	Exact tier match rate on rows without API errors.
`exact_match`	Integer count of exact matches (same scope as tier accuracy numerator).
`api_errors` / `valid_response_rate`	Error count and `1 - api_errors/sampled`.
`section_11`	Older single-step pass rate + `cost_savings_score` (uniform tokens); distinct from `scores_v2`.
`router_accounting`	Trajectory-level USD accounting: `D_usd`, `N_usd`, `pass_rate_percent`, `exact_match_rate_percent`, `accounting_savings_score_percent`, `overall_score_percent` (mean of those three headline percents).

Dynamic track (`twinrouterbench dynamic …`)

This forwards to miniswerouter.cli (run, score, audit-infra, audit-trace-cost, render).

`run` — main flags

Flag	Meaning
`--router-import`	Required. `module:factory`, e.g. `swerouter.routers.sr_knn_adapter:SemanticRouterKNNRouter.from_cli_args`.
`--router-arg KEY=VALUE`	Repeatable; passed as kwargs to the factory. Values are strings.
`--router-label`	Required label stored in `eval_summary.json` and traces.
`--output-dir`	Required. Run artifacts root (created if needed).
`--base-url` / `--api-key`	Default from `SWEROUTER_` then `OPENROUTER_` env after `.env` load.
`--instances id1 id2 …`	Optional explicit SWE-bench instance IDs.
`--limit N`	Optional cap on how many dataset instances to consider (ordering is harness-defined).
`--workers`	Parallel workers (default 2).
`--max-steps`	Agent step limit (default 250, matches mini-swe-agent SWE profile).
`--budget-usd`	Agent cost limit in USD (default 3).
`--run-id`	Stored in summaries; use a new id when you want a logically separate run.
`--force-rerun`	Re-run instances even if `results/<instance_id>.json` already exists.
`--pool`, `--pricing`, `--ttl`, `--tier-map`	Override locked JSON paths (defaults under `TwinRouterBench/data/dynamic/`).

Resume: without --force-rerun, instances that already have output_dir/results/<instance_id>.json are skipped.

`run` output layout

Under --output-dir:

results/<instance_id>.json — per-instance outcome (resolved, step_count, errors, etc.).
<instance_id>.trace.jsonl — per-step router and usage trace.
eval_summary.json — run-level aggregate (completed, resolved_count, errors, …).
case_summaries/<instance_id>.summary.json — condensed per-case view.
agent_logs/<instance_id>/agent.log — mini-swe-agent log.

eval_summary.json fields (miniswerouter.harness.run_eval.EvalSummary; the swe harness uses the same keys via swerouter.harness.run_eval): router_label, run_id, started_at, finished_at, dataset_name, dataset_split, pool_fingerprint, pricing_schema_version, ttl_policy_name (may be empty when unused), total_instances, completed, resolved_count, resolved_rate (resolved_count / completed, same resolved predicate as SWE-bench), total_router_cost_usd (sum of realized routed API spend only; no failure penalty), per_instance_paths, errors.

Long stretches without new console output are normal (Docker pull, repository setup, multi-step LLM calls).

Example — gold-tier oracle (paths under TwinRouterBench)

From the checkout parent directory, after installing [dynamic]:

twinrouterbench dynamic run \
  --router-import swerouter.routers.gold_tier:GoldTierRouter.from_cli_args \
  --router-arg question_bank_path=TwinRouterBench/data/static/question_bank.jsonl \
  --router-arg tier_to_model_path=TwinRouterBench/data/dynamic/tier_to_model.json \
  --router-arg allowed_instance_ids=django__django-11133 \
  --router-arg label=gold_tier_oracle \
  --router-label gold_tier_oracle \
  --output-dir runs/mini_gt_one \
  --instances django__django-11133 \
  --max-steps 250 --budget-usd 3 --run-id mini_gt_one --force-rerun

Adjust question_bank_path if your bank lives elsewhere.

Example — Semantic Router SR KNN router

Requires semantic-router/ at the checkout parent directory and CPU/GPU for embeddings (embedding_device).

twinrouterbench dynamic run \
  --router-import swerouter.routers.sr_knn_adapter:SemanticRouterKNNRouter.from_cli_args \
  --router-arg knn_json_path=semantic-router/src/training/model_selection/ml_model_selection/.cache/ml-models/knn_model.json \
  --router-arg mapping_path=TwinRouterBench/data/dynamic/sr_knn_to_pool.json \
  --router-arg sr_repo_root=semantic-router \
  --router-arg embedding_model=Qwen/Qwen3-Embedding-0.6B \
  --router-arg embedding_device=cpu \
  --router-arg label=sr_knn_smoke \
  --router-arg category=other \
  --router-label sr_knn_smoke \
  --output-dir runs/sr_knn_smoke \
  --instances django__django-11066 django__django-13410 \
  --workers 2 \
  --max-steps 40 \
  --budget-usd 5.0 \
  --run-id sr_knn_smoke \
  --force-rerun

`--router-arg`	Role
`knn_json_path`	Pretrained `knn_model.json` (e.g. under semantic-router `.cache/ml-models/`).
`mapping_path`	`sr_knn_to_pool.json` — maps KNN label strings to `model_id`s in the locked pool.
`sr_repo_root`	Root of semantic-router checkout (for `KNNModel` loader code path).
`embedding_model`	SentenceTransformers model id (training default: Qwen3 embedding).
`embedding_device`	`cpu`, `cuda`, or `mps`.
`category`	VSR one-hot bucket passed into the feature vector (default `other` for smoke).

`score`, `audit-*`, `render`

twinrouterbench dynamic score --run-dir runs/your_run --router-label your_label
twinrouterbench dynamic audit-infra --run-dir runs/your_run
twinrouterbench dynamic audit-trace-cost --run-dir runs/your_run
twinrouterbench dynamic render --score runs/a/score.json runs/b/score.json --out leaderboard.md

score writes score.json (or --out) using the same scorer as the editor harness.

score.json fields (from swerouter.leaderboard.score.score_run_dir):

Field	Meaning
`router_label` / `run_dir`	Router name and scored directory path.
`pool_fingerprint` / `pricing_schema_version` / `pricing_fingerprint`	Locked pool + pricing identity used when repricing traces.
`high_baseline_model_id`	Tier-high / Opus pool id (benchmark metadata).
`failure_penalty_usd`	Fixed add-on per unresolved instance (default `0.60` USD, the per-case price of a hypothetical perfect solver).
`total_leaderboard_bill_usd`	Leaderboard sort key (lower is better): Σ `instance_bill_usd` = router cost + penalty cost on unresolved.
`total_router_cost_usd`	Σ realized router cost only (no penalty).
`total_penalty_cost_usd`	Σ `penalty_usd` (unresolved instances only).
`resolved_count` / `resolved_rate` / `instance_count`	Resolution stats (denominator respects `exclude_infra_failures` when set).
`avg_steps` / `avg_cost_per_resolved_usd`	Mean steps over counted instances; `total_leaderboard_bill_usd / resolved_count` (or `inf` if none resolved).
`per_instance`	List of rows: `instance_id`, `resolved`, `step_count`, `router_actual_cost_usd`, `penalty_usd`, `instance_bill_usd`, plus `model_distribution`, errors, `excluded_from_metrics`, etc.
`exclude_infra_failures` / `raw_instance_count` / `infra_excluded_count`	Present when infra failures are excluded from aggregates.
`reprice_from_raw_usage`	Present when costs were recomputed from trace usage + current pricing tables.

Older score.json files may still use the deprecated key total_actual_bill_usd (same role as total_leaderboard_bill_usd); twinrouterbench dynamic render accepts both.

Editor scaffold (`twinrouterbench swe …`)

Forwards to swerouter.cli (full SWE-bench harness in the editor-oriented layout). Requires the same [dynamic] extra, Docker, and credentials.

Shell helpers

Under TwinRouterBench/scripts/examples/:

env.inc.sh — source TwinRouterBench/.env and apply the same gateway aliases as Python.
example_router_a.sh / example_router_b.sh — wrapped smoke patterns.
resume_until_n.sh — loop python -m miniswerouter.cli run until results/ contains TARGET_N JSON files (for long campaigns).

Troubleshooting

Symptom	Things to check
HTTP 401 on chat	`OPENROUTER_BASE_URL` must match the key type (OpenRouter vs CommonStack). Remove or fix `COMMONSTACK_*` if you intend to use raw OpenRouter keys.
`missing required connection settings` on `run`	Set `SWEROUTER_API_KEY` / `OPENROUTER_API_KEY` (after `.env`) or pass `--api-key` / `--base-url`.
Dynamic import errors for `main.*`	Install editable from `TwinRouterBench/` or set `PYTHONPATH` to the `TwinRouterBench` root.
SR KNN `FileNotFoundError` for knn JSON	Ensure paths exist; using the checkout parent directory as `cwd` is simplest.
Very slow first SR KNN step	Embedding model download + CPU encoding; use `embedding_device=cuda` when available.
“No output” for many minutes	Docker image pull + SWE environment + agent steps; watch `agent_logs/` or `docker ps`.

Repository layout (`TwinRouterBench/`)

Path	Purpose
`main/`	Static-track package (`main.cli`, tokenizer, pricing, eval).
`miniswerouter/`	Dynamic track on mini-swe-agent.
`swerouter/`	Router protocol, pricing, cache simulation, harness, and leaderboard.
`data/static/`	Static track JSONL: `question_bank.jsonl`, `manifest.json`.
`data/dynamic/`	Dynamic track locked JSON: `model_pool.json`, `model_pricing.json`, `ttl_policy.json`, `tier_to_model.json`, `sr_knn_to_pool.json`, …
`twinrouterbench/`	Meta-CLI dispatcher.
`.env.example`	Template for gateway credentials.

Citation

If you use TwinRouterBench in research, cite the TwinRouterBench paper: arXiv:2605.18859.

@misc{yang2026twinrouterbench,
  title={TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing},
  author={Pei Yang and Wanyi Chen and Tongyun Yang and Pengbin Feng and Jiarong Xing and Wentao Guo and Yuhang Yao and Yuhang Han and Hanchen Li and Xu Wang and Zeyu Wang and Jie Xiao and Anjie Yang and Liang Tian and Lynn Ai and Eric Yang and Tianyu Shi},
  year={2026},
  eprint={2605.18859},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2605.18859}
}

Implementation note (CLI forwarding)

twinrouterbench static|dynamic|swe dispatches in-process to the existing CLIs. For debugging, you may still invoke python -m miniswerouter.cli or python -m swerouter.cli with PYTHONPATH set to TwinRouterBench/.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
data		data
leaderboard		leaderboard
main		main
miniswerouter		miniswerouter
scripts/examples		scripts/examples
swerouter		swerouter
twinrouterbench		twinrouterbench
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
twinrouterbench.pdf		twinrouterbench.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twin Router Bench

Where to run commands (important)

Install

Configuration

`TwinRouterBench/.env`

Prerequisites (dynamic track)

semantic-router (SR KNN only)

Command-line interface

Static track

`twinrouterbench static metrics`

Static track — metric field names (outputs)

Dynamic track (`twinrouterbench dynamic …`)

`run` — main flags

`run` output layout

Example — gold-tier oracle (paths under TwinRouterBench)

Example — Semantic Router SR KNN router

`score`, `audit-*`, `render`

Editor scaffold (`twinrouterbench swe …`)

Shell helpers

Troubleshooting

Repository layout (`TwinRouterBench/`)

Citation

Implementation note (CLI forwarding)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Twin Router Bench

Where to run commands (important)

Install

Configuration

TwinRouterBench/.env

Prerequisites (dynamic track)

semantic-router (SR KNN only)

Command-line interface

Static track

twinrouterbench static metrics

Static track — metric field names (outputs)

Dynamic track (twinrouterbench dynamic …)

run — main flags

run output layout

Example — gold-tier oracle (paths under TwinRouterBench)

Example — Semantic Router SR KNN router

score, audit-*, render

Editor scaffold (twinrouterbench swe …)

Shell helpers

Troubleshooting

Repository layout (TwinRouterBench/)

Citation

Implementation note (CLI forwarding)

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`TwinRouterBench/.env`

`twinrouterbench static metrics`

Dynamic track (`twinrouterbench dynamic …`)

`run` — main flags

`run` output layout

`score`, `audit-*`, `render`

Editor scaffold (`twinrouterbench swe …`)

Repository layout (`TwinRouterBench/`)

Packages