Benchmark of local LLMs (Qwen 3.6, Gemma 4) vs ChatGPT and Gemini on coding tasks. Apple M5, 32GB. Methodology, raw outputs, judge prompts, scores, and charts.
-
Updated
May 25, 2026 - Jupyter Notebook
Benchmark of local LLMs (Qwen 3.6, Gemma 4) vs ChatGPT and Gemini on coding tasks. Apple M5, 32GB. Methodology, raw outputs, judge prompts, scores, and charts.
AI Benchmark 知识库 — 全面收录各大 AI 公司用来测试模型性能的 Benchmark 题库完整集合
Saotri Bench — coding benchmark for evaluating LLM agents on multi-phase programming tasks with hidden requirements.
🔍 Evaluate LLM agents on multi-phase programming tasks with FluxCodeBench, focusing on hidden requirements, long-context retention, and iterative refinement.
Raw logs of Claude Code running on local Qwen3.5-27B (llama.cpp). Builds a Python todo app with 50 tests. Real-world performance data: 30 min, cache thrashing, 38 t/s generation.
Open benchmark harness for latest major AI models on general reasoning, coding, tool use, and long-context tasks.
Add a description, image, and links to the coding-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the coding-benchmark topic, visit your repo's landing page and select "manage topics."