[DEPRECATED] Moved to ROCm/rocm-systems repo
-
Updated
May 28, 2026 - Python
[DEPRECATED] Moved to ROCm/rocm-systems repo
Online CUDA Occupancy Calculator
Agents, and RL environment, for optimizing GPU kernels on AMD ROCm using LLM agents. Benchmarks LLM serving workloads end-to-end, profiles bottleneck kernels, optimizes them via Claude Code or Codex, and scores on compilation, correctness, and speedup.
(Spring 2017) Assignment 2: GPU Executor
Runs a single CUDA/OpenCL kernel, taking its source from a file and arguments from the command-line
GPU Drano Static Analysis for GPU programs.
The official implementation for paper "AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation"
AgentKernelArena provides an end-to-end siloed-benchmarking environment where different LLM-powered agents—such as Cursor Agent, Claude Code, Codex, SWE-agent, and GEAK—can be evaluated side-by-side on the same GPU kernel tasks, using objective and reproducible metrics.
Repo containing artifacts for Neurips 2025 tutorial- How to Build Agents to Generate Kernels for Faster LLMs (and Other Models!)
Open source skill library for AI coding agents to write, optimize, and debug high performance compute kernels across CUDA, Triton, and quantized workloads.
Prototype for a SPIR-V assembler and dissasembler. It provides a composable Java interface for generating SPIR-V code at runtime.
From-scratch reimplementation of DeepSeek's Native Sparse Attention (arXiv:2502.11089) in Triton + CUDA Hopper WGMMA. 7.4x faster than FlashAttention-3 at 64k context. Five-model training fleet, perplexity sweep, LongBench v2, MoBA comparison.
AMD CDNA/RDNA (MI300 gfx942 / MI350 gfx950 / RDNA4 gfx1201) GPU kernel optimization knowledge base, packaged as a Claude Code skill. 7,400+ merged-PR references + 53 ISA-grounded synthesis pages. Inspired by MIT Han Lab's KernelWiki.
Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, FP8 GEMM — CPU-testable references, autotuning, and benchmarking
Real-time NVIDIA GPU command capture, decoding, and visualization
A self-hosted low-level functional-style programming language 🌀
Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).
16-step CUDA optimization of FlashAttention-2 achieving 99.2% of official performance on A100 — Ampere architecture
METAL-SCI: a scientific compute benchmark for evolutionary LLM kernel search on Apple Silicon Metal
Add a description, image, and links to the gpu-kernels topic page so that developers can more easily learn about it.
To associate your repository with the gpu-kernels topic, visit your repo's landing page and select "manage topics."