gpu-kernels

Star

Here are 39 public repositories matching this topic...

ROCm / rocprofiler-compute

Star

[DEPRECATED] Moved to ROCm/rocm-systems repo

linux hpc performance-analysis profiling hardware-counters gpu-kernels

Updated May 28, 2026
Python

xmartlabs / cuda-calculator

Sponsor

Star

Online CUDA Occupancy Calculator

kernel gpu cuda nvidia gpgpu gpu-computing occupancy gpu-kernels gpu-programming

Updated Oct 12, 2021
CoffeeScript

Agents, and RL environment, for optimizing GPU kernels on AMD ROCm using LLM agents. Benchmarks LLM serving workloads end-to-end, profiles bottleneck kernels, optimizes them via Claude Code or Codex, and scores on compilation, correctness, and speedup.

llamas gpu-kernels optimization-pipeline

Updated May 29, 2026
Python

dlsys-course / assignment2-2017

Star

(Spring 2017) Assignment 2: GPU Executor

deep-learning computation-graph neural-nets gpu-kernels gpu-executor

Updated Apr 27, 2017
Python

eyalroz / gpu-kernel-runner

Star

Runs a single CUDA/OpenCL kernel, taking its source from a file and arguments from the command-line

gpu opencl cuda runner gpgpu multi-language performance-analysis profiling performance-testing debugging-tool gpu-kernels gpu-kernel-performance

Updated May 17, 2026
C++

upenn-acg / gpudrano-static-analysis_v1.0

Star

GPU Drano Static Analysis for GPU programs.

llvm static-analysis gpu-kernels

Updated Nov 16, 2018
C++

StigLidu / AdaExplore

Star

The official implementation for paper "AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation"

gpu-kernels llm-agent llm-inference self-improving-agent inference-time-scaling

Updated Apr 28, 2026
Python

AMD-AGI / AgentKernelArena

Star

AgentKernelArena provides an end-to-end siloed-benchmarking environment where different LLM-powered agents—such as Cursor Agent, Claude Code, Codex, SWE-agent, and GEAK—can be evaluated side-by-side on the same GPU kernel tasks, using objective and reproducible metrics.

llamas gpu-kernels agent-evaluation

Updated Jun 1, 2026
Python

AMD-AGI / Neurips2025-GPU-kernels-Tutorial

Star

Repo containing artifacts for Neurips 2025 tutorial- How to Build Agents to Generate Kernels for Faster LLMs (and Other Models!)

llamas gpu-kernels kernel-optimization

Updated May 11, 2026
Jupyter Notebook

KrxGu / kernel-skills

Star

Open source skill library for AI coding agents to write, optimize, and debug high performance compute kernels across CUDA, Triton, and quantized workloads.

cuda high-performance-computing triton quantization rocm gpu-kernels prompt-engineering llm-agents ai-coding kernel-optimization

Updated May 7, 2026
TypeScript

beehive-lab / beehive-spirv-toolkit

Star

Prototype for a SPIR-V assembler and dissasembler. It provides a composable Java interface for generating SPIR-V code at runtime.

java library code-generator assembler spirv gpu-kernels dissassembler

Updated Oct 31, 2025
Java

qflen / nsa-from-scratch

Sponsor

Star

From-scratch reimplementation of DeepSeek's Native Sparse Attention (arXiv:2502.11089) in Triton + CUDA Hopper WGMMA. 7.4x faster than FlashAttention-3 at 64k context. Five-model training fleet, perplexity sweep, LongBench v2, MoBA comparison.

cuda pytorch transformer triton hopper attention-mechanism nsa gpu-kernels sparse-attention llm long-context flash-attention deepseek native-sparse-attention wgmma

Updated May 12, 2026
Python

jhinpan / ROCmKernelWiki

Star

AMD CDNA/RDNA (MI300 gfx942 / MI350 gfx950 / RDNA4 gfx1201) GPU kernel optimization knowledge base, packaged as a Claude Code skill. 7,400+ merged-PR references + 53 ISA-grounded synthesis pages. Inspired by MIT Han Lab's KernelWiki.

hip gemm rocm gpu-kernels cdna amd-gpu mi300 flash-attention kernel-optimization mi350