Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@ cache/*
!results/timing/
.env
_build_cache/
uv.lock
CLAUDE.md
4 changes: 3 additions & 1 deletion EVAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,12 @@ If the model can reward hack, it will find ways to reward hack! This can especia
Check out resources here:
- KernelBench [v0.1 Release](https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/)
- Cognition and Stanford's [Kevin](https://arxiv.org/abs/2507.11948) project on various hacking behaviors observed in RL training
- Jiwei Li's awesome [blogpost](https://deep-reinforce.com/defense_kernel_hack.html) on Hacks and Defenses in Automatic GPU Kernel Generation
- Jiwei Li's awesome [blogpost 1](https://deep-reinforce.com/defense_kernel_hack.html) and [blogpost 2](https://deep-reinforce.com/correctness_check.html) on Hacks and Defenses in Automatic GPU Kernel Generations

Our ongoing blogpost and PRs try to systematize and list out these behaviors and provide tests, detection, and mitigation toolings.

**Disclaimer**: KernelBench is an open-source evaluation framework. Due to limited bandwidth, the KernelBench team does not inspect, validate, or endorse any third-party kernels or reported results. Users are welcome to use the software infrastructure for evaluation, but should independently verify all results.


## Methodology
More on that coming.
Expand Down
45 changes: 28 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ To evaluate model-generated kernels, we need to check if they:
- **is correct ✅**: check against reference torch operators `n_correctness` times on randomized inputs.
- **is performant ⏱️**: compare against reference torch operators `n_trial` times to measure speedup between runtimes.

Check out `src/eval.py` for details on how we implement correctness check and timing.
Check out `src/eval.py` for details on how we implement correctness check and timing and `EVAL.md` for notes on evaluation and benchmarking guidelines [WIP].

We provide a convenient script `scripts/run_and_check.py` to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a kernel either locally or remotely by setting `eval_mode=local` or `eval_mode=modal`.

Expand Down Expand Up @@ -76,28 +76,37 @@ KernelBench/
```

## 🔧 Set up
```
conda create --name kernel-bench python=3.10
conda activate kernel-bench
pip install -r requirements.txt
pip install -e .

We have transitioned to using `pyproject.toml` and `uv` for dependency management. Install [uv](https://docs.astral.sh/uv/getting-started/installation/) if you haven't already

```bash
# Install base dependencies (works without a local GPU)
uv sync

# Install with GPU dependencies (for local GPU evaluation)
uv sync --extra gpu

# Run commands with uv (which invoke the right env)
uv run python scripts/<script_name>.py ...
```

You can still use `conda (python=3.10)` to create your environment and install dependencies with `requirements.txt`.

We use `litellm` for API calls. Please set your keys by creating a `.env` following our `.env.example`.

Running and profiling kernels require a GPU.
If you don't have GPU available locally, you can set up [Modal](https://modal.com/). Set up your modal token after creating an account by running `modal token new`. Then, use the `generate_and_eval_single_sample_modal.py` script.
Running and profiling kernels require a GPU.
If you don't have a GPU available locally, you can set up [Modal](https://modal.com/) for cloud serverless GPU evaluation. Set up your modal token after creating an account by running `modal token new`. Then, use the `generate_and_eval_single_sample_modal.py` script.

You can also try out our [tutorial notebook](https://bit.ly/kernelbench-neurips-colab) (also in notebooks/tutorial.ipynb) with Google Colab.

## 🚀 Usage
### Run on a single problem
It is easier to get started with a single problem. This will fetch the problem, generate a sample, and evaluate the sample.

```
# for example, run level 2 problem 40 from huggingface
```bash
# for example, run level 2 problem 40 from huggingface and use google gemini 2.5 flash for generation

python3 scripts/generate_and_eval_single_sample.py dataset_src="huggingface" level=2 problem_id=40
uv run python scripts/generate_and_eval_single_sample.py dataset_src=huggingface level=2 problem_id=40 server_type=google model_name=gemini/gemini-2.5-flash

# dataset_src could be "local" or "huggingface"
# add .verbose_logging for more visbility
Expand All @@ -112,21 +121,21 @@ Check the config fields for comprehensive set of options. Note we provide the mo

### Run on all problems

```
```bash
# 1. Generate responses and store kernels locally to runs/{run_name} directory
python3 scripts/generate_samples.py run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0
uv run python scripts/generate_samples.py run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0

# 2. Evaluate on all generated kernels in runs/{run_name} directory
python3 scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300
uv run python scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300

# If you like to speedup evaluation, you can use parallelize compilation on CPUs before getting to evluation on GPUs
# If you like to speedup evaluation, you can use parallelize compilation on CPUs before getting to evaluation on GPUs
# add build_cache=True and num_cpu_workers=<num_cpu_workers> to the command
```
### Analyze the eval results to compute Benchmark Performance
We provide `scripts/benchmark_eval_analysis.py` to analyze the eval results to compute success rate, timing metric, and overall benchmark performance `fast_p`.

```
python3 scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch
```bash
uv run python scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch
```
If you are using a different hardware, you can generate the baseline time with `scripts/generate_baseline_time.py` script.
We provide some reference baseline times a variety of NVIDIA GPUs across generations in `results/timing`, but we recommend you to generate your own baseline time for more accurate results (cluster power, software version, all affects timing result). See `results/timing/README.md` for more details.
Expand All @@ -140,6 +149,8 @@ Check out our [roadmap](https://github.com/ScalingIntelligence/KernelBench/issue
## 🔍 Known Usage
Since release, we have gotten a lot of interest from researchers, research labs, and companies that use KernelBench to explore this direction. We have documented [known usage](https://docs.google.com/document/d/e/2PACX-1vTjS-UMH1HB5n_PENq2k-3YRfXIXkqKIKeNC2zcWMyLPdl4Jrwvdk4dNDVSsM8ybKrCxZB7GJq1slZF/pub) of KernelBench and related efforts towards automated kernel generations. If you are using KernelBench, we love to hear more about it!

Disclaimer: KernelBench is designed as an open-source evaluation framework and toolkit. The KernelBench team does not review, validate, or endorse individual kernels or reported results. Users are responsible for independently verifying any results obtained using the framework. Please check out `EVAL.md` for more guidance on benchmarking and evaluating kernels.

## 🪪 License
MIT. Check `LICENSE.md` for more details.

Expand Down
56 changes: 56 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"

# this should be our single source of truth for versioning

[project]
# to rename as kb_src later and updated codebase
# do that later let's get all the import right first
name = "src"
version = "0.0.1"
requires-python = "==3.13.*"
dependencies = [
# Frameworks
"torch==2.9.1",

"transformers>=4.57.3",
"datasets>=4.4.2",
"modal>=1.3.0",

# helper
"tqdm>=4.67.1",
"packaging",
"pydra-config",
"ninja>=1.13.0",
"tomli>=2.3.0",
"tabulate>=0.9.0",

# Numerics
"einops>=0.8.1",
"python-dotenv>=1.2.1",
"numpy==2.4.0",

# LLM providers
"openai>=2.14.0",
"litellm[proxy]>=1.80.10",
]

[project.optional-dependencies]
gpu = [
# GPU-specific dependencies (requires CUDA)
"triton==3.5.1",
"nvidia-cutlass-dsl",
"tilelang",
"cupy-cuda12x==13.6.0",
]
dev = [
"pytest==9.0.2",
"ruff==0.14.10",
]


[tool.setuptools.packages.find]
where = ["."]
include = ["src*"]
# TODO: change to kb_src later
35 changes: 19 additions & 16 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,31 +1,34 @@
# ARCHIVED: We are transitioning to pyproject.toml and uv-based project management
# However, we provide this as a backup for now

# Frameworks
# we use latest PyTorch stable release
torch==2.9.0
torch==2.9.*
triton==3.5.*

# we shall upgrade torch for blackwell when it is stable
transformers
datasets
modal
transformers>=4.57.3
datasets>=4.4.2
modal>=1.3.0

# DSLs
nvidia-cutlass-dsl
tilelang
triton

# helper
tqdm
tqdm>=4.67.1
packaging
pydra_config
pytest
ninja
cupy-cuda12x
pydra-config
ninja>=1.13.0
cupy-cuda12x==13.6.0
tomli>=2.3.0
tabulate>=0.9.0

# Numerics
einops
dotenv
numpy
einops>=0.8.1
python-dotenv>=1.2.1
numpy==2.4.0

# use litellm for cloud providers and openai for local
openai
litellm[proxy]

openai>=2.14.0
litellm[proxy]>=1.80.10
2 changes: 1 addition & 1 deletion scripts/eval_from_generations.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@
"g++-10",
"clang"
)
.pip_install_from_requirements(os.path.join(REPO_TOP_DIR, "requirements.txt"))
.uv_sync(uv_project_dir=REPO_TOP_DIR)
.add_local_dir(
KERNEL_BENCH_PATH,
remote_path="/root/KernelBench"
Expand Down
8 changes: 5 additions & 3 deletions scripts/generate_and_eval_single_sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,17 @@ def __init__(self):

# Evaluation
# local (requires a GPU), modal (cloud GPU) coming soon
self.eval_mode = "local"
self.eval_mode = "local"
# only support local for now
# see scripts/eval_from_generations_modal.py for modal evaluation
# Construct this from mapping from architecture name to torch cuda arch list in the future
# you can either specify SM version or just use the name
self.gpu_arch = ["Ada"]
self.precision = "fp32" # options ["fp32", "fp16", "bf16"]

# Inference config
self.server_type = None
self.model_name = None
self.server_type = REQUIRED
self.model_name = REQUIRED
self.max_tokens = None
self.temperature = None

Expand Down
6 changes: 4 additions & 2 deletions scripts/generate_and_eval_single_sample_modal.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
from datasets import load_dataset

#from src.dataset import construct_kernelbench_dataset
from src.prompt_constructor_toml import get_prompt_for_backend, get_custom_prompt
from src.utils import extract_first_code, query_server, set_gpu_arch, read_file, create_inference_server_from_presets

app = modal.App("eval_single_sample")
Expand Down Expand Up @@ -102,7 +101,7 @@ def __repr__(self):
"g++-10",
"clang" # note i skip a step
)
.pip_install_from_requirements(os.path.join(REPO_TOP_DIR, "requirements.txt"))
.uv_sync(uv_project_dir=REPO_TOP_DIR, extras=["gpu"])
.add_local_python_source("src")
)

Expand Down Expand Up @@ -238,6 +237,9 @@ def main(config: EvalConfig):
"include_hardware_info is True but hardware_gpu_name is not provided."
)

# Lazy import prompt constructor
from src.prompt_constructor_toml import get_prompt_for_backend, get_custom_prompt

if custom_prompt_key:
custom_prompt = get_custom_prompt(
custom_prompt_key,
Expand Down
2 changes: 1 addition & 1 deletion scripts/generate_baseline_time_modal.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def __init__(self):
"g++-10",
"clang" # note i skip a step
)
.pip_install_from_requirements(os.path.join(REPO_TOP_PATH, "requirements.txt"))
.uv_sync(uv_project_dir=REPO_TOP_PATH, extras=["gpu"])
.add_local_dir(
KERNEL_BENCH_PATH,
remote_path="/root/KernelBench"
Expand Down
2 changes: 1 addition & 1 deletion scripts/run_and_check.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
image = (
modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.10")
.apt_install("git", "gcc-10", "g++-10", "clang")
.pip_install_from_requirements(os.path.join(REPO_TOP_PATH, "requirements.txt"))
.uv_sync(uv_project_dir=REPO_TOP_PATH)
.add_local_dir(KERNEL_BENCH_PATH, remote_path="/root/KernelBench")
.add_local_python_source("src")
.add_local_python_source("scripts")
Expand Down
8 changes: 0 additions & 8 deletions setup.py

This file was deleted.

1 change: 0 additions & 1 deletion src/frameworks.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@
import shutil
import concurrent
from functools import cache
from transformers import AutoTokenizer
import hashlib

from concurrent.futures import ProcessPoolExecutor, as_completed
Expand Down
4 changes: 1 addition & 3 deletions src/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,9 @@
from contextlib import contextmanager
from collections import defaultdict
import time
import shutil
import concurrent
from functools import cache
from transformers import AutoTokenizer
import hashlib


from concurrent.futures import ProcessPoolExecutor, as_completed

Expand Down