ScalingIntelligence · simonguozirui · Dec 27, 2025 · Dec 23, 2025 · Dec 23, 2025 · Dec 23, 2025
diff --git a/.gitignore b/.gitignore
@@ -11,3 +11,5 @@ cache/*
 !results/timing/
 .env
 _build_cache/
+uv.lock
+CLAUDE.md
diff --git a/EVAL.md b/EVAL.md
@@ -14,10 +14,12 @@ If the model can reward hack, it will find ways to reward hack! This can especia
 Check out resources here:
 - KernelBench [v0.1 Release](https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/) 
 - Cognition and Stanford's [Kevin](https://arxiv.org/abs/2507.11948) project on various hacking behaviors observed in RL training
-- Jiwei Li's awesome [blogpost](https://deep-reinforce.com/defense_kernel_hack.html) on Hacks and Defenses in Automatic GPU Kernel Generation
+- Jiwei Li's awesome [blogpost 1](https://deep-reinforce.com/defense_kernel_hack.html) and [blogpost 2](https://deep-reinforce.com/correctness_check.html) on Hacks and Defenses in Automatic GPU Kernel Generations
 
 Our ongoing blogpost and PRs try to systematize and list out these behaviors and provide tests, detection, and mitigation toolings.
 
+**Disclaimer**: KernelBench is an open-source evaluation framework. Due to limited bandwidth, the KernelBench team does not inspect, validate, or endorse any third-party kernels or reported results. Users are welcome to use the software infrastructure for evaluation, but should independently verify all results.
+
 
 ## Methodology
 More on that coming.

diff --git a/README.md b/README.md
@@ -37,7 +37,7 @@ To evaluate model-generated kernels, we need to check if they:
 - **is correct ✅**: check against reference torch operators `n_correctness` times on randomized inputs.
 - **is performant ⏱️**: compare against reference torch operators `n_trial` times to measure speedup between runtimes.
 
-Check out `src/eval.py` for details on how we implement correctness check and timing. 
+Check out `src/eval.py` for details on how we implement correctness check and timing and `EVAL.md` for notes on evaluation and benchmarking guidelines [WIP].
 
 We provide a convenient script `scripts/run_and_check.py` to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a kernel either locally or remotely by setting `eval_mode=local` or `eval_mode=modal`.
 
@@ -76,28 +76,37 @@ KernelBench/
 ```
 
 ## 🔧 Set up
-```
-conda create --name kernel-bench python=3.10
-conda activate kernel-bench
-pip install -r requirements.txt
-pip install -e . 
+
+We have transitioned to using `pyproject.toml` and `uv` for dependency management. Install [uv](https://docs.astral.sh/uv/getting-started/installation/) if you haven't already
+
+```bash
+# Install base dependencies (works without a local GPU)
+uv sync
+
+# Install with GPU dependencies (for local GPU evaluation)
+uv sync --extra gpu
+
+# Run commands with uv (which invoke the right env)
+uv run python scripts/<script_name>.py ...
 ```
 
+You can still use `conda (python=3.10)` to create your environment and install dependencies with `requirements.txt`.
+
 We use `litellm` for API calls. Please set your keys by creating a `.env` following our `.env.example`.
 
-Running and profiling kernels require a GPU. 
-If you don't have GPU available locally, you can set up [Modal](https://modal.com/). Set up your modal token after creating an account by running `modal token new`. Then, use the `generate_and_eval_single_sample_modal.py` script.  
+Running and profiling kernels require a GPU.
+If you don't have a GPU available locally, you can set up [Modal](https://modal.com/) for cloud serverless GPU evaluation. Set up your modal token after creating an account by running `modal token new`. Then, use the `generate_and_eval_single_sample_modal.py` script.
 
 You can also try out our [tutorial notebook](https://bit.ly/kernelbench-neurips-colab) (also in notebooks/tutorial.ipynb) with Google Colab.
 
 ## 🚀 Usage
 ### Run on a single problem 
 It is easier to get started with a single problem. This will fetch the problem, generate a sample, and evaluate the sample. 
 
-```
-# for example, run level 2 problem 40 from huggingface
+```bash
+# for example, run level 2 problem 40 from huggingface and use google gemini 2.5 flash for generation
 
-python3 scripts/generate_and_eval_single_sample.py dataset_src="huggingface" level=2 problem_id=40
+uv run python scripts/generate_and_eval_single_sample.py dataset_src=huggingface level=2 problem_id=40 server_type=google model_name=gemini/gemini-2.5-flash
 
 # dataset_src could be "local" or "huggingface"
 # add .verbose_logging for more visbility
@@ -112,21 +121,21 @@ Check the config fields for comprehensive set of options. Note we provide the mo
 
 ### Run on all problems 
 
-```
+```bash
 # 1. Generate responses and store kernels locally to runs/{run_name} directory
-python3 scripts/generate_samples.py run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0
+uv run python scripts/generate_samples.py run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0
 
 # 2. Evaluate on all generated kernels in runs/{run_name} directory
-python3 scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300
+uv run python scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300
 
-# If you like to speedup evaluation, you can use parallelize compilation on CPUs before getting to evluation on GPUs
+# If you like to speedup evaluation, you can use parallelize compilation on CPUs before getting to evaluation on GPUs
 # add build_cache=True and num_cpu_workers=<num_cpu_workers> to the command
 ```
 ### Analyze the eval results to compute Benchmark Performance
 We provide `scripts/benchmark_eval_analysis.py` to analyze the eval results to compute success rate, timing metric, and overall benchmark performance  `fast_p`. 
 
-```
-python3 scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch
+```bash
+uv run python scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch
 ```
 If you are using a different hardware, you can generate the baseline time with `scripts/generate_baseline_time.py` script.
 We provide some reference baseline times a variety of NVIDIA GPUs across generations in `results/timing`, but we recommend you to generate your own baseline time for more accurate results (cluster power, software version, all affects timing result). See `results/timing/README.md` for more details.
@@ -140,6 +149,8 @@ Check out our [roadmap](https://github.com/ScalingIntelligence/KernelBench/issue
 ## 🔍 Known Usage
 Since release, we have gotten a lot of interest from researchers, research labs, and companies that use KernelBench to explore this direction. We have documented [known usage](https://docs.google.com/document/d/e/2PACX-1vTjS-UMH1HB5n_PENq2k-3YRfXIXkqKIKeNC2zcWMyLPdl4Jrwvdk4dNDVSsM8ybKrCxZB7GJq1slZF/pub) of KernelBench and related efforts towards automated kernel generations. If you are using KernelBench, we love to hear more about it!
 
+Disclaimer: KernelBench is designed as an open-source evaluation framework and toolkit. The KernelBench team does not review, validate, or endorse individual kernels or reported results. Users are responsible for independently verifying any results obtained using the framework. Please check out `EVAL.md` for more guidance on benchmarking and evaluating kernels.
+
 ## 🪪 License
 MIT. Check `LICENSE.md` for more details.
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,56 @@
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+
+# this should be our single source of truth for versioning
+
+[project]
+# to rename as kb_src later and updated codebase
+# do that later let's get all the import right first
+name = "src"
+version = "0.0.1"
+requires-python = "==3.13.*"
+dependencies = [
+    # Frameworks
+    "torch==2.9.1",
+
+    "transformers>=4.57.3",
+    "datasets>=4.4.2",
+    "modal>=1.3.0",
+
+    # helper
+    "tqdm>=4.67.1",
+    "packaging",
+    "pydra-config",
+    "ninja>=1.13.0",
+    "tomli>=2.3.0",
+    "tabulate>=0.9.0",
+
+    # Numerics
+    "einops>=0.8.1",
+    "python-dotenv>=1.2.1",
+    "numpy==2.4.0",
+
+    # LLM providers
+    "openai>=2.14.0",
+    "litellm[proxy]>=1.80.10",
+]
+
+[project.optional-dependencies]
+gpu = [
+    # GPU-specific dependencies (requires CUDA)
+    "triton==3.5.1",
+    "nvidia-cutlass-dsl",
+    "tilelang",
+    "cupy-cuda12x==13.6.0",
+]
+dev = [
+    "pytest==9.0.2",
+    "ruff==0.14.10",
+]
+
+
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["src*"]
+# TODO: change to kb_src later
diff --git a/requirements.txt b/requirements.txt
@@ -1,31 +1,34 @@
+# ARCHIVED: We are transitioning to pyproject.toml and uv-based project management
+# However, we provide this as a backup for now
+
 # Frameworks
 # we use latest PyTorch stable release
-torch==2.9.0
+torch==2.9.*
+triton==3.5.*
 
 # we shall upgrade torch for blackwell when it is stable
-transformers
-datasets
-modal
+transformers>=4.57.3
+datasets>=4.4.2
+modal>=1.3.0
 
 # DSLs
 nvidia-cutlass-dsl
 tilelang
-triton
 
 # helper
-tqdm
+tqdm>=4.67.1
 packaging
-pydra_config
-pytest
-ninja
-cupy-cuda12x
+pydra-config
+ninja>=1.13.0
+cupy-cuda12x==13.6.0
+tomli>=2.3.0
+tabulate>=0.9.0
 
 # Numerics
-einops
-dotenv
-numpy
+einops>=0.8.1
+python-dotenv>=1.2.1
+numpy==2.4.0
 
 # use litellm for cloud providers and openai for local
-openai 
-litellm[proxy]
-
+openai>=2.14.0
+litellm[proxy]>=1.80.10
diff --git a/scripts/eval_from_generations.py b/scripts/eval_from_generations.py
@@ -67,7 +67,7 @@
                 "g++-10",
                 "clang"
                 )
-    .pip_install_from_requirements(os.path.join(REPO_TOP_DIR, "requirements.txt"))
+    .uv_sync(uv_project_dir=REPO_TOP_DIR)
     .add_local_dir(
         KERNEL_BENCH_PATH,
         remote_path="/root/KernelBench"

diff --git a/scripts/generate_and_eval_single_sample.py b/scripts/generate_and_eval_single_sample.py
@@ -46,15 +46,17 @@ def __init__(self):
 
         # Evaluation
         # local (requires a GPU), modal (cloud GPU) coming soon
-        self.eval_mode = "local"
+        self.eval_mode = "local" 
+        # only support local for now
+        # see scripts/eval_from_generations_modal.py for modal evaluation
         # Construct this from mapping from architecture name to torch cuda arch list in the future
         # you can either specify SM version or just use the name
         self.gpu_arch = ["Ada"]
         self.precision = "fp32" # options ["fp32", "fp16", "bf16"]
 
         # Inference config
-        self.server_type = None
-        self.model_name = None
+        self.server_type = REQUIRED
+        self.model_name = REQUIRED
         self.max_tokens = None
         self.temperature = None
 

diff --git a/scripts/generate_and_eval_single_sample_modal.py b/scripts/generate_and_eval_single_sample_modal.py
@@ -14,7 +14,6 @@
 from datasets import load_dataset
 
 #from src.dataset import construct_kernelbench_dataset
-from src.prompt_constructor_toml import get_prompt_for_backend, get_custom_prompt
 from src.utils import extract_first_code, query_server, set_gpu_arch, read_file, create_inference_server_from_presets
 
 app = modal.App("eval_single_sample")
@@ -102,7 +101,7 @@ def __repr__(self):
                 "g++-10",
                 "clang" # note i skip a step
                 )
-    .pip_install_from_requirements(os.path.join(REPO_TOP_DIR, "requirements.txt"))
+    .uv_sync(uv_project_dir=REPO_TOP_DIR, extras=["gpu"])
     .add_local_python_source("src")
 )
 
@@ -238,6 +237,9 @@ def main(config: EvalConfig):
                 "include_hardware_info is True but hardware_gpu_name is not provided."
             )
 
+    # Lazy import prompt constructor
+    from src.prompt_constructor_toml import get_prompt_for_backend, get_custom_prompt
+
     if custom_prompt_key:
         custom_prompt = get_custom_prompt(
             custom_prompt_key,

diff --git a/scripts/generate_baseline_time_modal.py b/scripts/generate_baseline_time_modal.py
@@ -91,7 +91,7 @@ def __init__(self):
                 "g++-10",
                 "clang" # note i skip a step
                 )
-    .pip_install_from_requirements(os.path.join(REPO_TOP_PATH, "requirements.txt"))
+    .uv_sync(uv_project_dir=REPO_TOP_PATH, extras=["gpu"])
     .add_local_dir(
         KERNEL_BENCH_PATH,
         remote_path="/root/KernelBench"

diff --git a/scripts/run_and_check.py b/scripts/run_and_check.py
@@ -35,7 +35,7 @@
 image = (
     modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.10")
     .apt_install("git", "gcc-10", "g++-10", "clang")
-    .pip_install_from_requirements(os.path.join(REPO_TOP_PATH, "requirements.txt"))
+    .uv_sync(uv_project_dir=REPO_TOP_PATH)
     .add_local_dir(KERNEL_BENCH_PATH, remote_path="/root/KernelBench")
     .add_local_python_source("src")
     .add_local_python_source("scripts")

diff --git a/setup.py b/setup.py
diff --git a/src/frameworks.py b/src/frameworks.py
@@ -32,7 +32,6 @@
 import shutil
 import concurrent
 from functools import cache
-from transformers import AutoTokenizer
 import hashlib
 
 from concurrent.futures import ProcessPoolExecutor, as_completed

diff --git a/src/utils.py b/src/utils.py
@@ -25,11 +25,9 @@
 from contextlib import contextmanager
 from collections import defaultdict
 import time
-import shutil
 import concurrent
 from functools import cache
-from transformers import AutoTokenizer
-import hashlib
+
 
 from concurrent.futures import ProcessPoolExecutor, as_completed