-
Notifications
You must be signed in to change notification settings - Fork 103
Description
This fall, the KernelBench team will continue to maintain and improve the repo. This issue serves as a roadmap and a document that we might continue to update. If you have concrete feature requests, please post them below or ideally open an issue on the repo.
We have a fantastic group of Stanford undergrads: @AffectionateCurry @nathanjpaek @pythonomar22 @Marsella8 as core maintainers, with @ethanboneh on RL framework integration. We very much welcome community contributions in these directions (we try our best to review the PRs). Thank you to @alexzhang13 @hqjenny for the feedback.
Goal & Motivation
KernelBench has quickly become the standard for evaluating LLM Kernel Generation capabilities. As pointed out by many others in the community and we found in our follow-up work, there are aspects of the benchmark that could be improved to make it a more valuable tool for the community. We already started with this over the summer with KernelBench v0.1 by @AffectionateCurry @nataliakokoromyti @anneouyang.
Ultimately, We want to make KernelBench easy (push-button eval), usable (easy to integrate), and referenceable (compare across various approaches)
Overall Milestone
- Milestone 1: By October (SF GPU mode hackathon), resolve all previous PRs and Issues (at least have an answer regarding it)
- Milestone 2: Various integrations with community project for future research directions (RL, evolutionary search, more languages) and for people to experiment with various approaches
- Milestone 3: Create a Referenceable, Reproducible Pipeline
We hope we could have an update/announcement by early Dec / NeurIPS.
Below are the concrete goals and (tempororay) assignments. We will try our best to realize all of these features, but we make no guarantees. We would love to welcome community contributions!
Milestone 1: Improve KernelBench itself
- Collect community feedback on how we can improve KernelBench
- Go through all the current issues & PRs on KernelBench repo. For Issues, reply them. For PR, close them, merge them, or abandon them, make a decision
- Refactor Codebase to make them easier to play with (including LiteLLM Integration Adding LiteLLM support #78, better dataset object More Robust Dataset Object #71, use TOML-based prompt template Toml-based Prompt Organization #85) @pythonomar22 @AffectionateCurry
- Robust and realistic shapes (Llama shapes) and correctness eval.
- Implementations on different ways to do timing eval, such as do_bench, host_side timing, cache clearing. w @Marsella8 @PaliC @simonguozirui Multiple Timing Eval Implementation #89
- Writeup / Blog to understand and showcase the difference between do_bench, ncu profiling, cuda event through benchmarking guide + blog
- Create adversarial tests (cache attacks, zero mean, low overhead, etc) @bkal01 Eval Unit Tests for Adversarial Eval Testing #82
- Precision Support Precision Support + TileLang Integration #80
- Create variable shape inputs [sweep across shapes]
- Create a SoL roofline model (theoretically max speedup on input shape and hardware spec) [high priority]
- Detail study of Torch Compile and new default as baseline for Level 2 and 3 baseline @PaliC
- Potential New Problems and Workloads @nathanjpaek @simonguozirui
Milestone 2: Framework Integration
- Curate a doc of how KernelBench has been used and various approaches tackling it.
DSL (NVIDIA hardware) support.
- ThunderKittens @Willy-Chan ThunderKittens DSL Support #101
- Triton @PaliC @AffectionateCurry Add Triton Backend #35
- CuTe @nathanjpaek Add Triton Backend #35
- TileLang Support @nathanjpaek @AffectionateCurry Precision Support + TileLang Integration #80
- CUTLASS support
- Helion Support
Alternative Hardware platform support.
- AMD HiP Support
- Google TPU support
RL and Search Framework Integration. See #73 for detail.
- RL env integration @ethanboneh
- OpenEvolve Search integration @pythonomar22
Milestone 3: Referenceable, Reproducible Pipeline
To make KernelBench an actual standard, led by @pythonomar22 @AffectionateCurry
- Easy-to-use CoLab notebook CoLab Notebook Tutorial #93
- Create a Modal-based (cloud executable) Pipeline for standard evaluation Adding modal support to run_and_check.py #83 and TODO
- People submit a set of kernel, we curate them and then we show who’s the fastest (SWE-Bench style leaderboard)
- Create a leaderboard
- Create a project website