Skip to content

Clarify KernelBench’s Benchmarking Scope: Inline CUDA Kernels vs. Forward Kernels Using PyTorch #57

@yuxuan-z19

Description

@yuxuan-z19

I’m trying to understand the intended purpose of KernelBench and find some ambiguity in the paper:

  • Inline CUDA kernels: Designed to evaluate LLM-generated, from‑scratch CUDA kernels that do not reuse PyTorch primitives (e.g., as in AI CUDA Engineer workflow).

  • Forward kernels: Designed to evaluate LLM-generated wrappers or glue code that do call into existing PyTorch implementations (e.g., as in CUDA-L1 workflow).

The paper does not clearly state which of these two scenarios KernelBench is meant to measure. As a result, it’s unclear whether submissions are expected to:

  1. Write fully self‑contained CUDA kernels with no PyTorch calls, or
  2. Write high‑level forward() functions that simply delegate to PyTorch’s optimized backend.

Questions

  1. What exactly does KernelBench measure?

    • Inline/custom CUDA kernels only?
    • Forward wrappers calling PyTorch?
    • Both, with separate tracks?
  2. What are the benchmarking requirements?

    • Are PyTorch calls disallowed in the “inline CUDA” track?
    • If PyTorch calls are allowed, what level of originality is expected?

Please update the README and/or paper to explicitly define the two benchmarking modes (if both are intended), including:

  • Allowed APIs and library calls for each track.
  • Example submissions for each mode.

If only one mode is intended, please clarify in both the paper and the repository documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions