This repository contains the code and data for the paper "LinuxFLBench: Benchmarking and Enhancing LLM-based Agents in Localizing Linux Kernel Bugs".
LINUXFLBENCH is a new benchmark of 250 Fault Localization tasks derived from real-world Linux kernel bugs.
- The dataset is located at
dataset/LINUXFLBENCH_dataset.jsonlin JSON Lines format. - Each line is a real Linux kernel bug sample, with fields including:
id: Bug IDtitle: Bug titledescription: Detailed bug descriptionKernel Version: The version of the Linux kernel in which the bug occurred (e.g., 5.6.7).patch: Patch content for the fixpaths: Source file paths involved (i.e., localization target files)methods: Function names involved- Additional metadata: kernel version, component, hardware, etc.
- The dataset covers various kernel versions and is suitable for evaluating LLM/agent-based fault localization in large and complex systems(i.e., the Linux kernel).
- The source code for different Linux kernel versions can be downloaded from here.
The main code is under the code/ directory, organized as follows:
scale/: Candidate file expansion and reasoningscaling_candidates_with_dir.py: Directory-based candidate expansionscaling_candidates_with_guess.py: LLM-based candidate expansion
merge/: Multi-method result fusion and rerankingmerge.py: Fusion of multiple ranking resultsrerank.py: LLM-based candidate reranking
mail/:Mail-related scriptsmails_retrieval.py:Retrieves relevant emails from the mail dataset based on queriessearch_mails_bm25s.py:BM25-based Mail Search Utils
method_fl/: Method-level fault localization based on the predicted code filesmethod_localize.py: Method-level fault localization script
eval/: Evaluation and metricsevaluate.py: Main evaluation scriptevaluation_metrics.py: Common metrics such as Recall@K, MRR
utils.py,file_parser.py: General utility functions- The mail data for retrieval can be downloaded from here.
- Candidate Expansion
Use scripts inscale/to expand candidate file lists for each bug (e.g., Directory-Aware Expansion, Potential Cause Expansion). - Candidate Integration
Use scripts inmerge/to fuse multiple candidate ranking results, and rerank with LLM. - Evaluation
Use scripts ineval/to evaluate the final results with metrics such as Recall@K and MRR.
All experimental results are located in the result/ directory and can be used for reproduction.
This project requires Python 3.8+ and the following packages:
- openai
- jsonlines
Install dependencies with pip:
pip install openai jsonlinesSome scripts require configuration of OpenAI API Key and base_url. See script arguments for details.
Example: Directory-Aware Expansion
python code/scale/scaling_candidates_with_dir.py \
--data_path dataset/LINUXFLBENCH_dataset.jsonl \
--save_path results/dir_scaling.jsonl \
--gpt_base_url https://api.openai.com/v1 \
--api_key YOUR_API_KEY \
--kernel_path /path/to/linux/kernel/Evaluate the results:
python code/eval/evaluate.py --path results/dir_scaling.jsonlFor more details, usage, or questions, please open an issue or contact the authors.