Bug Attention Probe (BAP) is a scalable method for fault localization (FL) in code without requiring executable tests, large-scale LLMs, or strong supervision. BAP outperforms traditional fault localization baselines and LLM prompting approaches by leveraging a novel attention-based probing technique.
- Lightweight: BAP elicits FL from small models which outperforms prompting of models >10x larger.
- Test-Free and Line Label-Free: No requirement for executable test cases or line-level labels.
- Multi-line Bug Detection: Effectively identifies faults spread across multiple lines.
- Code-Level Localization: FL can be abstracted to beyond lines of code by choice of aggregation method.
- Robustness: Empirically outperforms alternative FL methods across languages and bug types.
To get started, first create a python environment:
python -m venv .venv
and activate the environment with source .venv/bin/activate.
Then install the necessary requirements:
python -m pip install -r requirements.txt
To train BAP on the different datasets, run:
python scripts/train_probes.py --data <data_name> --emb_model llama3.2-11 --probe norm_attention2 --engineer --lr 1e-4 --batch_size 16 --num_epochs 5 --weight_decay 1 --seed 0
where <data_name> is one of github, blackbox, deepfix, tssb, manysstubs, juliet-java, or juliet-c.
This will guide you through the process of obtaining the token hidden states if you have not already collected them.
We provide huggingface checkpoints for Defects4J train/test hidden states as well as BAP-Llama3.2-11B trained on the detection task via 10-fold cross validation.
To generate results for zero-shot prompting of base models, run:
python instruct_zero_shot.py --data <data_name> --model llama3.2-11
And simply append the --eval flag to generate the top-K accuracies of the corresponding model responses.
python instruct_zero_shot.py --data <data_name> --model llama3.2-11 --eval
LLMAO follows a similar procedure to BAP where hidden are extracted and cached prior to training. We ran evaluations on both the originally used CodeGen and further adapted the authors' original code to BAP's base model of Llama3.2-11.
Extract hidden states from Llama3.2-11 via:
python codegen_loading.py /data <$data_name> 3211
The training process is called via:
python training.py /data <data_name> "llama3.2-11" 1
Evaluation is conducted by running:
python llmao_d4j_window.py --data <data_name> --pretrain_type llama3.2-11 --seed 0
| Method | Defects4J Top-1 Accuracy | Avg. Accuracy (8 Datasets) |
|---|---|---|
| Random | 0.144 | 0.087 |
| TRANSFER-FL | 0.218 | 0.218 |
| Llama-3.3-70B | 0.269 | 0.162 |
| DeepSeek-R1-Distill-Llama-70B | 0.221 | 0.131 |
| GPT-4o | 0.249 | 0.181 |
| LLMAO-Llama-3.2-11B | 0.144 | 0.126 |
| BAP-Llama-3.2-11B | 0.334 | 0.350 |
BAP consistenly outperforms traditional test-based FL, existing probing methods, and zero-shot prompting of large models.
If you use BAP in your research, please cite:
@article{bap2025,
author = {Adam Stein and Arthur Wayne and Aaditya Naik and Mayur Naik and Eric Wong},
title = {Where's the Bug? Attention Probing for Scalable Fault Localization},
year = {2025},
journal = {arXiv preprint}
}