Where’s the Bug? Attention Probing for Scalable Fault Localization

Bug Attention Probe (BAP) is a scalable method for fault localization (FL) in code without requiring executable tests, large-scale LLMs, or strong supervision. BAP outperforms traditional fault localization baselines and LLM prompting approaches by leveraging a novel attention-based probing technique.

Why Use BAP?

Lightweight: BAP elicits FL from small models which outperforms prompting of models >10x larger.
Test-Free and Line Label-Free: No requirement for executable test cases or line-level labels.
Multi-line Bug Detection: Effectively identifies faults spread across multiple lines.
Code-Level Localization: FL can be abstracted to beyond lines of code by choice of aggregation method.
Robustness: Empirically outperforms alternative FL methods across languages and bug types.

Setup

To get started, first create a python environment:

python -m venv .venv

and activate the environment with source .venv/bin/activate.

Then install the necessary requirements:

python -m pip install -r requirements.txt

Training and Evaluation

BAP

To train BAP on the different datasets, run:

python scripts/train_probes.py --data <data_name> --emb_model llama3.2-11 --probe norm_attention2 --engineer --lr 1e-4 --batch_size 16 --num_epochs 5 --weight_decay 1 --seed 0

where <data_name> is one of github, blackbox, deepfix, tssb, manysstubs, juliet-java, or juliet-c.

This will guide you through the process of obtaining the token hidden states if you have not already collected them.

We provide huggingface checkpoints for Defects4J train/test hidden states as well as BAP-Llama3.2-11B trained on the detection task via 10-fold cross validation.

Zero Shot

To generate results for zero-shot prompting of base models, run:

python instruct_zero_shot.py --data <data_name> --model llama3.2-11

And simply append the --eval flag to generate the top-K accuracies of the corresponding model responses.

python instruct_zero_shot.py --data <data_name> --model llama3.2-11 --eval

LLMAO

LLMAO follows a similar procedure to BAP where hidden are extracted and cached prior to training. We ran evaluations on both the originally used CodeGen and further adapted the authors' original code to BAP's base model of Llama3.2-11.

Extract hidden states from Llama3.2-11 via:

python codegen_loading.py /data <$data_name> 3211

The training process is called via:

python training.py /data <data_name> "llama3.2-11" 1

Evaluation is conducted by running:

python llmao_d4j_window.py --data <data_name> --pretrain_type llama3.2-11 --seed 0

Results

Method	Defects4J Top-1 Accuracy	Avg. Accuracy (8 Datasets)
Random	0.144	0.087
TRANSFER-FL	0.218	0.218
Llama-3.3-70B	0.269	0.162
DeepSeek-R1-Distill-Llama-70B	0.221	0.131
GPT-4o	0.249	0.181
LLMAO-Llama-3.2-11B	0.144	0.126
BAP-Llama-3.2-11B	0.334	0.350

BAP consistenly outperforms traditional test-based FL, existing probing methods, and zero-shot prompting of large models.

Citation

If you use BAP in your research, please cite:

@article{bap2025,
  author    = {Adam Stein and Arthur Wayne and Aaditya Naik and Mayur Naik and Eric Wong},
  title     = {Where's the Bug? Attention Probing for Scalable Fault Localization},
  year      = {2025},
  journal   = {arXiv preprint}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
BAP		BAP
LLMAO		LLMAO
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Where’s the Bug? Attention Probing for Scalable Fault Localization

Why Use BAP?

Setup

Training and Evaluation

BAP

Zero Shot

LLMAO

Results

Citation

About

Uh oh!

Releases

Packages

Languages

adaminsky/BAP

Folders and files

Latest commit

History

Repository files navigation

Where’s the Bug? Attention Probing for Scalable Fault Localization

Why Use BAP?

Setup

Training and Evaluation

BAP

Zero Shot

LLMAO

Results

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages