Skip to content

adaminsky/BAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Where’s the Bug? Attention Probing for Scalable Fault Localization

Alt Text

Bug Attention Probe (BAP) is a scalable method for fault localization (FL) in code without requiring executable tests, large-scale LLMs, or strong supervision. BAP outperforms traditional fault localization baselines and LLM prompting approaches by leveraging a novel attention-based probing technique.

Why Use BAP?

  • Lightweight: BAP elicits FL from small models which outperforms prompting of models >10x larger.
  • Test-Free and Line Label-Free: No requirement for executable test cases or line-level labels.
  • Multi-line Bug Detection: Effectively identifies faults spread across multiple lines.
  • Code-Level Localization: FL can be abstracted to beyond lines of code by choice of aggregation method.
  • Robustness: Empirically outperforms alternative FL methods across languages and bug types.

Setup

To get started, first create a python environment:

python -m venv .venv

and activate the environment with source .venv/bin/activate.

Then install the necessary requirements:

python -m pip install -r requirements.txt

Training and Evaluation

BAP

To train BAP on the different datasets, run:

python scripts/train_probes.py --data <data_name> --emb_model llama3.2-11 --probe norm_attention2 --engineer --lr 1e-4 --batch_size 16 --num_epochs 5 --weight_decay 1 --seed 0

where <data_name> is one of github, blackbox, deepfix, tssb, manysstubs, juliet-java, or juliet-c.

This will guide you through the process of obtaining the token hidden states if you have not already collected them.

We provide huggingface checkpoints for Defects4J train/test hidden states as well as BAP-Llama3.2-11B trained on the detection task via 10-fold cross validation.

Zero Shot

To generate results for zero-shot prompting of base models, run:

python instruct_zero_shot.py --data <data_name> --model llama3.2-11

And simply append the --eval flag to generate the top-K accuracies of the corresponding model responses.

python instruct_zero_shot.py --data <data_name> --model llama3.2-11 --eval

LLMAO

LLMAO follows a similar procedure to BAP where hidden are extracted and cached prior to training. We ran evaluations on both the originally used CodeGen and further adapted the authors' original code to BAP's base model of Llama3.2-11.

Extract hidden states from Llama3.2-11 via:

python codegen_loading.py /data <$data_name> 3211

The training process is called via:

python training.py /data <data_name> "llama3.2-11" 1

Evaluation is conducted by running:

python llmao_d4j_window.py --data <data_name> --pretrain_type llama3.2-11 --seed 0

Results

Method Defects4J Top-1 Accuracy Avg. Accuracy (8 Datasets)
Random 0.144 0.087
TRANSFER-FL 0.218 0.218
Llama-3.3-70B 0.269 0.162
DeepSeek-R1-Distill-Llama-70B 0.221 0.131
GPT-4o 0.249 0.181
LLMAO-Llama-3.2-11B 0.144 0.126
BAP-Llama-3.2-11B 0.334 0.350

BAP consistenly outperforms traditional test-based FL, existing probing methods, and zero-shot prompting of large models.

Citation

If you use BAP in your research, please cite:

@article{bap2025,
  author    = {Adam Stein and Arthur Wayne and Aaditya Naik and Mayur Naik and Eric Wong},
  title     = {Where's the Bug? Attention Probing for Scalable Fault Localization},
  year      = {2025},
  journal   = {arXiv preprint}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published