GitHub - FOR-sight-ai/interpreto: 🪄 Interpreto is an interpretability toolbox for LLMs

Interpreto: Interpretability Toolkit for LLMs

🚀 Quick Start

The library is available on PyPI, try pip install interpreto to install it.

Checkout the tutorials to get started:

Attributions walkthrough (both classification and generation)
Classification concept-based explanations
Generation concept-based explanations

📦 What's Included

Interpreto 🪄 provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.

Attribution Methods

Interpreto includes both inference-based and gradient-based attribution methods.

They all work seamlessly for both classification (...ForSequenceClassification) and generation (...ForCausalLM)

Inference-based Methods:

Gradient-based methods:

GradientShap — Lundberg and Lee, 2017
InputxGradient — Simonyan et al., 2013
Integrated Gradient — Sundararajan et al., 2017
Saliency — Simonyan et al., 2013
SmoothGrad — Smilkov et al., 2017
SquareGrad — Hooker et al., 2019
VarGrad — Richter et al., 2020

Concept-Based Methods or Mechanistic Interpretability

Concept-based explanations aim to provide high-level interpretations of latent model representations.

Interpreto generalizes these methods through three core steps:

Concept Discovery (e.g., from latent embeddings)
Concept Interpretation (mapping discovered concepts to human-understandable elements)
Concept-to-Output Attribution (assessing concept relevance to model outputs)

Dictionary Learning for Concept Discovery (mainly via Overcomplete):

Interpret neurons directly via NeuronsAsConcepts
NMF, Semi-NMF, ConvexNMF
ICA, SVD, PCA, KMeans
SAE variants: Vanilla SAE, TopK SAE, JumpReLU SAE, BatchTopK SAE

Available Concept Interpretation Techniques:

Top-k tokens from tokenizer vocabulary via TopKInputs and use_vocab=True
Top-k tokens/words/sentences/samples from specific datasets via TopKInputs
Label concepts via LLMs with LLMLabels (Bills et al. 2023)

Concept Interpretation Techniques Added in the future:

Input-to-concept attribution from dataset examples (Jourdan et al. 2023)
Theme prediction via LLMs from top-k tokens/sentences
Aligning concepts with human labels (Sajjad et al. 2022)
Word cloud visualizations of concepts (Dalvi et al. 2022)
VocabProj & TokenChange (Gur-Arieh et al. 2025)

Concept-to-Output Attribution:

Estimate the contribution of each concept to the model output.

Can be obtained with any concept-based explainer via MethodConcepts.concept_output_gradient().

Papers available in the future:

Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:

CAV and TCAV: Kim et al. 2018, Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
ConceptSHAP: Yeh et al. 2020, On Completeness-aware Concept-Based Explanations in Deep Neural Networks
COCKATIEL: Jourdan et al. 2023, COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP
Yun et al. 2021, Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
FFN values interpretation: Geva et al. 2022, Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
SparseCoding: Cunningham et al. 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models
Parameter Interpretation: Dar et al. 2023, Analyzing Transformers in Embedding Space

Evaluation Metrics

Evaluation Metrics for Attribution

To evaluate attribution methods faithfulness, there are the Insertion and Deletion metrics.

Evaluation Metrics for Concepts

Concept-based methods have several steps that can be evaluated together via ConSim.

Or independently:

Concept-space (dictionary learning evaluation)
- faithfulness: MSE, FID, and ReconstructionError
- complexity: Sparsity, SparsityRatio, SparsityRatio
- stability: Stability
Concepts interpretations
- No metric yet, will be included soon.
Concept-to-Output attribution
- No metric yet, will be included soon.

👍 Contributing

Feel free to propose your ideas or come and contribute with us on the Interpreto 🪄 toolbox! We have a specific document where we describe in a simple way how to make your first pull request.

👀 See Also

🙏 Acknowledgments

This project received funding from the French ”Investing for the Future – PIA3” program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.

👨‍🎓 Creators

Interpreto 🪄 is a project of the FOR and the DEEL teams at the IRT Saint-Exupéry in Toulouse, France.

🗞️ Citation

If you use Interpreto 🪄 as part of your workflow in a scientific publication, please consider citing 🗞️ our paper:

@article{poche2025interpreto,
    title       = {Interpreto: An Explainability Library for Transformers},
    author      = {Poch{\'e}, Antonin and Mullor, Thomas and Sarti, Gabriele and Boisnard, Fr{\'e}d{\'e}ric and Friedrich, Corentin and Claye, Charlotte and Hoofd, Fran{\c{c}}ois and Bernas, Raphael and Hudelot, C{\'e}line and Jourdan, Fanny},
    journal     = {arXiv preprint arXiv:2512.09730},
    year        = {2025}
}

📝 License

The package is released under MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 826 Commits
.github		.github
docs		docs
interpreto		interpreto
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Quick Start

📦 What's Included

Attribution Methods

Concept-Based Methods or Mechanistic Interpretability

Evaluation Metrics

👍 Contributing

👀 See Also

🙏 Acknowledgments

👨‍🎓 Creators

🗞️ Citation

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 9

Uh oh!

Languages

License

FOR-sight-ai/interpreto

Folders and files

Latest commit

History

Repository files navigation

🚀 Quick Start

📦 What's Included

Attribution Methods

Concept-Based Methods or Mechanistic Interpretability

Evaluation Metrics

👍 Contributing

👀 See Also

🙏 Acknowledgments

👨‍🎓 Creators

🗞️ Citation

📝 License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 9

Uh oh!

Languages

Packages