Skip to content

FOR-sight-ai/interpreto

Interpreto: Interpretability Toolkit for LLMs

Build status Version Python Version Downloads License: MIT

Explore Interpreto docs Β»

πŸš€ Quick Start

The library is available on PyPI, try pip install interpreto to install it.

Checkout the tutorials to get started:

πŸ“¦ What's Included

Interpreto πŸͺ„ provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.

Attribution Methods

Interpreto includes both inference-based and gradient-based attribution methods.

They all work seamlessly for both classification (...ForSequenceClassification) and generation (...ForCausalLM)

Inference-based Methods:

Gradient-based methods:

Concept-Based Methods or Mechanistic Interpretability

Concept-based explanations aim to provide high-level interpretations of latent model representations.

Interpreto generalizes these methods through three core steps:

  1. Concept Discovery (e.g., from latent embeddings)
  2. Concept Interpretation (mapping discovered concepts to human-understandable elements)
  3. Concept-to-Output Attribution (assessing concept relevance to model outputs)

Dictionary Learning for Concept Discovery (mainly via Overcomplete):

Available Concept Interpretation Techniques:

Concept Interpretation Techniques Added in the future:

Concept-to-Output Attribution:

Estimate the contribution of each concept to the model output.

Can be obtained with any concept-based explainer via MethodConcepts.concept_output_gradient().

Papers available in the future:

Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:

Evaluation Metrics

Evaluation Metrics for Attribution

To evaluate attribution methods faithfulness, there are the Insertion and Deletion metrics.

Evaluation Metrics for Concepts

Concept-based methods have several steps that can be evaluated together via ConSim.

Or independently:

πŸ‘ Contributing

Feel free to propose your ideas or come and contribute with us on the Interpreto πŸͺ„ toolbox! We have a specific document where we describe in a simple way how to make your first pull request.

πŸ‘€ See Also

More from the DEEL project:

  • Xplique a Python library dedicated to explaining neural networks (Images, Time Series, Tabular data) on TensorFlow.
  • Puncc a Python library for predictive uncertainty quantification using conformal prediction.
  • oodeel a Python library that performs post-hoc deep Out-of-Distribution (OOD) detection on already trained neural network image classifiers.
  • deel-lip a Python library for training k-Lipschitz neural networks on TensorFlow.
  • deel-torchlip a Python library for training k-Lipschitz neural networks on PyTorch.
  • Influenciae a Python library dedicated to computing influence values for the discovery of potentially problematic samples in a dataset.
  • DEEL White paper a summary of the DEEL team on the challenges of certifiable AI and the role of data quality, representativity and explainability for this purpose.

πŸ™ Acknowledgments

This project received funding from the French ”Investing for the Future – PIA3” program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.

πŸ‘¨β€πŸŽ“ Creators

Interpreto πŸͺ„ is a project of the FOR and the DEEL teams at the IRT Saint-ExupΓ©ry in Toulouse, France.

πŸ—žοΈ Citation

If you use Interpreto πŸͺ„ as part of your workflow in a scientific publication, please consider citing πŸ—žοΈ our paper:

@article{poche2025interpreto,
    title       = {Interpreto: An Explainability Library for Transformers},
    author      = {Poch{\'e}, Antonin and Mullor, Thomas and Sarti, Gabriele and Boisnard, Fr{\'e}d{\'e}ric and Friedrich, Corentin and Claye, Charlotte and Hoofd, Fran{\c{c}}ois and Bernas, Raphael and Hudelot, C{\'e}line and Jourdan, Fanny},
    journal     = {arXiv preprint arXiv:2512.09730},
    year        = {2025}
}

πŸ“ License

The package is released under MIT license.

About

πŸͺ„ Interpreto is an interpretability toolbox for LLMs

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 9