GitHub - jnzhao3/clip_text_span: official implementation of "Interpreting CLIP's Image Representation via Text-Based Decomposition"

📘 Note: This repository is adapted from the official CLIP Text Decomposition repo by Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt, with custom support for CIFAR-100 and transformed datasets.

For how to run the scripts related to finetuning, please refer to the README in the finetune/ subdirectory.

Interpreting CLIP's Image Representation via Text-Based Decomposition

Official PyTorch Implementation

Paper | Project Page

Yossi Gandelsman, Alexei A. Efros and Jacob Steinhardt

🔥 Check out our latest work on interpreting neurons in CLIP with text.

Setup

We provide an environment.yml file that can be used to create a Conda environment:

conda env create -f environment.yml
conda activate prsclip

Preprocessing

To obtain the projected residual stream components for your custom dataset (e.g., CIFAR-100 or grayscale-transformed sets), in this adapted version of the repo, including the contributions from multi-head attentions and MLPs, please run one of the following (and ensure --data_path points to the directory containing your .npy features or dataset images):

python compute_prs.py --dataset cifar100_grayscale --device cuda:0 --model ViT-B-16 --pretrained laion2b_s34b_b88k --data_path ./data/cifar100_grayscale

python compute_prs.py --dataset imagenet --device cuda:0 --model ViT-L-14 --pretrained laion2b_s32b_b82k --data_path python compute_prs.py --dataset imagenet --device cuda:0 --model ViT-B-16 --pretrained laion2b_s34b_b88k --data_path


To obtain the precomputed text representations of the ImageNet classes, please run:
```bash
python compute_text_projection.py  --dataset imagenet --device cuda:0 --model ViT-H-14 --pretrained laion2b_s32b_b79k
python compute_text_projection.py  --dataset imagenet --device cuda:0 --model ViT-L-14 --pretrained laion2b_s32b_b82k
python compute_text_projection.py  --dataset imagenet --device cuda:0 --model ViT-B-16 --pretrained laion2b_s34b_b88k

Mean-ablations

To verify that the MLPs and the attention from the class token to itself can be mean-ablated, please run:

python compute_ablations.py --model ViT-H-14
python compute_ablations.py --model ViT-L-14
python compute_ablations.py --model ViT-B-16

Convert text labels to representation

To convert the text labels for TextSpan to CLIP text representations, please run:

python compute_text_set_projection.py --device cuda:0 --model ViT-L-14 --pretrained laion2b_s32b_b82k --data_path text_descriptions/google_3498_english.txt
python compute_text_set_projection.py --device cuda:0 --model ViT-L-14 --pretrained laion2b_s32b_b82k --data_path text_descriptions/image_descriptions_general.txt

ImageNet segmentation

Please download the dataset from here:

mkdir imagenet_seg
cd imagenet_seg
wget http://calvin-vision.net/bigstuff/proj-imagenet/data/gtsegs_ijcv.mat

To get the evaluation results, please run:

python compute_segmentations.py --device cuda:0 --model ViT-H-14 --pretrained laion2b_s32b_b79k --data_path imagenet_seg/gtsegs_ijcv.mat --save_img
python compute_segmentations.py --device cuda:0 --model ViT-L-14 --pretrained laion2b_s32b_b82k --data_path imagenet_seg/gtsegs_ijcv.mat --save_img
python compute_segmentations.py --device cuda:0 --model ViT-B-16 --pretrained laion2b_s34b_b88k --data_path imagenet_seg/gtsegs_ijcv.mat --save_img

Save the results with the --save_img flag.

TextSpan

To find meaningful directions for all the attenion heads, run:

python compute_complete_text_set.py --device cuda:0 --model ViT-B-16 --texts_per_head 20 --num_of_last_layers 4 --text_descriptions image_descriptions_general
python compute_complete_text_set.py --device cuda:0 --model ViT-L-14 --texts_per_head 20 --num_of_last_layers 4 --text_descriptions image_descriptions_general
python compute_complete_text_set.py --device cuda:0 --model ViT-H-14 --texts_per_head 20 --num_of_last_layers 4 --text_descriptions image_descriptions_general

Other datasets

To download the Waterbirds datasets, run:

wget https://nlp.stanford.edu/data/dro/waterbird_complete95_forest2water2.tar.gz
tar -xf  waterbird_complete95_forest2water2.tar.gz

To compute the overall accuracy, run:

python compute_prs.py --dataset binary_waterbirds --device cuda:0 --model ViT-L-14 --pretrained laion2b_s32b_b82k --data_path <PATH>
python compute_text_projection.py  --dataset binary_waterbirds --device cuda:0 --model ViT-L-14 --pretrained laion2b_s32b_b82k
python compute_use_specific_heads.py --model ViT-L-14 --dataset binary_waterbirds

Spatial decomposition

Please see a demo for the spatial decomposition of CLIP in demo.ipynb.

Nearest neighbors search

Please see the nearest neighbors search demo in nns.ipynb.

BibTeX

@inproceedings{
      gandelsman2024interpreting,
      title={Interpreting {CLIP}'s Image Representation via Text-Based Decomposition},
      author={Yossi Gandelsman and Alexei A. Efros and Jacob Steinhardt},
      booktitle={The Twelfth International Conference on Learning Representations},
      year={2024},
      url={https://openreview.net/forum?id=5Ca9sSzuDp}
}

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
finetune		finetune
images		images
output_dir		output_dir
text_descriptions		text_descriptions
utils		utils
wandb		wandb
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
color_heads_ablation_on_colored.out		color_heads_ablation_on_colored.out
color_heads_ablation_on_gray.out		color_heads_ablation_on_gray.out
compute_ablations.py		compute_ablations.py
compute_activation_diff_per_head.py		compute_activation_diff_per_head.py
compute_complete_text_set.py		compute_complete_text_set.py
compute_complete_text_set.sh		compute_complete_text_set.sh
compute_model_accuracy.py		compute_model_accuracy.py
compute_prs.py		compute_prs.py
compute_prs.sh		compute_prs.sh
compute_prs_parallel.py		compute_prs_parallel.py
compute_saliency_map.py		compute_saliency_map.py
compute_segmentations.py		compute_segmentations.py
compute_text_projection.py		compute_text_projection.py
compute_text_projection.sh		compute_text_projection.sh
compute_text_set_projection.py		compute_text_set_projection.py
compute_text_set_projection.sh		compute_text_set_projection.sh
compute_use_specific_heads.py		compute_use_specific_heads.py
compute_weight_diff_per_head.py		compute_weight_diff_per_head.py
demo.ipynb		demo.ipynb
environment.yml		environment.yml
final_representation_diff.py		final_representation_diff.py
finetune.py		finetune.py
load.py		load.py
nns.ipynb		nns.ipynb
prs_hook.py		prs_hook.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Interpreting CLIP's Image Representation via Text-Based Decomposition

Paper | Project Page

Setup

Preprocessing

Mean-ablations

Convert text labels to representation

ImageNet segmentation

TextSpan

Other datasets

Spatial decomposition

Nearest neighbors search

BibTeX

About

Uh oh!

Releases

Packages

Languages

License

jnzhao3/clip_text_span

Folders and files

Latest commit

History

Repository files navigation

Interpreting CLIP's Image Representation via Text-Based Decomposition

Paper | Project Page

Setup

Preprocessing

Mean-ablations

Convert text labels to representation

ImageNet segmentation

TextSpan

Other datasets

Spatial decomposition

Nearest neighbors search

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages