👀 VITRina: VIsual Token Representations

Structure

src ‒ main source code with model and dataset implementations and code to train, test or infer model.
notebooks ‒ notebooks with experiments and visualizations.
scripts ‒ different useful scripts, e.g. print dataset examples or evaluate existing models.
tests ‒ unit tests.

Requirements

Create virtual environment with venv or conda and install requirements:

pip install -r requirements.txt

For proper contributions, also use dev requirements:

pip install -r requirements-dev.txt

Data

For data, we are using .jsonl format. Each line is a JSON object with the following fields: text, label. For example:

{"text": "скотина! что сказать", "label": 1}

To train tokenizer on your data, use scripts.train_tokenizer script:

python -m scripts.train_tokenizer \
  --data resources/data/dataset.jsonl \
  --save-to resources/tokenizer

Visually noisy dataset

To generate noisy dataset, i.e. replace visually similar characters, use scripts.generate_noisy_dataset script (see it for details about arguments):

python -m scripts.generate_noisy_dataset \
  --data resources/data/dataset.jsonl \
  --save-to resources/data/noisy_dataset.jsonl

For noisy dataset, each sample also contains information about class of each word. For example:

{"text": [["cкотина", 0], ["!", 0], ["что", 0], ["сказать", 0]], "label": 1}

There are 4 levels of replacements:

Replace characters w/ visually similar numbers, e.g. "o" -> "0". Full mapping: letters1.json.
Replace characters w/ visually similar symbols or symbols from another language, e.g. "a" -> "@". Full mapping: letters2.json.
Replace characters w/ sequence of symbols, e.g. "ж" -> "}|{". Full mapping: letters3.json.
Replace characters w/ character from the same cluster. Clustering is based on visual similarity between characters in the specified font. Use scripts.clusterization to build clusters before applying augmentation to data.

Toxic Russian Comments classification

Download dataset from Kaggle: Toxic Russian Comments. It is better to put it in resources/data folder.

Use scripts.prepare_ok_dataset to convert dataset to .jsonl format:

python -m scripts.prepare_ok_dataset \
  --data resources/data/dataset.txt \
  --save-to resources/data/dataset.jsonl

Example:

From: __label__INSULT скотина! что сказать
To: {"text": "скотина! что сказать", "toxic": 1}

Models

For now, we are supporting 2 models:

Vanilla BERT model, see src.models.transformer_encoder for implementation details.
VTR-based Transformer model, see src.models.vtr for implementation details. This model uses convolutions to extract features from visual token representations and passes them as embeddings for the vanilla Transformer.

Each model has 2 variants:

Sequence classification via [CLS] token and MLP.
Sequence labeling (suffix SL), where each token is passed to MLP.

You can study how convolutions on visual tokens works in src.models.vtr.embedder.

Training

First of all, we are using wandb to log metrics and artifacts, so you need to create an account and login:

wandb login

To run training, use src.main:

python -m src.main --vtr --sl \
  --train-data resources/data/noisy_dataset.jsonl \
  --tokenizer resources/tokenizer

See

src.main for basic arguments, e.g. data paths, model type.
src.utils.config for training, model, and vtr configurations.

Results

Toxic Russian Comments

Sequence classification:

Model	Accuracy	F1	Precision	Recall
BERT
VTR

Sequence labeling:

Model	Accuracy	F1	Precision	Recall
BERT-SL
VTR-SL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

👀 VITRina: VIsual Token Representations

Structure

Requirements

Data

Visually noisy dataset

Toxic Russian Comments classification

Models

Training

Results

Toxic Russian Comments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
notebooks		notebooks
resources		resources
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

License

deepvk/vitrina

Folders and files

Latest commit

History

Repository files navigation

👀 VITRina: VIsual Token Representations

Structure

Requirements

Data

Visually noisy dataset

Toxic Russian Comments classification

Models

Training

Results

Toxic Russian Comments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages