src‒ main source code with model and dataset implementations and code to train, test or infer model.notebooks‒ notebooks with experiments and visualizations.scripts‒ different useful scripts, e.g. print dataset examples or evaluate existing models.tests‒ unit tests.
Create virtual environment with venv or conda and install requirements:
pip install -r requirements.txtFor proper contributions, also use dev requirements:
pip install -r requirements-dev.txtFor data, we are using .jsonl format.
Each line is a JSON object with the following fields: text, label.
For example:
{"text": "скотина! что сказать", "label": 1}
To train tokenizer on your data, use scripts.train_tokenizer script:
python -m scripts.train_tokenizer \
--data resources/data/dataset.jsonl \
--save-to resources/tokenizerTo generate noisy dataset, i.e. replace visually similar characters, use scripts.generate_noisy_dataset script
(see it for details about arguments):
python -m scripts.generate_noisy_dataset \
--data resources/data/dataset.jsonl \
--save-to resources/data/noisy_dataset.jsonlFor noisy dataset, each sample also contains information about class of each word. For example:
{"text": [["cкотина", 0], ["!", 0], ["что", 0], ["сказать", 0]], "label": 1}
There are 4 levels of replacements:
- Replace characters w/ visually similar numbers, e.g. "o" -> "0". Full mapping:
letters1.json. - Replace characters w/ visually similar symbols or symbols from another language, e.g. "a" -> "@". Full mapping:
letters2.json. - Replace characters w/ sequence of symbols, e.g. "ж" -> "}|{". Full mapping:
letters3.json. - Replace characters w/ character from the same cluster.
Clustering is based on visual similarity between characters in the specified font.
Use
scripts.clusterizationto build clusters before applying augmentation to data.
Download dataset from Kaggle: Toxic Russian Comments.
It is better to put it in resources/data folder.
Use scripts.prepare_ok_dataset to convert dataset to .jsonl format:
python -m scripts.prepare_ok_dataset \
--data resources/data/dataset.txt \
--save-to resources/data/dataset.jsonl Example:
From: __label__INSULT скотина! что сказать
To: {"text": "скотина! что сказать", "toxic": 1}
For now, we are supporting 2 models:
- Vanilla BERT model, see
src.models.transformer_encoderfor implementation details. - VTR-based Transformer model, see
src.models.vtrfor implementation details. This model uses convolutions to extract features from visual token representations and passes them as embeddings for the vanilla Transformer.
Each model has 2 variants:
- Sequence classification via
[CLS]token and MLP. - Sequence labeling (suffix
SL), where each token is passed to MLP.
You can study how convolutions on visual tokens works in src.models.vtr.embedder.

First of all, we are using wandb to log metrics and artifacts, so you need to create an account and login:
wandb loginTo run training, use src.main:
python -m src.main --vtr --sl \
--train-data resources/data/noisy_dataset.jsonl \
--tokenizer resources/tokenizerSee
src.mainfor basic arguments, e.g. data paths, model type.src.utils.configfor training, model, and vtr configurations.
- Sequence classification:
| Model | Accuracy | F1 | Precision | Recall |
|---|---|---|---|---|
| BERT | ||||
| VTR |
- Sequence labeling:
| Model | Accuracy | F1 | Precision | Recall |
|---|---|---|---|---|
| BERT-SL | ||||
| VTR-SL |
