This repository provides tools for tokenization, focused on SCRIPT encoding, but also supporting UTF-8. It contains implementations for both BPE and Unigram tokenization algorithms.
For details of the methods, see our papers:
- BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
- Which Pieces Does Unigram Tokenization Really Need?
This repository provides tools for SCRIPT encoding-based pre-tokenization with BPE and Unigram, as well as regular byte-based tokenization.
-
pretokenize/: Pre-tokenizers that handle both chunking and encoding to 'atomic' or 'base' tokens (bytes or script/index pairs)bytes_gpt4/bytes_gpt4o: Classic regex + UTF-8 based tokenizerbytes_gpt4o_cb: With character boundaries enforcementscriptenc_cb: SCRIPT encoding with character boundaries (proposed BPE algorithm)scriptenc_cbi: SCRIPT encoding with inherited script enforcementscriptenc_gpt4o_cb: Hybrid (regex chunking + script encoding)
-
tokenizers/: Tokenization algorithmsbpe/: Byte Pair Encoding implementation with multi-worker trainingunigram/: Unigram language model with EM training, Trie, and Lattice-based Viterbi decoding
-
corpus/: Pretokenized corpus managementPretokenizedCorpus: Partitioned storage for efficient parallel training
-
analysis/: Evaluation utilities- Compression metrics, morphological scoring, experiment tracking
Ensure you have uv, it should take care of the rest.
To explore the available options for training, run:
uv run train --helpTo train a BPE tokenizer:
uv run train --corpus kor_hang_300mb -n 64000 --pretokenizer scriptenc_cb --model bpeTo train a Unigram tokenizer:
uv run train --corpus kor_hang_300mb -n 64000 --pretokenizer scriptenc_cb --model unigramThe paper_utils/ directory contains scripts to reproduce paper results from scratch:
-
paper_utils/script_bpe/: BPE paper reproduction- Paper: BPE Stays on SCRIPT
train_monolingual.sh/train_multilingual.sh: Training scriptsmonolingual_compression.ipynb/multilingual_compression.ipynb: Analysis notebooks
-
paper_utils/unigram/: Unigram paper reproduction- Paper: Which Pieces Does Unigram Tokenization Really Need?
run_all_experiments.sh: Run all experimentsgenerate_main_tables.py/generate_appendix_tables.py: Generate paper tablestrain_hyperparameters.py: Hyperparameter tuning experiments
- An interesting explanation of UTF-8 is given by Computerphile
- For more information on Unicode character properties, refer to the Wikipedia article.