SCRIPT: Script/Category Representation In (Pre-)Tokenization

This repository provides tools for tokenization, focused on SCRIPT encoding, but also supporting UTF-8. It contains implementations for both BPE and Unigram tokenization algorithms.

For details of the methods, see our papers:

Overview

This repository provides tools for SCRIPT encoding-based pre-tokenization with BPE and Unigram, as well as regular byte-based tokenization.

Core Modules (`script_bpe/`)

pretokenize/: Pre-tokenizers that handle both chunking and encoding to 'atomic' or 'base' tokens (bytes or script/index pairs)
- bytes_gpt4/bytes_gpt4o: Classic regex + UTF-8 based tokenizer
- bytes_gpt4o_cb: With character boundaries enforcement
- scriptenc_cb: SCRIPT encoding with character boundaries (proposed BPE algorithm)
- scriptenc_cbi: SCRIPT encoding with inherited script enforcement
- scriptenc_gpt4o_cb: Hybrid (regex chunking + script encoding)
tokenizers/: Tokenization algorithms
- bpe/: Byte Pair Encoding implementation with multi-worker training
- unigram/: Unigram language model with EM training, Trie, and Lattice-based Viterbi decoding
corpus/: Pretokenized corpus management
- PretokenizedCorpus: Partitioned storage for efficient parallel training
analysis/: Evaluation utilities
- Compression metrics, morphological scoring, experiment tracking

Usage

Installation

Ensure you have uv, it should take care of the rest.

Training

To explore the available options for training, run:

uv run train --help

To train a BPE tokenizer:

uv run train --corpus kor_hang_300mb -n 64000 --pretokenizer scriptenc_cb --model bpe

To train a Unigram tokenizer:

uv run train --corpus kor_hang_300mb -n 64000 --pretokenizer scriptenc_cb --model unigram

Reproducing Paper Results

The paper_utils/ directory contains scripts to reproduce paper results from scratch:

paper_utils/script_bpe/: BPE paper reproduction
- Paper: BPE Stays on SCRIPT
- train_monolingual.sh / train_multilingual.sh: Training scripts
- monolingual_compression.ipynb / multilingual_compression.ipynb: Analysis notebooks
paper_utils/unigram/: Unigram paper reproduction
- Paper: Which Pieces Does Unigram Tokenization Really Need?
- run_all_experiments.sh: Run all experiments
- generate_main_tables.py / generate_appendix_tables.py: Generate paper tables
- train_hyperparameters.py: Hyperparameter tuning experiments

Sources

An interesting explanation of UTF-8 is given by Computerphile
For more information on Unicode character properties, refer to the Wikipedia article.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
paper_utils		paper_utils
script_bpe		script_bpe
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCRIPT: Script/Category Representation In (Pre-)Tokenization

Overview

Core Modules (`script_bpe/`)

Usage

Installation

Training

Reproducing Paper Results

Sources

About

Uh oh!

Releases

Packages

Languages

License

sanderland/script_tok

Folders and files

Latest commit

History

Repository files navigation

SCRIPT: Script/Category Representation In (Pre-)Tokenization

Overview

Core Modules (script_bpe/)

Usage

Installation

Training

Reproducing Paper Results

Sources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Core Modules (`script_bpe/`)

Packages