Skip to content

sanderland/script_tok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCRIPT: Script/Category Representation In (Pre-)Tokenization

This repository provides tools for tokenization, focused on SCRIPT encoding, but also supporting UTF-8. It contains implementations for both BPE and Unigram tokenization algorithms.

For details of the methods, see our papers:

Overview

This repository provides tools for SCRIPT encoding-based pre-tokenization with BPE and Unigram, as well as regular byte-based tokenization.

Core Modules (script_bpe/)

  • pretokenize/: Pre-tokenizers that handle both chunking and encoding to 'atomic' or 'base' tokens (bytes or script/index pairs)

    • bytes_gpt4/bytes_gpt4o: Classic regex + UTF-8 based tokenizer
    • bytes_gpt4o_cb: With character boundaries enforcement
    • scriptenc_cb: SCRIPT encoding with character boundaries (proposed BPE algorithm)
    • scriptenc_cbi: SCRIPT encoding with inherited script enforcement
    • scriptenc_gpt4o_cb: Hybrid (regex chunking + script encoding)
  • tokenizers/: Tokenization algorithms

    • bpe/: Byte Pair Encoding implementation with multi-worker training
    • unigram/: Unigram language model with EM training, Trie, and Lattice-based Viterbi decoding
  • corpus/: Pretokenized corpus management

    • PretokenizedCorpus: Partitioned storage for efficient parallel training
  • analysis/: Evaluation utilities

    • Compression metrics, morphological scoring, experiment tracking

Usage

Installation

Ensure you have uv, it should take care of the rest.

Training

To explore the available options for training, run:

uv run train --help

To train a BPE tokenizer:

uv run train --corpus kor_hang_300mb -n 64000 --pretokenizer scriptenc_cb --model bpe

To train a Unigram tokenizer:

uv run train --corpus kor_hang_300mb -n 64000 --pretokenizer scriptenc_cb --model unigram

Reproducing Paper Results

The paper_utils/ directory contains scripts to reproduce paper results from scratch:

  • paper_utils/script_bpe/: BPE paper reproduction

    • Paper: BPE Stays on SCRIPT
    • train_monolingual.sh / train_multilingual.sh: Training scripts
    • monolingual_compression.ipynb / multilingual_compression.ipynb: Analysis notebooks
  • paper_utils/unigram/: Unigram paper reproduction

Sources

About

Code for the paper "BPE stays on SCRIPT"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published