Add Japanese tokenizer (fugashi) and minimal unit test #223

supernaiter · 2025-07-12T22:53:17Z

What does this PR do?

This PR adds minimal support for Japanese prompt tokenization.

What's included:

A Japanese tokenizer utility (tokenize_jp) using fugashi + unidic-lite
A Unicode-based language detector (is_japanese_text) to support lang="auto"
Minimal unit tests for tokenizer correctness
setup.py updated with extras_require["ja"] to optionally install Japanese dependencies

This is the first step toward enabling Japanese prompt compression, designed to be self-contained and safe to merge.
Future work (e.g., integration into compress_prompt) will follow as separate PRs.

Fixes: N/A

Before submitting

This PR is a new feature.
Changes are backward-compatible.
Tests for new functionality are included.
No documentation changes are needed at this stage.

Who can review?

@iofu728 @SiyunZhao — this is a minimal PR for Japanese support.
Would love your input before we follow up with lang="ja" integration.

supernaiter · 2025-07-12T22:55:21Z

@microsoft-github-policy-service agree

supernaiter · 2025-07-12T22:58:00Z

Hi @iofu728 @SiyunZhao — This is a minimal PR to support Japanese prompt tokenization.
All tests passed, CLA is signed, and the PR is self-contained.

Would love your feedback or approval when convenient. Thanks!

Add Japanese tokenizer (fugashi) and minimal unit test

953e5a0

supernaiter marked this pull request as ready for review July 12, 2025 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Japanese tokenizer (fugashi) and minimal unit test #223

Add Japanese tokenizer (fugashi) and minimal unit test #223

Uh oh!

supernaiter commented Jul 12, 2025

Uh oh!

supernaiter commented Jul 12, 2025

Uh oh!

supernaiter commented Jul 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Japanese tokenizer (fugashi) and minimal unit test #223

Are you sure you want to change the base?

Add Japanese tokenizer (fugashi) and minimal unit test #223

Uh oh!

Conversation

supernaiter commented Jul 12, 2025

What does this PR do?

What's included:

Before submitting

Who can review?

Uh oh!

supernaiter commented Jul 12, 2025

Uh oh!

supernaiter commented Jul 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant