Skip to content

Conversation

@supernaiter
Copy link

What does this PR do?

This PR adds minimal support for Japanese prompt tokenization.

What's included:

  • A Japanese tokenizer utility (tokenize_jp) using fugashi + unidic-lite
  • A Unicode-based language detector (is_japanese_text) to support lang="auto"
  • Minimal unit tests for tokenizer correctness
  • setup.py updated with extras_require["ja"] to optionally install Japanese dependencies

This is the first step toward enabling Japanese prompt compression, designed to be self-contained and safe to merge.
Future work (e.g., integration into compress_prompt) will follow as separate PRs.


Fixes: N/A

Before submitting

  • This PR is a new feature.
  • Changes are backward-compatible.
  • Tests for new functionality are included.
  • No documentation changes are needed at this stage.

Who can review?

@iofu728 @SiyunZhao — this is a minimal PR for Japanese support.
Would love your input before we follow up with lang="ja" integration.

@supernaiter
Copy link
Author

@microsoft-github-policy-service agree

@supernaiter supernaiter marked this pull request as ready for review July 12, 2025 22:56
@supernaiter
Copy link
Author

Hi @iofu728 @SiyunZhao — This is a minimal PR to support Japanese prompt tokenization.
All tests passed, CLA is signed, and the PR is self-contained.

Would love your feedback or approval when convenient. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant