Fix two out-of-bounds read issues when handling truncated UTF-8 input #1005

frankslin · 2026-01-09T06:30:51Z

Two independent out-of-bounds read issues were identified in OpenCC's UTF-8 processing logic when handling malformed or truncated UTF-8 sequences.

MaxMatchSegmentation:
NextCharLength() could return a value larger than the remaining input size.
The previous logic subtracted this value from a size_t length counter,
potentially causing underflow and subsequent out-of-bounds reads.
Conversion:
Similar length handling could allow reads past the end of the input buffer
during dictionary matching, potentially propagating unintended bytes to the
conversion output.

This patch fixes both issues by:

Explicitly tracking the end of the input buffer
Recomputing remaining length on each iteration
Clamping matched character and key lengths to the remaining buffer size
Preventing reads past the null terminator

The changes preserve existing behavior for valid UTF-8 input and add test coverage for truncated UTF-8 sequences.

These issues may have security implications when processing untrusted input and are classified as heap out-of-bounds reads (CWE-125).

Two independent out-of-bounds read issues were identified in OpenCC's UTF-8 processing logic when handling malformed or truncated UTF-8 sequences. 1) MaxMatchSegmentation: NextCharLength() could return a value larger than the remaining input size. The previous logic subtracted this value from a size_t length counter, potentially causing underflow and subsequent out-of-bounds reads. 2) Conversion: Similar length handling could allow reads past the end of the input buffer during dictionary matching, potentially propagating unintended bytes to the conversion output. This patch fixes both issues by: - Explicitly tracking the end of the input buffer - Recomputing remaining length on each iteration - Clamping matched character and key lengths to the remaining buffer size - Preventing reads past the null terminator The changes preserve existing behavior for valid UTF-8 input and add test coverage for truncated UTF-8 sequences. These issues may have security implications when processing untrusted input and are classified as heap out-of-bounds reads (CWE-125).

frankslin · 2026-01-09T06:34:38Z

Fixes #997

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix two out-of-bounds read issues when handling truncated UTF-8 input #1005

Fix two out-of-bounds read issues when handling truncated UTF-8 input #1005

frankslin commented Jan 9, 2026

Uh oh!

frankslin commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix two out-of-bounds read issues when handling truncated UTF-8 input #1005

Are you sure you want to change the base?

Fix two out-of-bounds read issues when handling truncated UTF-8 input #1005

Conversation

frankslin commented Jan 9, 2026

Uh oh!

frankslin commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants