Skip to content

Conversation

@frankslin
Copy link

Two independent out-of-bounds read issues were identified in OpenCC's UTF-8 processing logic when handling malformed or truncated UTF-8 sequences.

  1. MaxMatchSegmentation:
    NextCharLength() could return a value larger than the remaining input size.
    The previous logic subtracted this value from a size_t length counter,
    potentially causing underflow and subsequent out-of-bounds reads.

  2. Conversion:
    Similar length handling could allow reads past the end of the input buffer
    during dictionary matching, potentially propagating unintended bytes to the
    conversion output.

This patch fixes both issues by:

  • Explicitly tracking the end of the input buffer
  • Recomputing remaining length on each iteration
  • Clamping matched character and key lengths to the remaining buffer size
  • Preventing reads past the null terminator

The changes preserve existing behavior for valid UTF-8 input and add test coverage for truncated UTF-8 sequences.

These issues may have security implications when processing untrusted input and are classified as heap out-of-bounds reads (CWE-125).

Two independent out-of-bounds read issues were identified in OpenCC's UTF-8
processing logic when handling malformed or truncated UTF-8 sequences.

1) MaxMatchSegmentation:
   NextCharLength() could return a value larger than the remaining input size.
   The previous logic subtracted this value from a size_t length counter,
   potentially causing underflow and subsequent out-of-bounds reads.

2) Conversion:
   Similar length handling could allow reads past the end of the input buffer
   during dictionary matching, potentially propagating unintended bytes to the
   conversion output.

This patch fixes both issues by:
- Explicitly tracking the end of the input buffer
- Recomputing remaining length on each iteration
- Clamping matched character and key lengths to the remaining buffer size
- Preventing reads past the null terminator

The changes preserve existing behavior for valid UTF-8 input and add test
coverage for truncated UTF-8 sequences.

These issues may have security implications when processing untrusted input
and are classified as heap out-of-bounds reads (CWE-125).
@frankslin
Copy link
Author

Fixes #997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants