Skip to content

[Feature Request] Unicode support for string #8

@jingkaimori

Description

@jingkaimori

Search before asking

  • I searched the issues and found no similar issues.

What happened + What you expected to happen

R7rs suggests that string may contain unicode characters, and builtin method of string such as (string-ci=?) should handle i18n case mapping. So this interpreter should adjust inner encoding of String class.

There is several suggestions and requirements for unicode string handling in scheme. R7rs does not require constant complexity of (string-set!) and (string-ref), but requires index of string is index of code point. R7rs and The Scheme Programming Language does not suggest surrogate pairs in java.

Support of unicode string varies between standard library and spdlog. Standard library supports indexing on utf-32, but indexing on utf-8 and utf-16 string in standard library is byte index, rather than code point index. Standard library supports case mapping for unicode also. According to api interface of spdlog, this library may only receive utf-8 encoding message.

So I suggest to use utf-32 as encoding of strings appears in this project. Although utf-32 consumes more memory space, because character in utf-32 occupies 4 bits, it's easier than utf-8 and utf-16 to locate code point by given index.

Another choice is to use utf-8. Utf-8 consumes less space than utf-16 and utf-32, and can be scanned from start to end, therefore can be used to store scheme input code. C++ std lib lack methods to iterate codepoint from utf-8 string.

Reproduction way

THIS IS A FEATURE REQUEST AND PROPOSAL.

Anything else

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions