Light C Unicode Library [closed]
Asked Answered
L

3

55

I'm looking for a small C library to handle utf8 strings.

Specifically, splitting based on unicode delimiters for use with stemming algorithms.

Related posts have suggested:

ICU http://www.icu-project.org/ (I found it too bulky for my purposes on embedded devices)

UTF8-CPP: http://utfcpp.sourceforge.net/ (Excellent, but C++ not C)

Has anyone found any platform independent, small codebase libraries for handling unicode strings (doesn't need to do naturalisation).

Lemmuela answered 24/11, 2008 at 6:48 Comment(1)
utf8-cpp is great! ported smoothly to ios/android. header only libararyTagalog
L
39

A nice, light, library which I use successfully is utf8proc.

Laylalayman answered 24/11, 2008 at 6:52 Comment(0)
S
15

There's also MicroUTF-8, but it may require login credentials to view or download the source.

Sheepskin answered 30/10, 2011 at 12:28 Comment(0)
S
13

UTF-8 is specially designed so that many byte-oriented string functions continue to work or only need minor modifications.

C's strstr function, for instance, will work perfectly as long as both its inputs are valid, null-terminated UTF-8 strings. strcpy works fine as long as its input string starts at a character boundary (for instance the return value of strstr).

So you may not even need a separate library!

Shockley answered 24/11, 2008 at 7:30 Comment(4)
Very True, until now I had only needed to store/copy strings and was doing just that. But then I started needing to split/stem words for indexing so I wanted to make sure I was dealing with them properly.Lemmuela
While they work, searching functions will probably not perform as well in the face of UTF-8 characters. For example, if a UTF-8 character can be determined to not match immediately (often possible if it's compared with an ASCII character), the entire UTF-8 character encoding, which can be multiple bytes, can be skipped. But you're right that some of C's functions will work fine with UTF-8 strings, which is one of the reasons that UTF-8 is popular.Judenberg
Not crashing is not the same than working: something as simple as the string size does not work for UTF-8. UTF-8 is NOT designed especially for library compatibility.Chitter
@AdrianMaire actually strlen works as expected if your expectation is to know how many bytes are required to store the string. For the display length you need to consider the UTF-8 encoding. I have a naive version that is about 5 lines of C code. UTF-8 was designed to be as compatible as possible, but it can't be 100%, just by the nature of the problem.Caseate

© 2022 - 2024 — McMap. All rights reserved.