What open source C or C++ libraries can convert arbitrary UTF-32 to NFC? [closed]
Asked Answered
D

2

7

What open source C or C++ libraries can convert arbitrary UTF-32 to NFC?

Libraries that I think can do this so far: ICU, Qt, GLib (not sure?).

I don't need any other complex Unicode support; just conversion from arbitrary but known-correct UTF-32 to UTF-32 that is in NFC form.

I'm most interested in a library that can do this directly. For example, Qt and ICU (as far as I can tell) both do everything via an intermediate conversion stage to and from UTF-16.

Dubuffet answered 24/11, 2011 at 6:35 Comment(7)
What is NFC? Unicode Normalization Form Canonical Composition?Jerky
@BillyONeal: I'm pretty sure that is it. See en.wikipedia.org/wiki/Unicode_equivalence#Normal_formsGlacial
Why do you care about implementation details? I wouldn't care if a library used UTF-13 internally, as long as it produces the right results.Soren
"I don't need complex Unicode support" is a strange requirement. Surely, normalization is a very complex operation that requires full access to the Unicode character database...Ashil
@Soren you are right that implementation don't matter to a large extent. However, I'm using C++ because I care about memory usage and execution time: a single intermediate conversion could easily double both. If I didn't care at all, I'd just use python and be done with it. =)Dubuffet
@Kerrek I didn't say it's a requirement that the library doesn't have complex Unicode support, I just don't need anything except UTF-32 to UTF-32 NFC conversion. For example, Qt is MUCH, MUCH simpler than ICU in it's Unicode support, but both support normalization.Dubuffet
What is the output destined for that requires NFC, and why is an intermediate conversion undesirable?Logy
L
2

ICU or Boost.Locale (wrapping ICU) will be your best by a very, very long way. The normalisation mappings will be equivalent with those from more software, which I assume is the point of this conversion.

Logy answered 1/12, 2011 at 4:53 Comment(2)
There is only one possible (correct) NFC normalization mapping, so there isn't any compatibility worry, but I suppose that ICU is perhaps the least likely to be ever be buggy. I was hoping for something a little lighter-weight that could just do normalization, but I after lots of looking, ended up deciding that ICU was the best choice as well, so I'm marking this as accepted. =)Dubuffet
To clarify, by compatibility I mean as always: 'both sides will likely have the same bugs' =)Logy
D
0

Here is the main part of the code I ended up using after deciding on ICU. I figured I should put it here in case it helps someone who tries this same thing.

std::string normalize(const std::string &unnormalized_utf8) {
    // FIXME: until ICU supports doing normalization over a UText
    // interface directly on our UTF-8, we'll use the insanely less
    // efficient approach of converting to UTF-16, normalizing, and
    // converting back to UTF-8.

    // Convert to UTF-16 string
    auto unnormalized_utf16 = icu::UnicodeString::fromUTF8(unnormalized_utf8);

    // Get a pointer to the global NFC normalizer
    UErrorCode icu_error = U_ZERO_ERROR;
    const auto *normalizer = icu::Normalizer2::getInstance(nullptr, "nfc", UNORM2_COMPOSE, icu_error);
    assert(U_SUCCESS(icu_error));

    // Normalize our string
    icu::UnicodeString normalized_utf16;
    normalizer->normalize(unnormalized_utf16, normalized_utf16, icu_error);
    assert(U_SUCCESS(icu_error));

    // Convert back to UTF-8
    std::string normalized_utf8;
    normalized_utf16.toUTF8String(normalized_utf8);

    return normalized_utf8;
}
Dubuffet answered 3/2, 2013 at 1:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.