C++ tolower on special characters such as ü
Asked Answered
D

4

6

I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted successfully.

How I use my function:

string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
    LowerCase += tolower(NotLowerCase[i]);
    }

For example:

  1. Test -> test
  2. TeST2 -> test2
  3. Grüßen -> gr????en
  4. (§) -> ()

3 and 4 are not working as expected as you can see

How can I fix this issue? I have to keep the special chars, but as lowercase.

Depew answered 14/3, 2017 at 16:27 Comment(3)
Do you realise that this is impossible to get right due to the fact that ß translates to SS, whereas SS may be translated to either ß or ss depending on the context?Statolith
Yes, I understand and have already deleted my comment, great answers guys, thx keep it up. p.s. what is a safe language to use when this doesn't occur and just stay with the normal 'word' like it was originally? e.g. en_US.iso88591?Depew
Why do people keep calling perfectly normal letters "special characters"?Arak
T
7

The sample code (below) from tolower shows how you fix this; you have to use something other than the default "C" locale.

#include <iostream>
#include <cctype>
#include <clocale>

int main()
{
    unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
                              // but ´ (acute accent) in ISO-8859-1 

    std::setlocale(LC_ALL, "en_US.iso88591");
    std::cout << std::hex << std::showbase;
    std::cout << "in iso8859-1, tolower('0xb4') gives "
              << std::tolower(c) << '\n';
    std::setlocale(LC_ALL, "en_US.iso885915");
    std::cout << "in iso8859-15, tolower('0xb4') gives "
              << std::tolower(c) << '\n';
}

You might also change std::string to std::wstring which is Unicode on many C++ implementations.

wstring NotLowerCase = L"Grüßen";
wstring LowerCase;
for (auto&& ch : NotLowerCase) {
    LowerCase += towlower(ch);
    }

Guidance from Microsoft is to "Normalize strings to uppercase", so you might use toupper or towupper instead.

Keep in mind that a character-by-character transformation might not work well for some languages. For example, using German as spoken in Germany, making Grüßen all upper-case turns it into GRÜESSEN (although there is now a capital ). There are numerous other "problems" such a combining characters; if you're doing real "production" work with strings, you really want a completely different approach.

Finally, C++ has more sophisticated support for managing locales, see <locale> for details.

Torrance answered 14/3, 2017 at 16:33 Comment(6)
Mind you, this works for ISO-8859-*, but NOT for Unicode. And since it's tagged "htmlspecialcharacters", unicode is a fair assumption.Wellrounded
Indeed, I would like to support unicode, since I will have to process many different languages and therefore multiple character setsDepew
Wauw, this toupper -> towupper did it. (I modified it to lower of course, but it seems to work for now) thx for your support!Depew
@Ðаn: Indeed. To be fair, saying tolower is already making assumptions about character sets. Chinese is the classical counter-example. ISO-8859 describes a collection of 8 bit character sets, which together cover most of the alphabets for which lowercase makes sense. But for UTF-8, things suddenly are a lot more complex. And don't get me started about locale-specific case rules; I only have 600 characters per comment. One short example to remember, though: ß=>SS. Even in 8859-1, that can't be done with char toupper(char). The length of strings changes with uppercasing!Wellrounded
@TVAvanHesteren You cannot really support multiple languages unless you support their individual quirks on a case by case basis. You can support characters that are used in multiple languages, but only if you don't manipulate these characters in any way. Changing a word to uppercase and then back to lowercase can be deadly.Arak
So, how do you suggest to fix this or counter the problem?Depew
J
3

I think the most portable way to do this is to use the user selected locale which is achieved by setting the locale to "" (empty string).

std::locale::global(std::locale("")); 

That sets the locale to whatever was in use where the program was run and it effects the standard character conversion routines (std::mbsrtowcs & std::wcsrtombs) that convert between multi-byte and wide-string characters.

Then you can use those functions to convert from the system/user selected multi-byte characters (such as UTF-8) to system standard wide character codes that can be used in functions like std::tolower that operate on one character at a time.

This is important because multi-byte character sets like UTF-8 can not be converted using single character operations like with std::tolower().

Once you have converted the wide string version to upper/lower case it can then be converted back to the system/user multibyte character set for printing to the console.

// Convert from multi-byte codes to wide string codes
std::wstring mb_to_ws(std::string const& mb)
{
    std::wstring ws;
    std::mbstate_t ps{};
    char const* src = mb.data();

    std::size_t len = 1 + mbsrtowcs(0, &src, 3, &ps);

    ws.resize(len);
    src = mb.data();

    mbsrtowcs(&ws[0], &src, ws.size(), &ps);

    if(src)
        throw std::runtime_error("invalid multibyte character after: '"
            + std::string(mb.data(), src) + "'");

    ws.pop_back();

    return ws;
}

// Convert from wide string codes to multi-byte codes
std::string ws_to_mb(std::wstring const& ws)
{
    std::string mb;
    std::mbstate_t ps{};
    wchar_t const* src = ws.data();

    std::size_t len = 1 + wcsrtombs(0, &src, 0, &ps);

    mb.resize(len);
    src = ws.data();

    wcsrtombs(&mb[0], &src, mb.size(), &ps);

    if(src)
        throw std::runtime_error("invalid wide character");

    mb.pop_back();

    return mb;
}

int main()
{
    // set locale to the one chosen by the user
    // (or the one set by the system default)
    std::locale::global(std::locale(""));

    try
    {
        string NotLowerCase = "Grüßen";

        std::cout << NotLowerCase << '\n';

        // convert system/user multibyte character codes
        // to wide string versions
        std::wstring ws1 = mb_to_ws(NotLowerCase);
        std::wstring ws2;

        for(unsigned int i = 0; i < ws1.length(); i++) {
            // use the system/user locale
            ws2 += std::tolower(ws1[i], std::locale("")); 
        }

        // convert wide string character codes back
        // to system/user multibyte versions
        string LowerCase = ws_to_mb(ws2);

        std::cout << LowerCase << '\n';
    }
    catch(std::exception const& e)
    {
        std::cerr << e.what() << '\n';
        return EXIT_FAILURE;
    }
    catch(...)
    {
        std::cerr << "Unknown exception." << '\n';
        return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}

Code not heavily tested

Jewry answered 14/3, 2017 at 18:18 Comment(0)
P
0

I'm not a fan of changing the std::locale, so I wrote a little mapping function that converts (the most relevant?) characters of the Unicode table to lowercase. Maybe it comes handy for you.

wchar_t unicode_tolower(wchar_t c) {
#define LWR_OFFSET(from, to, by){if(c>=(from) && c<=(to)){return c+(by);}}
#define LWR_NEXT(from, to){const int odd = (from) & 0x0001; if(c>=(from) && c<=(to) && ((c&0x0001) == odd)) {return ++c;}}

    LWR_OFFSET(L'A', L'Z', 0x20)
    LWR_OFFSET(0x00c0, 0x00d6, 0x20) // A with grave ... O with diaeresis
    // 0x00d7=multiplication
    LWR_OFFSET(0x00d8, 0x00de, 0x20) // O with stroke ...Thorn

    LWR_NEXT(0x0100, 0x017e) // A with macron ... Z with caron

    LWR_NEXT(0x0370, 0x0376) // greek
    LWR_OFFSET(0x0391, 0x03ab, 0x20) // greek
    if (c==0x03f7) {return ++c;}
    if (c==0x03fa) {return ++c;}


    LWR_OFFSET(0x0400, 0x040f, 0x50) // Cyrillic - this range is strange
    LWR_OFFSET(0x0410, 0x042f, 0x20) // Cyrillic
    LWR_NEXT(0x0460, 0x04bf)
    LWR_NEXT(0x04c1, 0x04ce)
    LWR_NEXT(0x04d0, 0x052f)

    LWR_OFFSET(0x0531, 0x0556, 0x30) // Armenian
    return c;
#undef LWR_OFFSET
#undef LWR_NEXT
}
Panaggio answered 12/3 at 5:45 Comment(0)
C
-6

use ASCII

string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
    if(NotLowerCase[i]<65||NotLowerCase[i]>122)
    {
        LowerCase+='?';
    }
    else
        LowerCase += tolower(NotLowerCase[i]);
}
Carapace answered 14/3, 2017 at 16:39 Comment(1)
I need the special characters in lowercase as stated in te question. This is just replacing them with a 'valid' question mark which is not requested. Thanks for you input thoughDepew

© 2022 - 2024 — McMap. All rights reserved.