Converting wide char string to lowercase in C++

Asked 23/10, 2009 at 16:37 Answered 23/10, 2009 at 17:0

How do I convert a wchar_t string from upper case to lower case in C++?

The string contains a mixture of Japanese, Chinese, German and Greek characters.

I thought about using towlower...

http://msdn.microsoft.com/en-us/library/8h19t214%28VS.80%29.aspx

.. but the documentation says that:

The case conversion of towlower is locale-specific. Only the characters relevant to the current locale are changed in case.

Edit: Maybe I should describe what I'm doing. I receive a Unicode search query from a user. It's originally in UTF-8 encoding, but I'm converting it to a widechar (I may be wrong on the wording). My debugger (VS2008) correctly shows the Japanese, German, etc characters in in the "variable quick watch". I need to go through another set of data in Unicode and find matches of the search string. While this is no problem for me to do when the search is case sensitive, it's more problematic to do it case insensitive. My (maybe naive) approach to solve the problem would be to convert all input data and output data to lower case and then compare it.

Dustan answered 23/10, 2009 at 16:37 Comment(2)

another approach would be to use comparison algorithms that ignore case. And case is not your only problem. Without normalizing the string, diacritics for instance can be considered part of one (é, Õ), or several individual characters ('e, ~O). Proper normalizing (NFC/NFD/NFKC/NFKD) before comparison is vital in your situation. – Sorrento 23/10, 2009 at 17:9

Abel, please post it as a proper answer so it can be upvoted as it should be. It's pretty much the only correct answer in this situation... – Foulard 23/10, 2009 at 17:24

If your string contains all those characters, the codeset must be Unicode-based. If implemented properly, Unicode (Chapter 4 'Character Properties') defines character properties including whether the character is upper case and the lower case mapping, and so on.

Given that preamble, the towlower() function from <wctype.h> is the correct tool to use. If it doesn't do the job, you have a QoI (Quality of Implementation) problem to discuss with your vendor. If you find the vendor unresponsive, then look at alternative libraries. In this case, you might consider ICU (International Components for Unicode).

Dermatologist answered 23/10, 2009 at 16:50 Comment(5)

Unicode case mappings, as specified in the document that you've linked to, are still partially locale-dependent. Quote: "SpecialCasing.txt - Contains additional case mappings that map to more than one character, such as “ß” to “SS”. Also contains context-dependent mappings, with flags to distinguish them from the normal mappings, as well as some locale-dependent mappings.". So tolower cannot avoid being locale specific. – Foulard 23/10, 2009 at 16:59

@Pavel This process is called "normalization of Unicode strings", which makes sure that ß and ss are treated equal (depending on chosen normalization form) and Unicode contains language-neutral algorithms for that, while not ignoring the wish for locale or application specific treatment. – Sorrento 23/10, 2009 at 17:14

@Abel: normalization is not a complete solution. For example, in some Latin languages diacritics disappear on uppercased letters, in other languages they do not. There's no way to tell unless you know which language the text is written in. Then, of course, there's the infamous Turkish dotless "i" problem - you want İ to lowercase to i and I to lowecase to ı for Turkish, but you want I to lowercase to i for any other Latin alphabet language. – Foulard 23/10, 2009 at 17:23

@Pavel: that's an excellent elaboration, I fully agree. No, normalization is not perfect, it's more a simplistic brute-force method, but it helps in a fine bunch of situations. Probably good moment in the discussion to include a link to the Unicode Collation Algorithm, which discusses this in full (goes much further then lowercase/uppercase): unicode.org/reports/tr10 and the Unicode Case Mapping: unicode.org/reports/tr21/tr21-5.html – Sorrento 26/10, 2009 at 15:13

@JonathanLeffler: ICU is interesting, but perhaps overkill. I would probably go for processing the UnicodeData.txt [compile to binary and filter out irrelevant parts]. – Iguana 24/5, 2015 at 7:36

You have a nasty problem in hand. A Japanese locale will not help converting German and vice versa. There are languages which do not have the concept of captalization either (toupper and friends would be a no-op here, I suppose). So, can you break up your string into individual chunks of words from the same language? If you can then you can convert the pieces and string them up.

Winona answered 23/10, 2009 at 16:50 Comment(6)

Japanese and the other ideographic languages from East Asia are examples of languages mainly without upper-case. – Dermatologist 23/10, 2009 at 16:51

Not only that, but individual languages can have different opinions on how a particular letter should be upper/lowercased. There's simply no single algorithm to do it properly on any random Unicode string without knowing the language. – Foulard 23/10, 2009 at 16:55

Though I agree with that assessment, Unicode includes locale-independent uppercase/lowercase properties, its usage described under 3.13 "Default Case Opreations", which are are to be used in the absence of tailoring for particular languages, so the standard says. – Sorrento 23/10, 2009 at 17:3

It does. The problem is that it is right for, say, 99% of all cases, but you'll get 1% wrong. Which may or may not be a problem. In general, it's good enough when you use it for things like identifiers in code, and maybe even filenames. – Foulard 23/10, 2009 at 17:9

@Pavel: Which means that you can't do it right all the time, but you can do it consistently all the time. I know that lowercasing 'I' to 'i' is wrong in Turkish, but if you're just normalizing the string for comparison rather than printing out the result it may work just fine. – Lest 23/10, 2009 at 19:51

@David: it might not work fine. Say you have text "Diyarbakır" in the original document, and the user entered "DİYARBAKIR" search string. You use the default Unicode casing rules to lowercase both strings; the first one becomes "diyarbakır", the second one "diyarbakir". And now they don't match, and they really should have, if the text is Turkish. – Foulard 23/10, 2009 at 20:16

This SO answer shows how to work with facets to work with several locales. If this is on Windows, you can consider using win32 API functions, if you can work with C++.NET (managed C++), you can use the char.ToLower and string.ToLower functions, which are Unicode compliant.

Sorrento answered 23/10, 2009 at 16:53 Comment(0)

Have a look at _wcslwr_l in <wchar.h> (MSDN).

You should be able to run the function on the input for each of the locales.

Cristen answered 23/10, 2009 at 17:0 Comment(2)

"You should be able to run the function on the input for each of the locales." - what if two locales in the set map the same character differently? – Foulard 23/10, 2009 at 17:26

As mentioned in other comments, you have to know the language of each part of the string in order to avoid those cases. There's really no getting around that. I'm merely suggesting a different function to use to more easily manage the issue with running the operation on the current locale. – Cristen 23/10, 2009 at 17:55

Recommended topics

Hot tags