Which tolower in C++?

Asked 27/5, 2016 at 11:24 Answered 2/6, 2016 at 13:25

Given string foo, I've written answers on how to use cctype's tolower to convert the characters to lowercase

transform(cbegin(foo), cend(foo), begin(foo), static_cast<int (*)(int)>(tolower))

But I've begun to consider locale's tolower, which could be used like this:

use_facet<ctype<char>>(cout.getloc()).tolower(data(foo), next(data(foo), foo.size()));

Is there a reason to prefer one of these over the other?
Does their functionality differ at all?
I mean other than the fact that tolower accepts and returns an int which I assume is just some antiquated C stuff?

Steen answered 27/5, 2016 at 11:24 Comment(35)

man, only c++ can make such easy things so difficult... – Ostyak 27/5, 2016 at 11:32

why the static_cast ? Just do std::transform(foo.cbegin(), foo.cend(), foo.begin(), ::tolower). Alternatively, consider boost's to_lower. – Intervalometer 27/5, 2016 at 11:50

@SanderDeDycker: yes, but he is asking the why! only one reason comes to my mind right now, which i posted as answer... but i guess there are more, maybe also considering performance. – Ostyak 27/5, 2016 at 11:53

@progressive_overload With Great Power Comes Great Responsibility – Trihedron 27/5, 2016 at 11:53

@Trihedron converting a string to lowercase using a 86-liner is the most powerful thing i've ever seen in my life – Ostyak 27/5, 2016 at 11:55

@progressive_overload : I made a comment, not an answer. I didn't claim to answer the OP's question. I pointed out an oddity, and suggested an alternative that I consider to be better than either of the suggested ones. – Intervalometer 27/5, 2016 at 11:59

@SanderDeDycker I see, but I would like to know the why as well. :) it is shorter... that is an advantage for sure, but is there more to it? – Ostyak 27/5, 2016 at 12:0

@SanderDeDycker You than check my answer here: https://mcmap.net/q/54349/-why-can-39-t-quot-transform-s-begin-s-end-s-begin-tolower-quot-be-complied-successfully ::tolower is implementation dependent, and I always try to avoid Boost. – Steen 27/5, 2016 at 12:1

@progressive_overload : boost's to_lower is shorter, more readable, and has the option to pass in a locale as well. – Intervalometer 27/5, 2016 at 12:2

@SanderDeDycker Boost always has the massive drawback that you must include the Boost libraries. I recognize there is a place for Boost's convenience, but using it when C++ already provides you not 1 but 2 ways to accomplish this... well it doesn't make any sense to me. – Steen 27/5, 2016 at 12:5

@progressive_overload I don't want to start a flame-war, just saying that: Sure this is specific task could be solved easier, but on the other hand the STL provides you great flexibility (power). And sometimes what's an advantage in one case is a drawback in another. – Trihedron 27/5, 2016 at 12:13

@Alex Good catch I've looked at this question like 10 times today and missed it every time. You must program without Intelisense to have the eagle-eye to catch that ;) – Steen 27/5, 2016 at 12:20

@JonathanMee : ::tolower works fine with #include <ctype.h> - it's all about choices (I'd personally rather put these few functions in the global namespace than to have to deal with overload disambiguation). And about boost : many people want to avoid it as much as possible - I learned to embrace it, but to each their own. I prefer the readability advantage it provides, as well as the seamless support for non-ASCII encodings and locales. – Intervalometer 27/5, 2016 at 12:21

@JonathanMee : oh, and boost does not always require you to include boost libraries. Much of the boost functionality is headers only. Including the functionality I suggested. – Intervalometer 27/5, 2016 at 12:25

@SanderDeDycker The standard has deprecated ctype.h, hence the use of cctype which necessitates the static_cast. Anyway even though I don't want to include Boost, I recognize and share your readability concerns. The standard could do a lot better here. – Steen 27/5, 2016 at 12:27

@JonathanMee : everyone makes their own choices. Unfortunately, my set of choices is incompatible with yours for this specific subject, so my suggestions weren't useful to you. I apologize. Hopefully they can be useful to someone else in the future :) – Intervalometer 27/5, 2016 at 12:33

Regardless of everything you have to cast to uint8_t or unsigned char before converting to int because otherwise you may get unwanted sign extension depending on your platform! – Bailable 27/5, 2016 at 12:33

@Bailable Can you elaborate, string works with signed chars; why would I want to cast to unsigned char when using tolower? – Steen 27/5, 2016 at 12:35

@SanderDeDycker Please don't apologize. I sometimes work in solutions where Boost is already included if I need to do this in such a solution I'll go look up Boost's tolower. So you have provided me with some helpful guidance. It's just not the answer that I want for this question. – Steen 27/5, 2016 at 12:37

@JonathanMee std::string uses char which may or may not be signed. – Bailable 27/5, 2016 at 12:43

@JonathanMee thank post-review for nor not doing syntax highlighting. – Ionogen 27/5, 2016 at 12:47

@Bailable Isn't string defined as basic_string<char>? So it will be signed? – Steen 27/5, 2016 at 12:49

I already dissected everything you need to see what's wrong. – Bailable 27/5, 2016 at 12:53

@JonathanMee char may or may not be signed, that is implementation defined. – Raphael 27/5, 2016 at 13:2

@BaummitAugen Hmmm, I'm not sure about that, "signed is default if omitted": en.cppreference.com/w/cpp/language/types#Modifiers – Steen 27/5, 2016 at 13:9

@JonathanMee On the same page, see this passage: "char - type for character representation which can be most efficiently processed on the target system (has the same representation and alignment as either signed char or unsigned char, but is always a distinct type)." – Gaul 27/5, 2016 at 13:15

@JonathanMee I am sure I'm right. You want me to find the standard quote or do you believe me? :) – Raphael 27/5, 2016 at 13:16

@BaummitAugen If I have to change everything I've believed about the signed modifier could you at least grace me with a citation from the standard? – Steen 27/5, 2016 at 13:23

@JonathanMee Sure thing. ;) "It is implementation-defined whether objects of char type are represented as signed or unsigned quantities. The signed specifier forces char objects to be signed; it is redundant in other contexts." 7.1.6.2 [decl.type.simple] in N4140. – Raphael 27/5, 2016 at 13:26

The C classification functions require the input value to be representable by unsigned char or be equal to EOF. Thus calling them directly with plain char is invalid if it is signed and the value is negative. – Caponize 27/5, 2016 at 17:4

@Caponize So if I understand what you're saying correctly, if I am working with a signed char[] using cctype's tolower is invalid o.O – Steen 27/5, 2016 at 17:27

@JonathanMee Due to my quote above, that might even be true for plain char[]. See this. – Raphael 27/5, 2016 at 19:11

@BaummitAugen So because locale's tolower works with chars not ints, it should be preferred then? That may be as good an argument as any as far as why I should choose one over the other. Are you interested in writing it up, if not I can. – Steen 31/5, 2016 at 10:54

@JonathanMee I always just used the C one with the cast, non-trivial string handling was never in the scope of my work. Feel free to write it up and use the potential UB (which is an atrocity, I agree) as argument. – Raphael 31/5, 2016 at 21:44

@BaummitAugen Welp, I've done it. I've written up an answer citing basically the determining factor being whether you are willing to work with the cast. I expect to accept this tomorrow unless you have any showstopping comments or an answer of your own you'd like to add. – Steen 2/6, 2016 at 13:37

It should be noted that the language designers were aware of cctype's tolower when locale's tolower was created. It improved in 2 primary ways:

As is mentioned in progressive_overload's answer the locale version allowed the use of the facet ctype, even a user modified one, without requiring the shuffling in of a new LC_CTYPE in via setlocale and the restoration of the previous LC_CTYPE
From section 7.1.6.2[dcl.type.simple]3:

It is implementation-defined whether objects of char type are represented as signed or unsigned quantities. The signed specifier forces char objects to be signed

Which creates an the potential for undefined behavior with the cctype version of tolower's if it's argument:

Is not representable as unsigned char and does not equal EOF

So there is an additional input and output static_cast required by the cctype version of tolower yielding:

transform(cbegin(foo), cend(foo), begin(foo), [](const unsigned char i){ return tolower(i); });

Since the locale version operates directly on chars there is no need for a type conversion.

So if you don't need to perform the conversion in a different facet ctype it simply becomes a style question of whether you prefer the transform with a lambda required by the cctype version, or whether you prefer the locale version's:

use_facet<ctype<char>>(cout.getloc()).tolower(data(foo), next(data(foo), size(foo)));

Steen answered 2/6, 2016 at 13:25 Comment(0)

Unfortunately,both are equally bad. Although std::string pretends to be a utf-8 encoded string, non of the methods/function (including tolower), are really utf-8 aware. So, tolower / tolower + locale may work with characters which are single byte (= ASCII), they will fail for every other set of languages.

On Linux, I'd use ICU library. On Windows, I'd use CharUpper function.

Lucila answered 27/5, 2016 at 12:47 Comment(8)

You're saying that locale's tolower can't handle UTF-8 either? Hmmm, that would have been a good argument for it. – Steen 27/5, 2016 at 12:52

@JonathanMee Unfortunately, C++ has no meaningful Unicode support in any sense. – Raphael 27/5, 2016 at 13:2

C++ sucks at this indeed, but are you meaning that in 2016 we can't even have a portable library to handle this ? – Lupulin 28/5, 2016 at 10:42

@BaummitAugen Are you guys sure about the lack of UTF-8 support? That's actually one of the things demonstrated in the en.cppreference.com/w/cpp/locale/tolower example. I haven't been able to come up with a way to make it fail with UTF-8, even when using "multi-byte characters". – Steen 1/6, 2016 at 16:3

@JonathanMee Rekt, output should be ω. – Raphael 1/6, 2016 at 16:19

@JonathanMee And then there is stuff like this which should yield SS, but have fun building that with the normal char types. – Raphael 1/6, 2016 at 16:25

@BaummitAugen You are right :( The input I tested with was just wchar_t on Windows not UTF-8. UTF-8 is still broken. Return to your lives citizens. In other news, I happen to have done some personal research on the 'ß' character though. According to the standard it should not convert to "SS" nor to 'ẞ': https://mcmap.net/q/56585/-c-c-utf-8-upper-lower-case-conversions – Steen 1/6, 2016 at 18:41

@JonathanMee Unicode says it should: "German sharp s . The German sharp s character has several complications in case mapping. Not only does its uppercase mapping expand in length, but its default case-pairings are asymmetrical. The default case mapping operations follow standard German orthography, which uses the string “SS” as the regular uppercase mapping for U+00DF ß latin small letter sharp s ." Unicode 8 5.18 And Unicode is the standard that defines the behavior of UTF8, not some C++ standard. – Raphael 1/6, 2016 at 22:12

In the first case (cctype) the locale is set implicitely:

Converts the given character to lowercase according to the character conversion rules defined by the currently installed C locale.

http://en.cppreference.com/w/cpp/string/byte/tolower

In the second (locale's) case you have to explicitely set the locale:

Converts parameter c to its lowercase equivalent if c is an uppercase letter and has a lowercase equivalent, as determined by the ctype facet of locale loc. If no such conversion is possible, the value returned is c unchanged.

http://www.cplusplus.com/reference/locale/tolower/

Ostyak answered 27/5, 2016 at 11:48 Comment(0)