Which tolower in C++?
Asked Answered
S

3

21

Given string foo, I've written answers on how to use cctype's tolower to convert the characters to lowercase

transform(cbegin(foo), cend(foo), begin(foo), static_cast<int (*)(int)>(tolower))

But I've begun to consider locale's tolower, which could be used like this:

use_facet<ctype<char>>(cout.getloc()).tolower(data(foo), next(data(foo), foo.size()));
  • Is there a reason to prefer one of these over the other?
  • Does their functionality differ at all?
  • I mean other than the fact that tolower accepts and returns an int which I assume is just some antiquated C stuff?
Steen answered 27/5, 2016 at 11:24 Comment(35)
man, only c++ can make such easy things so difficult...Ostyak
why the static_cast ? Just do std::transform(foo.cbegin(), foo.cend(), foo.begin(), ::tolower). Alternatively, consider boost's to_lower.Intervalometer
@SanderDeDycker: yes, but he is asking the why! only one reason comes to my mind right now, which i posted as answer... but i guess there are more, maybe also considering performance.Ostyak
@progressive_overload With Great Power Comes Great ResponsibilityTrihedron
@Trihedron converting a string to lowercase using a 86-liner is the most powerful thing i've ever seen in my lifeOstyak
@progressive_overload : I made a comment, not an answer. I didn't claim to answer the OP's question. I pointed out an oddity, and suggested an alternative that I consider to be better than either of the suggested ones.Intervalometer
@SanderDeDycker I see, but I would like to know the why as well. :) it is shorter... that is an advantage for sure, but is there more to it?Ostyak
@SanderDeDycker You than check my answer here: https://mcmap.net/q/54349/-why-can-39-t-quot-transform-s-begin-s-end-s-begin-tolower-quot-be-complied-successfully ::tolower is implementation dependent, and I always try to avoid Boost.Steen
@progressive_overload : boost's to_lower is shorter, more readable, and has the option to pass in a locale as well.Intervalometer
@SanderDeDycker Boost always has the massive drawback that you must include the Boost libraries. I recognize there is a place for Boost's convenience, but using it when C++ already provides you not 1 but 2 ways to accomplish this... well it doesn't make any sense to me.Steen
@progressive_overload I don't want to start a flame-war, just saying that: Sure this is specific task could be solved easier, but on the other hand the STL provides you great flexibility (power). And sometimes what's an advantage in one case is a drawback in another.Trihedron
@Alex Good catch I've looked at this question like 10 times today and missed it every time. You must program without Intelisense to have the eagle-eye to catch that ;)Steen
@JonathanMee : ::tolower works fine with #include <ctype.h> - it's all about choices (I'd personally rather put these few functions in the global namespace than to have to deal with overload disambiguation). And about boost : many people want to avoid it as much as possible - I learned to embrace it, but to each their own. I prefer the readability advantage it provides, as well as the seamless support for non-ASCII encodings and locales.Intervalometer
@JonathanMee : oh, and boost does not always require you to include boost libraries. Much of the boost functionality is headers only. Including the functionality I suggested.Intervalometer
@SanderDeDycker The standard has deprecated ctype.h, hence the use of cctype which necessitates the static_cast. Anyway even though I don't want to include Boost, I recognize and share your readability concerns. The standard could do a lot better here.Steen
@JonathanMee : everyone makes their own choices. Unfortunately, my set of choices is incompatible with yours for this specific subject, so my suggestions weren't useful to you. I apologize. Hopefully they can be useful to someone else in the future :)Intervalometer
Regardless of everything you have to cast to uint8_t or unsigned char before converting to int because otherwise you may get unwanted sign extension depending on your platform!Bailable
@Bailable Can you elaborate, string works with signed chars; why would I want to cast to unsigned char when using tolower?Steen
@SanderDeDycker Please don't apologize. I sometimes work in solutions where Boost is already included if I need to do this in such a solution I'll go look up Boost's tolower. So you have provided me with some helpful guidance. It's just not the answer that I want for this question.Steen
@JonathanMee std::string uses char which may or may not be signed.Bailable
@JonathanMee thank post-review for nor not doing syntax highlighting.Ionogen
@Bailable Isn't string defined as basic_string<char>? So it will be signed?Steen
I already dissected everything you need to see what's wrong.Bailable
@JonathanMee char may or may not be signed, that is implementation defined.Raphael
@BaummitAugen Hmmm, I'm not sure about that, "signed is default if omitted": en.cppreference.com/w/cpp/language/types#ModifiersSteen
@JonathanMee On the same page, see this passage: "char - type for character representation which can be most efficiently processed on the target system (has the same representation and alignment as either signed char or unsigned char, but is always a distinct type)."Gaul
@JonathanMee I am sure I'm right. You want me to find the standard quote or do you believe me? :)Raphael
@BaummitAugen If I have to change everything I've believed about the signed modifier could you at least grace me with a citation from the standard?Steen
@JonathanMee Sure thing. ;) "It is implementation-defined whether objects of char type are represented as signed or unsigned quantities. The signed specifier forces char objects to be signed; it is redundant in other contexts." 7.1.6.2 [decl.type.simple] in N4140.Raphael
The C classification functions require the input value to be representable by unsigned char or be equal to EOF. Thus calling them directly with plain char is invalid if it is signed and the value is negative.Caponize
@Caponize So if I understand what you're saying correctly, if I am working with a signed char[] using cctype's tolower is invalid o.OSteen
@JonathanMee Due to my quote above, that might even be true for plain char[]. See this.Raphael
@BaummitAugen So because locale's tolower works with chars not ints, it should be preferred then? That may be as good an argument as any as far as why I should choose one over the other. Are you interested in writing it up, if not I can.Steen
@JonathanMee I always just used the C one with the cast, non-trivial string handling was never in the scope of my work. Feel free to write it up and use the potential UB (which is an atrocity, I agree) as argument.Raphael
@BaummitAugen Welp, I've done it. I've written up an answer citing basically the determining factor being whether you are willing to work with the cast. I expect to accept this tomorrow unless you have any showstopping comments or an answer of your own you'd like to add.Steen
S
1

It should be noted that the language designers were aware of cctype's tolower when locale's tolower was created. It improved in 2 primary ways:

  1. As is mentioned in progressive_overload's answer the locale version allowed the use of the facet ctype, even a user modified one, without requiring the shuffling in of a new LC_CTYPE in via setlocale and the restoration of the previous LC_CTYPE
  2. From section 7.1.6.2[dcl.type.simple]3:

It is implementation-defined whether objects of char type are represented as signed or unsigned quantities. The signed specifier forces char objects to be signed

Which creates an the potential for undefined behavior with the cctype version of tolower's if it's argument:

Is not representable as unsigned char and does not equal EOF

So there is an additional input and output static_cast required by the cctype version of tolower yielding:

transform(cbegin(foo), cend(foo), begin(foo), [](const unsigned char i){ return tolower(i); });

Since the locale version operates directly on chars there is no need for a type conversion.

So if you don't need to perform the conversion in a different facet ctype it simply becomes a style question of whether you prefer the transform with a lambda required by the cctype version, or whether you prefer the locale version's:

use_facet<ctype<char>>(cout.getloc()).tolower(data(foo), next(data(foo), size(foo)));
Steen answered 2/6, 2016 at 13:25 Comment(0)
L
6

Unfortunately,both are equally bad. Although std::string pretends to be a utf-8 encoded string, non of the methods/function (including tolower), are really utf-8 aware. So, tolower / tolower + locale may work with characters which are single byte (= ASCII), they will fail for every other set of languages.

On Linux, I'd use ICU library. On Windows, I'd use CharUpper function.

Lucila answered 27/5, 2016 at 12:47 Comment(8)
You're saying that locale's tolower can't handle UTF-8 either? Hmmm, that would have been a good argument for it.Steen
@JonathanMee Unfortunately, C++ has no meaningful Unicode support in any sense.Raphael
C++ sucks at this indeed, but are you meaning that in 2016 we can't even have a portable library to handle this ?Lupulin
@BaummitAugen Are you guys sure about the lack of UTF-8 support? That's actually one of the things demonstrated in the en.cppreference.com/w/cpp/locale/tolower example. I haven't been able to come up with a way to make it fail with UTF-8, even when using "multi-byte characters".Steen
@JonathanMee Rekt, output should be ω.Raphael
@JonathanMee And then there is stuff like this which should yield SS, but have fun building that with the normal char types.Raphael
@BaummitAugen You are right :( The input I tested with was just wchar_t on Windows not UTF-8. UTF-8 is still broken. Return to your lives citizens. In other news, I happen to have done some personal research on the 'ß' character though. According to the standard it should not convert to "SS" nor to 'ẞ': https://mcmap.net/q/56585/-c-c-utf-8-upper-lower-case-conversionsSteen
@JonathanMee Unicode says it should: "German sharp s . The German sharp s character has several complications in case mapping. Not only does its uppercase mapping expand in length, but its default case-pairings are asymmetrical. The default case mapping operations follow standard German orthography, which uses the string “SS” as the regular uppercase mapping for U+00DF ß latin small letter sharp s ." Unicode 8 5.18 And Unicode is the standard that defines the behavior of UTF8, not some C++ standard.Raphael
O
4

In the first case (cctype) the locale is set implicitely:

Converts the given character to lowercase according to the character conversion rules defined by the currently installed C locale.

http://en.cppreference.com/w/cpp/string/byte/tolower

In the second (locale's) case you have to explicitely set the locale:

Converts parameter c to its lowercase equivalent if c is an uppercase letter and has a lowercase equivalent, as determined by the ctype facet of locale loc. If no such conversion is possible, the value returned is c unchanged.

http://www.cplusplus.com/reference/locale/tolower/

Ostyak answered 27/5, 2016 at 11:48 Comment(0)
S
1

It should be noted that the language designers were aware of cctype's tolower when locale's tolower was created. It improved in 2 primary ways:

  1. As is mentioned in progressive_overload's answer the locale version allowed the use of the facet ctype, even a user modified one, without requiring the shuffling in of a new LC_CTYPE in via setlocale and the restoration of the previous LC_CTYPE
  2. From section 7.1.6.2[dcl.type.simple]3:

It is implementation-defined whether objects of char type are represented as signed or unsigned quantities. The signed specifier forces char objects to be signed

Which creates an the potential for undefined behavior with the cctype version of tolower's if it's argument:

Is not representable as unsigned char and does not equal EOF

So there is an additional input and output static_cast required by the cctype version of tolower yielding:

transform(cbegin(foo), cend(foo), begin(foo), [](const unsigned char i){ return tolower(i); });

Since the locale version operates directly on chars there is no need for a type conversion.

So if you don't need to perform the conversion in a different facet ctype it simply becomes a style question of whether you prefer the transform with a lambda required by the cctype version, or whether you prefer the locale version's:

use_facet<ctype<char>>(cout.getloc()).tolower(data(foo), next(data(foo), size(foo)));
Steen answered 2/6, 2016 at 13:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.