What is the best way of doing case-insensitive string comparison in C++ without transforming a string to all uppercase or all lowercase?
Please indicate whether the methods are Unicode-friendly and how portable they are.
What is the best way of doing case-insensitive string comparison in C++ without transforming a string to all uppercase or all lowercase?
Please indicate whether the methods are Unicode-friendly and how portable they are.
Boost includes a handy algorithm for this:
#include <boost/algorithm/string.hpp>
// Or, for fewer header dependencies:
//#include <boost/algorithm/string/predicate.hpp>
std::string str1 = "hello, world!";
std::string str2 = "HELLO, WORLD!";
if (boost::iequals(str1, str2))
{
// Strings are identical
}
boost::iequals
defers to std::locale()
which is unable to handle these things. Anything not ICU is, at this point of writing, lying through its teeth. –
Hannis The trouble with boost is that you have to link with and depend on boost. Not easy in some cases (e.g. android).
And using char_traits means all your comparisons are case insensitive, which isn't usually what you want.
This should suffice. It should be reasonably efficient. Doesn't handle unicode or anything though.
#include <cctype> // std::tolower
#include <algorithm> // std::equal
bool ichar_equals(char a, char b)
{
return std::tolower(static_cast<unsigned char>(a)) ==
std::tolower(static_cast<unsigned char>(b));
}
bool iequals(const std::string& a, const std::string& b)
{
return a.size() == b.size() &&
std::equal(a.begin(), a.end(), b.begin(), ichar_equals);
}
#include <cctype> // std::tolower
#include <algorithm> // std::equal
bool iequals(const std::string& a, const std::string& b)
{
return std::equal(a.begin(), a.end(), b.begin(), b.end(), ichar_equals);
}
std::ranges
#include <cctype> // std::tolower
#include <algorithm> // std::equal
#include <string_view> // std::string_view
bool iequals(std::string_view lhs, std::string_view rhs)
{
return std::ranges::equal(lhs, rhs, ichar_equals);
}
std::equal
is not available in C++11. –
Nonpros std::tolower
should not be called on char
directly, a static_cast
to unsigned char
is needed. –
Truant -O3
and they all do the sane thing. I imagine if you made a compiler that didn't it wouldn't be able to compile a lot of existing code. –
Nonpros [](unsigned char a, unsigned char b)
, no static_cast
is necessary. –
Apposite Σ
maps to ς
at word's end and to σ
elsewhere). While the other way round exists, too, (was that in Turkish?) this case is rarer, so chances to get correct comparison is greater with toupper
– sure, doesn't help out if you happen to encode exactly one of the counter-example languages ;) –
Heall std::tolower
, but some UTF-8 strings can be compared this way. –
Maledict char
into std::tolower
. (2) std::tolower
isn't explicitly addressable, so I don't believe it would be valid to takes its address and store it in a function pointer. (3) all the solutions could have been simplified and turned into one-liners. –
Maledict std::equal
is available in C++11 too, just the version taking four iterators isn't. See timsong-cpp.github.io/cppwp/n3337/alg.equal –
Maledict Take advantage of the standard char_traits
. Recall that a std::string
is in fact a typedef for std::basic_string<char>
, or more explicitly, std::basic_string<char, std::char_traits<char> >
. The char_traits
type describes how characters compare, how they copy, how they cast etc. All you need to do is typedef a new string over basic_string
, and provide it with your own custom char_traits
that compare case insensitively.
struct ci_char_traits : public char_traits<char> {
static bool eq(char c1, char c2) { return toupper(c1) == toupper(c2); }
static bool ne(char c1, char c2) { return toupper(c1) != toupper(c2); }
static bool lt(char c1, char c2) { return toupper(c1) < toupper(c2); }
static int compare(const char* s1, const char* s2, size_t n) {
while( n-- != 0 ) {
if( toupper(*s1) < toupper(*s2) ) return -1;
if( toupper(*s1) > toupper(*s2) ) return 1;
++s1; ++s2;
}
return 0;
}
static const char* find(const char* s, int n, char a) {
while( n-- > 0 && toupper(*s) != toupper(a) ) {
++s;
}
return s;
}
};
typedef std::basic_string<char, ci_char_traits> ci_string;
The details are on Guru of The Week number 29.
typedef std::basic_string<char, ci_char_traits<char> > istring
, not typedef std::basic_string<char, std::char_traits<char> > string
. –
Pointer find
should be std::size_t
, not int
. Unfortunately I can't edit because the question is locked. Also it's possible to implement find
and compare
in terms of eq
, lt
and ne
. –
Warrant std::string
, it's trivial to convert between the two using the range constructor. –
Poitiers std::unordered_map
. (Or at least, the implementation of the string hash in MSVC's standard library does not appear to use the char traits for anything.) So if using this with std::unordered_map
, a specialization of std::hash
will probably be needed too. –
Poitiers std::toupper
should not be called on char
directly, a static_cast
to unsigned char
is needed. –
Truant std::string s1{ "Ignore my CASE" }; std::string s2{ "ignore my case" }; std::basic_string_view<std::string::value_type, ci_char_traits> ci_view{ s1.c_str() }; std::cout << std::boolalpha << "\"" << s1 << "\" equals \"" << s2 << "\": " << (s1.compare(s2) == 0) << std::endl; std::cout << "\"" << s1 << "\" equals \"" << s2 << "\" (ignore casing): " << (ci_view.compare(s2.c_str()) == 0) << std::endl;
–
Venter If you are on a POSIX system, you can use strcasecmp. This function is not part of standard C, though, nor is it available on Windows. This will perform a case-insensitive comparison on 8-bit chars, so long as the locale is POSIX. If the locale is not POSIX, the results are undefined (so it might do a localized compare, or it might not). A wide-character equivalent is not available.
Failing that, a large number of historic C library implementations have the functions stricmp() and strnicmp(). Visual C++ on Windows renamed all of these by prefixing them with an underscore because they aren’t part of the ANSI standard, so on that system they’re called _stricmp or _strnicmp. Some libraries may also have wide-character or multibyte equivalent functions (typically named e.g. wcsicmp, mbcsicmp and so on).
C and C++ are both largely ignorant of internationalization issues, so there's no good solution to this problem, except to use a third-party library. Check out IBM ICU (International Components for Unicode) if you need a robust library for C/C++. ICU is for both Windows and Unix systems.
Are you talking about a dumb case insensitive compare or a full normalized Unicode compare?
A dumb compare will not find strings that might be the same but are not binary equal.
Example:
U212B (ANGSTROM SIGN)
U0041 (LATIN CAPITAL LETTER A) + U030A (COMBINING RING ABOVE)
U00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE).
Are all equivalent but they also have different binary representations.
That said, Unicode Normalization should be a mandatory read especially if you plan on supporting Hangul, Thaï and other asian languages.
Also, IBM pretty much patented most optimized Unicode algorithms and made them publicly available. They also maintain an implementation : IBM ICU
a
, o
or u
with diaeresis or directly via letters ä
, ö
, ü
– however the distance of the two dots is (slightly) different (direct charachters narrower)... –
Heall My first thought for a non-unicode version was to do something like this:
bool caseInsensitiveStringCompare(const string& str1, const string& str2) {
if (str1.size() != str2.size()) {
return false;
}
for (string::const_iterator c1 = str1.begin(), c2 = str2.begin(); c1 != str1.end(); ++c1, ++c2) {
if (tolower(static_cast<unsigned char>(*c1)) != tolower(static_cast<unsigned char>(*c2))) {
return false;
}
}
return true;
}
boost::iequals is not utf-8 compatible in the case of string. You can use boost::locale.
comparator<char,collator_base::secondary> cmpr;
cout << (cmpr(str1, str2) ? "str1 < str2" : "str1 >= str2") << endl;
You can use strcasecmp
on Unix, or stricmp
on Windows.
One thing that hasn't been mentioned so far is that if you are using stl strings with these methods, it's useful to first compare the length of the two strings, since this information is already available to you in the string class. This could prevent doing the costly string comparison if the two strings you are comparing aren't even the same length in the first place.
I'm trying to cobble together a good answer from all the posts, so help me edit this:
Here is a method of doing this, although it does transforming the strings, and is not Unicode friendly, it should be portable which is a plus:
bool caseInsensitiveStringCompare( const std::string& str1, const std::string& str2 ) {
std::string str1Cpy( str1 );
std::string str2Cpy( str2 );
std::transform( str1Cpy.begin(), str1Cpy.end(), str1Cpy.begin(), ::tolower );
std::transform( str2Cpy.begin(), str2Cpy.end(), str2Cpy.begin(), ::tolower );
return ( str1Cpy == str2Cpy );
}
From what I have read this is more portable than stricmp() because stricmp() is not in fact part of the std library, but only implemented by most compiler vendors.
To get a truly Unicode friendly implementation it appears you must go outside the std library. One good 3rd party library is the IBM ICU (International Components for Unicode)
Also boost::iequals provides a fairly good utility for doing this sort of comparison.
transform
the whole string before comparison –
Burmaburman std::tolower
should not be called on char
directly, a static_cast
to unsigned char
is needed. –
Truant See std::lexicographical_compare
:
// lexicographical_compare example
#include <iostream> // std::cout, std::boolalpha
#include <algorithm> // std::lexicographical_compare
#include <cctype> // std::tolower
// a case-insensitive comparison function:
bool mycomp(char c1, char c2) {
return std::tolower(c1) < std::tolower(c2);
}
int main() {
std::string foo = "Apple";
std::string bar = "apartment";
std::cout << std::boolalpha;
std::cout << "Comparing foo and bar lexicographically (foo<bar):\n";
std::cout << "Using default comparison (operator<): ";
std::cout << std::lexicographical_compare(foo.begin(), foo.end(), bar.begin(), bar.end());
std::cout << '\n';
std::cout << "Using custom comparison (mycomp): ";
std::cout << std::lexicographical_compare(foo.begin(), foo.end(), bar.begin(), bar.end(), mycomp);
std::cout << '\n';
return 0;
}
std::tolower
works only if the character is ASCII-encoded. There is no such guarantee for std::string
- so it can be undefined behavior easily. –
Klong std::string
not with std::lexicographical_compare
. –
Bretbretagne str1.size() == str2.size() && std::equal(str1.begin(), str1.end(), str2.begin(), [](auto a, auto b){return std::tolower(a)==std::tolower(b);})
You can use the above code in C++14 if you are not in a position to use boost. You have to use std::towlower
for wide chars.
str1.size() == str2.size() &&
to the front so that will not go out of bounds when str2 is a prefix of str1. –
Hominoid Short and nice. No other dependencies, than extended std C lib.
strcasecmp(str1.c_str(), str2.c_str()) == 0
returns true if str1
and str2
are equal.
strcasecmp
may not exist, there could be analogs stricmp
, strcmpi
, etc.
Example code:
#include <iostream>
#include <string>
#include <string.h> //For strcasecmp(). Also could be found in <mem.h>
using namespace std;
/// Simple wrapper
inline bool str_ignoreCase_cmp(std::string const& s1, std::string const& s2) {
if(s1.length() != s2.length())
return false; // optimization since std::string holds length in variable.
return strcasecmp(s1.c_str(), s2.c_str()) == 0;
}
/// Function object - comparator
struct StringCaseInsensetiveCompare {
bool operator()(std::string const& s1, std::string const& s2) {
if(s1.length() != s2.length())
return false; // optimization since std::string holds length in variable.
return strcasecmp(s1.c_str(), s2.c_str()) == 0;
}
bool operator()(const char *s1, const char * s2){
return strcasecmp(s1,s2)==0;
}
};
/// Convert bool to string
inline char const* bool2str(bool b){ return b?"true":"false"; }
int main()
{
cout<< bool2str(strcasecmp("asd","AsD")==0) <<endl;
cout<< bool2str(strcasecmp(string{"aasd"}.c_str(),string{"AasD"}.c_str())==0) <<endl;
StringCaseInsensetiveCompare cmp;
cout<< bool2str(cmp("A","a")) <<endl;
cout<< bool2str(cmp(string{"Aaaa"},string{"aaaA"})) <<endl;
cout<< bool2str(str_ignoreCase_cmp(string{"Aaaa"},string{"aaaA"})) <<endl;
return 0;
}
Output:
true
true
true
true
true
stricmp
, strcmpi
, strcasecmp
, etc. Thank you. message edited. –
Promisee cout << boolalpha
rather than my bool2str
because It to implicitly convert bool to chars for stream. –
Promisee Visual C++ string functions supporting unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx
the one you are probably looking for is _wcsnicmp
FYI, strcmp()
and stricmp()
are vulnerable to buffer overflow, since they just process until they hit a null terminator. It's safer to use _strncmp()
and _strnicmp()
.
stricmp()
and strnicmp()
are not part of the POSIX standard :-( However you can find strcasecmp()
, strcasecmp_l()
, strncasecmp()
and strncasecmp_l()
in POSIX header strings.h
:-) see opengroup.org –
Bailiff The Boost.String library has a lot of algorithms for doing case-insenstive comparisons and so on.
You could implement your own, but why bother when it's already been done?
For my basic case insensitive string comparison needs I prefer not to have to use an external library, nor do I want a separate string class with case insensitive traits that is incompatible with all my other strings.
So what I've come up with is this:
bool icasecmp(const string& l, const string& r)
{
return l.size() == r.size()
&& equal(l.cbegin(), l.cend(), r.cbegin(),
[](string::value_type l1, string::value_type r1)
{ return toupper(l1) == toupper(r1); });
}
bool icasecmp(const wstring& l, const wstring& r)
{
return l.size() == r.size()
&& equal(l.cbegin(), l.cend(), r.cbegin(),
[](wstring::value_type l1, wstring::value_type r1)
{ return towupper(l1) == towupper(r1); });
}
A simple function with one overload for char and another for whar_t. Doesn't use anything non-standard so should be fine on any platform.
The equality comparison won't consider issues like variable length encoding and Unicode normalization, but basic_string has no support for that that I'm aware of anyway and it isn't normally an issue.
In cases where more sophisticated lexicographical manipulation of text is required, then you simply have to use a third party library like Boost, which is to be expected.
Doing this without using Boost can be done by getting the C string pointer with c_str()
and using strcasecmp
:
std::string str1 ="aBcD";
std::string str2 = "AbCd";;
if (strcasecmp(str1.c_str(), str2.c_str()) == 0)
{
//case insensitive equal
}
Assuming you are looking for a method and not a magic function that already exists, there is frankly no better way. We could all write code snippets with clever tricks for limited character sets, but at the end of the day at somepoint you have to convert the characters.
The best approach for this conversion is to do so prior to the comparison. This allows you a good deal of flexibility when it comes to encoding schemes, which your actual comparison operator should be ignorant of.
You can of course 'hide' this conversion behind your own string function or class, but you still need to convert the strings prior to comparison.
I wrote a case-insensitive version of char_traits for use with std::basic_string in order to generate a std::string that is not case-sensitive when doing comparisons, searches, etc using the built-in std::basic_string member functions.
So in other words, I wanted to do something like this.
std::string a = "Hello, World!";
std::string b = "hello, world!";
assert( a == b );
...which std::string can't handle. Here's the usage of my new char_traits:
std::istring a = "Hello, World!";
std::istring b = "hello, world!";
assert( a == b );
...and here's the implementation:
/* ---
Case-Insensitive char_traits for std::string's
Use:
To declare a std::string which preserves case but ignores case in comparisons & search,
use the following syntax:
std::basic_string<char, char_traits_nocase<char> > noCaseString;
A typedef is declared below which simplifies this use for chars:
typedef std::basic_string<char, char_traits_nocase<char> > istring;
--- */
template<class C>
struct char_traits_nocase : public std::char_traits<C>
{
static bool eq( const C& c1, const C& c2 )
{
return ::toupper(c1) == ::toupper(c2);
}
static bool lt( const C& c1, const C& c2 )
{
return ::toupper(c1) < ::toupper(c2);
}
static int compare( const C* s1, const C* s2, size_t N )
{
return _strnicmp(s1, s2, N);
}
static const char* find( const C* s, size_t N, const C& a )
{
for( size_t i=0 ; i<N ; ++i )
{
if( ::toupper(s[i]) == ::toupper(a) )
return s+i ;
}
return 0 ;
}
static bool eq_int_type( const int_type& c1, const int_type& c2 )
{
return ::toupper(c1) == ::toupper(c2) ;
}
};
template<>
struct char_traits_nocase<wchar_t> : public std::char_traits<wchar_t>
{
static bool eq( const wchar_t& c1, const wchar_t& c2 )
{
return ::towupper(c1) == ::towupper(c2);
}
static bool lt( const wchar_t& c1, const wchar_t& c2 )
{
return ::towupper(c1) < ::towupper(c2);
}
static int compare( const wchar_t* s1, const wchar_t* s2, size_t N )
{
return _wcsnicmp(s1, s2, N);
}
static const wchar_t* find( const wchar_t* s, size_t N, const wchar_t& a )
{
for( size_t i=0 ; i<N ; ++i )
{
if( ::towupper(s[i]) == ::towupper(a) )
return s+i ;
}
return 0 ;
}
static bool eq_int_type( const int_type& c1, const int_type& c2 )
{
return ::towupper(c1) == ::towupper(c2) ;
}
};
typedef std::basic_string<char, char_traits_nocase<char> > istring;
typedef std::basic_string<wchar_t, char_traits_nocase<wchar_t> > iwstring;
Late to the party, but here is a variant that uses std::locale
, and thus correctly handles Turkish:
auto tolower = std::bind1st(
std::mem_fun(
&std::ctype<char>::tolower),
&std::use_facet<std::ctype<char> >(
std::locale()));
gives you a functor that uses the active locale to convert characters to lowercase, which you can then use via std::transform
to generate lower-case strings:
std::string left = "fOo";
transform(left.begin(), left.end(), left.begin(), tolower);
This also works for wchar_t
based strings.
I've had good experience using the International Components for Unicode libraries - they're extremely powerful, and provide methods for conversion, locale support, date and time rendering, case mapping (which you don't seem to want), and collation, which includes case- and accent-insensitive comparison (and more). I've only used the C++ version of the libraries, but they appear to have a Java version as well.
Methods exist to perform normalized compares as referred to by @Coincoin, and can even account for locale - for example (and this a sorting example, not strictly equality), traditionally in Spanish (in Spain), the letter combination "ll" sorts between "l" and "m", so "lz" < "ll" < "ma".
Just use strcmp()
for case sensitive and strcmpi()
or stricmp()
for case insensitive comparison. Which are both in the header file <string.h>
format:
int strcmp(const char*,const char*); //for case sensitive
int strcmpi(const char*,const char*); //for case insensitive
Usage:
string a="apple",b="ApPlE",c="ball";
if(strcmpi(a.c_str(),b.c_str())==0) //(if it is a match it will return 0)
cout<<a<<" and "<<b<<" are the same"<<"\n";
if(strcmpi(a.c_str(),b.c_str()<0)
cout<<a[0]<<" comes before ball "<<b[0]<<", so "<<a<<" comes before "<<b;
Output
apple and ApPlE are the same
a comes before b, so apple comes before ball
A simple way to compare two string in c++ (tested for windows) is using _stricmp
// Case insensitive (could use equivalent _stricmp)
result = _stricmp( string1, string2 );
If you are looking to use with std::string, an example:
std::string s1 = string("Hello");
if ( _stricmp(s1.c_str(), "HELLO") == 0)
std::cout << "The string are equals.";
For more information here: https://msdn.microsoft.com/it-it/library/e0z9k731.aspx
Just a note on whatever method you finally choose, if that method happens to include the use of strcmp
that some answers suggest:
strcmp
doesn't work with Unicode data in general. In general, it doesn't even work with byte-based Unicode encodings, such as utf-8, since strcmp
only makes byte-per-byte comparisons and Unicode code points encoded in utf-8 can take more than 1 byte. The only specific Unicode case strcmp
properly handle is when a string encoded with a byte-based encoding contains only code points below U+00FF - then the byte-per-byte comparison is enough.
Looks like above solutions aren't using compare method and implementing total again so here is my solution and hope it works for you (It's working fine).
#include<iostream>
#include<cstring>
#include<cmath>
using namespace std;
string tolow(string a)
{
for(unsigned int i=0;i<a.length();i++)
{
a[i]=tolower(a[i]);
}
return a;
}
int main()
{
string str1,str2;
cin>>str1>>str2;
int temp=tolow(str1).compare(tolow(str2));
if(temp>0)
cout<<1;
else if(temp==0)
cout<<0;
else
cout<<-1;
}
As of early 2013, the ICU project, maintained by IBM, is a pretty good answer to this.
ICU is a "complete, portable Unicode library that closely tracks industry standards." For the specific problem of string comparison, the Collation object does what you want.
The Mozilla Project adopted ICU for internationalization in Firefox in mid-2012; you can track the engineering discussion, including issues of build systems and data file size, here:
If you have to compare a source string more often with other strings one elegant solution is to use regex.
std::wstring first = L"Test";
std::wstring second = L"TEST";
std::wregex pattern(first, std::wregex::icase);
bool isEqual = std::regex_match(second, pattern);
error: conversion from 'const char [5]' to non-scalar type 'std::wstring {aka std::basic_string<wchar_t>}' requested
–
Nakano If you don't want to use Boost library then here is solution to it using only C++ standard io header.
#include <iostream>
#include <cctype>
#include <algorithm>
#include <stdexcept>
#include <cassert>
struct iequal
{
bool operator()(int c1, int c2) const
{
return std::toupper(c1) == std::toupper(c2);
}
};
bool iequals(const std::string& str1, const std::string& str2)
{
if (str1.empty() || str2.empty())
{
return str1.empty() && str2.empty();
}
return std::equal(str1.begin(), str1.end(), str2.begin(), iequal());
}
void runTests()
{
assert(iequals("HELLO", "hello") == true);
assert(iequals("HELLO", "") == false);
assert(iequals("", "hello") == false);
assert(iequals("", "") == true);
std::cout << "All tests passed!" << std::endl;
}
int main(void)
{
try
{
runTests();
}
catch (const std::exception& e)
{
std::cerr << "Exception: " << e.what() << std::endl;
}
return 0;
}
bool insensitive_c_compare(char A, char B){
static char mid_c = ('Z' + 'a') / 2 + 'Z';
static char up2lo = 'A' - 'a'; /// the offset between upper and lowers
if ('a' >= A and A >= 'z' or 'A' >= A and 'Z' >= A)
if ('a' >= B and B >= 'z' or 'A' >= B and 'Z' >= B)
/// check that the character is infact a letter
/// (trying to turn a 3 into an E would not be pretty!)
{
if (A > mid_c and B > mid_c or A < mid_c and B < mid_c)
{
return A == B;
}
else
{
if (A > mid_c)
A = A - 'a' + 'A';
if (B > mid_c)/// convert all uppercase letters to a lowercase ones
B = B - 'a' + 'A';
/// this could be changed to B = B + up2lo;
return A == B;
}
}
}
this could probably be made much more efficient, but here is a bulky version with all its bits bare.
not all that portable, but works well with whatever is on my computer (no idea, I am of pictures not words)
An easy way to compare strings that are only different by lowercase and capitalized characters is to do an ascii comparison. All capital and lowercase letters differ by 32 bits in the ascii table, using this information we have the following...
for( int i = 0; i < string2.length(); i++)
{
if (string1[i] == string2[i] || int(string1[i]) == int(string2[j])+32 ||int(string1[i]) == int(string2[i])-32)
{
count++;
continue;
}
else
{
break;
}
if(count == string2.length())
{
//then we have a match
}
}
© 2022 - 2024 — McMap. All rights reserved.
std::stricmp
. Otherwise, read what Herb has to say. – Roofstrcasecmp
is not part of the standard and is missing from at least one common compiler. – Shetrit