How can I check if a string has special characters in C++ effectively?
Asked Answered
E

9

15

I am trying to find if there is better way to check if the string has special characters. In my case, anything other than alphanumeric and a '_' is considered a special character. Currently, I have a string that contains special characters such as std::string = "!@#$%^&". I then use the std::find_first_of () algorithm to check if any of the special characters are present in the string.

I was wondering how to do it based on whitelisting. I want to specify the lowercase/uppercase characters, numbers and an underscore in a string ( I don't want to list them. Is there any way I can specify the ascii range of some sort like [a-zA-Z0-9_]). How can I achieve this? Then I plan to use the std::find_first_not_of(). In this way I can mention what I actually want and check for the opposite.

Etruria answered 7/7, 2011 at 2:46 Comment(3)
https://mcmap.net/q/55357/-most-efficient-way-to-remove-special-characters-from-string/82705Pontoon
@Sai Ganesh: different language there (C#)Luisaluise
C++ doesn't assume ASCII. It's even compatible with EBCDIC, in which A-Z is not contiguous.Luisaluise
I
19

Try:

std::string  x(/*Load*/);
if (x.find_first_not_of("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890_") != std::string::npos)
{
    std::cerr << "Error\n";
}

Or try boost regular expressions:

// Note: \w matches any word character `alphanumeric plus "_"`
boost::regex test("\w+", re,boost::regex::perl);
if (!boost::regex_match(x.begin(), x.end(), test)
{
    std::cerr << "Error\n";
}

// The equivalent to \w should be:
boost::regex test("[A-Za-z0-9_]+", re,boost::regex::perl);   
Ilex answered 7/7, 2011 at 2:50 Comment(3)
i know that i can do this but was wondering it i could mention the range such as [a-z A-Z 0-9 _] or ascii value range or something.Etruria
@Praveen: Added boost version.Ilex
regex go much simpler since the post: #include <regex>; /*....*/ if(!std::regex_match(str_val,std::regex("[A-Za-z0-9\-_]+")) throw;Devitalize
A
4

There's no way using standard C or C++ to do that using character ranges, you have to list out all of the characters. For C strings, you can use strspn(3) and strcspn(3) to find the first character in a string that is a member of or is not a member of a given character set. For example:

// Test if the given string has anything not in A-Za-z0-9_
bool HasSpecialCharacters(const char *str)
{
    return str[strspn(str, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_")] != 0;
}

For C++ strings, you can equivalently use the find_first_of and find_first_not_of member functions.

Another option is to use the isalnum(3) and related functions from the <ctype.h> to test if a given character is alphanumeric or not; note that these functions are locale-dependent, so their behavior can (and does) change in other locales. If you do not want that behavior, then don't use them. If you do choose to use them, you'll have to also test for underscores separately, since there's no function that tests "alphabetic, numeric, or underscore", and you'll also have to code your own loop to search the string (or use std::find with an appropriate function object).

Acotyledon answered 7/7, 2011 at 2:58 Comment(0)
S
4

The first thing that you need to consider is "is this ASCII only"? If you answer is yes, I would encourage you to really consider whether or not you should allow ASCII only. I currently work for a company that is really having some headaches getting into foreign markets because we didn't think to support unicode from the get-go.

That being said, ASCII makes it really easy to check for non alpha numerics. Take a look at the ascii chart.

http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters

  • Iterate through each character
  • Check if the character is decimal value 48 - 57, 65 - 90, 97 - 122, or 95 (underscore)
Syllabi answered 7/7, 2011 at 2:59 Comment(0)
B
4

I think I'd do the job just a bit differently, treating the std::string as a collection, and using an algorithm. Using a C++0x lambda, it would look something like this:

bool has_special_char(std::string const &str) {
    return std::find_if(str.begin(), str.end(),
        [](unsigned char ch) { return !(isalnum(ch) || ch == '_'); }) != str.end();
}

At least when you're dealing with char (not wchar_t), isalnum will typically use a table look up, so it'll usually be (quite a bit) faster than anything based on find_first_of (which will normally use a linear search instead). IOW, this is O(N) (N=str.size()), where something based on find_first_of will be O(N*M), (N=str.size(), M=pattern.size()).

If you want to do the job with pure C, you can use scanf with a scanset conversion that's theoretically non-portable, but supported by essentially all recent/popular compilers:

char junk;
if (sscanf(str, "%*[A-Za-z0-9_]%c", &junk))
    /* it has at least one "special" character
else
    /* no special characters */

The basic idea here is pretty simple: the scanset skips across all consecutive non-special characters (but doesn't assign the result to anything, because of the *), then we try to read one more character. If that succeeds, it means there was at least one character that was not skipped, so we must have at least one special character. If it fails, it means the scanset conversion matched the whole string, so all the characters were "non-special".

Officially, the C standard says that trying to put a range in a scanset conversion like this isn't portable (a '-' anywhere but the beginning or end of the scanset gives implementation defined behavior). There have even been a few compilers (from Borland) that would fail for this -- they would treat A-Z as matching exactly three possible characters, 'A', '-' and 'Z'. Most current compilers (or, more accurately, standard library implementations) take the approach this assumes: "A-Z" matches any upper-case character.

Backhand answered 7/7, 2011 at 5:57 Comment(4)
isalnum is optimized, it does not make a table-lookup, but it checks if the character is in a range: the code works like ('0' <= c && c <= '9') || ('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z') where c is the character. It uses the fact that the characters (of a group like upper or lower letters or digits) are linear after each other in ascii. This is more efficient than regex(which needs a parser or an interpreter) or find_first_not_of.Hydrostatic
@cmdLP: It's up to the implementation to decide how to implement it, of course. That said, here (for one example) is how it's implemented in libstdc++ for Linux: return _M_table[static_cast<unsigned char>(__c)] & __m;. (from: gcc/libstdc++-v3/config/os/gnu-linux/ctype_inline.h). And in libcxx, it's: return isascii(c) ? (ctype<char>::classic_table()[c] & m) != 0 : false;. (libcxx/src/locale.cpp). So, while there may be exceptions, it's usually table based.Backhand
@cmdLP: If you do know of one where it's based on comparison to ranges, however, I'd be interested in knowing what it is--I believe at the present time, that stands a good chance of being more efficient than a table lookup, but I don't know of any implementation that actually does it.Backhand
"... the behavior of std::isalnum is undefined if the argument's value is neither representable as unsigned char nor equal to EOF." As with all functions from cctype, their argument should first be converted to unsigned char.Unto
W
2

I would just use the built-in C facility here. Iterate over each character in the string and check if it's _ or if isalpha(ch) is true. If so then it's valid, otherwise it's a special character.

Waisted answered 7/7, 2011 at 2:56 Comment(0)
P
1

The functions (macros) are subject to locale settings, but you should investigate isalnum() and relatives from <ctype.h> or <cctype>.

Pleasing answered 7/7, 2011 at 2:51 Comment(0)
O
1

Using

    s.erase(std::remove_if(s.begin(), s.end(), my_predicate), s.end());

    bool my_predicate(char c)
    {
     return !(isalpha(c) || c=='_');
    }

will get you a clean string s.

Erase will strip it off all the special characters and is highly customisable with the my_predicate function.

Ornamentation answered 23/9, 2012 at 7:59 Comment(0)
M
1

You can use something like this:

#include <ctype>

for(int i=0;i<s.length();i++){
    if( !std::isalpha(s[i]) && !std::isdigit(s[i]) && s[i]!='_')
          return false
}

The isalpha() function checks whether it is alphanumeric or not and isdigit() checks whether it is digit.

Monograph answered 26/1, 2021 at 3:53 Comment(1)
This worked without importing ctype.Nebo
M
0

If you want this, but don't want to go the whole hog and use regexps, and given you're test is for ASCII chars - just create a function to generate the string for find_first_not_of...

#include <iostream>
#include <string>

std::string expand(const char* p)
{
    std::string result;
    while (*p)
        if (p[1] == '-' && p[2])
        {
            for (int c = p[0]; c <= p[2]; ++c)
                result += (char)c;
            p += 3;
        }
        else
            result += *p++;
    return result;
}

int main()
{
    std::cout << expand("A-Za-z0-9_") << '\n';
}
Mastery answered 7/7, 2011 at 3:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.