Why is the alphabet split into multiple ranges in this C code?
Asked Answered
S

2

160

In a custom library I saw an implementation:

inline int is_upper_alpha(char chValue)
{
    if (((chValue >= 'A') && (chValue <= 'I')) ||
        ((chValue >= 'J') && (chValue <= 'R')) ||
        ((chValue >= 'S') && (chValue <= 'Z')))
        return 1;
    return 0;
}

Is that an Easter egg or what are the advantages vs standard C/C++ method?

inline int is_upper_alpha(char chValue)
{
    return ((chValue >= 'A') && (chValue <= 'Z'));
}
Stonecutter answered 5/5, 2015 at 10:4 Comment(2)
Note that in EBCDIC, the character range for lower-case letters comes before the character range for upper-case letters, and both come before the digits — which is exactly the opposite of the order in ASCII-based encodings (such as the 8859-x series, or Unicode, or CP1252, or …).Lexine
Note: if 'J' - 'I' and 'S' - 'R' both equal 1, then I expect that a reasonable optimizer would turn the former in the latter.Lecroy
P
215

The author of this code presumably had to support EBCDIC at some point, where the numeric values of the letters are non-contiguous (gaps exist between I, J and R, S, as you may have guessed).

It is worth noting that the C and C++ standards only guarantee that the characters 0 to 9 have contiguous numeric values for precisely this reason, so neither of these methods is strictly standard-conforming.

Papeete answered 5/5, 2015 at 10:8 Comment(12)
Yes, This is sure that author want's to support EBCDIC 037 code. to check EBCDIC codes please refer the link en.wikipedia.org/wiki/EBCDIC_037Compliance
Yes you are right. The method is implemented for the non-contiguous letters in EBCDIC. Thanks for the answer!Stonecutter
The real WTF is why didn't the original author put in a comment: // In the EBCDIC coding, the alphabet has gaps between these values. See URL: xxxx for details. Then you'd never even have to ask the question. You'd have the answer built-in to the code.Smasher
@Smasher If the code was originally for a system where ebcdic is normally used it may have seemed obvious at the time and not needed a comment, unfortunately things that seem fine in legacy code seem strange now.Bluegreen
@abelenky: The real WTF is why didn't the original author use standard functionality, i.e. return ( isalpha( chValue ) && isupper( chValue ) )...Contaminate
Does any machine that uses EBCDIC have a C++ compiler at all? To my knowledge, no single computer built after ~1970 uses this... :-)Kathrinkathrine
@Damon: That is not the issue. You might have to process an "alien" encoding even on a system that doesn't use that encoding natively. So you set your locale to the given encoding, and then you have to keep your fingers crossed that the programmer actually used standard functions instead of doing "smart" coding like the above, thinking he knows every encoding his program will ever encounter...Contaminate
If it was written to support EBCDIC from the 1970's, was isalpha and isupper even ANSI or supported by majority of compilers back then?Hollinger
@Smasher not really; it's clearly depending upon ranges that happen to exist in the encoding(s) in use. It's certainly no more of a WTF than then second piece of code in the question.Characteristically
@Damon: I believe IBM mainframes do still use EBCDIC, at least in compatibility modes but probably by default. Your cutoff date is at least 30 years premature, and probably more than that.Lexine
@DevSolar: Actually isalpha is wrong; its results are locale-specific and meant for processing natural language in the user's configured locale, whereas the actual need for most software is to match a fixed set of characters independent of locale.Suitcase
@R.: In my experience, the actual need for most software is to match "word contents", or similar, and the programmer simply forgot about locale issues completely... in either case, a comment would do loads of good. ;-)Contaminate
E
54

Looks like it attempts to cover both EBCDIC and ASCII. Your alternative method doesn't work for EBCDIC (it has false positives, but no false negatives)

C and C++ do require that '0'-'9' are contiguous.

Note that the standard library calls do know whether they run on ASCII, EBCDIC or other systems, so they're more portable and possibly more efficient.

Eparch answered 5/5, 2015 at 10:8 Comment(5)
std::isupper actually queries the currently installed global C locale.Feldt
Yes, you are right. The method is written for cover both of encodings. Thanks for the answer!Stonecutter
@Lingxi: True, but that doesn't mean you can switch the locale from ASCII to EBCDIC. 'A' has to remain 'A' regardless from locale. ASCII to UTF-8, that would be possible.Eparch
@Lingxi: std::isupper queries the currently installed global C locale, yes, but the phase of compilation that interprets character literals does not.Prognosticate
@Feldt - Just quick note. It is questionable whether std::isupper is really needed in most cases. It respects locales used for input from user. But when parsing files, interacting with databases you usually expect some other locale. Moreover at least on Linux these locale related calls are very slow - for example std::isalpha calls dynamic_cast two times to "find" proper locale implementation before actually comparing a single character.Quadripartite

© 2022 - 2024 — McMap. All rights reserved.