Why is the alphabet split into multiple ranges in this C code?

About

Asked 5/5, 2015 at 10:4 Answered 5/5, 2015 at 10:8

160

In a custom library I saw an implementation:

inline int is_upper_alpha(char chValue)
{
    if (((chValue >= 'A') && (chValue <= 'I')) ||
        ((chValue >= 'J') && (chValue <= 'R')) ||
        ((chValue >= 'S') && (chValue <= 'Z')))
        return 1;
    return 0;
}

Is that an Easter egg or what are the advantages vs standard C/C++ method?

inline int is_upper_alpha(char chValue)
{
    return ((chValue >= 'A') && (chValue <= 'Z'));
}

Stonecutter answered 5/5, 2015 at 10:4 Comment(2)

Note that in EBCDIC, the character range for lower-case letters comes before the character range for upper-case letters, and both come before the digits — which is exactly the opposite of the order in ASCII-based encodings (such as the 8859-x series, or Unicode, or CP1252, or …). – Lexine 6/5, 2015 at 13:44

Note: if 'J' - 'I' and 'S' - 'R' both equal 1, then I expect that a reasonable optimizer would turn the former in the latter. – Lecroy 6/5, 2015 at 14:44

215

The author of this code presumably had to support EBCDIC at some point, where the numeric values of the letters are non-contiguous (gaps exist between I, J and R, S, as you may have guessed).

It is worth noting that the C and C++ standards only guarantee that the characters 0 to 9 have contiguous numeric values for precisely this reason, so neither of these methods is strictly standard-conforming.

Papeete answered 5/5, 2015 at 10:8 Comment(12)

Yes, This is sure that author want's to support EBCDIC 037 code. to check EBCDIC codes please refer the link en.wikipedia.org/wiki/EBCDIC_037 – Compliance 5/5, 2015 at 10:17

Yes you are right. The method is implemented for the non-contiguous letters in EBCDIC. Thanks for the answer! – Stonecutter 5/5, 2015 at 10:28

The real WTF is why didn't the original author put in a comment: // In the EBCDIC coding, the alphabet has gaps between these values. See URL: xxxx for details. Then you'd never even have to ask the question. You'd have the answer built-in to the code. – Smasher 5/5, 2015 at 15:12

@Smasher If the code was originally for a system where ebcdic is normally used it may have seemed obvious at the time and not needed a comment, unfortunately things that seem fine in legacy code seem strange now. – Bluegreen 5/5, 2015 at 15:57

@abelenky: The real WTF is why didn't the original author use standard functionality, i.e. return ( isalpha( chValue ) && isupper( chValue ) )... – Contaminate 6/5, 2015 at 8:12

Does any machine that uses EBCDIC have a C++ compiler at all? To my knowledge, no single computer built after ~1970 uses this... :-) – Kathrinkathrine 6/5, 2015 at 8:26

@Damon: That is not the issue. You might have to process an "alien" encoding even on a system that doesn't use that encoding natively. So you set your locale to the given encoding, and then you have to keep your fingers crossed that the programmer actually used standard functions instead of doing "smart" coding like the above, thinking he knows every encoding his program will ever encounter... – Contaminate 6/5, 2015 at 9:53

If it was written to support EBCDIC from the 1970's, was isalpha and isupper even ANSI or supported by majority of compilers back then? – Hollinger 6/5, 2015 at 11:45

@Smasher not really; it's clearly depending upon ranges that happen to exist in the encoding(s) in use. It's certainly no more of a WTF than then second piece of code in the question. – Characteristically 6/5, 2015 at 14:19

@Damon: I believe IBM mainframes do still use EBCDIC, at least in compatibility modes but probably by default. Your cutoff date is at least 30 years premature, and probably more than that. – Lexine 6/5, 2015 at 16:54

@DevSolar: Actually isalpha is wrong; its results are locale-specific and meant for processing natural language in the user's configured locale, whereas the actual need for most software is to match a fixed set of characters independent of locale. – Suitcase 7/5, 2015 at 6:44

@R.: In my experience, the actual need for most software is to match "word contents", or similar, and the programmer simply forgot about locale issues completely... in either case, a comment would do loads of good. ;-) – Contaminate 7/5, 2015 at 7:9

Looks like it attempts to cover both EBCDIC and ASCII. Your alternative method doesn't work for EBCDIC (it has false positives, but no false negatives)

C and C++ do require that '0'-'9' are contiguous.

Note that the standard library calls do know whether they run on ASCII, EBCDIC or other systems, so they're more portable and possibly more efficient.

Eparch answered 5/5, 2015 at 10:8 Comment(5)

std::isupper actually queries the currently installed global C locale. – Feldt 5/5, 2015 at 10:21

Yes, you are right. The method is written for cover both of encodings. Thanks for the answer! – Stonecutter 5/5, 2015 at 10:26

@Lingxi: True, but that doesn't mean you can switch the locale from ASCII to EBCDIC. 'A' has to remain 'A' regardless from locale. ASCII to UTF-8, that would be possible. – Eparch 5/5, 2015 at 10:29

@Lingxi: std::isupper queries the currently installed global C locale, yes, but the phase of compilation that interprets character literals does not. – Prognosticate 5/5, 2015 at 10:51

@Feldt - Just quick note. It is questionable whether std::isupper is really needed in most cases. It respects locales used for input from user. But when parsing files, interacting with databases you usually expect some other locale. Moreover at least on Linux these locale related calls are very slow - for example std::isalpha calls dynamic_cast two times to "find" proper locale implementation before actually comparing a single character. – Quadripartite 6/5, 2015 at 7:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags