Detect if execution character set letters are contiguous
Asked Answered
H

4

7

In the code, a switch is used to convert letters to contiguous values. Optimizing compilers in general won't do the job as well as the simple contiguous digit condition right before the switch. How can I detect which execution character set used and/or conclude that the letters are contiguous to replace it with simple conditionals ?

static long digit_value(char c)
{
    if (c >= '0' && c <= '9')
        return (c-'0');

    switch(c) {
    case 'a':
    case 'A':
        return 10;
    case 'b':
    case 'B':
        return 11;
    case 'c':
    case 'C':
        return 12;
    case 'd':
    case 'D':
        return 13;
    case 'e':
    case 'E':
        return 14;
    case 'f':
    case 'F':
        return 15;
    case 'g':
    case 'G':
        return 16;
    case 'h':
    case 'H':
        return 17;
    case 'i':
    case 'I':
        return 18;
    case 'j':
    case 'J':
        return 19;
    case 'k':
    case 'K':
        return 20;
    case 'l':
    case 'L':
        return 21;
    case 'm':
    case 'M':
        return 22;
    case 'n':
    case 'N':
        return 23;
    case 'o':
    case 'O':
        return 24;
    case 'p':
    case 'P':
        return 25;
    case 'q':
    case 'Q':
        return 26;
    case 'r':
    case 'R':
        return 27;
    case 's':
    case 'S':
        return 28;
    case 't':
    case 'T':
        return 29;
    case 'u':
    case 'U':
        return 30;
    case 'v':
    case 'V':
        return 31;
    case 'w':
    case 'W':
        return 32;
    case 'x':
    case 'X':
        return 33;
    case 'y':
    case 'Y':
        return 34;
    case 'z':
    case 'Z':
        return 35;
    default:
        break;
    }

    return -1;
}
Heinrich answered 20/7, 2020 at 21:27 Comment(6)
Related/context: #13142276Simferopol
Even though the language doesn't require it, unless you need to be portable to an EBCDIC environment you can count on letters being sequential.Scraggly
Would it be enough to just exclude EBCDIC? (that for me is some sort of legend, as I never worked with such charset).Enriquetaenriquez
Character sets may have gaps, but AFAIR they need to be sorted. Wouldn't a check 'z'-'a'==25 do the trick for detecting gaps?Impudence
I've seen modern GCC and Clang turn switches with lots of contiguous entires into array lookups, so frankly if your compiler can't do this, my first relfex is "yes it can, you just didn't turn up the optimization settings high enough", and my second reflex is "ok, maybe it can't, but realistically, almost all situations are either ones where you should just not optimize except for clearly expressing your intent to keep the bar as low as possible for compiler optimizations to catch up, or you can just assume a character set in your niche performance-critical use-case."Bartie
Having said all that, I think the question is valid and I strongly relate to the motivation behind it.Bartie
K
7

How can I detect which execution character set used and/or conclude that the letters are contiguous?

At compile time, simply test them all. ('a-z' left out for simplicity)

static_assert(
  'A' == ('B' - 1) &&
  'B' == ('C' - 1) && 'C' == ('D' - 1) && 'D' == ('E' - 1) && 'E' == ('F' - 1) && 'F' == ('G' - 1) && 'G' == ('H' - 1) && 'H' == ('I' - 1) && 'I' == ('J' - 1) && 'J' == ('K' - 1) && 'K' == ('L' - 1) && 'L' == ('M' - 1) && 'M' == ('N' - 1) && 'N' == ('O' - 1) && 'O' == ('P' - 1) && 'P' == ('Q' - 1) && 'Q' == ('R' - 1) && 'R' == ('S' - 1) && 'S' == ('T' - 1) && 'T' == ('U' - 1) && 'U' == ('V' - 1) && 'V' == ('W' - 1) && 'W' == ('X' - 1) && 'X' == ('Y' - 1) &&
  'Y' == ('Z' - 1), "Dinosaur: not continuous A-Z");

static int digit_value(char c) {
  if (c >= '0' && c <= '9') return c - '0';
  if (c >= 'A' && c <= 'Z') return c - 'A' + 10;
  return -1;
}

Other dinosaur tests.


Or use the slow, but highly portable:

static int digit_value(char c) {
  static const char *base36 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
  const char *p = strchr(base36, (unsigned char) c);
  if (p && *p) {
    return (int) (p - base36);
  }
  return -1;
}

Or perhaps a big #if?

#if ('A' == ('B' - 1) && 'B' == ('C' - 1) && 'C' == ('D' - 1) && 'D' == ('E' - 1) && 'E' == ('F' - 1) && 'F' == ('G' - 1) && 'G' == ('H' - 1) && 'H' == ('I' - 1) && 'I' == ('J' - 1) && 'J' == ('K' - 1) && 'K' == ('L' - 1) && 'L' == ('M' - 1) && 'M' == ('N' - 1) && 'N' == ('O' - 1) && 'O' == ('P' - 1) && 'P' == ('Q' - 1) && 'Q' == ('R' - 1) && 'R' == ('S' - 1) && 'S' == ('T' - 1) && 'T' == ('U' - 1) && 'U' == ('V' - 1) && 'V' == ('W' - 1) && 'W' == ('X' - 1) && 'X' == ('Y' - 1) && 'Y' == ('Z' - 1))

static int digit_value(char c) {
  if (c >= '0' && c <= '9') return c - '0';
  if (c >= 'A' && c <= 'Z') return c - 'A' + 10;
  return -1;
}

#else

static int digit_value(char c) {
  static const char *base36 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
  const char *p = strchr(base36, (unsigned char) c);
  if (p && *p) {
    return (int) (p - base36);
  }
  return -1;
}

#endif

Or .... if UCHAR_MAX not too big and concern about speed, make a lookup table and skip the sequential concerns.

#include <limits.h>
int digit_value(char c) {
  unsigned char val[UCHAR_MAX] = {['0'] = 1, ['1'] = 2, ['2'] = 3, ['3'] = 4,
      ['4'] = 5, ['5'] = 6, ['6'] = 7, ['7'] = 8, ['9'] = 10, 
      ['A'] = 11, ['B'] = 12, ['C'] = 13, ['D'] = 14, ['E'] = 15, ...
      ['a'] = 11, ['b'] = 12, ['c'] = 13, ['d'] = 14, ['e'] = 15, ...
  };
  return val[(unsigned char) c] - 1;
}
Kcal answered 20/7, 2020 at 21:53 Comment(7)
The code must allow any kind of character set, but static_assert prohibits non-contiguous ones.Heinrich
@Heinrich Simply use if ('A' == ('B' - 1) && 'B' == ('C' - 1) ... 'Y' == ('Z' - 1)) and 1st digit_value(), else use the 2nd digit_value(char c) with its strchr(). The compile will optimize.Kcal
@chux-ReinstateMonica nice, I don't know if I'm happy or sad, but I'll try itHeinrich
@EricPostpischil True 'B' == 'A'+1 may be clearer. Pedantic concern: given A-Z cannot be a null character, 'A' == 'B'-1 cannot overflow (execution character set letters are positive) , yet in theory, some_letter+1 can intmax_t overflow. And we are making code to detect strange implementations.Kcal
My only concern, as I wrote in my answer, is that static_assert assert is a "new" feature of the standard, and it will unlikely supported in an old system.Enriquetaenriquez
@RobertoCaboni For C99, alternatives readily exist: Is there a static_assert replacement which satisfies the C99 standard? Common to emulate with a failure to for an array of size -1.Kcal
@chux, it was just a side concern. Anyway the ` #if` chain would do the trick as well.Enriquetaenriquez
B
1

You can write an appropriate test for conditional compilation, as @chux answered. However, if you need to support character sets with non-contiguous letters, then you need an implementation that supports that, and a table lookup would be a lot more efficient than what you present in the question. So much more so that you could consider using it for all cases instead of maintaining two separate implementations. For example:

static long digit_value(char c) {
    static const long dvs[UCHAR_MAX + 1] = {
      [0] = 1, [1] = 2, [2] = 3, [3] = 4, [5] = 6, [7] = 8, [8] = 9, [9] = 10,
      ['A'] = 11, ['B'] = 12, ['C'] = 13, ['D'] = 14, ['E'] = 15, ['F'] = 16, ['G'] = 17,
      ['H'] = 18, ['I'] = 19, ['J'] = 20, ['K'] = 21, ['L'] = 22, ['M'] = 23, ['N'] = 24,
      ['O'] = 25, ['P'] = 26, ['Q'] = 27, ['R'] = 28, ['S'] = 29, ['T'] = 30, ['U'] = 31,
      ['V'] = 32, ['W'] = 33, ['X'] = 34, ['Y'] = 35, ['Z'] = 36,
      ['a'] = 11, ['b'] = 12, ['c'] = 13, ['d'] = 14, ['e'] = 15, ['f'] = 16, ['g'] = 17,
      ['h'] = 18, ['i'] = 19, ['j'] = 20, ['k'] = 21, ['l'] = 22, ['m'] = 23, ['n'] = 24,
      ['o'] = 25, ['p'] = 26, ['q'] = 27, ['r'] = 28, ['s'] = 29, ['t'] = 30, ['u'] = 31,
      ['v'] = 32, ['w'] = 33, ['x'] = 34, ['y'] = 35, ['z'] = 36
    };

    return dvs[(unsigned char) c] - 1;
}

Note that the characters in the basic execution character set, which include all the decimal digits and upper- and lower-case Latin letters, are guaranteed to have non-negative integer values. That somewhat simplifies the initializer for the table. Note also that elements that are not explicitly initialized will be default-initialized to zero, which eventually gets converted to a -1 return value by subtracting 1 from the tabulated value.

Blackhead answered 20/7, 2020 at 22:27 Comment(2)
Maybe long dvs[] to char dvs[]?Kcal
Perhaps, @chux. I simply matched the table element type with the return type (which in turn I took from the OP's code), knowing that the table will then take up more space than it needs to do. I'm uncertain about the performance implications of the choice of table element type in the OP's target environments, but it is possible that changing it to char would provide both a size advantage and a speed advantage.Blackhead
E
0

First of all I have to say that I would choose the preprocessor solution, as the charset of the compiler would be an information I'd like to discover at compile time.

The _static_assert would be an elegant solution, but since it has been introduced only with C11 standard, real dinosaurs are unlikely to support it.

I would perform the check with common preprocessor directives that is with a chain of #ifs each making sure that the representation of one char is contiguous to the one of the previous char.


A simple runtime check

The compiling time check has been well covered by the other answers, so I just want to suggest a simple runtime way for excluding EBCDIC charset: the assumption is a scenario in which the charset can either be EBCDIC or ASCII.

In this way, by excluding EBCDIC we can assume that the ASCII charset is used, so the characters are contiguous.

Just two simple EBCDIC features have to be considered:

  • The letters that have a non-contiguous decimal representations are well known (for example 'j' != 'i'+1)
  • The alphabetical chars should have common decimal representations in all hundreds EBCDIC variants, as recommended in this IBM page: invariant charset

So:

static long digit_value(char c)
{
    if (c >= '0' && c <= '9')
        return (c-'0');

    if ( 'j' == 'i'+1 ) //if it's not EBCDIC 
    {
        if( c >= 'a' && c <= 'z' )
        {
             return 10 + c - 'a';
        }
        else if ( c >= 'A' && c <= 'Z' )
        {
            return 10 + c - 'A';
        }
    }

    return -1;
}

An error code is returned in case of EBCDIC charset, and in case of ASCII charset a simple conditional make sure that the range [10 - 35] is returned for both lower case and upper case characters.

Enriquetaenriquez answered 20/7, 2020 at 22:20 Comment(0)
N
0

To be portable and most efficient use the lookup table

const char table[128] = {
                         ['0'] = 0,  ['1'] = 1, /* ... */ ['9'] = 9, 
                         ['a'] = 10, ['A'] = 10,
                         ['b'] = 11, ['B'] = 11,
                         ['c'] = 12, ['C'] = 12,
                         ['d'] = 13, ['D'] = 13,
                         ['e'] = 14, ['E'] = 14,
                         ['f'] = 15, ['f'] = 15,
                         ['g'] = 16, ['g'] = 16,
                         ['h'] = 17, ['H'] = 17,
                         ['i'] = 18, ['I'] = 18,
                         ['j'] = 19, ['J'] = 19,
                         /* etc etc */
};

int get_value(char ch)
{
    return table[ch & 0x7f];
}

and the generated code:

get_value:
        movsx   rdi, dil
        movsx   eax, BYTE PTR table[rdi]
        ret
Nembutal answered 20/7, 2020 at 22:30 Comment(1)
Typos in ['i'] = 18, ['J'] = 18, and others. Suggest table[256] or the like and negative index protection in return table[ch];Kcal

© 2022 - 2024 — McMap. All rights reserved.