strcmp() but with 0-9 AFTER A-Z? (C/C++)
Asked Answered
S

6

7

For reasons I completely disagree with but "The Powers (of Anti-Usability) That Be" continue to decree despite my objections, I have a sorting routine which does basic strcmp() compares to sort by its name. It works great; it's hard to get that one wrong. However, at the 11th hour, it's been decided that entries which begin with a number should come AFTER entries which begin with a letter, contrary to the ASCII ordering. They cite the EBCDIC standard has numbers following letters so the prior assumption isn't a universal truth, and I have no power to win this argument... but I digress.

Therein lies my problem. I've replaced all appropriate references to strcmp with a new function call nonstd_strcmp, and now need to implement the modifications to accomplish the sort change. I've used a FreeBSD source as my base: http://freebsd.active-venture.com/FreeBSD-srctree/newsrc/libkern/strncmp.c.html

 if (n == 0)
  return (0);
 do {
  if (*s1 != *s2++)
   return (*(const unsigned char *)s1 -
    *(const unsigned char *)(s2 - 1));
  if (*s1++ == 0)
   break;
 } while (--n != 0);
 return (0);

I guess I might need to take some time away to really think about how it should be done, but I'm sure I'm not the only one who's experienced the brain-deadness of just-before-release spec changes.

Sonjasonnet answered 15/6, 2010 at 21:18 Comment(7)
The amount of hate packed into this question amuses meSelfcontent
definitely nothing to do with C++Nance
And why does this have a C++ tag?Pulpiteer
Do you have instructions on uppercase vs. lowercase letters also (EBCDIC and ASCII differ there also)? BTW, the purpose of your program is to allow other people to get stuff done, and there is nothing sacred about ASCII order. Your only legitimate complaint is the just-before-release requirement change.Millrun
I'm not sure this has "definitely nothing to do with C++;" after all this is valid C++ syntax. Many people deal with C-style strings in C++. I don't think this question should be down-voted. The OP thinks his bosses are being silly and he is probably right, big deal.Mimetic
Forgot to clarify: all input is A-Z 0-9. No lowercase characters exist. Re: C++ tag -- the project is mixed C/C++ code. Either is acceptable. The only constraint is the avoidance of dynamic memory.Sonjasonnet
I was about to ask what is so bad about A-Z 0-9 ordering, but then I remembered that I didn't care.Allanadale
W
4

In this special case with only uppercase letters (as mentioned by the OP in comments) and digits 0-9, you could also omit the order table and instead multiply both differing characters by 4 and compare the results modulo 256. The range of ASCII digits (48 to 57) will not overflow 8 bits (57 × 4 = 228), but the range of uppercase letters (65 to 90) will (65 × 4 = 260). When we compare the multiplied values modulo 256, the value for each letter will be less than that of any digit: 90×4 % 256 = 104 < 192 = 48×4

The code might look something like:

int my_strcmp (const char *s1, const char *s2) {
    for (; *s1 == *s2 && *s1; ++s1, ++s2);
    return (((*(const unsigned char *)s1) * 4) & 0xFF) - \
           (((*(const unsigned char *)s2) * 4) & 0xFF);
}

Of course, the order table solution is far more versatile in general as it allows one to define a sort order for every character—this solution is sensible only for this special case with uppercase letters vs digits. (But e.g. on microcontroller platforms, saving even the small amount of memory used by the table can be a real benefit.)

Whiten answered 16/6, 2010 at 1:22 Comment(1)
Incredibly clever solution Arkku! This is, in fact, on an embedded system (though we're not tightly memory constrained). I used your implementation and it appears to pass initial testing. I've also added a strncmp equivalent where the only changes are for (; *s1 == *s2 && *s1 && n; ++s1, ++s2, --n); if( !n ) return n;Sonjasonnet
R
16

What you need to do is create an ordering table for each character. This is also the easiest way to do case-insensitive comparisons as well.

if (order_table[*s1] != order_table[*s2++])

Be aware that characters might be signed, in which case the index to your table might go negative. This code is for signed chars only:

int raw_order_table[256];
int * order_table = raw_order_table + 128;
for (int i = -128;  i < 128;  ++i)
    order_table[i] = (i >= '0' && i <= '9') ? i + 256 : toupper(i);
Rascal answered 15/6, 2010 at 21:30 Comment(3)
Right... that makes sense. That's the spark I needed to get my brain going again! Thanks!Sonjasonnet
Using an int does mean that the table is needlessly big. Note that this algorithm is needed only when '9' < 'A', and the inputs are [0-9A-Z] only. Hence, you can store (i>='0' && i<='9') ? 26+i : i - 'A', and the table only needs to allocate memory for indices '0' to 'Z'.Ancestral
@MSalters, yes an int is larger than necessary; a short would suffice, or even a char if one is careful with the values. Since int is meant to be the most natural size, that's what I went with. Even if this project restricts the input to [0-9A-Z], I wouldn't code it that way because I'd want it to be reusable for another project without those restrictions.Rascal
S
8

If your powers-that-be are like all the other powers-that-be that I've run into, you may want to make it an option (even if it's hidden):

Sort Order:

o Numbers after Letters

o Letters after Numbers

or even worse, they might figure out that they want Numbers to be sorted numerically (e.g. "A123" comes after "A15"), then it can be

o Numbers after Letters

o Letters after Numbers

o Smart Numbers after Letters

o Letters after Smart Numbers

This gets into diagnosing the real problem, not the symptom. I bet there's a slight chance they may change their mind at the 11th hour and 59th minute.

Subacute answered 15/6, 2010 at 22:12 Comment(4)
Smart numbers are referred to as "Natural Sort". codinghorror.com/blog/2007/12/…Rascal
Unintended pun I guess - did you mean anticipatory or antipathy ?Ancestral
Oh I have no doubt that's what's going to happen. Sounds like you're speaking from many hard-learned lessons in frustration franji1!Sonjasonnet
@Aaron - Yeah, you'll look like a hero if they can't make up their minds and you say "What if we make it a user option?", and you have it done at the 11th hour, 59th minute, and 30th second :-DSubacute
G
5

You could use a lookup table to translate ASCII to EBCDIC when comparing characters ;-)

Guinna answered 15/6, 2010 at 21:20 Comment(3)
His stuff is ASCII not EBCDIC, but a lookup table generally will be faster, and will be simpler to modify when they decide to reorder other characters. +1Atomicity
+1: while converting to EBCDIC as such probably makes no sense, a lookup table is almost certainly the right way to go for this.Georgettegeorgi
He did say that EBCDIC has the numbers after the letters. But any lookup table that puts the numbers after the letters will work.Guinna
W
4

In this special case with only uppercase letters (as mentioned by the OP in comments) and digits 0-9, you could also omit the order table and instead multiply both differing characters by 4 and compare the results modulo 256. The range of ASCII digits (48 to 57) will not overflow 8 bits (57 × 4 = 228), but the range of uppercase letters (65 to 90) will (65 × 4 = 260). When we compare the multiplied values modulo 256, the value for each letter will be less than that of any digit: 90×4 % 256 = 104 < 192 = 48×4

The code might look something like:

int my_strcmp (const char *s1, const char *s2) {
    for (; *s1 == *s2 && *s1; ++s1, ++s2);
    return (((*(const unsigned char *)s1) * 4) & 0xFF) - \
           (((*(const unsigned char *)s2) * 4) & 0xFF);
}

Of course, the order table solution is far more versatile in general as it allows one to define a sort order for every character—this solution is sensible only for this special case with uppercase letters vs digits. (But e.g. on microcontroller platforms, saving even the small amount of memory used by the table can be a real benefit.)

Whiten answered 16/6, 2010 at 1:22 Comment(1)
Incredibly clever solution Arkku! This is, in fact, on an embedded system (though we're not tightly memory constrained). I used your implementation and it appears to pass initial testing. I've also added a strncmp equivalent where the only changes are for (; *s1 == *s2 && *s1 && n; ++s1, ++s2, --n); if( !n ) return n;Sonjasonnet
E
3

While in general agreement with the above answers, I think that it is silly to do lookups for every iteration of the loop, unless you think that most comparisons will have different first characters, when you could instead do

char c1, c2;
while((c1 = *(s1++)) == (c2 = *(s2++)) && c1 != '\0');
return order_table[c1] - order_table[c2];

Also, I would recommend constructing the order_table with a static initializer, which will improve speed (no need to generate every time -- or ever) and also perhaps readability

Elboa answered 15/6, 2010 at 22:31 Comment(4)
This is a good, clever optimization, except you'll want to make sure you get the pointer increment happening at the right time - the example given will not compare the first character of the strings (and will wander off the end of a string if one or both strings happen to be empty).Tonguelashing
This also doesn't work correctly in all cases if two different characters can have the same value in order_table (e.g. case-insensitive sort).Whiten
@Michael wow good catch, didn't see that at all (stupid me). Any way I edited it and it should be good now. @Arkku: OP specified case sensitive search (or rather case-insensitive is not necessary) so I see no reason why order_table would need to have multiple characters share the same ordering valueElboa
I know, I just mentioned it for the benefit of others who might be looking at this solution. =)Whiten
D
2

Here is what should be a pretty good implementation of the string compare similar to the one described by other posts.

static const unsigned char char_remap_table[256] = /* values */

#define char_remap(c) (char_remap_table[(unsigned char) c])

int nonstd_strcmp(const char * restrict A, const char * restrict B) {
     while (1) {
          char a = *A++;
          char b = *B++;
          int x = char_remap(a) - char_remap(b);
          if (x) {
               return x;
          }
          /* Still using null termination, so test that from the original char,
           * but if \0 maps to \0 or you want to use a different end of string
           * then you could use the remapped version, which would probably work
           * a little better b/c the compiler wouldn't have to keep the original
           * var a around. */
          if (!a) { /* You already know b == a here, so only one test is needed */
               return x;  /* x is already 0 and returning it allows the compiler to
                           * store it in the register that it would store function
                           * return values in without doing any extra moves. */
          }
     }
}

Above and beyond that you could generalize the function to take the char_remap_table as a parameter which would allow you to easily use different mappings later if you needed to.

int nonstd_strcmp(const char * restrict a, const char * restrict b, const char * restrict map);
Ditheism answered 15/6, 2010 at 22:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.