I have a requirement to (very) quickly process strings of a limited range, tallying their values. The input file is of the form:
January 7
March 22
September 87
March 36
and so forth. Because the line widths are identical, I can simply read in a line with fread
reasonably fast, and I've developed a perfect hashing function which works, but I wanted to see if anyone could offer any advice on how to make it even faster. I'll profile each suggestion to see how it goes.
The hashing function is based on the month name to allow fast allocation of the value to a bucket. Bear with me here. I first figured out the minimal number of characters for a perfect hash:
January
February
March
April
May
June
July
August
September
October
November
December
Keep in mind that the months are all nine characters due to the fact I have the entire input line.
Unfortunately, there is no single column to mark a month unique. Column 1 duplicates J
, column 2 duplicates a
, column 3 duplicates r
, column 4 duplicates u
and columns 5 onwards duplicate <space>
(there are other duplicates but one is enough to prevent a single-column hash key).
However, by using the first and fourth column, I get the values Ju
, Fr
, Mc
, Ai
, M<space>
, Je
, Jy
, Au
, St
, Oo
, Ne
and De
, which are unique. There will be no invalid values in this file so I don't have to worry about incorrect buckets for the input data.
By viewing the hex codes for the characters, I found I could get low unique values by just ANDing with strategic values:
FirstChar Hex Binary &0x0f
--------- --- --------- -----
A x41 0100 0001 1
D x44 0100 0100 4
F x46 0100 0110 6
J x4a 0100 1010 10
M x4d 0100 1101 13
N x4e 0100 1110 14
O x4f 0100 1111 15
S x53 0101 0011 3
SecondChar Hex Binary &0x1f
---------- --- --------- -----
<space> x20 0010 0000 0
c x63 0110 0011 3
e x65 0110 0101 5
i x69 0110 1001 9
o x6f 0110 1111 15
r x72 0111 0010 18
t x74 0111 0100 20
u x75 0111 0101 21
y x79 0111 1001 25
and this allowed me to set up a static array to create a (hopefully) blindingly-fast hash function:
#define __ -1
static unsigned int hash (const char *str) {
static unsigned char bucket[] = {
// A S D F J M N O
__, __, __, __, __, __, __, __, __, __, __, __, __, 4, __, __, // space
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, 2, __, __, // c
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, 11, __, __, __, __, __, 5, __, __, __, 10, __, // e
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, 3, __, __, __, __, __, __, __, __, __, __, __, __, __, __, // i
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, 9, // o
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, 1, __, __, __, __, __, __, __, __, __, // r
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, 8, __, __, __, __, __, __, __, __, __, __, __, __, // t
__, 7, __, __, __, __, __, __, __, __, 0, __, __, __, __, __, // u
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, //
__, __, __, __, __, __, __, __, __, __, 6, __, __, __, __, __ // y
};
return bucket[((unsigned int)(str[3]&0x1f)<<4)|(str[0]&0xf)];
}
Testing that with the code:
#include <stdio.h>
#include <string.h>
// Hash function here.
static char *months[] = {
"January ", "February ", "March ", "April ", "May ", "June ",
"July ", "August ", "September", "October ", "November ", "December "
};
int main (void) {
int i;
for (i = 0; i < sizeof(months)/sizeof(*months); i++)
printf ("%-10s -> %2d\n", months[i], hash(months[i]));
return 0;
}
shows that it's functionally correct:
January -> 0
February -> 1
March -> 2
April -> 3
May -> 4
June -> 5
July -> 6
August -> 7
September -> 8
October -> 9
November -> 10
December -> 11
but I want to know if it can be made faster.
Any suggestions out there? I'm open to any simple optimisations or even a total rewrite if there's something inherently bad with my hashing function.
I don't think this is that important but the final version will be using EBCDIC. The theory will still stand but the AND operation may change slightly since the characters have different code points. I'll be happy with any assistance only on the ASCII front since I'm confident whatever advice is offered will translate okay to EBCDIC.
t
andu
and there's no clashes there because they don't share common columns. While that won't make it faster (in terms of the binary ops), it will take up less space (and possibly make it faster in re cache occupancy). So bang that in an answer and I'll give you a vote (at least). – Alkalimeter__
or -1). We have agreements in place with the source that invalid data will be tossed away. – Alkalimeter