how to map a specialized string into specified integer
Asked Answered
F

4

6

I am doing some financial trading work. I have a set of stock symbols but they have very clear pattern: it's composed of two characters AB, AC AD and current month which is a four digit number: 1503, 1504, 1505. Some examples are:

AB1504
AB1505
AC1504
AC1505
AD1504
AD1505
....

Since these strings are so well designed patterned, I want to map (hash) each of the string into a unique integer so that I can use the integer as the array index for fast accessing, since I have a lot of retrievals inside my system and std::unordered_map or any other hash map are not fast enough. I have tests showing that general hash map are hundred-nanoseconds latency level while array indexing is always under 100 nanos. my ideal case would be, for example, AB1504 maps to integer 1, AB1505 maps to 2...., then I can create an array inside to access the information related to these symbols much faster. I'm trying to figure out some hash algorithms or other methods that can achieve my goal but couldn't find out. Do you guys have any suggestions on this problem?

Fennie answered 17/5, 2015 at 16:17 Comment(3)
One simple idea: see your pattern as hexadecimal(or higher imaginary base) number and convert it to decimal to get a unique number. although it doesn't start from 0 and they are not consequenceNaranjo
You might also try something like compressing the data (zlib, Huffman, lzw, etc?) and pre-sharing the decompression data (reuse it for all your messages or "evolve it" deterministically on each side of the communication) so that the messages don't have the "header" data as overhead.Bidget
Do you have some more information on the number format? Like does the two first digits represent years after 2000? What does the letters stand for, if anything? Do you need to address stuff earlier than AA1501 (or similar)?Tea
D
1

You can regard the string as a variable-base number representation, and convert that to an integer. For example:

AC1504:
A (range: A-Z)
C (range: A-Z)
15 (range: 0-99)
04 (range: 1-12)

Extract the parts; then a hash function could be

int part1, part2, part3, part4;
...
part1 -= 'A';
part2 -= 'A';
part4 -= 1;
return (((part1 * 26 + part2) * 100 + part3) * 12 + part4;
Drunkard answered 17/5, 2015 at 16:31 Comment(0)
T
0

If you parse the string as a mixed base number, first 2 base-26 digits and then 4 base-10 digits you will quickly get a unique index for each string. The only issue is that if you might get a sparsely populated array.

You can always reorder the digits when calculating the index to minimize the issue mentioned above.

As the numbers are actually months I would calculate the number of months from the first entry and multiply that with the 2 digit base-26 number from the prefix.

Hope you can make some sense from this, typing on my tablet at the moment. :D

Tannen answered 17/5, 2015 at 16:29 Comment(0)
Y
0

The following values should be representable by a 32-bit integer:

XYnnnn => (26 * X + Y) * 10000 + nnnn

Here X and Y take values in the range [0, 26), and n takes values in the range [0, 10).

You have a total of 6,760,000 representable values, so if you only want to associate a small amount of data with it (e.g. a count or a pointer), you can just make a flat array, where each symbol occupies one array entry.

Yesteryear answered 17/5, 2015 at 16:30 Comment(0)
S
0

I assume the format is 'AAyymm', where A is an uppercase character yy a two digit year and mm the two digit month.

Hence you can map it to 10 (AA) + Y (yy) + 4 (mm) bits. where Y = 32 - 10 - 4 = 18 bits for a 32 bit representation (or 262144 years). Having that, you can represent the format as an integer by shifting the characters to there place and shifting the year and month pairs to there places after converting these to an integer.

Note: There will always be gaps in the binary representation, Here the 5+5 bit representation for the characters (6 + 6 values) and in the 4 bit month representation (4 values)

To avoid the gaps change the representation to ABmmmm, were the pair AB is represented by a the number 26*A+B and mmmm is the month relative to some zero month in some year (which covers 2^32/1024/12 = 349525 years - having 32 bits).

However, you might consider a split of stock symbols and time. Combining two values in one field is usually troublesome (It might be a good storage format, but no good 'program data format').

Sawmill answered 17/5, 2015 at 17:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.