How can I improve spell check time in C program?

Asked 17/5, 2016 at 22:31 Answered 18/5, 2016 at 4:32

As one of the assignments in Harvard's CS50 course, students are tasked with creating a spell checking program. The main thrust of the assignment is speed - pure speed - and I've gotten to the point where I'm beating the staff's implementation, but I feel that I can do better, and am looking for a push in the right direction.

Here is my pseudocode:

// read the dictionary word list
Read entire dictionary in one fread into memory
rawmemchr through and pick out the words
send each word through the hash function
create chain links for any index where collisions occur

// accept the incoming test words
Run the test word through the hash function
compare to the existing table / linked list
return the result of the comparison

With a dictionary of 150K words, and input text up to 6MB, I'm able to accurately spell check in about a half second.

However, when I look at the words that are coming from the input text, it's pretty clear that a large percentage of those words are common (like "the", "and", "for"), and that most of the misspelled words are also checked multiple times.

My intuition says I should be able to "cache" the "good hits" and "bad hits" so that I'm not hashing the same words over and over for table lookups. Even though the current result is very close to O(1), I feel like I should be able to shave a few microseconds off the time by re-evaluating my approach.

For example, after I have loaded the dictionary, the text input could be 8MB of nothing but this: "missspeling". So rather than hash / check the same word over and over (at a computational expense), I would like to understand if there is a way to programmatically discard words which have already been hashed and rejected, but in a way that is more efficient than the hash / check itself. (I'm using MurmurHash3, fwiw).

I realize that the theoretical performance improvement would be limited to situations where the input text is long, and there are a high number of repeat misspellings. Based on some of the input texts I've evaluated, here are some of the results:

Unique Misspellings: 6960
Total Misspellings:  17845
Words in dictionary: 143091
Words in input text: 1150970
Total Time:          0.56 seconds

Unique Misspellings: 8348
Total Misspellings:  45691
Words in dictionary: 143091
Words in input text: 904612
Total Time:          0.83 seconds

In the second sample run, you can see that I'm having to go back to the hash table approximately 5.5 times for each misspelled word! That seems nuts to me, and I feel there must be a more efficient way to address this circumstance, since most of my program's time is spent in the hash function.

I could implement Posix threads (this runs on an 8 core system) to improve the program's time, but I am more interested in improving my approach and thought process around the problem.

Sorry this is so long winded, but this is my first Stack Overflow post and I'm trying to be thorough. I searched before posting, but most other "spell check" posts are related to "how" and not "improve". I'm grateful for suggestions that get me pointed in the right direction.

http://github.com/Ganellon/spell_check

Brittenybrittingham answered 17/5, 2016 at 22:31 Comment(7)

The problem with the caching idea: you'll speed up the processing of words that happen to be in the cache, but slow down the processing of words not in the cache. So for example, if the misspelled words are put into a cache, then you'll speed up the processing of about 5000 words, but slow down the processing of about 1.14 million words. That's a bad trade-off. So I'd skip the caching idea, and work on getting all 8 cores computing hash functions. – Cuvette 17/5, 2016 at 23:10

If you don't find what you're looking for here, try on Code Review. – Epistle 18/5, 2016 at 0:43

Did you profile your code? – Neoimpressionism 18/5, 2016 at 1:55

Given that you have a significant number of repeated misspellings, have you considered adding the misspellings to the hash? If you store the true/false result of the lookup, and insert bad words at the start of the hash chains, you might get some performance back. – Fancier 18/5, 2016 at 2:39

@PaulHankin - sort of, using getrusage and rusage struct between calls to each function, where loading the dictionary into memory, checking incoming words against the hash table, and freeing the heap are individually timed, as well as overall Valgrind to ensure that everything is cleaned up. – Brittenybrittingham 18/5, 2016 at 4:45

@AustinHastings Adding the misspellings / marking them "bad' to the hash would require a change to the chain link struct and require an additional check beyond the hash -> value -> key checks I'm doing already. I'll have to think on that. Good suggestion though. I will try it out. – Brittenybrittingham 18/5, 2016 at 4:49

You could append an extra byte to the word, and store it past the trailing '\0'. – Fancier 19/5, 2016 at 1:8

This is a pretty well-solved problem. ;-) You should look into a data structure called a trie. A trie is a tree built of individual characters, so that the path represents the information. Each node consists of letters you can legitimately add to the current prefix. When a letter is a valid word, that is also recorded.

For four words:

root-> [a]-> [a]-> [r]-> [d]-> [v]-> [a]-> [r]-> [k*]->[s*]
             [b]
                \> [a]-> [c]-> [i*]
                               [u]-> [s*]

This would represent "aardvark", "aardvarks", "abaci", and "abacus." The nodes are vertically contiguous, so 2nd letter [ab] is a node, and 5th letter [i*u] is a node.

Traverse the trie character by character, and check for a valid word when you hit space. If you can't traverse with the character you have, then it's a bad word. If you don't find valid when you hit space, it's a bad word.

This is O(n) to process (n = word length) and it's very, very fast. Building the trie is going to consume a bunch of RAM, but you don't care about that I think.

Fancier answered 17/5, 2016 at 22:45 Comment(4)

Thank you for the note, Austin. I had considered a trie instead of a hash table, but I'm not convinced it would provide a speed improvement. I'm at close to O(1), and moving to a trie would be a step in the wrong direction, unless I am missing something fundamental about your suggestion. I will go ahead and implement a trie anyhow, and post the results in a bit. Was hoping to avoid trial and error, but since I'm learning, those are the breaks. :-) – Brittenybrittingham 18/5, 2016 at 1:28

Hashing a word is O(len) on the word. Trie traversal is O(len) as well. It's an apples/apples scenario. – Fancier 18/5, 2016 at 1:58

@Drew: O(1) can be fast or it can be horribly slow. What matters is the constant factor. (Tweak your instructor about that - I'm amazed how many kiddoes come out of school thinking they can snow people with big-O.) I tend to agree with Austin's trie, because if there are only 50k words in the dictionary, the trie doesn't get very deep. What's more, you can use it for spelling correction. Also, I bet there's another trick they're not telling you - you can precompile the dictionary into code. That will give you massive speedup. – Lindi 23/5, 2016 at 0:31

@MikeDunlavey thanks for your reply. The challenge to each of the points you make (and they are valid) is that I have absolutely no idea ahead of time what dictionary they will throw at me. So, they may give me a dictionary with 10 words, or 500,000 between runs. Otherwise, for sure I'd take the 30-40 seconds to pre-process and persist a perfect hash table. That would be great! But, they're onto that trick. ;-) – Brittenybrittingham 23/5, 2016 at 6:28

In both of your trials, what is noticeable is that the majority of words are correctly spelled. Consequently, you should focus on optimizing lookup of words which are in the dictionary.

In your first trial, for example, only 1.5% of all words are misspelled. Suppose it takes twice as long on average to look up a word which is not in the dictionary (because every word in the bucket needs to be checked). Even if if you reduced that to 0 (the theoretical minimum :) ), you would speed your program up by less than 3%.

A common hashtable optimization is to move the key you find to the beginning of the bucket chain, if it is not already there. That will tend to reduce the number of hash entries checked for commonly-used words. It's not a huge speed-up, but in cases where some keys are looked up much more often than others, it can definitely be noticed.

Reducing chain length by decreasing the hashtable occupancy may help more, at the cost of more memory.

Another possibility, since you are not going to modify the dictionary once it is built, is to store each bucket chain in contiguous memory, without pointers. Not only will that reduce memory consumption, it will improve cache performance because since most words are short, most buckets will fit in a single cache line.

And since words tend to be quite short, you may well be able to find a way to optimize the comparison. strcmp() is well optimized but it is generally optimized for larger strings. If you're allowed to use it, the SSE4.2 PCMPESTRI opcode is amazingly powerful (but figuring out what it does and how to use it to solve your problem can be a huge time-waster). More simply, you should be able to compare four eight-byte prefixes simultaneously with 256-bit comparison operations (and you might even have 512-bit operations available to you), so with clever data arrangement, you may well be able to do an entire bucket comparison in parallel.

That's not to say that hashtables are necessarily the optimal datastructure for this problem. But remember that the more you can do in a single cache-line, the faster you program will run. Linked-list-intensive datastructures can turn out to be suboptimal even if they look good on paper.

After thinking about this problem for a couple of days and actually writing some code, I came to the conclusion that optimizing for successful hashtable lookup speed is probably not correct for a realworld spellchecker. It's true that most words in the text being looked up are usually correctly spelled -- although that depends on the spellcheck user -- but the algorithm which attempts to suggest correct spellings is likely to do a lot of unsuccessful lookups as it cycles through possible misspellings. I know that's probably out of scope for this problem, but it does make a difference for optimization, because you end up with two quite different strategies.

If you're trying to reject quickly, you need lots of possibly empty bucket chains, or a Bloom filter or its moral equivalent, so you can reject most errors on the first probe.

For example, if you have a good hash algorithm which yields more bits than you need -- and you almost certainly do, because spellcheck dictionaries aren't that big -- then you can just use some otherwise-unused bits in the hash for a secondary hash. Without even going to the trouble of implementing the entire Bloom filter, you can just add, say, a 32-bit mask to each bucket header representing the possible values of five secondary hash bits in the values stored in that bucket. Combined with a sparse table -- I used 30% occupancy for the experiment, which is not that sparse -- you should be able to reject 80-90% of lookup failures without going beyond the bucket header.

On the other hand, if you're trying to optimize for success, then it might turn out that largish buckets are better because it cuts down on the number of bucket headers, and that improves cache usage. As long as the entire bucket fits into a cache-line, the speed of multiple comparisons is so high that you won't notice the difference. (And since words tend to be short, it's reasonable to expect five or six to fit into a 64-byte cacheline.)

Anyway, without going to too much work, I managed to do a million lookups in 70 milliseconds of CPU. Multiprocessing could speed up elapsed time quite a bit particularly as no locking is required given that the hash table is immutable.

The morals I want to draw from this:

In order to optimize:

you need to understand your data
you need to understand your expected usage pattern
you need to adapt your algorithms based on the above
you need to do a lot of experimenting.

Isabelleisac answered 18/5, 2016 at 1:17 Comment(23)

Thank you for your response, rici. I like the idea of moving the "commonest" link to the front of the list, even if it only saves a few iterations. Occupancy is less of a problem since I'm using a very large table with fewer than 2K internal collisions for the 150K word dictionary, though I have not yet analyzed each bucket to determine how evenly the collisions are distributed. (I have a secondary integer table that stores indexes of the hash table for easy cleanup.) I will definitely be looking at PCMPESTRI and I appreciate the suggestion. – Brittenybrittingham 18/5, 2016 at 1:37

PCMPESTRI is actually easier to implement in ASM than it is in C. That is a first for me. – Brittenybrittingham 18/5, 2016 at 2:21

@drew: there are intrinsics :) If you don't have many collisions, it's probably not useful for multiple bucket compares (but getting rid of the linked lists is still a good optimization for dictionaries). PCMPESTRI is also useful for implementing strtok-like things. But it's not sliced bread. Profile and see if it really helps. – Isabelleisac 18/5, 2016 at 3:18

It is interesting that your second experiment takes significantly longer with fewer word lookups. Are the words longer? Or, contrary to what I suggest in my answer, is there something which is hugely slowing down failed lookups? – Isabelleisac 18/5, 2016 at 3:29

re: longer lookup time - this is hosted on a cloud9 Ubuntu IDE, so mid-day runs are dubious. I ran it again just now on the same data set, and the result was: Unique Misspellings: 8348 WORDS MISSPELLED: 45691 WORDS IN DICTIONARY: 143091 WORDS IN TEXT: 904612 TIME IN load: 0.05 TIME IN check: 0.45 TIME IN size: 0.00 TIME IN unload: 0.03 TIME IN TOTAL: 0.53 – Brittenybrittingham 18/5, 2016 at 4:34

The variation in those timings makes that benchmarking technique unusable. It is possible that the bulk of the time is reading from the disk; standard benchmarking techniques would help get more consistent values (although not necessarily more meaningful). Do the experiment k times (in a loop); discard outliers and report the mean. That way you will probably have the input file cached in kernel buffers after the first couple of iterations, and the remaining ones won't need to wait for spindles. – Isabelleisac 18/5, 2016 at 21:48

@drew: So, how did it work out for you in the end? For my little hack I used an Ubuntu dictionary of 99,171 words and an old newsfeed corpus I had kicking around. I'm probably cheating by not doing doing a second lookup on words with uppercase characters, but the result is to detect a lot more failures. Anyway, the input is a bit over 6MB; it has 968234 words, of which 199845 are not in the dictionary; building the dictionary takes about 25 milliseconds and looking up the text takes about 70 milliseconds. (i5 laptop, 2.5GHz, single thread executable.) – Isabelleisac 21/5, 2016 at 1:25

Hey @rici, thanks for asking and for taking the time to try it out yourself. I got sidetracked working on the next project (writing an Apache-like web server) that I finished this afternoon. Now I'm going back to put a stamp on the spell checker. I'm also not checking caps. I'm cheating by taking the incoming word and bitwise OR 32 to give me all lowercase. I will have an answer for you tomorrow. – Brittenybrittingham 22/5, 2016 at 3:22

I'll be working on this today If you want to take a look... github.com/Ganellon/spell_check main is in speller.c but according to the rules of the assignment, this source file may not be changed. All of my code is in dictionary.c – Brittenybrittingham 22/5, 2016 at 18:45

I took a look at your code. I'm not sure if it will matter in your environment, but if you're running 64 bit pointers, you might get a speed up with 1.2 million of them just by cutting them down to 32 bits. Try replacing the pointers to the words in your dictionary with 32 bit values -either near pointers or 32 bit integers offset from the mem_dict base address. (This is primarily to improve cache performance by reducing data size.) – Fancier 23/5, 2016 at 3:53

Also, don't free mem_dict. Just point into it. Allocating individual strings is worse, since malloc() and friends round up to some even block size. – Fancier 23/5, 2016 at 3:59

Finally, consider abandoning mem_dict and throwing memory at the problem. If you treat the first 4 or 8 bytes of the chain entry as an integer type, you can zero-fill the temp buffer and your chain block and compare them as integers. if (*(int32_t*)temp == *(int32_t*)(Trav->key)) {... I think you'll do pretty well with 32 bits, and most of the common words are less than 64 bits long, so you can hard-code the logic (since you know lenny) to skip the memcmp entirely. – Fancier 23/5, 2016 at 4:9

Finally-finally (really!), consider converting to Pascal strings (put an 8-bit length in the first char). You're doing this work already, and it triggers the earliest possible failure for words of different lengths. – Fancier 23/5, 2016 at 4:14

@drew: Is the benchmarking code in spelling.c part of what you were given? My immediate reaction was that it was "just plain wrong", but I've settled down to "inaccurate"; it's trying to measure events whose expected running time is (at worst) hundreds of nanoseconds with a microsecond resolution timer. Obviously, that's inexact but there is some reason to believe that it's not too inexact because if you have a lot of events, the timer will roll over to the next value inside an event with a probability roughly equal to the actual time taken by the event. – Isabelleisac 23/5, 2016 at 4:27

@AustinHastings Thank you so much for taking a look! I'm not actually freeing mem_dict until the hash table is filled and have no further use for mem_dict. I thought I could just use it without creating a hash table, but then really I'd have a glorified linear search. Is that what you mean? – Brittenybrittingham 23/5, 2016 at 6:19

@Isabelleisac Yes, everything that's in speller.c is as it arrived as part of the distro. Changes to speller.c are not permitted - and I can guess why in some places where I'd have made changes. – Brittenybrittingham 23/5, 2016 at 6:22

@AustinHastings Just had an ah-ha moment with regard to keeping mem_dict around by moving it into global space and using those addresses. That will definitely save some time, rather than malloc for each new node. I'll still have to malloc for collision nodes, but there are far fewer. Thanks for this tip. – Brittenybrittingham 23/5, 2016 at 7:25

@drew: it's clear that they're trying to not contaminate the timing of your lookup function with, for example, the timing of the word parser or with printing the misspelled words, if the printf is uncommented. But the result is to contaminate the benchmark with the cost of calling getrusage. On my system, getrusage takes about 220 nsec; a reasonable hash lookup shouldn't take even 100 nsec, so at least 2/3 of the accounted time is made up of calling getrusage. – Isabelleisac 23/5, 2016 at 16:36

@Isabelleisac - you can see why I commented that line out of speller.c; having all of those misspellings dumping to the screen is nonsensical. I commented it out because I was only interested in the summary, but that is not the "official" version, which I will have to use in the final product. – Brittenybrittingham 23/5, 2016 at 18:8

@AustinHastings changing node from key[45] to *key and updating the CreateNode and AddNode functions reduced the load function time from .06 to .04 seconds. This will always be a consistent rate based on the size of the dictionary, but this is a huge savings! Great suggestion. – Brittenybrittingham 23/5, 2016 at 18:16

@drew: I suppose they dump the misspellings to stdout in order to verify that the correct misspellings are identified. However, combining a benchmark and a validation is not always a good idea. In this case, I would think that the two should be separated, in which case you could just surround the entire lookup loop with a single pair of getrusage. Of course, that would have the effect of including the cost of parsing, which is also non-trivial compared with the cost of lookup... – Isabelleisac 23/5, 2016 at 21:13

... The way I'd do it would be to first read the entire input into memory, parse it into words, and validate the misspellings. Then I'd reparse the input, keeping track of how much time that took. Then I'd reparse the input again, but this time do the lookups, again keeping track of the time. The difference between the two would be the cost of lookup. Validating first would warm up the cache so that the other two loops would be (more) comparable. – Isabelleisac 23/5, 2016 at 21:15

Let us continue this discussion in chat. – Brittenybrittingham 24/5, 2016 at 6:56

A few insights/ideas you might explore:

where the values are similar in length - or little bigger than pointers - closed hashing will give you better performance than any open hashing aka separate chaining approach
the length of words being checked is a cheap (perhaps free if you're tracking it anyway) way you can direct validations to methods that are most optimal for that word length
- to get more words onto fewer pages of memory (and thereby be more cache friendly), you could try having multiple hash tables, where the buckets are sized to the longest length of text therein
- 4-byte and 8-byte buckets conveniently allow single-instruction aligned 32-bit and 64-bit value comparisons, if you pad the strings out with NULs (i.e. you can make a union of uint32_t and char[4], or uint64_t and char[8], and compare the integer values).
- your choice of hash function is important: try a good few
- your choice of collision handling strategy is important too: profile with linear, quadratic, and perhaps a list of primes (1, 3, 7, 11...).
- the number of buckets is a balancing act: too few and you have too many collisions, too many buckets and you have more memory cache misses, so test with a range of values to find the optimal settings
you might profile using a more collision-averse prime number of buckets with % folding hash values into the bucket index range vs. a power of two bucket count where you can use faster & bitmasking
many of the above interact: e.g. if you use a strong hash function, you have less need of a prime number of buckets; if you have less collisions, you have less need of an elaborate post-collision search order through alternative buckets
the spell checking is very easy to scale with threads, as you're doing read-only hash table lookups; the prior insertion of the dictionary into the hash table(s) - less so, though using multiple tables as above offers one way to parallelise it

Offset answered 18/5, 2016 at 4:32 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags