As one of the assignments in Harvard's CS50 course, students are tasked with creating a spell checking program. The main thrust of the assignment is speed - pure speed - and I've gotten to the point where I'm beating the staff's implementation, but I feel that I can do better, and am looking for a push in the right direction.
Here is my pseudocode:
// read the dictionary word list
Read entire dictionary in one fread into memory
rawmemchr through and pick out the words
send each word through the hash function
create chain links for any index where collisions occur
// accept the incoming test words
Run the test word through the hash function
compare to the existing table / linked list
return the result of the comparison
With a dictionary of 150K words, and input text up to 6MB, I'm able to accurately spell check in about a half second.
However, when I look at the words that are coming from the input text, it's pretty clear that a large percentage of those words are common (like "the", "and", "for"), and that most of the misspelled words are also checked multiple times.
My intuition says I should be able to "cache" the "good hits" and "bad hits" so that I'm not hashing the same words over and over for table lookups. Even though the current result is very close to O(1), I feel like I should be able to shave a few microseconds off the time by re-evaluating my approach.
For example, after I have loaded the dictionary, the text input could be 8MB of nothing but this: "missspeling". So rather than hash / check the same word over and over (at a computational expense), I would like to understand if there is a way to programmatically discard words which have already been hashed and rejected, but in a way that is more efficient than the hash / check itself. (I'm using MurmurHash3, fwiw).
I realize that the theoretical performance improvement would be limited to situations where the input text is long, and there are a high number of repeat misspellings. Based on some of the input texts I've evaluated, here are some of the results:
Unique Misspellings: 6960
Total Misspellings: 17845
Words in dictionary: 143091
Words in input text: 1150970
Total Time: 0.56 seconds
Unique Misspellings: 8348
Total Misspellings: 45691
Words in dictionary: 143091
Words in input text: 904612
Total Time: 0.83 seconds
In the second sample run, you can see that I'm having to go back to the hash table approximately 5.5 times for each misspelled word! That seems nuts to me, and I feel there must be a more efficient way to address this circumstance, since most of my program's time is spent in the hash function.
I could implement Posix threads (this runs on an 8 core system) to improve the program's time, but I am more interested in improving my approach and thought process around the problem.
Sorry this is so long winded, but this is my first Stack Overflow post and I'm trying to be thorough. I searched before posting, but most other "spell check" posts are related to "how" and not "improve". I'm grateful for suggestions that get me pointed in the right direction.