If false positives are acceptable, then one possible solution would be to use a bloom filter. Bloom filters are similar to hash tables, but instead of using one hash value to index a table of buckets, it uses multiple hashes to index a bit array. The bits corresponding to those indices are set. Then, to test if a string is in the filter, the string is hashed again, and if the corresponding indices are set, then the string is "in" the filter.
It doesn't store any information about the strings, so it uses very little memory -- but if there's a collision between two strings, no collision resolution is possible. This means there may be false positives (because a string not in the filter may hash to the same indices as a string that is in the filter). However, there can be no false negatives; any string that really is in the set will be found in the bloom filter.
There are a few Python implementations. It's also not hard to roll your own; I recall coding up a quick-and-dirty bloom filter myself once using bitarray
s that performed pretty well.
set
and let the OS's virtual memory manager page it to disk as necessary. You can also explicitly save it to disk usingpickle
. No need to build a database. – Evette