Efficient way to check existence in a large set of strings

Asked 15/11, 2012 at 20:19 Answered 15/11, 2012 at 21:0

Solved python berkeley-db leveldb kyotocabinet

I have a set of 100+ million strings, each up to 63 characters long. I have lots of disk space and very little memory (512 MB). I need to query for existence alone, and store no additional metadata.

My de facto solution is a BDB btree. Are there any preferable alternatives? I'm aware of leveldb and Kyoto Cabinet, but not familiar enough to identify advantages.

Gregoire answered 15/11, 2012 at 20:19 Comment(3)

Can you tolerate an occasional false positive? – Gitt 15/11, 2012 at 20:23

False negatives are unacceptable; occasional false positives are potentially tolerable. – Gregoire 15/11, 2012 at 20:29

Just store them all in a set and let the OS's virtual memory manager page it to disk as necessary. You can also explicitly save it to disk using pickle. No need to build a database. – Evette 16/11, 2012 at 10:26

If false positives are acceptable, then one possible solution would be to use a bloom filter. Bloom filters are similar to hash tables, but instead of using one hash value to index a table of buckets, it uses multiple hashes to index a bit array. The bits corresponding to those indices are set. Then, to test if a string is in the filter, the string is hashed again, and if the corresponding indices are set, then the string is "in" the filter.

It doesn't store any information about the strings, so it uses very little memory -- but if there's a collision between two strings, no collision resolution is possible. This means there may be false positives (because a string not in the filter may hash to the same indices as a string that is in the filter). However, there can be no false negatives; any string that really is in the set will be found in the bloom filter.

There are a few Python implementations. It's also not hard to roll your own; I recall coding up a quick-and-dirty bloom filter myself once using bitarrays that performed pretty well.

Gitt answered 15/11, 2012 at 20:47 Comment(2)

How did your "quick-and-dirty implementation resolve false positives? – Evette 16/11, 2012 at 0:9

@martineau, well, it didn't really. False positives were acceptable in my case. I was iterating over a very large dataset, looking for possible duplicates. I didn't need to know for sure that they were duplicates; I was just thinning out the dataset for further processing. – Gitt 16/11, 2012 at 0:13

You said you have lots of disk, huh? One option would be to store the strings as filenames in nested subdirectories. You could either use the strings directly:

Store "drew sears" in d/r/e/w/ sears

or by taking a hash of the string and following a similar process:

MD5('drew sears') = 'f010fe6e20d12ed895c10b93b2f81c6e'
Create an empty file named f0/10/fe/6e/20d12ed895c10b93b2f81c6e.

Think of it as an OS-optimized, hash-table based indexed NoSQL database.

Side benefits:

You could change your mind at a later time and store data in the file.
You can replicate your database to another system using rsync.

Boulevardier answered 15/11, 2012 at 21:0 Comment(4)

+1 for creativity -- but could be slow using the OS to check for the existence of such deeply nested files, not to mention creating them all in the first place. – Evette 15/11, 2012 at 23:58

Actually now that I think about it more, why have a bunch of nested subdirectories? Instead just create an empty file for each string and store them all in a single directory. Of course there might be some filesystem issues... – Evette 16/11, 2012 at 0:6

Without testing, I'm not sure whether it'd be slow or not. It could actually be pretty spritely depending on the filesystem. A lot of filesystems are much faster with smaller directories, and most (all?) modern FSes cache directory lookups well. That was the motivation for the nested subdirectories. – Boulevardier 16/11, 2012 at 0:33

Yes, 100 million files in a single directory would cause "some filesystem issues". – Gregoire 20/11, 2012 at 14:14

Recommended topics

Hot tags