Efficient way to check existence in a large set of strings
Asked Answered
G

2

6

I have a set of 100+ million strings, each up to 63 characters long. I have lots of disk space and very little memory (512 MB). I need to query for existence alone, and store no additional metadata.

My de facto solution is a BDB btree. Are there any preferable alternatives? I'm aware of leveldb and Kyoto Cabinet, but not familiar enough to identify advantages.

Gregoire answered 15/11, 2012 at 20:19 Comment(3)
Can you tolerate an occasional false positive?Gitt
False negatives are unacceptable; occasional false positives are potentially tolerable.Gregoire
Just store them all in a set and let the OS's virtual memory manager page it to disk as necessary. You can also explicitly save it to disk using pickle. No need to build a database.Evette
G
5

If false positives are acceptable, then one possible solution would be to use a bloom filter. Bloom filters are similar to hash tables, but instead of using one hash value to index a table of buckets, it uses multiple hashes to index a bit array. The bits corresponding to those indices are set. Then, to test if a string is in the filter, the string is hashed again, and if the corresponding indices are set, then the string is "in" the filter.

It doesn't store any information about the strings, so it uses very little memory -- but if there's a collision between two strings, no collision resolution is possible. This means there may be false positives (because a string not in the filter may hash to the same indices as a string that is in the filter). However, there can be no false negatives; any string that really is in the set will be found in the bloom filter.

There are a few Python implementations. It's also not hard to roll your own; I recall coding up a quick-and-dirty bloom filter myself once using bitarrays that performed pretty well.

Gitt answered 15/11, 2012 at 20:47 Comment(2)
How did your "quick-and-dirty implementation resolve false positives?Evette
@martineau, well, it didn't really. False positives were acceptable in my case. I was iterating over a very large dataset, looking for possible duplicates. I didn't need to know for sure that they were duplicates; I was just thinning out the dataset for further processing.Gitt
B
1

You said you have lots of disk, huh? One option would be to store the strings as filenames in nested subdirectories. You could either use the strings directly:

  • Store "drew sears" in d/r/e/w/ sears

or by taking a hash of the string and following a similar process:

  • MD5('drew sears') = 'f010fe6e20d12ed895c10b93b2f81c6e'
  • Create an empty file named f0/10/fe/6e/20d12ed895c10b93b2f81c6e.

Think of it as an OS-optimized, hash-table based indexed NoSQL database.

Side benefits:

  • You could change your mind at a later time and store data in the file.
  • You can replicate your database to another system using rsync.
Boulevardier answered 15/11, 2012 at 21:0 Comment(4)
+1 for creativity -- but could be slow using the OS to check for the existence of such deeply nested files, not to mention creating them all in the first place.Evette
Actually now that I think about it more, why have a bunch of nested subdirectories? Instead just create an empty file for each string and store them all in a single directory. Of course there might be some filesystem issues...Evette
Without testing, I'm not sure whether it'd be slow or not. It could actually be pretty spritely depending on the filesystem. A lot of filesystems are much faster with smaller directories, and most (all?) modern FSes cache directory lookups well. That was the motivation for the nested subdirectories.Boulevardier
Yes, 100 million files in a single directory would cause "some filesystem issues".Gregoire

© 2022 - 2024 — McMap. All rights reserved.