I have been playing at work with some very very large sets of data, typically several billions of elements, that are all maintained in a memcached cloud and periodically dumped into files, and for one of my tasks I'm trying to count the cardinality of this set.
For some context, each item contains an IP and some other attributes identifying a person and is encoded in base64, item size is 20 bytes. Reducing the size of an item by removing some fields is not a possibility.
Here is something that emulates my dataset as an in-memory version (thanks to this post for string generation):
import base64, os
dataset_size = 10000000000 # that's 10 billion, be careful if you run it !
big_dataset = [base64.b64encode(os.urandom(10)) for i in range(dataset_size)]
My first approach was to use a hashset like this:
uniques = set(big_dataset)
print "Cardinality: %d" % len(uniques)
While this in theory works fine on a small dataset, as you can guess there is a hiccup:
- I can't make any assumption on the uniqueness of my data. I could have 50% of my dataset that is unique, or I could have 100% just as well. This is generated dynamically at regular time intervals and varies depending on a lot of factors (time of day for example)
- Dataset size in 10 billion. Each item encoded in base 64 is 20 bytes, times 10 billion is a few hundredids gigabytes in average. Unfortunately, I don't have access to a machine with that much RAM !
I've done my homework and found at best some research papers, or some obscure libraries, but part of the goal of this is to understand what approach works and why.
So I'm calling to you Python users, do you know of any algorithm that would help me estimate cardinality efficiently? By complexity I mean I don't care that much about running time complexity, but I'm more focused about space complexity. I don't mind sacrificing a bit of accuracy if it boosts performance tremendously (so I don't necessarily need to know the exact number of uniques, even if that would be ideal, but probably not a viable approach). I would say up to 5% would be acceptable. I'm looking for something specifically in Python for this project.
Thanks for any help you can provide !
As some people noted, I could use Hadoop/MR, but for this specific projects we don't want to go the MR way, and would like to explore algorithms to do this on a single machine efficiently, as this could be applied to a few other different projects.