How do you count cardinality of very large datasets efficiently in Python?
Asked Answered
C

2

16

I have been playing at work with some very very large sets of data, typically several billions of elements, that are all maintained in a memcached cloud and periodically dumped into files, and for one of my tasks I'm trying to count the cardinality of this set.

For some context, each item contains an IP and some other attributes identifying a person and is encoded in base64, item size is 20 bytes. Reducing the size of an item by removing some fields is not a possibility.

Here is something that emulates my dataset as an in-memory version (thanks to this post for string generation):

import base64, os

dataset_size = 10000000000 # that's 10 billion, be careful if you run it !
big_dataset = [base64.b64encode(os.urandom(10)) for i in range(dataset_size)]

My first approach was to use a hashset like this:

uniques = set(big_dataset)
print "Cardinality: %d" % len(uniques)

While this in theory works fine on a small dataset, as you can guess there is a hiccup:

  • I can't make any assumption on the uniqueness of my data. I could have 50% of my dataset that is unique, or I could have 100% just as well. This is generated dynamically at regular time intervals and varies depending on a lot of factors (time of day for example)
  • Dataset size in 10 billion. Each item encoded in base 64 is 20 bytes, times 10 billion is a few hundredids gigabytes in average. Unfortunately, I don't have access to a machine with that much RAM !

I've done my homework and found at best some research papers, or some obscure libraries, but part of the goal of this is to understand what approach works and why.

So I'm calling to you Python users, do you know of any algorithm that would help me estimate cardinality efficiently? By complexity I mean I don't care that much about running time complexity, but I'm more focused about space complexity. I don't mind sacrificing a bit of accuracy if it boosts performance tremendously (so I don't necessarily need to know the exact number of uniques, even if that would be ideal, but probably not a viable approach). I would say up to 5% would be acceptable. I'm looking for something specifically in Python for this project.

Thanks for any help you can provide !

As some people noted, I could use Hadoop/MR, but for this specific projects we don't want to go the MR way, and would like to explore algorithms to do this on a single machine efficiently, as this could be applied to a few other different projects.

Canonicity answered 15/4, 2012 at 18:1 Comment(5)
Just a suggestion, but this sounds like something that might be good for the Map-Reduce framework -- mapping elements in your data down to counts in a dictionary or something. For this, you could use MRJob, the Python Map-Reduce framework created by Yelp. Running it at Amazon EC2 with MRJob might also be a good idea for you. I have used it for word frequency counts in large corpra before. I guess it depends exactly on how you would parse individual data elements.Mellie
Thanks for the suggestion, yes I've thought about MR (I'm in fact using it a lot in other projects), but for this specific problem MR/Hadoop is not an option, we'd like to look into algorithms to do this in a single machine as part of a proof of concept.Canonicity
If 100% accuracy is not important, perhaps a bloom filter that will give you 5% error would fit in memory? If not, and a single machine is necessary, you can simply use some simple nosql database with unique keys, that stores on disk and removes duplicates. it will be slow, but it will work with whatever amount of ram you have. You can still parallelize the actual insertion work.Evenings
There was recently an excellent post on HN about a company (Clearspring) doing this in the real world... Big Data Counting: How To Count A Billion Distinct Objects Using Only 1.5KB Of MemoryUprear
Wow, excellent link, that seems exactly like what I need ! Wish I had seen that before.Canonicity
S
8

I would recommend the usage of Hash Sketches, namely (Super)Log Log sketches or Hyper Log Sketches.

You can check and perhaps use and improve the simple python implementation that I made: https://github.com/goncalvesnelson/Log-Log-Sketch

Stumper answered 15/4, 2012 at 20:3 Comment(4)
Awesome, how does it compare to a Bloom Filter as mentionned in previous post? Do you know the pros and cons of using LogLog/HyperLog ?Canonicity
The main issue with these techniques is their inaccuracy regarding small cardinalities (< 2000). In the following graph Bloom Filters vs hash sketches you can see that for small cardinalities of almost up to 2000 elements their error is greater than 5%, but for bigger cardinalities their error is below your desired 5%. Despite not having the same accuracy as Bloom Filters, looking at [this] (cl.ly/3M2T1h3s1T2e1G0N1u1K) you can check that both these techniques are much more efficient in terms of space.Stumper
@Stumper is any source code available for producing those two graphs?Nazarene
@MaxGhenis It has been a long time ago, and I've since switched computers, but I found these files: dropbox.com/sh/omgxqmmt4hnjgfq/AABHAu42sEAz_UrGkXL2fUbJa?dl=0 It must be the benchmark and sizebenchmark files, but I'm not sure those are the final version. I hope this was still helpful.Stumper
N
3

I would advise you to try with a bloom filter. Even with such an amount of data you can achieve extremely low error rates with modest system requirements. Given that you will be using the (roughly) optimal k=ln(2)*(bloom filter size in bits)/(10billions) you can calculate your bloom filter size in bits as -((10billions)*ln(desired false positive rate))/ln(2)^2.

For example with less than 2gigs of memory you can get an error rate of 0.1%. A very fast and extremely simple implementation of all this is http://mike.axiak.net/python-bloom-filter/docs/html/

Nonplus answered 15/4, 2012 at 20:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.