How to handle a dict variable with 2^50 elements?

Asked 10/4, 2012 at 13:3 Answered 10/4, 2012 at 19:47

Solved python dictionary hash ram hash-collision

I have to find SHA256 hashes of 2^25 random strings. And then look for collision (using birthday paradox for the last, say, 50 bits of the hash only).

I am storing the string:hash pair in a dict variable. Then sorting the variable with values (not keys) and then looking for collision using a O(n) loop.

The problem is that since there are 2^25 strings and their 2^25 hashes, so the dict variable has 2^50 values in it. This is EXTREMELY resource intensive. So, how do I do this with limited RAM, say, around 1GB?

What I have already tried:
1. Running this with a 6GB swap space. The program ran overnight and was still not done. This is essentially even slower than an O(n_square) search! The hashes are calculated with RAM usage of about 3.2 GB. But after that when it comes to the sort command, the RAM used starts shooting up again! I though the python sort uses In-Place-Quicksort :(
2. I have stored only the hashes and found a collision. But could not find the corresponding string since did not store it.

I am not supposed to use databases etc. At the most a text file but that doesn't help. Also, I am pretty new to Python but do not let that limit the level of your answer.

PS: this is an assignment. Some claim to have found the collisions in under one minute with 300MB RAM. Don't know if that is true. I have solved the problem but the answer is unreachable! At work so do not have access to code right now. Will add soon.

Code:

from Crypto.Hash import SHA256
import os
import random
import string
from operator import itemgetter

def shaa():

    trun=[]
    clist={}
    for i in range(0,33554432):

        sha=SHA256.new(str(i)).hexdigest()
        sha=int(bin(int(sha,16))[-50:],2)
        clist[i]=sha

    print 'Hashes done.'

    clist=sorted(clist.items(), key=itemgetter(1))  
    for i in range(0,33554432):

        if(clist[i]==clist[i+1]):
            #print string[i],string[i+1]
            print clist[i]
            return 1
    return 2

result=2
while(result==2):
    result=shaa()

Antiseptic answered 10/4, 2012 at 13:3 Comment(4)

It's not so bad, since 2^25 + 2^25 = 2^26 – Syphilology 10/4, 2012 at 13:6

If you are looking for hash collisions, I would suggest that you make the dict hash:string instead. Then when you try to insert a new pair, you can trivially see if the hash is already present, and retrieve the corresponding colliding string. – Latchet 10/4, 2012 at 13:11

You are not supposed to use a database? Not even the anydbm module? Strange requirement, though I think there are better solution anway. – Greenlee 10/4, 2012 at 13:11

FYI, Python's list.sort uses Timsort. – Promethean 10/4, 2012 at 13:15

I'd go for something like this:

Open 16 files (opened in binary mode should be fine; this will be easiest if all your strings have the same length). Generate your strings and hashes, and write them to a file depending on the first 4 bit of the hash. Then load and process each file separately. This will reduce memory usage by a factor of 16. (Of course you can use any number of files as long as you don't run out of file handles. Having to open and close each file on every access will become rather slow.)

If generating the strings and hashes is relatively inexpensive, you don't even need the files. Simply do 16 passes, and in each pass keep only thoses hashes the upper nibbles of which match the pass number.

Greenlee answered 10/4, 2012 at 13:15 Comment(3)

each text file will be roughly 125 MB (2GB/16 files). Yes this sounds like a good approach. I will try this one out. No idea what a binary file is. Researching that too. Thanks! – Antiseptic 10/4, 2012 at 13:39

@ritratt: "binary file" wasn't a particular good description. What I meant was "file opened in binary mode". – Greenlee 10/4, 2012 at 13:40

i tried this. the memory issue was resolved but unable to find collisions. No idea why :S – Antiseptic 13/4, 2012 at 15:28

One way to solve the problem is to use a very long bitfield, so that every hash is mapped to certain position in 2^25 bits long memory block.

A better, yet non-100%-assurance manner to solve this kind of problems is done via Bloom filter or other probabilistic data structures.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not; i.e. a query returns either "inside set (may be wrong)" or "definitely not in set".

Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries.

A Bloom filter with 1% error requires only about 9.6 bits per element — regardless of the size of the elements.

So, 9.6 bits per 2^25 elements, will need just 38.4 MiB of memory.

Electrolyse answered 10/4, 2012 at 13:28 Comment(8)

I don't understand this answer. How do you map a 256-bit hash to 2^25 bits in a way that lets you establish if there has been a hash collision, and more importantly, what strings caused this collision? I can't figure out how this is supposed to work, and I'm tempted to say it doesn't. – Greenlee 10/4, 2012 at 13:38

@SvenMarnach, I think this could be done in two passes. There will be few collisions, so in the first pass, store only the hashes in a memory-efficient data structure, check for collisions, and store any offending strings. For any colliding pair of strings (a, b), this will give all values of b. Store these in a (relatively small) dictionary mapping hashes to strings. Then do a second pass through the strings, checking the dictionary for each one. Does that make sense to you? – Bola 10/4, 2012 at 14:15

@SvenMarnach, both BasicWolf's suggestions would be probabilistic (the first is really a bloom filter using only one hash function), so some false positives would probably have to be weeded out, but that shouldn't be difficult. – Bola 10/4, 2012 at 14:19

@senderle: Hmm, I see what you mean. Citing from the original post: "I have stored only the hashes and found a collision. But could not find the corresponding string since did not store it." So the OP already got this part, and the problem is to iterate the strings a second time (there are certainly ways to do this; random seeds etc.). This answer describes how to do what the OP already manages to do in a more memory-efficient way, but is somewhat incomplete in my opinion. – Greenlee 10/4, 2012 at 14:35

@SvenMarnach, yes I think you're right about that -- and when I looked back over the question, I noticed the same passage. Seems like the OP is closer than he realizes. – Bola 10/4, 2012 at 14:46

It was not memory efficient. it took 2GB of RAM. – Antiseptic 10/4, 2012 at 15:44

@ritratt, the bloom filter should not have taken up that much space. How did you implement it? – Bola 10/4, 2012 at 18:26

I have used a normal loop. I have now tried using files. it uses 1GB ram. might learn bloomfilter and try with it too. – Antiseptic 13/4, 2012 at 18:9

I think the key insight here -- which I admit evaded me for some time, till I came back to this a few hours later -- is that the sha256 hash digest is its own hash. In other words, you don't have to do any additional hashing or set creation. All you have to do is create a custom hash table, using the sha256 digest as the hash. To save space, don't store the strings; just create a bit-array (using shift operations on integers in an array of ints created with table = numpy.zeros(table_size / bits_per_int + 1, dtype='i')) to detect collisions, and then save colliding strings in a dict mapping hashes to strings for lookup in a second pass.

table_size should be a large prime -- I picked one slightly larger than 2 ** 31, which made for a 268MB table -- because that will produce few new collisions/false positives (introduced by the modulo operation on the hash). You can save the strings themselves to a text file, which you can iterate over.

So for any string, the index of the corresponding bit to set would be index = int(hashlib.sha256('foo').hexdigest(), base=16) % table_size. Then calculate the major_index = index / bits_in_int and minor_index = index % bits_in_int, use shift and bitwise operations on minor_index to check and store the correct bit in the int at table[major_index], and so on.

Now do a pass through the strings. Whenever a string generates a hash that maps to a bit that has already been set, store a hash:string pair in a dictionary. Or better yet, store a hash:[string_list] pair, appending new strings to the list in the case of multiple collisions. Now for any colliding pair of strings (a, b), the dictionary will contain the hash and value of b. Then do a second pass through the strings, hashing each in turn, and checking the dictionary for each hash. If the hash is in the dictionary, and the string is not already in the corresponding list, add the string to the list. Some of the strings in the dictionary will not correspond to true collisions; the [string_list] for most of those hashes will be only one-item long, and those hash:[string_list] pairs may be discarded. The others are probably true collisions resulting from the hash function itself, rather than from the modulo operation. However, you may still have a few false positives to weed out, in those cases where there was both a true and a false positive; you'll have to double-check the resulting lists for false positives.

BasicWolf's suggestion of using a bloom filter is a good one, and might result in a smaller table. But it adds a lot of complication; I didn't bother. I tried the above method on newline-terminated strings from '0\n' to '33554431\n' and found two hashes with a 54-bit overlap. It took 11 minutes, and the maximum memory usage was about 350MB (though that could probably be reduced.) I did some profiling and found that most of the time was spent calculating offsets for the bit-table. Coding this in c would probably provide a significant speed-up, and pre-hashing and storing the hashes as well as the strings would help too.

In fact, I tried pre-hashing the strings, and replaced my rather ad-hoc numpy-based bitarray with a bitarray from the c-based extension module of the same name. This reduced the runtime to just over 2 minutes, while keeping the memory profile just around 350MB.

Close enough for government work, I think. Since this is an assignment, I won't post the code, but I'm happy to provide additional pointers.

Bola answered 10/4, 2012 at 19:47 Comment(1)

@ritratt, for what it's worth, I've managed to get fairly close to the numbers you gave (350MB, 2 min). – Bola 13/4, 2012 at 3:33

Why don't you use a dictionary from last-50-bits-of-hash to string?

Onetoone answered 10/4, 2012 at 13:11 Comment(0)

Split the hashed for example in groups of 10 chars. And nest the values this way you will have recursive search but it should be faster

Soapbox answered 10/4, 2012 at 13:12 Comment(2)

The OP stated "I am not supposed to use databases etc." I don't think this is a good approach anyway. – Greenlee 10/4, 2012 at 13:12

Sorry, my mistake, replaced with new proposal – Soapbox 10/4, 2012 at 13:14

Recommended topics

Hot tags