How to find identical byte[]-objects in two arrays concurrently?

Asked 2/1, 2012 at 12:58 Answered 2/1, 2012 at 16:33

Solved java multithreading concurrency hash hash-collision

I'm trying to implement an collision attack on hashes (I'm visiting the course 'cryptography'). Therefore I have two arrays of hashes (= byte-sequences byte[]) and want to find hashes which are present in both arrays. After some research and a lot of thinking I am sure that the best solution on a single-core machine would be a HashSet (add all elements of the first array and check via contains if elements of the second array are already present).

However, I want to implement a concurrent solution, since I have access to a machine with 8 cores and 12 GB RAM. The best solution I can think of is ConcurrentHashSet, which could be created via Collections.newSetFromMap(new ConcurrentHashMap<A,B>()). Using this data structure I could add all elements of the first array in parallel and - after all elements where added - I can concurrently check via contains for identical hashes.

So my question is: Do you know an algorithm designed for this exact problem? If not, do you have experience using such a ConcurrentHashSet concerning problems and effective runtime complexity? Or can you recommend another prebuilt data structure which could help me?

PS: If anyone is interested in the details: I plan to use Skandium to parallelize my program.

Rave answered 2/1, 2012 at 12:58 Comment(3)

Are the arrays already sorted? If so, a one-pass-through merge like function will find duplicates. Otherwise you could sort array1 and array2 in parallel and do the merge on the results. – Moppet 2/1, 2012 at 13:3

By byte hashes do you mean that all hashes are in the interval 0-255? – Hygrometer 2/1, 2012 at 13:7

I meant byte-sequences, i.e. byte[]. They are the result of a hash-function like SHA or MD5. No, the arrays aren't sorted. Sorting and merging them would need O(n log n) for sorting and O(n + m) for merging. I hoped for a higher efficiency. – Rave 2/1, 2012 at 14:12

I think it would be a complete waste of time to use any form of HashMap. I am guessing you are calculating multi-byte hashes of various data, these are already hashes, there is no need to perform any more hashing on them.

Although you do not state it, I am guessing your hashes are byte sequences. Clearly either a trie or a dawg would be ideal to store these.

I would suggest therefore you implement a trie/dawg and use it to store all of the hashes in the first array. You could then use all of your computing power in parallel to lookup each element in your second array in this trie. No locks would be required.

Added

Here's a simple Dawg implementation I knocked together. It seems to work.

public class Dawg {
  // All my children.
  Dawg[] children = new Dawg[256];
  // Am I a leaf.
  boolean isLeaf = false;

  // Add a new word.
  public void add ( byte[] word ) {
    // Finds its location, growing as necessary.
    Dawg loc = find ( word, 0, true );
    loc.isLeaf = true;
  }

  // String form.
  public void add ( String word ) {
    add(word.getBytes());
  }

  // Returns true if word is in the dawg.
  public boolean contains ( byte [] word ) {
    // Finds its location, no growing allowed.
    Dawg d = find ( word, 0, false );
    return d != null && d.isLeaf; 
  }

  // String form.
  public boolean contains ( String word ) {
    return contains(word.getBytes());
  }

  // Find the Dawg - growing the tree as necessary if requested.
  private Dawg find ( byte [] word, int i, boolean grow ) {
    Dawg child = children[word[i]];
    if ( child == null ) {
      // Not present!
      if ( grow ) {
        // Grow the tree.
        child = new Dawg();
        children[word[i]] = child;
      }
    }
    // Found it?
    if ( child != null ) {
      // More to find?
      if ( i < word.length - 1 ) {
        child = child.find(word, i+1, grow);
      }
    }
    return child;
  }

  public static void main ( String[] args ) {
    Dawg d = new Dawg();
    d.add("H");
    d.add("Hello");
    d.add("World");
    d.add("Hell");
    System.out.println("Hello is "+(d.contains("Hello")?"in":"out"));
    System.out.println("World is "+(d.contains("World")?"in":"out"));
    System.out.println("Hell is "+(d.contains("Hell")?"in":"out"));
    System.out.println("Hal is "+(d.contains("Hal")?"in":"out"));
    System.out.println("Hel is "+(d.contains("Hel")?"in":"out"));
    System.out.println("H is "+(d.contains("H")?"in":"out"));
  }
}

Added

This could be a good start at a concurrent lock-free version. These things are notoriously difficult to test so I cannot guarantee this will work but to my mind it certainly should.

import java.util.concurrent.atomic.AtomicReferenceArray;


public class LFDawg {
  // All my children.
  AtomicReferenceArray<LFDawg> children = new AtomicReferenceArray<LFDawg> ( 256 );
  // Am I a leaf.
  boolean isLeaf = false;

  // Add a new word.
  public void add ( byte[] word ) {
    // Finds its location, growing as necessary.
    LFDawg loc = find( word, 0, true );
    loc.isLeaf = true;
  }

  // String form.
  public void add ( String word ) {
    add( word.getBytes() );
  }

  // Returns true if word is in the dawg.
  public boolean contains ( byte[] word ) {
    // Finds its location, no growing allowed.
    LFDawg d = find( word, 0, false );
    return d != null && d.isLeaf;
  }

  // String form.
  public boolean contains ( String word ) {
    return contains( word.getBytes() );
  }

  // Find the Dawg - growing the tree as necessary if requested.
  private LFDawg find ( byte[] word, int i, boolean grow ) {
    LFDawg child = children.get( word[i] );
    if ( child == null ) {
      // Not present!
      if ( grow ) {
        // Grow the tree.
        child = new LFDawg();
        if ( !children.compareAndSet( word[i], null, child ) ) {
          // Someone else got there before me. Get the one they set.
          child = children.get( word[i] );
        }
      }
    }
    // Found it?
    if ( child != null ) {
      // More to find?
      if ( i < word.length - 1 ) {
        child = child.find( word, i + 1, grow );
      }
    }
    return child;
  }

  public static void main ( String[] args ) {
    LFDawg d = new LFDawg();
    d.add( "H" );
    d.add( "Hello" );
    d.add( "World" );
    d.add( "Hell" );
    System.out.println( "Hello is " + ( d.contains( "Hello" ) ? "in" : "out" ) );
    System.out.println( "World is " + ( d.contains( "World" ) ? "in" : "out" ) );
    System.out.println( "Hell is " + ( d.contains( "Hell" ) ? "in" : "out" ) );
    System.out.println( "Hal is " + ( d.contains( "Hal" ) ? "in" : "out" ) );
    System.out.println( "Hel is " + ( d.contains( "Hel" ) ? "in" : "out" ) );
    System.out.println( "H is " + ( d.contains( "H" ) ? "in" : "out" ) );
  }
}

Altonaltona answered 2/1, 2012 at 14:2 Comment(12)

Yeah you are right, I would hash hashes which sounds awful. But I couldn't think about another way using prebuilt data-structures. I thought about Tries, too, but they have lookups in O(log n) rather O(1) a HashSet has - or am I wrong about that? Besides, if I can override the hash-method of the HashSet, I could put my data directly into that, preventing the hashing of hashes. (But I couldn't see how to do it in the JavaDoc of HashSet.) – Rave 2/1, 2012 at 14:21

@FlorianPilz the (worst case) access time of a Trie is indeed O(log n), where n = number of "characters" in your "word". But since hashes all have the same length, this is irrelevant, as n is always the same. Also, keep in mind that O(1) is allowed to take longer than even O(e^n) for small n and its only the asymptote that is part of the O()-notation. – Sillsby 2/1, 2012 at 14:49

@nd Thanks for your comment. If I understand you right, the Trie would have O(1) best-case and worst-case, since the length of my words is constant. After some more reading I understand that HashMap and Trie are comparable in speed (especially in this scenario), so Paul is right: A Trie would be better, since I don't loose speed, but save memory and have a better worst-case runtime complexity. If I got it right, this solution yields a guaranteed O(2 * n) runtime complexity, if may arrays contain n hashes. Correct? – Rave 2/1, 2012 at 14:59

@Florian - Yes and No. Once the dawg is built your maximum search depth for each key is the width of the hash - 1 (because if you hit the hash length you've found a match). n does not enter into the equation at all which makes searching essentially O(1). – Altonaltona 2/1, 2012 at 15:4

@Paul Thanks for the added source code, however I'm looking for a DAWG / Trie that can be filled and read in parallel. This way I can use the 8 cores when adding all hashes of the first array. I hope such an concurrent DAWG / Trie implementation is out there. But I'm still not sure if this is the best solution. Maybe there is an idea like dynamic programming to lower the complexity by an order of magnitude. – Rave 2/1, 2012 at 15:4

@Paul That's what I said. ;) The DAWG / Trie has O(1) for every operation. The runtime complexity of O(2 * n) was meant for the whole problem (adding n elements of the first array and calling contains n times to check every element of the second array). – Rave 2/1, 2012 at 15:7

You can certainly read this in parallel. Making it fillable in parallel may be a little more complex but certainly doable. You would need an array of AtomicReference<Dawg>s instead of just plain Dawgs and some thought on growing it in parallel would be necessary. If you would like help with this, post another question. – Altonaltona 2/1, 2012 at 15:12

A short search reveals that there is no concurrent implementation of DAWGs or Tries (or they aren't popular enough to be shown in Google). So I will have to built it myself - in that case your code will be of great help, I think I will post an follow-up question in some days. Thanks a lot for your help! – Rave 2/1, 2012 at 15:18

@Florian - Please post a comment here when you post your follow-up. I would like to see how it all works out. - Glad to be of help. – Altonaltona 2/1, 2012 at 15:28

Using word.getBytes() for the string form is dependent on the default encoding which means that if the default encoding changes during the life of the JVM any existing dawgs may no longer respond with true to contains(myStr) if add(myStr) was done before the encoding change. I suggest mandating an encoding as in word.getBytes("UTF-8") or leaving string handling out entirely. – Garay 2/1, 2012 at 16:4

Thanks Mike - the String versions of the add and contains are more utility functions than core functions. They are here for ease of testing. The byte versions will be used in almost all cases so I would leave String handling out in the final version. Thanks for the heads-up. – Altonaltona 2/1, 2012 at 16:10

Yo dawg, we heard you like recursive data structures, so we put a dawg in your dawg. – Elwin 3/1, 2012 at 17:28

A simpler approach would be to just split the first array into N equal (or near-equal) parts (with 8 cores, n=8 seems reasonable). Then solve the program in the "normal" way, by looking if any hashes in the 2nd array are present in the N smaller sub-first-arrays. This can be done in parallel.

That said, I've never heard of tries/dawgs before and I found the main discussion fascinating and informative. (I mainly work with numbers, not words)

This assumes that the byte[] hashes are of some finite, shortish length so that you really can split up the original file to process in parallel. Is that the case?

EDIT ADDED

For an example of this idea, see GPU Graphics Gems, edited by Wen-Mei W. Hwu, chapter 11, an article by Ligowski, Rudnicki, Liu and Schmidt. They parallelize a massive protein sequence database search by splitting the huge single database into many smaller pieces, then run the normal algorithm on each subpiece. I like this quote. "The described algorithm is embarrassingly parallel". In their case they used CUDA and had to do a lot of memory optimization, but the principle should still apply to multi-core machines.

semi-PSEUDOCODE FOLLOWS. I'll use Lists for the incoming byte[] hashes, hope that's o.k.

Original, 1 core method

originalProcess(List<byte[]> list1, List<byte[]> list2) {
   HashSet<byte[]> bigHugeHashOfList1 = new HashSet<byte[]>();
   bigHugeHashOfList1.addAll(list1);
   for (byte[] hash : list2)
      if (bigHugeHashOfList1.contains(hash)
         // do something
}

New method. Uses exact same process method (later on). No DAWGS or TRIES here...

preprocess(List<byte[]> list1, List<byte[]> list2) {
   List<byte[]>[] splitLists = new ArrayList<byte[]>[8];
   for (int i=0; i<8; i++)
      splitLists[i] = new ArrayList<byte[]>();
   for (byte[] hash : list1) {
      int idx = hash[0]&7; // I'm taking the 3 low order bits, YMMV
      splitLists[idx].add(hash);
      // a minor speedup would be to create the HashSet here instead of in originalProcess()
   }

   // now, using your favorite parallel/concurrency technique,
   // do the equivalent of
   for (int i=0; i<8; i++)
      originalProcess(splitLists[i], list2);
}

Tilla answered 2/1, 2012 at 16:33 Comment(9)

Your approach is possible and simpler, but less efficient. Testing if an element is inside an Arrays of length n costs up to O(n), because you have to iterate through the array. HashMaps and Tries perform lookups in O(1), which is way faster. (Sidenote: Tries can normally have a lookup time of O(m), where m is the length of the word. In this special case all words have the same (constant) length, therefore it has no effect on the big-O-notation.) – Rave 2/1, 2012 at 17:17

You could still use a HashMap for the N smaller sub-problems. Just like your original single-core solution. That's what I meant by the "normal" way. An advantage is that they need not be Concurrent. – Tilla 2/1, 2012 at 17:39

You could split across 8 cores by taking the first 3 bits of the hash as a discriminator. This would be an excellent first step. – Altonaltona 2/1, 2012 at 19:0

Yeah, a hash would be better. But now it's the exact same solution I proposed with the ConcurrentHashSet. @Paul You are right, that would be an excellent first step and I thought about it. I'm not sure if that improves the speed of the algorithm, but requires some clever splitting-algorithm to divide them by the first x bit (remember: the arrays aren't sorted). Maybe you could elaborate on that idea. :) – Rave 3/1, 2012 at 9:57

@Florian Yes, it is the exact same solution you proposed with a HashSet. That's the point. You have parallelized "strategically", not "tactically". Please see edits in my answer for more info. – Tilla 3/1, 2012 at 16:20

Don't get your 'strategically' vs 'tactically' statement. And the algorithm is the same for both ideas: HashSets and DAWG / Tries, since I only use add and contains. Besides, I don't see the point why you post an answer that is identical to a solution I wrote in the question. I'm sure you have something in mind, but I still can't see your point - please elaborate. PS: Could you set a link to the paper you are referring? (Write [Link](Linklocation).) – Rave 3/1, 2012 at 17:47

@Florian see added pseudocode. I have a physical book for the article, I don't think there is an online version, but try some Googling. – Tilla 3/1, 2012 at 18:18

OK, I understood you correctly. :) Yes, your solution is simpler and is as fast as the DAWG / Trie variant. Because I think the memory efficiency of a DAWG / Trie is worth the effort of building it, I will implement the solution Paul proposed rather than yours. I thank you for your time to develop another idea and discuss it with me. – Rave 4/1, 2012 at 12:17

Yes, my solution is a memory hog. Let us know how trie dawg work. How about an upvote? – Tilla 4/1, 2012 at 16:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags