Programming Pearls: find one integer appears at least twice
Asked Answered
D

4

8

It's in the section 2.6 and problem 2, the original problem is like this:

"Given a sequential file containing 4,300,000,000 32-bit integers, how can you find one that appears at least twice?"

My question toward this exercise is that: what is the tricks of the above problem and what kind of general algorithm category this problem is in?

Diecious answered 2/4, 2011 at 6:56 Comment(1)
the solution given in the book is binary searchEyeless
L
10

Create a bit array of length 2^32 bits (initialize to zero), that would be about 512MB and will fit into RAM on any modern machine.

Start reading the file, int by int, check bit with the same index as the value of the int, if the bit is set you have found a duplicate, if it is zero, set to one and proceed with the next int from the file.

The trick is to find a suitable data structure and algorithm. In this case everything fits into RAM with a suitable data structure and a simple and efficient algorithm can be used.
If the numbers are int64 you need to find a suitable sorting strategy or make multiple passes, depending on how much additional storage you have available.

Loyce answered 2/4, 2011 at 7:6 Comment(3)
It should be noted this method also requires bit-level access to the data structure. A combination of bitwise operations(<<, &&, etc.) should do the trick. Other than this one tiny implementation detail, the method is pretty straightforward.Mciver
"will fit into RAM on any modern machine" Not at the time of publication of the book :) In general, that seems more like a discussion question, without a single best answer. (I didn't see the book, though) But this is sensible strategy today, so +1Underrate
This is a potential solution but the author in that section promotes us to think in a way were we don't have too much RAM and wants us to make use of binary search for the problem. Can someone comeup with a soln using B.Search.?Subtreasury
B
7

The Pigeonhole Principle -- If you have N pigeons in M pigeonholes, and N>M, there are at least 2 pigeons in a hole. The set of 32-bit integers are our 2^32 pigeonholes, the 4.3 billion numbers in our file are the pigeons. Since 4.3x10^9 > 2^32, we know there are duplicates.

You can apply this principle to test if a duplicate we're looking for is in a subset of the numbers at the cost of reading the whole file, without loading more than a little at a time into RAM-- just count the number of times you see a number in your test range, and compare to the total number of integers in that range. For example, to check for a duplicate between 1,000,000 and 2,000,000 inclusive:

int pigeons = 0;
int pigeonholes = 2000000 - 1000000 + 1; // include both fenceposts
for (each number N in file) {
  if ( N >= 1000000 && N <= 2000000 ) {
    pigeons++
  }
}
if (pigeons > pigeonholes) {
  // one of the duplicates is between 1,000,000 and 2,000,000
  // try again with a narrower range
} 

Picking how big of range(s) to check vs. how many times you want to read 16GB of data is up to you :)

As far as a general algorithm category goes, this is a combinatorics (math about counting) problem.

Bullhorn answered 3/4, 2011 at 11:41 Comment(0)
B
-1

If what do you mean is 32 bit positive integers, I think this problem doesn't require some special algorithm or trick to solve. Just a simple observation will lead to the intended solution.

My observation goes like this, the sequential file will contain only 32 bit integers (which is from 0 to 2 ^ 31 - 1). Assume you put all of them in that file uniquely, you will end up with 2 ^ 31 lines. You can see that if you put those positive integers once again, you will end up with 2 ^ 31 * 2 lines and it is smaller than 4,300,000,000.

Thus, the answer is the whole positive integers ranging from 0 to 2 ^ 31 - 1.

Banket answered 2/4, 2011 at 7:11 Comment(3)
1) That doesn't give you the number itself 2) 32-bit integer usually means 32 bits, no 31 bit.Underrate
1) Yeah, i know. 2) Well.., 32 bit integer is from 0 to 2 ^ 31 - 1, not from 0 to 2 ^ 32 or something. That's why there is an if at the beginning of my post. This solution works if what the writer mean is 32 signed positive integer, not unsigned.Banket
There is no such constraint on the data values - they are just 32 bit intsTerwilliger
C
-1

Sort the integers and loop through them to see if consecutive integers are duplicates. If you want to do this in memory, it requires 16GB memory that is possible with todays machines. If this is not possible, you could sort the numbers using mergesort and by store intermediate arrays to disk.

My first implementation attempt would be to use sort and uniq commands from unix.

Chaff answered 2/4, 2011 at 15:30 Comment(1)
this question is to test your constraints with limited resources. Saying your answer requires x GB of ram isn't in the spirit of the question.Swoon

© 2022 - 2024 — McMap. All rights reserved.