Generate an integer that is not among four billion given ones
Asked Answered
G

38

724

I have been given this interview question:

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.

My analysis:

The size of the file is 4×109×4 bytes = 16 GB.

We can do external sorting, thus letting us know the range of the integers.

My question is what is the best way to detect the missing integer in the sorted big integer sets?

My understanding (after reading all the answers):

Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.

Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.

Solution:

If we use one bit representing one distinct integer, it is enough. we don't need sort.

Implementation:

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits

Solution:

For all possible 16-bit prefixes, there are 216 number of integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.

  1. Build the counter of each bucket through the first pass through the file.
  2. Scan the buckets, find the first one who has less than 65536 hit.
  3. Build new buckets whose high 16-bit prefixes are we found in step2 through second pass of the file
  4. Scan the buckets built in step3, find the first bucket which doesnt have a hit.

The code is very similar to above one.

Conclusion: We decrease memory through increasing file pass.


A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.

Gravelly answered 22/8, 2011 at 21:11 Comment(39)
Well, I think sorting is the best way to do it. I'm no expert, though. Sort and keep track of the last number put into the file. If anything is missing, you'll know. (ie you put in 101 then 103 pops up, so you know that the range 102-102 is missing)Scapolite
Are the integers in the file all unique? If so you can do it in O(n) time for the 10 MBVentura
@trashgod, wrong. For 4294967295 unique integers you'll have 1 integer remaining. To find it, you should summ all integers and substract it from precalculated summ of all possible integers.Larocca
This is the second "pearl" from "Programming Pearls", and I would suggest you read the whole discussion in the book. See books.google.com/…Sailer
@Nakilon: Yes, I was off by 1; thanks. For a file containing 4294967296 unique 32-bit integers, the algorithm is _O_(1): return false.Bromberg
Coming from Superuser and not the best programmer (Yet!) - I would say do it how you would normally without the memory constraints and let the Pagefile pick up the slack!Mel
@William Yeah... that's not going to work. The OS will crash your application long before you finish the algorithm, and your application will have caused the whole machine to get bogged down really fast. You're better off pretending the pagefile doesn't exist.Robertaroberto
@nakilon Are you serious? You have four billion integers. Even if all of them were under one million, what type of storage would you use to get the sum of a million integers? Really bad idea.Lucho
@Richard, my storage would be 1 Integer variable.Larocca
@Larocca Summation of numbers 1 to 1 million = 500,000,500,000. That's not a 32-bit integer. And what about your numbers being 1 to 4 billion? Thinking it over, it could work, but the storage would not at all be "1 integer variable".Lucho
@Lucho a 64 bit int would be more than large enough.Bedcover
the answer to this qn can be found on page 202 of Gayle Laakman's Cracking the coding interview book. not sure if i can post the solution though. but anyway, im sure there are far better answers out there...Rearmost
int getMissingNumber(File inputFile) { return 4; } (reference)Anther
@Richard, do you ever heard about overflow in digital arithmetics?Larocca
This discussion has been moved here. @LaroccaLucho
@cftarnas, a 32 bit int would be enough.Larocca
@Larocca How do you propose storing the sum of (almost) all numbers between 0 and 2^32 - 1 in an integer that can only store values up to 2^32 - 1? You need a 64 bit integer (or two 32 bit integers) to store the result.Fettling
@Larocca In fact, a 64 bit integer is asymptotically barely enough to store the result. Squaring a number will double the number of digits required to safely represent it (in any base). It's really the 1/2 constant factor in n(n-1)/2 that saves us from having to think too hard about it.Fettling
It doesn't matter that you can't store the sum of all integers from 1 to 2^32 because the integer type in languages like C/C++ ALWAYS preserves properties like associativity and communicativity. What this means is that although the sum won't be the right answer, if you calculate the expected with overflow, the actual sum with overflow, and then subtract, the result will still be correct (provided it itself does not overflow).Threadfin
@thedayturns, haha, they were faster than you and there are more of them, than us. They already rage and went into my profile to minus my questions and answers just because of feeling, that they don't understand something, like modular arithmetic. It's funny. They made my day )Larocca
@Larocca - your answer was my first thought, and modular arithmetic certainly works in the 2^32-1 integers case, but I'd imagine it would fail if there are "4 billion" integers. I.e. there are actually 294967296 missing integers out of the complete set.Bohlin
@James, ofcourse! our thread in comments were about trashgod's 4294967295 unique integers. Sad, he deleted his first thread-starting comment.Larocca
The size of the file is 4×109×4 bytes = 16GB. I didn't agree. The file may contain text information (numbers not as binary data, but as text representation of them). Also nothing is said about if we can modify input file or not.Ammo
@Threadfin OK yes, that's a good point. But it restricts you to the case when there's a single missing integer; the method with a 64-bit sum can find one missing integer out of many. If you want to do that with a 32-bit sum then you have to check partitions of the difference.Fettling
@Brian, in that particular case (where you're guaranteed to have all 32-bit ints except one, and without duplication), doing everything modulo 2^32 is sufficient.Tsai
@Henning That's what I admitted toFettling
@SecureFish: Case 2 changes from the question to the solution. Should it be 1 MB or 10 MB?Woodcraft
The quoted question isn't very precise on the definition of integer. If you were to take it as "whole number" rather than a more technical definition including explicit bounds (32 bits/64 bits etc.). You could just loop through the file once,find the biggest number and add oneIngunna
@Nakilon, your solution may work in the case where only one integer is missing, but I'm not sure that is assumed in the question, it may be more than one integer.Grison
<<For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.>> If each bucket represents all integers with a given 16-bit prefix, then as stated, there are 65536 such integers. How could all 4 billion integers belong to the same bucket? Couldn't we save ourselves space by allocating 2^16 = 65536 = 2 bytes per bucket instead of 4 bytes per bucket? Also regarding Case 2, for the second pass through the file, I presume it's best to simply use a 65536 bit vector analogous to Case 1's 1 integer:1 bit mapping bit vector?Izabel
The salient point is that it's an interview question. If a decent employer gives you a question like this, they expect you to check some assumptions first. You should respond by asking things like "are we talking about mathematical integers or those restricted to a fixed domain? Are we talking about 32-bit quantities?" and whether your answer needs to fall inside the same domain.Scincoid
A modified bucket approach which record minimum and maximum for each bucket will give you lots of numbers to select from. You can also calculate the average for each bucket in the first pass, and in the second pass, count the numbers below the average, and above average, and the standard deviation if you likeGeorgettegeorgi
2^32 = 4*109 ??? howKatrinka
Correct me if I am wrong. The solution for 2nd case is wrong. Suppose we have all 1s for the first bucket and all 65536+10 for the 2nd bucket, all 2*65536+10 for the 3rd bucket,..... all 65535*65536+10 for the last bucket. Then after first pass, we will find all buckets have 65536 bits and then we will conclude no number is missing which is wrong!Treed
In case 1, it is assumed that all integers are read into bitvector at once. Given the size of integers in windows, wouldn't the file size be greater than 1gb which itself is greater than the available memory?Grindelia
As a working programmer with nearly 35 years of experience these "coding interview questions" crack me up. If I was asked a question like this I'd return fire with, "Why are you wasting time with a cop-out question like this? Is this in any way indicative of what developers here do on a daily basis? What is it you're hoping to learn? Will this tell you if I show up on time, if I work a full day, if I'm willing to work OT on occasion to get things wrapped, or if I work and play well with others? No. This is a geek's cop-out. Call when you're ready to conduct a professional interview. Bye".Abloom
I don't think the problem is formulated correctly / unambiguously. Even in the book. Are the numbers unique or not???Roseola
Could anyone explain me the purpose of using a radix here?Randee
Question is from "Cracking the Coding Interview", exercise 10.7.Mature
T
547

Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.

If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.

If "integer" means mathematical integer: Read through the input once and keep track of the largest number length of the longest number you've ever seen. When you're done, output the maximum plus one a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).

Tsai answered 22/8, 2011 at 21:28 Comment(44)
Perfect. Your first answer requires only 2 passes through the file!Blueprint
I'm pretty sure that "integer" means 32-bit int, otherwise, you could just track the maximum value and return max + 1. Also, the OP's analysis states that integers are 32 bits. But maybe he misunderstood the interview question.Argyrol
A 10 MB bignum? That's pretty extreme.Candidacandidacy
@Henning: In the second part, what do you mean with ignoring numbers outside the 32-bit range?Mattoid
Good answer. Since the output format wasn't specified, for the "arbitrary integers in some bignum encoding" case, I was just going to suggest printf("googolplex\n"), which is not really an algorithm, but works :)Telemetry
@Legate, just skip overlarge numbers and do nothing about them. Since you're not going to output an overlarge number anyway, there's no need to keep track of which of them you've seen.Tsai
The good thing about Solution 1, is that you can decrease the memory by increasing passes.Pedroza
I am a bit confused at solution 1. Are you saying ignore the high 16bits and count how many times you see the low 16bits? that would require (2^16(bytes for low 16bits) * 3(bytes required to count)) which is 192k so that works but i just wanted to make sure i didnt get it vastly confused.Shinberg
@acidzombie: Actually I was saying ignore the low 16 bits and count the number of times you see each possibility for the high 16 bits. But doing it the other way around works equally well.Tsai
Nice! This is clearly the correct solution to the interviewer's question. I feel like the question was trying to get you to come up with a solution which holds up even when you keep going lower: "What about 10MB? What about 1MB? What about just static constants and two int32 pointers I give you?" I think your answer holds up even under the last constraint.Fettling
Using this technique, I need only 2^16 bits. That's for 2^16 buckets and one bit per bucket. Since we know exactly one number is missing, one of the buckets will have an odd number of hits; the others will have an even number. Initialize the bit array to all zeros. For every hit, toggle the corresponding bit. When you are done, look for the 1 and then scan through that bucket (reusing the bit array) to find the missing number.Carpentry
@Barry: The question above does not indicate that there is exactly one number missing. It doesn't say the numbers in the file don't repeat, either. (Following the question actually asked is probably a good idea in an interview, right? ;-))Barnhill
I'm probably missing something here, but how can solution number one work unless we know that there are no duplicates in the file, and I don't see that in the question?Kirwin
@Thomas, I don't see any reason why having duplicates would be a problem for the solution. Notice that the question does not ask for us to identify all missing numbers, just to find at least one of them. Could you please elaborate?Tsai
@Henning: Assume the file contains 65536 instances of each 32-bit multiple of 65536, that is, the numbers with 16 zeros as suffix. Each counter will then show 65536. -- But wait. I think I see what I missed. That would mean there are 4294967296 numbers in the file, and it was only exactly 4 billion?Kirwin
@Christopher I see now. I based my solution on some of the other questions/comments in the threads here, which distracted me from the wording of the original question.Carpentry
I think I may be missing something. Can someone explain to me how to keep count? Consider the scenario where every single number in the file is n. If we are keeping count, wouldn't there be 4 billion n's requiring us to allocate 2^32 (overhead to represent 4b,max count)* 2^16(number of Hi-bit permutations) == total 2^48 bits == 32768GB? what am i missing?Timid
@mrBorna, even a counter that can reach a value of 4 billion takes up only 32 bits, not a bit for each value it can reach. And you don't even have to keep precise track -- once a bucket counter reaches ceil(4.0e9/65536)=61036 you can stop increasing it because then it cannot possibly be the counter with the least value when the first pass is over. Thus you really only need a 16-bit counter for each bucket.Tsai
@Henning Ok... So based on your clarification (and thank you for that) , each bucket (of the 2^16 buckets) has a 2^16 bit counter. Thats a total of 512MB in memory total for a 1-run count (though less if we split it up into multiple runs). So this is not a 2-run solution for 10MB memory, correct?Timid
@mrBorna, wrong! Each bucket has a two-byte counter (sixteen bits, capable of counting up to 65535), making the entire table take up 128 kilobytes.Tsai
Gah I messed up it totally makes sense. I don't know why I kept saying 2^16 when I meant 16 bits for the counter lol. thanks a lot @Henning!Timid
@HenningMakholm, why the entire table is 128 kb? What I think is that, there are 2^16 buckets which will take up 2^16/8=8192b, and there a 2-byte counter for each bucket, so 2*2^16=131072b is taken. Then I think 8192+131072=139.3kb will be taken, right?Bloodred
@HenningMakholm, the second thing is, if there are duplicates, say there are 65535 duplicates for a bucket, no other numbers for this bucket, will the solution still detect the missing numbers?Bloodred
@Alcott: (1) it is not clear to me what you need the the the 2^16 single bits (8192 bytes) you speak about for. The counters themselves should be enough. (2) The algorithm does not purport to find all missing numbers. All it needs to do is locate one among 32-bit the numbers that are not in the file. A bucket that is hit by 65535 duplicates of a single number will just be ignored in favor of a less-hit bucket.Tsai
I don't see how the second pass would guarantee it would yield a new number. Each slot keeps track of the number of integers in the file with the 16-bit prefix,I see in the second pass if the counter in a slot is either 0 or 1, when we map to it, we can find a new number by just adding 1 to the prefix or adding 1 to the current integer in the file under the scanning pointer. However, when the counter is greater 1, how does it yield a new number in the 2 second pass just by looking at the 16 bit prefix counter? Someone sheds some light please.Nipha
@CongHui: in the second pass, we only consider those numbers with the specific 16-bit prefix we identified in the first pass, which we already know that there must be less than 2^16 such numbers. Then we record the count of each 16-bit suffix for numbers with the specified 16-bit prefix. There must be a 16-bit suffix which is not in the file. So the number formed by the 16-bit prefix and 16-bit suffix is not in the file, we can output that.Airel
@Harshdeep, what we meant by a "pass" here is the process of reading the numbers in the input file, not the memory. Of course after the second pass (that is, the second time reading the input), we need to loop through the memory to find which 16-bit suffix has not been set.Airel
Wonderful answer! And it scales very well to larger fixed size numbers, such as 4 passes for 64-bit integers, etc.Maryn
Or the length of the length of the longest number! Or the length of that! Or the length of that! Or the length of the number of times you took the length of the longest number! (Or the length of the Knuth up-arrow notation of the longest number! Or ...)Daubigny
The solution for 2nd case is wrong. Just suppose the following case: 1st bucket: all 1s.Treed
@deryk: Is "1st bucket: all 1s" meant to be a description of a counterexample? I don't understand exactly what counterexample that would be; could you perhaps attempt to extend your description to one or more complete sentences?Tsai
@Henning Makholm. There must be something wrong of this website. I wrote a lot and only a few left. Anyway, my point is this: if each bucket only has duplicates of a single number, the first pass will tell you no number is missing which wrong! For example, 1st buckets has 65536 of 10s, 2nd bucket has 65536 of 65536+10, the 3rd bucket has 65536 2*65536+10, ..., the last bucket has 65536 of 65535*65536+10. The first pass will tell you each bucket counts 65536 times and so will report no number is missing. Please correct me if this counterexample is wrong.Treed
@deryk: 65536 buckets with 65536 copies in each would require the input file to contain 65536*65536 = 4294967296 entires -- and it is said explicitly in the original interview question that there are only 4 billion numbers there, which is less than that. (Reasonable minds disagree on whether e.g. "gigabyte" ought to mean 1024^3 or 10^9 bytes, but "billion" is unambiguously 10^9 in English).Tsai
@Henning Makholm, what do you mean "65536 buckets with 65536 copies in each"? Does that mean each bucket has 65536 bits? or in other words, do you allocate a 2-d array of buckets[65536][65536/8]? if so, the total size is way larger than 10MB.Treed
@deryk: No, I'm just saying that your claimed counterexample seems to describe an input file with 65536*65536 integers in it. That doesn't match the description in the question, which is about a file with only 4 billion integers in it. (In the solution I describe each of the 65536 buckets is represented by one 16-bit counter).Tsai
@Henning Makholm, Or are you saying you read in the file 65536 times and each time you only count the numbers in [k*65536, k*65536+65535]. Am my understanding correct?Treed
@Henning Makholm, a little confused. Any differences between "65536*65536 integers in a file" and "4 billion integers in a file"? 65536*65536 is about 4 billion.Treed
@deryk: (1) I read the file once and count how many numbers there are in each length-65536 interval. Then I choose the interval that was hit the fewest times and read the file once again, now keeping track of which of those particular 65536 numbers were hit. (2) Yes, the difference is 65536*65536-4000000000 == 294967296. Your file has about 295 million more numbers in it than the 4 billion it's supposed to have.Tsai
@Henning Makholm: OK, my question is back. For each 16-bit counter, if duplicates exist, do you count them once or multiple times? For example, for the 1st counter, you see 1,1,1, do you count once or three times?Treed
@deryk: You count the number of lines in the file that contain a number in the interval covered by the bucket. It is immaterial whether these numbers are the same or different (since you can't let anything depend on whether there are duplicates without using too much memory).Tsai
@Henning Makholm: now I understand your solution completely. Since the total integers in file is 4 billion which is less than 2^32, for 65536 buckets, there will always be at least a bucket that will be hit less than 65536 times. Good solution! I dig it a little bit more and found It is called pigeonhole principle.Treed
for unbounded: one pass, remove delimiters (everything that is not a digit).Malan
If the integers are unbounded, it's not always possible to store a finite integer larger than all of them (or equivalently, their lengths) in a finite amount of space.Aldora
Could you pls elaborate with an example with lesser size, ex 4 bit integers? unable to clearly understand, my calculations are off. @BlueprintEcumenicism
H
206

Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.

If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.

If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.

Heartfelt answered 23/8, 2011 at 5:4 Comment(12)
I'm pretty sure the integers are bounded. If they weren't, then even a beginner programmer would think of the algorithm "take one pass through the data to find the maximum number, and add 1 to it"Ca
Literally guessing a random output probably won't get you many points on an interviewFettling
@Adrian that was my solution ;) requires one pass through the file and meets the requirements.Heredia
@Adrian, your solution seems obvious (and it was to me, I used it in my own answer) but it's not obvious to everybody. It's a good test to see if you can spot obvious solutions or if you're going to over-complicate everything you touch.Candidacandidacy
As the number of ints in the file increases, the feasibility of the pseudorandom approach decreases. If the file contains, say, (2^32 - 5) ints, trying to find one of the handful of remaining ints randomly is impractical.Kung
@Brian: I think this solution is both imaginitive and practical. I for one would give much kudos for this answer.Anabelle
Great answer, but... Roar. I'm the worst case monster -- annoying but vaguely relevant. Worst case: O(N^2), assuming you check for doubles in your random thing. Still, upboats.Eliza
@brian, If even a one in 64 billion billion billion chance is too high for you, just "guess" your number and then make one pass through to verify it doesn't exist.Maltzman
ah here lies the line between engineers and scientists. Great answer Ben!Dorton
I've got to disagree with this answer, as well as with the comments on its practicality. I have seen on multiple occasions disastrous effects of allegedly statistically improbably collisions. One of the reasons is that the probability of collision within K numbers randomly chosen out of N possibilities, with K << N, is roughly proportional to K^2/N, NOT K/N, see Birthday Paradox. Second reason is in pseudo-randomness shortcomings. You single-pass check definitely helps, but when K is close to sqrt(N) you'd need numerous guesses and passes.Maryn
@Maryn you are right that pseudo-randomness can have shortcomings. In this case the birthday paradox does not apply. But your more general point that extra assumptions lead to extra failure modes is well taken. My answer has value for those situations where the speed improvement is worth the extra implementation and debugging work.Heartfelt
The birthday paradox is irrelevant here because you are checking whether any of your random numbers match the ones in the file, not whether any two of your random numbers match each other. Also, for the second algorithm, if none of the 1,000 numbers generated are unique, then you can just generate another 1,000 and try again. The expected number of passes is imperceptibly greater than 1.Buttonwood
N
147

A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10

Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :) Coming to the problem, 4 billion numbers can be represented in :

4 billion bits = (4000000000 / 8) bytes = about 0.466 GB

The code to implement the bitset is simple: (taken from solutions page )

#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];

void set(int i) {        a[i>>SHIFT] |=  (1<<(i & MASK)); }
void clr(int i) {        a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int  test(int i){ return a[i>>SHIFT] &   (1<<(i & MASK)); }

Bentley's algorithm makes a single pass over the file, setting the appropriate bit in the array and then examines this array using test macro above to find the missing number.

If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of 32/8 = 4 passes.

HTH.

Noh answered 23/8, 2011 at 4:20 Comment(15)
I dont know the book, but no reason to call it "Wonder Sort", as it is just a bucketsort, with a 1-bit counter.Struck
In the book, Jon Bentley is just being humorous...and calls it "Wonder Sort"Ecumenicism
Although more portable, this code will be annihilated by code written to utilize hardware-supported vector instructions. I think gcc can automatically convert code to using vector operations in some cases though.Fettling
@Brian I think this is more like "super naive code" which is often the most debuggable, but never the most optimized.Heredia
@brian I don't think Jon Bentley was allowing such things into his book on algorithms.Marler
@Brian: Will it? I think that entirely depends on how many times you plan to run the program, and how much your programming (and debugging!) time is worth. If you only need to run it once, you're better off writing something simple that's less likely to have bugs.Muscovado
@Brian: the book was written in 1986, and made up of columns previously written for the CACM journal. I don't think they had hardware-supported vector instructions in those days.Garonne
@Dave: supercomputers had vector instructions since the 1970s. Wikipedia (en.wikipedia.org/wiki/Vector_processor) claims the first work was done in the early 1960s.Ullrich
@vine'th, I don't quite understand what did you mean if there is only one byte memory available? how to do the job with one byte? Can you be more specific?Bloodred
@vine'th, by 4 passes, did you mean read the file 4 times to finish the job?Bloodred
@Alcott: The one byte example was just an very basic example. Yes by 4 passes I meant read the file 4 times. To be more realistic, if we have memory for 1 Billions bits available, we need 4 passes to find the missing number among the 4 billion numbers in the above example.Ecumenicism
@BrianGordon, the time spent in ram will be negligible compared to the time spent reading the file. Forget about optimizing it.Scincoid
@BrianGordon: What are you going to do with vectors? Remember that your input isn't sorted, so for each input element, you need to set a different bit in memory, and those bits won't be contiguous. x86 AVX2 has gathered loads (of 32bit elements). It doesn't have scattered stores of ints, let alone bytes, let alone bits into a bitfield!Jaborandi
@BrianGordon: Or were you talking about the loop at the end to find the first unset bit? Yes, vectors will speed that up, but looping over the bitfield with 64bit integers, looking for one that's != -1 will still saturate memory bandwidth running on a single core (this is SIMD-within-a-register, SWAR, with bits as elements). (For recent Intel/AMD designs). You only have to figure out which bit is unset after you find the 64bit location containing it. (And for that you can not / lzcnt.) Fair point that looping over a single-bit test may not get optimized well.Jaborandi
This programming pearls book is gold. +1 for the reference to it. Why didn't I know of this before...!Donnettedonni
M
124

Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)

Maccabees answered 23/8, 2011 at 11:7 Comment(4)
Unless the largest number in the file is max int then you will just overflowConners
What would be the size of that file in a real World program that may need to generate a new integer and append it to the "used integers" file 100 times?Maryn
I was thinking this. Assuming int is 32 bits, just output 2^64-1. Done.Dermis
If it's one int per line: tr -d '\n' < nums.txt > new_num.txt :DMarxism
O
61

For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.

Osteology answered 22/8, 2011 at 21:17 Comment(7)
The range of numbers in the input isn't specified. How does this algorithm work if the input consists of all the even numbers between 8 billion and 16 billion?Candidacandidacy
@Mark, just ignore inputs that are outside the 0..2^32 range. You're not going to output any of them anyway, so there's no need to remember which of them to avoid.Tsai
@Mark whatever algorithm you use to determine how a 32 bit string maps to a real number is up to you. The process is still the same. The only difference is how you print it as a real number to the screen.Blueprint
Instead of iterating yourself you can use bitSet.nextClearBit(0): download.oracle.com/javase/6/docs/api/java/util/…Anabantid
It would be useful to mention that, regardless of the range of the integers, at least one bit is guaranteed to be 0 at the end of the pass. This is due to the pigeonhole principle.Pion
unless you have memory the same speed as your processor, Henning Makholm's solution is better no matter the amount of ram.Allisonallissa
@Brian - if you have 2^32 bits, and 4 billion integers, you can map one bit to each of the integers, and you're guaranteed to have a slot left over, since you run out of integers before you run out of bits, you would have to map two bits to one integer, which is a contradiction. So, assuming 4B really means 2^32, as I suspect Raf did, he is correct.Bohlin
P
51

If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.

In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.

But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.

If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.

However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.

That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:

import numpy as np
nums_in_bin = np.zeros(65536, dtype=np.uint32)
for N in four_billion_int_array:
    nums_in_bin[N // 65536] += 1
for bin_num, bin_count in enumerate(nums_in_bin):
    if bin_count < 65536:
        break # we have found an incomplete bin with missing ints (bin_num)

Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.

del nums_in_bin # allow gc to free old 256kB array
from bitarray import bitarray
my_bit_array = bitarray(65536) # 32 kB
my_bit_array.setall(0)
for N in four_billion_int_array:
    if N // 65536 == bin_num:
        my_bit_array[N % 65536] = 1
for i, bit in enumerate(my_bit_array):
    if not bit:
        print bin_num*65536 + i
        break
Prolactin answered 23/8, 2011 at 5:58 Comment(6)
Such an awesome answer. This would actually work; and has guaranteed results.Cosh
@dr jimbob, what if there is only one number in a bin, and that single number has 65535 duplicates? If so, the bin will still counts 65536, but all of the 65536 numbers are the same one.Bloodred
@Bloodred - I assumed you had 2^32-1 (or fewer) numbers, so by the pigeonhole principle you are guaranteed to have one bin with fewer than 65536 counts to check in more detail. We are trying to find just one missing integer, not all of them. If you had 2^32 or more numbers, you can't guarantee a missing integer and would not be able to use this method (or have a guarantee from the outset there is a missing integer). Your best bet then would be brute force (e.g., read through the array 32 times; checking the first 65536 #s the first time; and stopping once an answer was found).Prolactin
The clever upper-16 / lower-16 method was posted earlier by Henning: https://mcmap.net/q/63405/-generate-an-integer-that-is-not-among-four-billion-given-ones. I loved the add-them-up idea for a unique set of integers missing exactly one member, though.Jaborandi
@PeterCordes - Yes, Henning's solution predates mine, but I think my answer is still useful (working through several things in more detail). That said, Jon Bentley in his book Programming Pearls suggested a multi-pass option for this problem (see vine'th's answer) way before stackoverflow existed (not that I am claiming either of us consciously stole from there or that Bentley was the first to analyze this problem -- it is a fairly natural solution to develop). Two passes seems most natural when the limitation is you no longer have enough memory for a 1 pass solution with a giant bit array.Prolactin
Yup, I agree it's a nice answer for the analysis/explanation. Comparing to other answers, I see the add-them-up idea is similar to the XOR-them-all answers.Jaborandi
D
39

Why make it so complicated? You ask for an integer not present in the file?

According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.

There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.

Denaturalize answered 23/8, 2011 at 14:38 Comment(5)
This would work unless the max int was in the file, which is entirely possible...Taxidermy
The rules does not specify that it is 32bit or 64bit or anything, so according to the rules specified, there is no max int. Integer is not a computer term, it is a math term identifying positive or negative whole numbers.Denaturalize
True enough, but one cannot assume that it is a 64 bit number, or that someone wouldn't just sneak in the max int number just to confuse such algorithms.Taxidermy
The entire notion of "max int" is not valid in the context if no programming language has been specified. e.g. look at Python's definition of a long integer. It is limitless. There is no roof. You can always add one. You are assuming it is being implemented in a language that has a maximum allowed value for an integer.Denaturalize
The rules do specify "this is an interview question". So, proposing your solution is very good, but you need to be ready for when the interviewer predictably follows up with "Very good, now same question but with 32-bit integers". Also, if this had not been an interview question but a practical question, then you'd need to start with "First we need to check whether there is a maxint or not", not with "I'll pretend that there is no maxint and hope it's okay"Contort
T
35

This can be solved in very little space using a variant of binary search.

  1. Start off with the allowed range of numbers, 0 to 4294967295.

  2. Calculate the midpoint.

  3. Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.

  4. If no numbers were equal, you're done. The midpoint number is the answer.

  5. Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.

This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.

This is essentially the same as Henning's solution, except it uses two bins instead of 16k.

Tempera answered 23/8, 2011 at 20:17 Comment(5)
It's what I started with, before I began optimizing for the given parameters.Tsai
@Henning: Cool. It's a nice example of an algorithm where it's easy to tweak the space-time tradeoff.Tempera
@hammar, but what if there are those numbers which appear more than once?Bloodred
@Alcott: then the algorithm will pick the denser bin instead of the sparser bin, but by the pigeonhole principle, it can't ever pick a completely full bin. (The smaller of the two counts will always be less than the bin range.)Jaborandi
this is more like quicksort than binary search. But while going through ram NLgN times is no big deal with files it's pretty bad. So a good solution would go through the file O(1) times at most. Otherwise you're really going to be bogged down.Poison
S
27

EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:


There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is

( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4

The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).

Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.


The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only 1 - 4000000000/2³² ≈ 6% of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
Slapdash answered 23/8, 2011 at 1:49 Comment(8)
When memory is the constraint and not time, this is probably the best method. Keep re-reading the file over and overHeredia
I'm glad to see someone else thought of this, but why is it way down here at the bottom? This is a 1-pass algo… 10 MB is enough for 2.5 M guesses, and 93%^2.5M ≈ 10^-79000 is truly a negligible chance of needing a second scan. Due to the overhead of binary searching, it goes faster if you use fewer guesses! This is optimal in both time and space.Vodka
@Potatoswatter: good you mentioned binary search. That's probably not worth the overhead when using only 5 guesses, but it certainly is at 10 or more. You could even do the 2 M guesses, but then you should store them in a hash set to get O(1) for the searching.Slapdash
@Vodka Ben Haley's equivalent answer is up near the topFettling
I like this approach, but would suggest a memory-saving improvement: if one has N bits of indexed storage available, plus some constant storage, define a configurable reversible 32-bit scrambling function (permutation), pick an arbitrary permutation, and clear all indexed bits. Then read each number from the file, scramble it, and if the result is less than N, set the corresponding bit. If any bit isn't set at the end of the file, reverse the scramble function on its index. With 64KB of memory, one could effectively test over 512,000 numbers for availability in a single pass.Regolith
If the scrambling function is randomly selected from a set of 2^32 permutations which are equivalent to random distributions, it may be possible for a carefully-constructed list to include all 512,000 numbers for some of those distributions, but not very many. Realistically, even with data that was deliberately designed to be evil, making a truly random one-in-2^32 selection would probably give a better than 99.9999% chance of success in one pass.Regolith
Of course, with this algorithm, the worst case is one where the numbers were created by the same random number generator that you are using. Assuming you can guarantee that's not the case, your best tactic is to use a linear congruential random number generator to generate your list, so that you will go through the number space in a pseudorandom way. That means if you somehow fail, you can continue generating numbers until you have covered the whole range of ints (of have found a gap), without ever duplicating your effort.Simian
@DewiMorgan: right, though that's a really bad worst case (as in, bad luck).Slapdash
S
24

If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:

>>> 0 ^ 1 ^ 3
2
>>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7
5

(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)

Spelling answered 24/8, 2011 at 2:43 Comment(3)
Yes, it's easy to prove[] that works when one integer is missing, but it frequently fails if more than one is missing. For example, 0 ^ 1 ^ 3 ^ 4 ^ 6 ^ 7 is 0. [ Writing 2x for 2 to x'th power, and a^b for a xor b, the xor of all k<2x is zero -- k ^ ~k = (2^x)-1 for k < 2^(x-1), and k ^ ~k ^ j ^ ~j = 0 when j=k+2**(x-2) -- so the xor of all but one number is the value of the missing one]Ealasaid
As I mention in a comment on ircmaxell's reply: The problem does not say "one number is missing", it says to find a number not included in the 4 billion numbers in the file. If we assume 32-bit integers, then about 300 million numbers may be missing from the file. The likelihood of the xor of the numbers present matching a missing number is only about 7%.Ealasaid
This is the answer I was thinking of when I initially read the question, but on closer inspection I think the question is more ambiguous than this. FYI, this is the question I was thinking of: #35685Gulosity
V
19

They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)

Ventura answered 23/8, 2011 at 21:20 Comment(5)
With probably over 90% of the possible values set, your Bloom Filter would probably need to degenerate to the bitfield so many answers already use. Otherwise, you'll just end up with a useless completely filled bitstring.Barnhill
@Christopher My understanding of Bloom filters is that you don't get a filled bitarray until you reach 100%Ventura
...otherwise you'd get false negatives.Ventura
@Ventura a filled bit array gives you false positives, which are allowed. In this case the bloom filter would most likely degenerate to the case where the solution, which would be negative, returns a false positive.Calculable
@Paul: You can get a filled bitarray as soon as the number of hash functions multiplied by the number of entries is as large as the length of your field. Of course, that would be an exceptional case, but the probability will rise pretty quickly.Barnhill
C
17

Based on the current wording in the original question, the simplest solution is:

Find the maximum value in the file, then add 1 to it.

Caelian answered 23/8, 2011 at 3:4 Comment(7)
What if the MAXINT is included in the file?Knout
@Petr Peller: A BIGINT library would essentially remove limitations on integer size.Caelian
@oosterwal, if this answer was allowed, than you don't even need to read the file – just print as big number as you can.Larocca
@Nakilon: While that might work for almost every single case, there is no guarantee that the randomly huge number that was printed was not in the file.Caelian
@oosterwal, if your random huge number was the largest you could print, and it was in file, then this task couldn't be solved.Larocca
@Nakilon: +1 Your point is taken. It's roughly equivalent to figuring the total number of digits in the file and printing a number with that many digits.Caelian
@Caelian - There is a guarantee if the randomly huge number was too big to be represented in a file that size. I came to a similar conclusion and discussed it in my answer.Endoergic
C
14

Use a BitSet. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.

To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.

Actually BitSet.nextClearBit(0) will tell you the first non-set bit.

Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.

Cherokee answered 22/8, 2011 at 21:17 Comment(1)
Or if you don't want to use a BitSet ... try an array of bits. Does the same thing ;)Heredia
B
12

If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.

However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.

Blister answered 24/8, 2011 at 6:9 Comment(0)
A
11

If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:

The Theory

Given any 0 based range of integers that has 2^n elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.

The Proof

Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:

  • 0 - 00
  • 1 - 01
  • 2 - 10
  • 3 - 11

Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be 10 (or 2).

Now, let's look at n+1. Let's call the number of times each bit is set in n, x and the number of times each bit is set in n+1 y. The value of y will be equal to y = x * 2 because there are x elements with the n+1 bit set to 0, and x elements with the n+1 bit set to 1. And since 2x will always be even, n+1 will always have each bit set an even number of times.

Therefore, since n=2 works, and n+1 works, the xor method will work for all values of n>=2.

The Algorithm For 0 Based Ranges

This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.

long supplied = 0;
long result = 0;
while (supplied = read_int_from_file()) {
    result = result ^ supplied;
}
return result;

The Algorithm For Arbitrary Based Ranges

This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).

long supplied = 0;
long result = 0;
long offset = INT_MAX;
while (supplied = read_int_from_file()) {
    if (supplied < offset) {
        offset = supplied;
    }
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
    result = result ^ (supplied - offset);
}
return result + offset;

Arbitrary Ranges

We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:

long supplied = 0;
long result = 0;
long offset = INT_MAX;
long n = 0;
double temp;
while (supplied = read_int_from_file()) {
    if (supplied < offset) {
        offset = supplied;
    }
}
reset_file_pointer();
while (supplied = read_int_from_file()) {
    n++;
    result = result ^ (supplied - offset);
}
// We need to increment n one value so that we take care of the missing 
// int value
n++
while (n == 1 || 0 != (n & (n - 1))) {
    result = result ^ (n++);
}
return result + offset;

Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.

Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):

function find($array) {
    $offset = min($array);
    $n = 0;
    $result = 0;
    foreach ($array as $value) {
        $result = $result ^ ($value - $offset);
        $n++;
    }
    $n++; // This takes care of the missing value
    while ($n == 1 || 0 != ($n & ($n - 1))) {
        $result = $result ^ ($n++);
    }
    return $result + $offset;
}

Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.

Another Approach

Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:

long supplied = 0;
long last = read_int_from_file();
while (supplied = read_int_from_file()) {
    if (supplied != last + 1) {
        return last + 1;
    }
    last = supplied;
}
// The range is contiguous, so what do we do here?  Let's return last + 1:
return last + 1;
Axon answered 24/8, 2011 at 13:56 Comment(2)
The problem does not say "one number is missing", it says to find a number not included in the 4 billion numbers in the file. If we assume 32-bit integers, then about 300 million numbers may be missing from the file. The likelihood of the xor of the numbers present matching a missing number is only about 7%.Ealasaid
If you have a contiguous-but-missing-one range that isn't zero-based, add instead of xor. sum(0..n) = n*(n+1)/2. So missing = nmax*(nmax+1)/2 - nmin*(nmin+1)/2 - sum(input[]). (sum idea from @hammar's answer.)Jaborandi
C
9

Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n, and return n+1.

Of course you'd need a backup plan in case n+1 causes an integer overflow.

Candidacandidacy answered 22/8, 2011 at 21:37 Comment(2)
Here's a solution that works... except when it doesn't. Useful! :-)Cherokee
Unless it's been quoted improperly, the question didn't place a bound on type of integer, or even on language used. Many modern languages have integers only bounded by available memory. If the largest integer in the file is > 10MB, tough luck, task impossible for the second case. My favourite solution.Acetylide
E
9

Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).

void maxNum(ulong filesize)
{
    ulong bitcount = filesize * 8; //number of bits in file

    for (ulong i = 0; i < bitcount; i++)
    {
        Console.Write(9);
    }
}

Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.

Endoergic answered 24/8, 2011 at 4:16 Comment(2)
Why not just Console.Write( 1 << bitcount ) instead of the loop? If there are n bits in the file, then any (_n_+1)-bit number with a leading 1 is absolutely guaranteed to be bigger.Lanford
@Lanford - That would just cause integer overflow, unless the file were smaller than the size of an int (4 bytes in C#). C++ might let you use something bigger, but C# doesn't seem to allow anything but 32-bit ints with the << operator. Either way, unless you roll your own gigantic integer type, it's going to be a very small file size. Demo: rextester.com/BLETJ59067Endoergic
E
8
  • The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.

  • The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.

  • A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).

  • The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.

Ealasaid answered 22/8, 2011 at 21:35 Comment(0)
D
7

For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.

Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.

Donkey answered 22/8, 2011 at 21:11 Comment(7)
+1 for creativity (and the smallest worst-case output for a single-pass solution yet).Tsai
But there aren't 4 billion bits to diagonalize on, there are only 32. You'll just end up with a 32 bit number which is different from the first 32 numbers in the list.Fettling
@Henning It's hardly a single pass; you still have to convert from unary to binary. Edit: Well I guess it's one pass over the file. Nevermind.Fettling
@Brian, where's there something "unary" here? The answer is constructing a binary answer one bit a time, and it only reads the input file once, making it single pass. (If decimal output is required, things get problematic -- then you're probably better off constructing one decimal digit per three input numbers and accept a 10% increase in the log of the output number).Tsai
@Brian (first comment), Jonathan explicitly assumed arbitrarily large integers.Tsai
@Henning The problem doesn't make sense for arbitrarily large integers because, as many people have pointed out, it's trivial to just find the largest number and add one, or construct a very long number out of the file itself. This diagonalization solution is particularly inappropriate because rather than branching on the ith bit you could just output 1 bits 4 billion times and throw an extra 1 on the end. I'm OK with having arbitrarily large integers in the algorithm but I think the problem is to output a missing 32-bit integer. It just doesn't make sense any other way.Fettling
@Henning (first comment) For some reason I thought that the ones of the 2^32 bit output would have some meaning as a unary number converted to a 32 bit binary number. That's just the population count though :XFettling
B
7

Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.

From Reddit by Carbonetc.

Bach answered 24/8, 2011 at 12:39 Comment(1)
Love It! Even Though it's not quite the answer he was looking for... :DThunderstorm
P
6

You can use bit flags to mark whether an integer is present or not.

After traversing the entire file, scan each bit to determine if the number exists or not.

Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.

Pontic answered 22/8, 2011 at 21:18 Comment(2)
0.5 Gb, unless you've redefined byte to be 4 bits ;-)Cherokee
@Cherokee I think he means "comfortably", as in there will be lots of room in the 1Gb.Blueprint
R
6

Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.

Let all possible integers be the range from int_min to int_max, and bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)

for (integer i = int_min; i <= int_max; ++i)
{
    if (isNotInFile(i)) {
        return i;
    }
}
Revile answered 24/8, 2011 at 11:51 Comment(6)
The question was exactly about the algorithm for isNotInFile function. Please make sure you understand the question before answering.Cog
no, the question was "which integer is not in the file", not "is integer x in the file". a function to determine the answer to the latter question could for example just compare every integer in the file to the integer in question and return true on a match.Revile
I think this is a legitimate answer. Except for I/O you only need one integer and a bool flag.Fettling
@Aleks G - I don't see why this is marked as wrong. We all agree it's the slowest algorithm of all :-), but it works and just needs 4 bytes to read the file. The original question does not stipulate the file can only be read once for example.Aerodontia
@Simon I never said the file needs to be read once. I stated that the question includes the requirement for an algorithm to check if a number is in the file.Cog
@Aleks G - Right. I never said you said that either. We just say IsNotInFile can be trivially implemented using a loop on the file: Open;While Not Eof{Read Integer;Return False if Integer=i;Else Continue;}. It needs only 4 bytes of memory.Aerodontia
J
5

For the 10 MB memory constraint:

  1. Convert the number to its binary representation.
  2. Create a binary tree where left = 0 and right = 1.
  3. Insert each number in the tree using its binary representation.
  4. If a number has already been inserted, the leafs will already have been created.

When finished, just take a path that has not been created before to create the requested number.

4 billion number = 2^32, meaning 10 MB might not be sufficient.

EDIT

An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.

EDIT II

There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.

Jameyjami answered 22/8, 2011 at 21:38 Comment(2)
... and how will that fit into 10 MB?Tsai
How about: limit the depth of the BTree to something that would fit into 10MB; this would mean you would have results in the set { false positive | positive } and you could iterate through that and use other techniques find values.Cosh
A
5

I will answer the 1 GB version:

There is not enough information in the question, so I will state some assumptions first:

The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.

Pseudo-code:

var bitArray = new bit[4294967296];  // 0.5 GB, initialized to all 0s.

foreach (var number in file) {
    bitArray[number + 2147483648] = 1;   // Shift all numbers so they start at 0.
}

for (var i = 0; i < 4294967296; i++) {
    if (bitArray[i] == 0) {
        return i - 2147483648;
    }
}
Attested answered 23/8, 2011 at 12:39 Comment(0)
A
4

As long as we're doing creative answers, here is another one.

Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed). Read through the sorted file and output the first number that is missing.

Amperehour answered 22/8, 2011 at 21:11 Comment(0)
R
4

As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)

EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.

Reese answered 22/8, 2011 at 21:15 Comment(5)
One crucial part is that you should be doing it as you go, that way you only have to read once. Accessing physical memory is slow.Scapolite
@ryan external sort is in most cases a merge sort so on the last merge you can do the check :)Reese
If the data is on disk, it will have to be loaded into memory. This happens automatically by the file system. If we have to find one number (the problem does not state otherwise) then reading the sorted file a line at a time is the most efficient method. It uses little memory and is no slower than anything else - the file must be read.Denishadenison
How will you sort 4 billion integers when you only have 1 GB of memory? If you use virtyual memory it will take a loooong time as memory blocks get paged in and out of physical memory.Detrude
@klas merge sort is designed for thatReese
C
3

Bit Elimination

One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:

long val = 0xFFFFFFFFFFFFFFFF; // (all bits set)
foreach long fileVal in file
{
    val = val & ~fileVal;
    if (val == 0) error;
}

Bit Counts

Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.

Range Logic

Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:

struct Range
{
  long Start, End; // Inclusive.
}
Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF };

Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.

Cosh answered 23/8, 2011 at 12:26 Comment(0)
P
3

2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.

Penoyer answered 24/8, 2011 at 9:57 Comment(2)
And how would you print the result? You can't put it in a file, and printing on the screen would take a few billion years. Not an uptime likely to be achieved with today's computers.Blister
it is never said we need to print the result anywhere, just 'generate' it. so it depends on what you mean by generate. anyway, my answer is just a trick to avoid working out a real algorithm :)Penoyer
C
3

I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:

If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.

Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.

The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).

Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.

# Grab the file size
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
# Print a '2' for every bit of the file.
for (c=0; c<sz; c++) {
  for (b=0; b<4; b++) {
    print "2";
  }
}
Centroclinal answered 24/8, 2011 at 10:37 Comment(1)
@Larocca and TheDayTurns have pointed this out in the comments to the original questionFettling
C
2

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GiB memory. Follow up with what you would do if you have only 10 MiB of memory.

The size of the file is 4 * 109 * 4 bytes = 16 GiB

In case of 32-bit Unsigned Integer

0 <= Number < 2^32
0 <= Number < 4,294,967,296

My proposed solution: C++ without error checking

#include <vector>
#include <fstream>
#include <iostream>
using namespace std;

int main ()
{
    const long SIZE = 1L << 32;

    std::vector<bool> checker(SIZE, false);

    std::ifstream infile("file.txt");  // TODO: error checking

    unsigned int num = 0;

    while (infile >> num)
    {
        checker[num] = true ;
    }

    infile.close();

    // print missing numbers

    for (long i = 0; i < SIZE; i++)
    {
        if (!checker[i])
            cout << i << endl ;
    }

    return 0;
}

Complexity

  • Space ~ 232 bits = 229 Bytes = 219 KB = 29 MB = 1/2 GB
  • Time ~ Single Pass
  • Completeness ~ Yes
Commander answered 22/8, 2011 at 21:11 Comment(2)
This duplicates all the years-older bitmap answers. Also, std::vector<bool> doesn't have a fast way to scan for an unset bit. Neither does std::bitset, though. (Testing 64bits at a time against (long)-1 is way faster, unless the compiler is really clever and sees through a bit-at-a-time loop.)Jaborandi
Tested this on x86; gcc 4.9.2 generates terrible one-bit-at-a-time loops. clang generates worse loops for setting a sequence of bits, but slightly better loops (using bt r, r) for testing a bit at a time. Both are still ~100 times slower than checking 64bits at a time with cmp r, -1Jaborandi
P
2

If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!

(Joking... kind of.)

Potentiometer answered 24/8, 2011 at 8:12 Comment(5)
I see this has already been suggested :) upvotes for those peoplePotentiometer
The birthday paradox makes this kind of solution not worth the risk, without checking the file to see if your random guess was actually an valid answer. (Birthday paradox doesn't apply in this case, but repeatedly calling this function to generate new unique values does create a birthday paradox situation.)Jaborandi
@PeterCordes Randomly generated 128bit numbers is precisely how UUIDs work - they even mention the birthday paradox when calculating the probability of a collision in the Wikipedia UUID pagePotentiometer
Variant: Find max in set, add 1.Robertaroberto
I would quicksort the original array (no additional storage.) then march through the array and report the first 'skipped' integer. Done. Answered the question.Horrid
T
1

You could speed up finding the missing integers after reading the existing ones by storing ranges of unvisited integers in some tree structure.

You'd start by storing [0..4294967295] and every time you read an integer you splice the range it falls in, deleting a range when it becomes empty. At the end you have the exact set of integers that are missing in the ranges. So if you see 5 as the first integer, you'd have [0..4] and [6..4294967295].

This is a lot slower than marking bits so it would only be a solution for the 10MB case provided you can store the lower levels of the tree in files.

One way to store such a tree would be a B-tree with the start of the range as the key and the end of the range as the value. Worst case usage would be when you get all odd or even integers which would mean storing 2^31 values or tens of GB for the tree... Ouch. Best case is a sorted file where you'd only use a few integers for the whole tree.

So not really the correct answer but I thought I'd mention this way of doing it. I suppose I'd fail the interview ;-)

Toffic answered 22/8, 2011 at 21:11 Comment(0)
J
1

You don't need to sort them, just repeatedly partition subsets of them.

The first step is like the first pass of a quicksort. Pick one of the integers, x, and using it make a pass through the array to put all the values less than x to its left and values more than x to its right. Find which side of x has the greatest number of available slots (integers not in the list). This is easily computable by comparing the value of x with its position. Then repeat the partition on the sub-list on that side of x. Then repeat the partition on the sub-sub list with the greatest number of available integers, etc. Total number of compares to get down to an empty range should be about 4 billion, give or take.

Jacklynjackman answered 22/8, 2011 at 21:11 Comment(0)
G
1

Perhaps I'm completely missing the point of this question, but you want to find an integer missing from a sorted file of integers?

Uhh... really? Let's think about what such a file would look like:

1 2 3 4 5 6 ... first missing number ... etc.

The solution to this problem seems trivial.

Goldina answered 24/8, 2011 at 0:28 Comment(1)
It's not specified to be sorted.Whisenhunt
S
0

Old question, but I wonder about the "non-functional" requirements. In my opinion there should be a clue given - if this question was asked someplace else than in a book which then goes on to discuss all the possibilities with pros and cons. Often enough it seems to be askes in job interviews, leaving me puzzled as there can't be a definite answer given without knowing the soft requirements, i.e. "it must be very fast to lookup missing numbers, because it's used x times in a second".

I think such question might be possible to give a reasonable answer for.

  • I'd merge-sort all numbers into a new file, using 4 byte per int. Of course this will be slow to do at first. But it can be done with small memory amount (you don't need to necessarily keep all in RAM)
  • Use binary-search to check if the number exists n the presorted file. Since we remain 4-bytes per value this is no problem

disadvantages:

  • File size
  • Slow first sort - but only needed once

advantages:

  • very quick to lookup

So again, a very nice question for a book. But I think it's an odd question when asking for a single best solution, when the problem to be solved isn't fully known.

Scissure answered 22/8, 2011 at 21:11 Comment(1)
Why do you think a job interviewer would be expecting a "single best solution"? Also you're solving a different problem than the one that was asked -- it's not about checking whether some given number occurs in the file, it's about producing one that doesn't occur there.Tsai
W
0

I came up with the following algorithm.

My idea: go through all the whole file of integers once and for every bit position count its 0s and 1s. The amount of 0s and 1s must be 2^(numOfBits)/2, therefore, if the amount is less then expected we can use it of our resulting number.

For example, suppose integer is 32 bit, then we require

int[] ones = new int[32];
int[] zeroes = new int[32];

For every number we have to iterate though 32 bits and increase value of 0 or 1:

for(int i = 0; i < 32; i++){
   ones[i] += (val>>i&0x1); 
   zeroes[i] += (val>>i&0x1)==1?0:1;
}

Finally, after the file was processed:

int res = 0;
for(int i = 0; i < 32; i++){
   if(ones[i] < (long)1<<31)res|=1<<i;
}
return res;

NOTE: in some languages (ex. Java) 1<<31 is a negative number, therefore, (long)1<<31 is the right way to do it

Weariless answered 22/8, 2011 at 21:11 Comment(1)
Is this one of the solutions that only works for one missing integer and no duplicates (meaning there's exactly one possible answer)? If so, it looks like it works for the same reason as the much-less-laborious xor method.Jaborandi
G
0

I may be reading this too closely, but the questions says "generate an integer which is not contained in the file". I'd just sort the list and add 1 to the max entry. Bam, an integer which is not contained in the file.

Gallonage answered 24/8, 2011 at 2:45 Comment(3)
Why would you sort it if you just want to find the max?Spelling
How would you sort 4 billion integers using no more than 1 GB of memory?Detrude
@Brian lol. As interview questions go, not understanding where the complexity lies is usually worse than not coming up with a good solution. After all, understanding the problem is usually more than halfway of solving it.Detrude
J
-2

Surely, and speaking with limited experience (just started learning java at Uni) you can run trhough one set (barrel) of int, and if number not found dispose of barrel. This would both free up space and run a check through each unit of data. If what you are looking for is found add it to a count variable. Would take a long time but, if you made multiple variables for each section and run the check count through each variable and ensure they are exiting/disposing at the same time, the variable storage should not increase? And will speed up the check process. Just a thought.

Jabez answered 22/8, 2011 at 21:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.