find if two arrays contain the same set of integers without extra space and faster than NlogN
Asked Answered
B

14

26

I came across this post, which reports the following interview question:

Given two arrays of numbers, find if each of the two arrays have the same set of integers ? Suggest an algo which can run faster than NlogN without extra space?

The best that I can think of is the following:

  1. (a) sort each array, and then (b) have two pointers moving along the two arrays and check if you find different values ... but step (a) has already NlogN complexity :(

  2. (a) scan shortest array and put values into a map, and then (b) scan second array and check if you find a value that is not in the map ... here we have linear complexity, but we I use extra space

... so, I can't think of a solution for this question.

Ideas?


Thank you for all the answers. I feel many of them are right, but I decided to choose ruslik's one, because it gives an interesting option that I did not think about.

Benyamin answered 14/7, 2011 at 9:39 Comment(9)
@TheHorse: I don't know, the question doesn't say it. But I think I see where you're aiming at: if we have 16 bit integers, then we may use my approach number 1 with linear-time sorting algorithm like radix sort. Correct?Benyamin
Option 1 can have linear complexity with a counting or pigeonhole sort, but then it takes extra memory.Simms
@larsmans: correct, but then I would go with option 2, which is also linearBenyamin
Less, that's almost the same solution except you skip sorting the second array. For a moment, I though you meant std::map/sorted tree, but a bitmap would do it in linear time.Simms
naming arrays A and B, is the N in NlogN size(A)+size(B) or max(size(A),size(B)) ?Indigent
@larsmans: yeah ... but still extra space, sot that doesn't seem the correct answerBenyamin
@Marino: I don't know, we should ask the interviewer :) my guess is that N = max(size(A), size(B))Benyamin
s/Less/Yes/g (that was a weird typo)Simms
@Marino: asymptotically, that's exactly the same if both |A| and |B| are variable.Simms
C
11

You can try a probabilistic approach by choosing a commutative function for accumulation (eg, addition or XOR) and a parametrized hash function.

unsigned addition(unsigned a, unsigned b);
unsigned hash(int n, int h_type);

unsigned hash_set(int* a, int num, int h_type){
    unsigned rez = 0;
    for (int i = 0; i < num; i++)
        rez = addition(rez, hash(a[i], h_type));
    return rez;
};

In this way the number of tries before you decide that the probability of false positive will be below a certain treshold will not depend on the number of elements, so it will be linear.

EDIT: In general case the probability of sets being the same is very small, so this O(n) check with several hash functions can be used for prefiltering: to decide as fast as possible if they are surely different or if there is a probability of them being equivalent, and if a slow deterministic method should be used. The final average complexity will be O(n), but worst case scenario will have the complexity of the determenistic method.

Catacomb answered 14/7, 2011 at 12:26 Comment(2)
I don't understand how does it solve the problem? Can't two arrays of different integers produce same hash value? What if there is a collision? I don't see any guarantee in your solution.Ermaermanno
@Nawaz The hash can only guarantee they are different. If the hashes are the same, you can use the explicit (slow) comparation. Read carefully the "EDIT" portion.Catacomb
G
6

You said "without extra space" in the question but I assume that you actually mean "with O(1) extra space".

Suppose that all the integers in the arrays are less than k. Then you can use in-place radix sort to sort each array in time O(n log k) with O(log k) extra space (for the stack, as pointed out by yi_H in comments), and compare the sorted arrays in time O(n log k). If k does not vary with n, then you're done.

Germaine answered 14/7, 2011 at 12:33 Comment(9)
You need O(k) extra space for the stack. I know, that's still O(1), just wanted to mention it.Fewell
Oh yes, you're right, in-place radix sort recurses on the bins. I'll fix.Germaine
If the numbers are that big then you can't compare even two numbers in time O(n), let alone a whole array of them!Germaine
Why are we making the assumption that the values of the integers are bounded to a polynomial factor of n?Heraclitean
I guess this comes down to what model of computation one prefers to treat as realistic. I like models in which it take O(k) time to perform an operation on a number with k bits. Models with constant-time operations on arbitrary-precision numbers seem unrealistic to me as I don't know how to program anything like them on the computers I have access to.Germaine
@Gareth: In general, the assumption is that for this type of problem, the size of the machine word will be scaled according to the data given. For example, if one is asked to determine whether an array of N elements is a permutation, the space taken up by the array elements is considered to be O(N) rather than O(NlgN). This assumption is reasonable not because it's "realistic", but because it affects most algorithms equally.Hildredhildreth
I like models in which it take O(k) time to perform an operation on a number with k bits I don't, and neither do most of my colleagues in theoretical computer science. On commodity hardware, does it take four times as long to add 32-bit integers as 8-bit? No. The point of a model is to capture reality, and almost all processors offer no discount for operating on quantities smaller than a word. Since it's usually the case that 1 word = 1 pointer, the most common assumption is that the word size suffices to index main memory: Theta(log n) bits, with unit-cost operations.Wafer
This comment box is a bit small to have this argument, but look: if you happen to know that all your integers are 32 bits long, then the operation time is O(32), which drops out of the big-O. So you'll get the same answer as me for the complexity of radix sort. But if you don't know that all your integers are bounded like this, then you better account for the operation time. On commodity hardware, it does take longer to operate on, say, 10000-bit integers than it does for 32-bit integers.Germaine
@Gareth, Assumption of number sizes is compeletly wrong, sure each given set has suprimom also digits size in real PC's has limitations, but in algorithms there isn't any limitation. may be u solve some problem with previous approachs by limitation of problem boundry but it's not an answer for general purpose.Burghley
A
3

I'll assume that the integers in question are of fixed size (eg. 32 bit).

Then, radix-quicksorting both arrays in place (aka "binary quicksort") is constant space and O(n).

In case of unbounded integers, I believe (but cannot proof, even if it is probably doable) that you cannot break the O(n k) barrier, where k is the number of digits of the greatest integer in either array.

Whether this is better than O(n log n) depends on how k is assumed to scale with n, and therefore depends on what the interviewer expects of you.

Allanallana answered 14/7, 2011 at 12:38 Comment(0)
I
2

A special, not harder case is when one array holds 1,2,..,n. This was discussed many times:

and despite many tries no deterministic solutions using O(1) space and O(n) time were shown. Either you can cheat the requirements in some way (reuse input space, assume integers are bounded) or use probabilistic test.

Probably this is an open problem.

Iliac answered 14/7, 2011 at 16:21 Comment(0)
H
1

Here is a co-rp algorithm:

In linear time, iterate over the first array (A), building the polynomial Pa = A[0] - x)(A[1] -x)...(A[n-1] - x). Do the same for array B, naming this polynomial Pb.

We now want to answer the question "is Pa = Pb?" We can check this probabilistically as follows. Select a number r uniformly at random from the range [0...4n] and compute d = Pa(r) - Pb(r) in linear time. If d = 0, return true; otherwise return false.

Why is this valid? First of all, observe that if the two arrays contain the same elements, then Pa = Pb, so Pa(r) = Pb(r) for all r. With this in mind, we can easily see that this algorithm will never erroneously reject two identical arrays.

Now we must consider the case where the arrays are not identical. By the Schwart-Zippel Lemma, P(Pa(r) - Pb(r) = 0 | Pa != Pb) < (n/4n). So the probability that we accept the two arrays as equivalent when they are not is < (1/4).

Heraclitean answered 14/7, 2011 at 12:41 Comment(10)
Building a polynomial is not O(1) space. If you don't build it but calculate it for each x then it's O(1). But still, the calculation is going to overflow very fast.. I don't know how exactly overflow works on processors but I think you will get different results if the order of the operands differ...Fewell
I was assuming by the way the question was worded that we were talking about more abstract models of computation that don't have to worry about overflow. Either way, we don't have to use machine primitives for the operations, we can use some arbitrary precision datatypes (I believe).Heraclitean
with arbitrary precision datatypes you lose the O(1) property of the basic operations like addition and multiplication.Fewell
Again, I was mostly working off the assumption for an abstract computer. If not, just give me a processor with sufficient bit-width to keep a running product.Heraclitean
Again, while you certainly have a witty response, you seem to keep ignoring the part where I say multiple times that this is for an abstract computer.Heraclitean
@yi_H let us continue this discussion in chatHeraclitean
I'm not ignoring it. You clearly started the sentence with "If not, ..."Fewell
You can do computation modulo some prime 2n < p < 4n, it should still be coRP and won't overflow.Iliac
The questions states "do the arrays contain the same set of integers". With A = {1, 1} and B = {1} your method will fail, since (X-1)^2 != X-1.Allanallana
So I thought about that but assumed that to just be ambiguity in the wording and hoped that he really meant multi-set (i.e. same elements but shuffled).Heraclitean
M
1

The usual assumption for these kinds of problems is Theta(log n)-bit words, because that's the minimum needed to index the input.

  1. sshannin's polynomial-evaluation answer works fine over finite fields, which sidesteps the difficulties with limited-precision registers. All we need are a prime of the appropriate (easy to find under the same assumptions that support a lot of public-key crypto) or an irreducible polynomial in (Z/2)[x] of the appropriate degree (difficulty here is multiplying polynomials quickly, but I think the algorithm would be o(n log n)).

  2. If we can modify the input with the restriction that it must maintain the same set, then it's not too hard to find space for radix sort. Select the (n/log n)th element from each array and partition both arrays. Sort the size-(n/log n) pieces and compare them. Now use radix sort on the size-(n - n/log n) pieces. From the previously processed elements, we can obtain n/log n bits, where bit i is on if a[2*i] > a[2*i + 1] and off if a[2*i] < a[2*i + 1]. This is sufficient to support a radix sort with n/(log n)^2 buckets.

Mellicent answered 14/7, 2011 at 16:5 Comment(0)
M
1

In the algebraic decision tree model, there are known Omega(NlogN) lower bounds for computing set intersection (irrespective of the space limits).

For instance, see here: http://compgeom.cs.uiuc.edu/~jeffe/teaching/497/06-algebraic-tree.pdf

So unless you do clever bit manipulations/hashing type approaches, you cannot do better than NlogN.

For instance, if you used only comparisons, you cannot do better than NlogN.

Mickiemickle answered 15/7, 2011 at 6:17 Comment(0)
E
0

why not i find the sum , product , xor of all the elements one array and compare them with the corresponding value of the elements of the other array ??

the xor of elements of both arrays may give zero if the it is like

2,2,3,3 1,1,2,2

but what if you compare the xor of the elements of two array to be equal ???

consider this

10,3 12,5

here xor of both arrays will be same !!! (10^3)=(12^5)=9 but their sum and product are different . I think two different set of elements cannot have same sum ,product and xor ! This can be analysed by simple bitvalue examination. Is there anything wrong in this approach ??

Exhibitioner answered 14/7, 2011 at 9:39 Comment(0)
F
0

You can break the O(n*log(n)) barrier if you have some restrictions on the range of numbers. But it's not possible to do this if you cannot use any extra memory (you need really silly restrictions to be able to do that).

I would also like to note that even O(nlog(n)) with sorting is not trivial if you have O(1) space limit as merge sort uses O(n) space and quicksort (which is not even strict o(nlog(n)) needs O(log(n)) space for the stack. You have to use heapsort or smoothsort.

Some companies like to ask questions which cannot be solved and I think it is a good practice, as a programmer you have to know both what's possible and how to code it and also know what are the limits so you don't waste your time on something that's not doable.

Check this question for a couple of good techniques to use: Algorithm to tell if two arrays have identical members

Fewell answered 14/7, 2011 at 10:41 Comment(0)
G
0

For each integer i check that the number of occurrences of i in the two arrays are either both zero or both nonzero, by iterating over the arrays.

Since the number of integers is constant the total runtime is O(n).

No, I wouldn't do this in practice.

Gigantean answered 14/7, 2011 at 11:40 Comment(7)
just a friendly qn: are you sure that is O(n) and not O(n^2)? cos you will need 2 for loops to accomplish the comparison..Steelmaker
Dhruv: It is (A+B)N (where N is the size of the set of ints, and A and B are the lengths of the arrays), which is O(n)Electuary
But in the general case A and B are O(n), making the whole algorithm O(n²) as indicated by Dhruv.Germaine
ah, right. i thought he was running a comparison based algo. comparing indexes... but still, that means he has to check all integers in the range [-2^32, 2^32-1] which seems expensive...Steelmaker
But in that sense every algorithm on a 32-bit machine is O(1), since there are a constant number (2^32^(2^32)) machine states, hence at most (2^32^(2^32))! execution sequences... This is a decidedly un-practical notion of big-O!Germaine
You wouldn't say anything is wrong with it if the numbers were limited to the range [0,20], right?. The constant factor is just ridiculously large. asymptotic growth is not the silver bullet for choosing the right algorithm.Fewell
I agree! I just think that if you're going to use the finiteness of computers (rather than an actual argument about practicality) to nitpick at the definition of big-O, then why not nitpick in the most general way?Germaine
P
0

Was just thinking if there was a way you could hash the cumulative of both arrays and compare them, assuming the hashing function doesn't produce collisions from two differing patterns.

Prut answered 19/7, 2011 at 18:31 Comment(0)
E
-1

I'm not sure that correctly understood the problem, but if you are interested in integers that are in both array:

If N >>>>> 2^SizeOf(int) (count of bit for integer (16, 32, 64)) there is one solution:

a = Array(N); //length(a) = N;
b = Array(M); //length(b) = M; 
//x86-64. Integer consist of 64 bits.

for i := 0 to 2^64 / 64 - 1 do  //very big, but CONST
  for k := 0 to M - 1 do
    if a[i] = b[l] then doSomething;   //detected

for i := 2^64 / 64 to N - 1 do
 if not isSetBit(a[i div 64], i mod 64) then
   setBit(a[i div 64], i mod 64);

for i := 0 to M - 1 do
 if isSetBit(a[b[i] div 64], b[i] mod 64) then doSomething; //detected

O(N), with out aditional structures

Encarnacion answered 14/7, 2011 at 10:36 Comment(2)
a[i] where i goes up to 2^64 ? doesn't make sense. also what's that detected thing there? you have to check that those two arrays contain the same set of numbers.Fewell
Yeah, i understood this task not correctly. My solution find all integers wich places in both array. about 2^64 - it is only theoreticallyEncarnacion
S
-1

All I know is that comparison based sorting cannot possibly be faster than O(NlogN), so we can eliminate most of the "common" comparison based sorts. I was thinking of doing a bucket sort. Perhaps if this qn was asked in an interview, the best response would first be to clarify what sort of data those integers represent. For e.g., if they represent a persons age, then we know that the range of values of int is limited, and can use bucket sort at O(n). However, this will not be in place....

Steelmaker answered 14/7, 2011 at 12:12 Comment(2)
just a terminology note: by "common" sorts you mean comparison-based sorts.Fewell
okay then maybe i deserve the second -1.. lol, oh well, life's too short to be pissed, huh? cheers!Steelmaker
E
-2

If the arrays have the same size, and there are guaranteed to be no duplicates, sum each of the arrays. If the sum of the values is different, then they contain different integers.

Edit: You can then sum the log of the entries in the arrays. If that is also the same, then you have the same entries in the array.

Electuary answered 14/7, 2011 at 11:42 Comment(17)
A bit more thought, in response to the edit: I'm still not convinced. You'd need to take into account the number of digits of precision needed for the logs to ensure that you couldn't construct a counter example. How does that scale with the number of integers? A deeper problem is the "no duplicates" assumption: you have either limited the array to finite length or you need to allow arbitrary precision integers, which breaks the assumption of O(1) mathematical operations.Headley
(a) I've assumed that the arrays are finite, because if they are not finite, they are streams, rather than arrays, and we would need more information about them anyway. (b) I assume that the number of integers is finite, i.e. we are dealing with fixed-point representations, not mathematical integers. (c) with fixed-width integers, we can work out the necessary precision of the representation of the logs.Electuary
Let me restate: by assuming distinct, fixed-length integers, you have imposed a finite upper bound on the length of the array---N cannot increase without bound. Without that bound, it is not clear that (c) holds; the precision needed may well increase with the length of the array.Headley
I think you are confused: if the precision needed increases with the length of the array, and the length of the array is limited, then there is a limit to the precision required.Electuary
Not confused at all. I'm saying that you've rewritten the question in a way that allows you to determine a limit to the precision, but that you can't apply that precision to the question actually asked.Headley
(a) All of the solutions here rely on rewriting the question into something answerable. (b) Given the usual definition of both array and array of integers in computing machinery, I don't accept that assuming finite representations of the integers or finite arrays represents rewriting the question.Electuary
Even if we want to assume infinite precision for free this does not work. For arrays [a,b,c,d and [w,x,y,z], there are assignments of values such that a+b+c+d=w+x+y+z and log(w)+log(x)+log(y)+log(z) where the arrays are not equal.Heraclitean
@Electuary You've done more than assume a finite array---you've assumed a finite array whose maximum length is independent of N.Headley
When you say 'N', I take it you mean the size of the set of (finite-width) integers. I don't see why you are imputing to me the assumption that the length of the array is independent of N. Indeed, because I assume that there are no duplicates, the maximum size of the arrays would be N.Electuary
@sshannin: This no doubt reveals the limits of my mathematical powers, but I can't think of a scheme for generating integers that would defeat the algorithm above. Perhaps you'd like to outline one?Electuary
I just asked asked wolfram to solve for non-identity solutions. It's a bit messy, and doesn't paste well, but you can check there.Heraclitean
@Electuary - Now take N to infinity while keeping the number of bits in the integers fixed. Of course it can't be done with unique integers in the array; the number of bits is what determines the maximum length of the array. You're trying to pass off a finite-sized problem as describing the asymptotic behavior as the size of the problem increases to infinity.Headley
Michael - I don't understand why you consider this to be a problem. When dealing with arrays and integers in real computing machinery we always have a finite sized problem. This doesn't invalidate anything.Electuary
@Electuary Any given problem is finite, but the asymptotic behavior describes how the time and space behavior of the algorithm scales with the size of the problem. By putting an upper bound on the length of the array (i.e., the size of the problem), nothing meaningful is said about the problem as the length goes to infinity. You could call it O(1) at that point, as there would be a strict upper bound on the time and space requirements of the problem. It's completely meaningless.Headley
Michael - So, you believe that computer scientists should never think about bounded problems? You can analyze the complexity of this problem towards infinity as long as you posit that the integers will grow in size. I really don't see what point you're trying to make.Electuary
@Electuary You need to take the size of the integers into account in your analysis. Summing the array will be exponential in the number of bits. Nor will it be pseudo-linear in the length of the array, since the cost of addition will increase with length of the array, going as the number of bits needed to produce an array of length N, which is log(N). Summing the logarithms is a more difficult assessment, as is determining the precision needed for the logarithm function to ensure that no wrong answers result---I'd be amazed if it were easier than just summing up the numbers.Headley
@Electuary let us continue this discussion in chatHeadley

© 2022 - 2024 — McMap. All rights reserved.