find if two arrays contain the same set of integers without extra space and faster than NlogN

Asked 14/7, 2011 at 9:39 Answered 19/7, 2011 at 18:31

I came across this post, which reports the following interview question:

Given two arrays of numbers, find if each of the two arrays have the same set of integers ? Suggest an algo which can run faster than NlogN without extra space?

The best that I can think of is the following:

(a) sort each array, and then (b) have two pointers moving along the two arrays and check if you find different values ... but step (a) has already NlogN complexity :(
(a) scan shortest array and put values into a map, and then (b) scan second array and check if you find a value that is not in the map ... here we have linear complexity, but we I use extra space

... so, I can't think of a solution for this question.

Ideas?

Thank you for all the answers. I feel many of them are right, but I decided to choose ruslik's one, because it gives an interesting option that I did not think about.

Benyamin answered 14/7, 2011 at 9:39 Comment(9)

@TheHorse: I don't know, the question doesn't say it. But I think I see where you're aiming at: if we have 16 bit integers, then we may use my approach number 1 with linear-time sorting algorithm like radix sort. Correct? – Benyamin 14/7, 2011 at 9:57

Option 1 can have linear complexity with a counting or pigeonhole sort, but then it takes extra memory. – Simms 14/7, 2011 at 9:57

@larsmans: correct, but then I would go with option 2, which is also linear – Benyamin 14/7, 2011 at 9:59

Less, that's almost the same solution except you skip sorting the second array. For a moment, I though you meant std::map/sorted tree, but a bitmap would do it in linear time. – Simms 14/7, 2011 at 10:2

naming arrays A and B, is the N in NlogN size(A)+size(B) or max(size(A),size(B)) ? – Indigent 14/7, 2011 at 10:3

@larsmans: yeah ... but still extra space, sot that doesn't seem the correct answer – Benyamin 14/7, 2011 at 10:3

@Marino: I don't know, we should ask the interviewer :) my guess is that N = max(size(A), size(B)) – Benyamin 14/7, 2011 at 10:5

s/Less/Yes/g (that was a weird typo) – Simms 14/7, 2011 at 10:8

@Marino: asymptotically, that's exactly the same if both |A| and |B| are variable. – Simms 14/7, 2011 at 10:9

You can try a probabilistic approach by choosing a commutative function for accumulation (eg, addition or XOR) and a parametrized hash function.

unsigned addition(unsigned a, unsigned b);
unsigned hash(int n, int h_type);

unsigned hash_set(int* a, int num, int h_type){
    unsigned rez = 0;
    for (int i = 0; i < num; i++)
        rez = addition(rez, hash(a[i], h_type));
    return rez;
};

In this way the number of tries before you decide that the probability of false positive will be below a certain treshold will not depend on the number of elements, so it will be linear.

EDIT: In general case the probability of sets being the same is very small, so this O(n) check with several hash functions can be used for prefiltering: to decide as fast as possible if they are surely different or if there is a probability of them being equivalent, and if a slow deterministic method should be used. The final average complexity will be O(n), but worst case scenario will have the complexity of the determenistic method.

Catacomb answered 14/7, 2011 at 12:26 Comment(2)

I don't understand how does it solve the problem? Can't two arrays of different integers produce same hash value? What if there is a collision? I don't see any guarantee in your solution. – Ermaermanno 11/4, 2013 at 4:6

@Nawaz The hash can only guarantee they are different. If the hashes are the same, you can use the explicit (slow) comparation. Read carefully the "EDIT" portion. – Catacomb 23/4, 2013 at 23:31

You said "without extra space" in the question but I assume that you actually mean "with O(1) extra space".

Suppose that all the integers in the arrays are less than k. Then you can use in-place radix sort to sort each array in time O(n log k) with O(log k) extra space (for the stack, as pointed out by yi_H in comments), and compare the sorted arrays in time O(n log k). If k does not vary with n, then you're done.

Germaine answered 14/7, 2011 at 12:33 Comment(9)

You need O(k) extra space for the stack. I know, that's still O(1), just wanted to mention it. – Fewell 14/7, 2011 at 12:41

Oh yes, you're right, in-place radix sort recurses on the bins. I'll fix. – Germaine 14/7, 2011 at 12:44

If the numbers are that big then you can't compare even two numbers in time O(n), let alone a whole array of them! – Germaine 14/7, 2011 at 13:9

Why are we making the assumption that the values of the integers are bounded to a polynomial factor of n? – Heraclitean 14/7, 2011 at 13:9

I guess this comes down to what model of computation one prefers to treat as realistic. I like models in which it take O(k) time to perform an operation on a number with k bits. Models with constant-time operations on arbitrary-precision numbers seem unrealistic to me as I don't know how to program anything like them on the computers I have access to. – Germaine 14/7, 2011 at 13:16

@Gareth: In general, the assumption is that for this type of problem, the size of the machine word will be scaled according to the data given. For example, if one is asked to determine whether an array of N elements is a permutation, the space taken up by the array elements is considered to be O(N) rather than O(NlgN). This assumption is reasonable not because it's "realistic", but because it affects most algorithms equally. – Hildredhildreth 16/7, 2011 at 4:58

I like models in which it take O(k) time to perform an operation on a number with k bits I don't, and neither do most of my colleagues in theoretical computer science. On commodity hardware, does it take four times as long to add 32-bit integers as 8-bit? No. The point of a model is to capture reality, and almost all processors offer no discount for operating on quantities smaller than a word. Since it's usually the case that 1 word = 1 pointer, the most common assumption is that the word size suffices to index main memory: Theta(log n) bits, with unit-cost operations. – Wafer 16/7, 2011 at 14:24

This comment box is a bit small to have this argument, but look: if you happen to know that all your integers are 32 bits long, then the operation time is O(32), which drops out of the big-O. So you'll get the same answer as me for the complexity of radix sort. But if you don't know that all your integers are bounded like this, then you better account for the operation time. On commodity hardware, it does take longer to operate on, say, 10000-bit integers than it does for 32-bit integers. – Germaine 16/7, 2011 at 16:15

@Gareth, Assumption of number sizes is compeletly wrong, sure each given set has suprimom also digits size in real PC's has limitations, but in algorithms there isn't any limitation. may be u solve some problem with previous approachs by limitation of problem boundry but it's not an answer for general purpose. – Burghley 17/7, 2011 at 19:57

I'll assume that the integers in question are of fixed size (eg. 32 bit).

Then, radix-quicksorting both arrays in place (aka "binary quicksort") is constant space and O(n).

In case of unbounded integers, I believe (but cannot proof, even if it is probably doable) that you cannot break the O(n k) barrier, where k is the number of digits of the greatest integer in either array.

Whether this is better than O(n log n) depends on how k is assumed to scale with n, and therefore depends on what the interviewer expects of you.

Allanallana answered 14/7, 2011 at 12:38 Comment(0)

A special, not harder case is when one array holds 1,2,..,n. This was discussed many times:

and despite many tries no deterministic solutions using O(1) space and O(n) time were shown. Either you can cheat the requirements in some way (reuse input space, assume integers are bounded) or use probabilistic test.

Probably this is an open problem.

Iliac answered 14/7, 2011 at 16:21 Comment(0)

Here is a co-rp algorithm:

In linear time, iterate over the first array (A), building the polynomial Pa = A[0] - x)(A[1] -x)...(A[n-1] - x). Do the same for array B, naming this polynomial Pb.

We now want to answer the question "is Pa = Pb?" We can check this probabilistically as follows. Select a number r uniformly at random from the range [0...4n] and compute d = Pa(r) - Pb(r) in linear time. If d = 0, return true; otherwise return false.

Why is this valid? First of all, observe that if the two arrays contain the same elements, then Pa = Pb, so Pa(r) = Pb(r) for all r. With this in mind, we can easily see that this algorithm will never erroneously reject two identical arrays.

Now we must consider the case where the arrays are not identical. By the Schwart-Zippel Lemma, P(Pa(r) - Pb(r) = 0 | Pa != Pb) < (n/4n). So the probability that we accept the two arrays as equivalent when they are not is < (1/4).

Heraclitean answered 14/7, 2011 at 12:41 Comment(10)

Building a polynomial is not O(1) space. If you don't build it but calculate it for each x then it's O(1). But still, the calculation is going to overflow very fast.. I don't know how exactly overflow works on processors but I think you will get different results if the order of the operands differ... – Fewell 14/7, 2011 at 12:48

I was assuming by the way the question was worded that we were talking about more abstract models of computation that don't have to worry about overflow. Either way, we don't have to use machine primitives for the operations, we can use some arbitrary precision datatypes (I believe). – Heraclitean 14/7, 2011 at 12:51

with arbitrary precision datatypes you lose the O(1) property of the basic operations like addition and multiplication. – Fewell 14/7, 2011 at 13:31

Again, I was mostly working off the assumption for an abstract computer. If not, just give me a processor with sufficient bit-width to keep a running product. – Heraclitean 14/7, 2011 at 13:47

Again, while you certainly have a witty response, you seem to keep ignoring the part where I say multiple times that this is for an abstract computer. – Heraclitean 14/7, 2011 at 13:56

@yi_H let us continue this discussion in chat – Heraclitean 14/7, 2011 at 13:57

I'm not ignoring it. You clearly started the sentence with "If not, ..." – Fewell 14/7, 2011 at 14:5

You can do computation modulo some prime 2n < p < 4n, it should still be coRP and won't overflow. – Iliac 14/7, 2011 at 15:59

The questions states "do the arrays contain the same set of integers". With A = {1, 1} and B = {1} your method will fail, since (X-1)^2 != X-1. – Allanallana 14/7, 2011 at 18:25

So I thought about that but assumed that to just be ambiguity in the wording and hoped that he really meant multi-set (i.e. same elements but shuffled). – Heraclitean 14/7, 2011 at 18:28

The usual assumption for these kinds of problems is Theta(log n)-bit words, because that's the minimum needed to index the input.

sshannin's polynomial-evaluation answer works fine over finite fields, which sidesteps the difficulties with limited-precision registers. All we need are a prime of the appropriate (easy to find under the same assumptions that support a lot of public-key crypto) or an irreducible polynomial in (Z/2)[x] of the appropriate degree (difficulty here is multiplying polynomials quickly, but I think the algorithm would be o(n log n)).
If we can modify the input with the restriction that it must maintain the same set, then it's not too hard to find space for radix sort. Select the (n/log n)th element from each array and partition both arrays. Sort the size-(n/log n) pieces and compare them. Now use radix sort on the size-(n - n/log n) pieces. From the previously processed elements, we can obtain n/log n bits, where bit i is on if a[2*i] > a[2*i + 1] and off if a[2*i] < a[2*i + 1]. This is sufficient to support a radix sort with n/(log n)^2 buckets.

Mellicent answered 14/7, 2011 at 16:5 Comment(0)

In the algebraic decision tree model, there are known Omega(NlogN) lower bounds for computing set intersection (irrespective of the space limits).

For instance, see here: http://compgeom.cs.uiuc.edu/~jeffe/teaching/497/06-algebraic-tree.pdf

So unless you do clever bit manipulations/hashing type approaches, you cannot do better than NlogN.

For instance, if you used only comparisons, you cannot do better than NlogN.

Mickiemickle answered 15/7, 2011 at 6:17 Comment(0)

why not i find the sum , product , xor of all the elements one array and compare them with the corresponding value of the elements of the other array ??

the xor of elements of both arrays may give zero if the it is like

2,2,3,3 1,1,2,2

but what if you compare the xor of the elements of two array to be equal ???

consider this

10,3 12,5

here xor of both arrays will be same !!! (10^3)=(12^5)=9 but their sum and product are different . I think two different set of elements cannot have same sum ,product and xor ! This can be analysed by simple bitvalue examination. Is there anything wrong in this approach ??

Exhibitioner answered 14/7, 2011 at 9:39 Comment(0)

You can break the O(n*log(n)) barrier if you have some restrictions on the range of numbers. But it's not possible to do this if you cannot use any extra memory (you need really silly restrictions to be able to do that).

I would also like to note that even O(nlog(n)) with sorting is not trivial if you have O(1) space limit as merge sort uses O(n) space and quicksort (which is not even strict o(nlog(n)) needs O(log(n)) space for the stack. You have to use heapsort or smoothsort.

Some companies like to ask questions which cannot be solved and I think it is a good practice, as a programmer you have to know both what's possible and how to code it and also know what are the limits so you don't waste your time on something that's not doable.

Check this question for a couple of good techniques to use: Algorithm to tell if two arrays have identical members

Fewell answered 14/7, 2011 at 10:41 Comment(0)

For each integer i check that the number of occurrences of i in the two arrays are either both zero or both nonzero, by iterating over the arrays.

Since the number of integers is constant the total runtime is O(n).

No, I wouldn't do this in practice.

Gigantean answered 14/7, 2011 at 11:40 Comment(7)

just a friendly qn: are you sure that is O(n) and not O(n^2)? cos you will need 2 for loops to accomplish the comparison.. – Steelmaker 14/7, 2011 at 12:5

Dhruv: It is (A+B)N (where N is the size of the set of ints, and A and B are the lengths of the arrays), which is O(n) – Electuary 14/7, 2011 at 12:9

But in the general case A and B are O(n), making the whole algorithm O(n²) as indicated by Dhruv. – Germaine 14/7, 2011 at 12:18

ah, right. i thought he was running a comparison based algo. comparing indexes... but still, that means he has to check all integers in the range [-2^32, 2^32-1] which seems expensive... – Steelmaker 14/7, 2011 at 12:18

But in that sense every algorithm on a 32-bit machine is O(1), since there are a constant number (2^32^(2^32)) machine states, hence at most (2^32^(2^32))! execution sequences... This is a decidedly un-practical notion of big-O! – Germaine 14/7, 2011 at 12:21

You wouldn't say anything is wrong with it if the numbers were limited to the range [0,20], right?. The constant factor is just ridiculously large. asymptotic growth is not the silver bullet for choosing the right algorithm. – Fewell 14/7, 2011 at 12:32

I agree! I just think that if you're going to use the finiteness of computers (rather than an actual argument about practicality) to nitpick at the definition of big-O, then why not nitpick in the most general way? – Germaine 14/7, 2011 at 12:46

Was just thinking if there was a way you could hash the cumulative of both arrays and compare them, assuming the hashing function doesn't produce collisions from two differing patterns.

Prut answered 19/7, 2011 at 18:31 Comment(0)

-1

I'm not sure that correctly understood the problem, but if you are interested in integers that are in both array:

If N >>>>> 2^SizeOf(int) (count of bit for integer (16, 32, 64)) there is one solution:

a = Array(N); //length(a) = N;
b = Array(M); //length(b) = M; 
//x86-64. Integer consist of 64 bits.

for i := 0 to 2^64 / 64 - 1 do  //very big, but CONST
  for k := 0 to M - 1 do
    if a[i] = b[l] then doSomething;   //detected

for i := 2^64 / 64 to N - 1 do
 if not isSetBit(a[i div 64], i mod 64) then
   setBit(a[i div 64], i mod 64);

for i := 0 to M - 1 do
 if isSetBit(a[b[i] div 64], b[i] mod 64) then doSomething; //detected

O(N), with out aditional structures

Encarnacion answered 14/7, 2011 at 10:36 Comment(2)

a[i] where i goes up to 2^64 ? doesn't make sense. also what's that detected thing there? you have to check that those two arrays contain the same set of numbers. – Fewell 14/7, 2011 at 10:47

Yeah, i understood this task not correctly. My solution find all integers wich places in both array. about 2^64 - it is only theoretically – Encarnacion 14/7, 2011 at 12:25

-1

All I know is that comparison based sorting cannot possibly be faster than O(NlogN), so we can eliminate most of the ~~"common"~~ comparison based sorts. I was thinking of doing a bucket sort. Perhaps if this qn was asked in an interview, the best response would first be to clarify what sort of data those integers represent. For e.g., if they represent a persons age, then we know that the range of values of int is limited, and can use bucket sort at O(n). However, this will not be in place....

Steelmaker answered 14/7, 2011 at 12:12 Comment(2)

just a terminology note: by "common" sorts you mean comparison-based sorts. – Fewell 14/7, 2011 at 13:34

okay then maybe i deserve the second -1.. lol, oh well, life's too short to be pissed, huh? cheers! – Steelmaker 14/7, 2011 at 15:18

-2

If the arrays have the same size, and there are guaranteed to be no duplicates, sum each of the arrays. If the sum of the values is different, then they contain different integers.

Edit: You can then sum the log of the entries in the arrays. If that is also the same, then you have the same entries in the array.

Electuary answered 14/7, 2011 at 11:42 Comment(17)

A bit more thought, in response to the edit: I'm still not convinced. You'd need to take into account the number of digits of precision needed for the logs to ensure that you couldn't construct a counter example. How does that scale with the number of integers? A deeper problem is the "no duplicates" assumption: you have either limited the array to finite length or you need to allow arbitrary precision integers, which breaks the assumption of O(1) mathematical operations. – Headley 14/7, 2011 at 12:35

(a) I've assumed that the arrays are finite, because if they are not finite, they are streams, rather than arrays, and we would need more information about them anyway. (b) I assume that the number of integers is finite, i.e. we are dealing with fixed-point representations, not mathematical integers. (c) with fixed-width integers, we can work out the necessary precision of the representation of the logs. – Electuary 14/7, 2011 at 12:48

Let me restate: by assuming distinct, fixed-length integers, you have imposed a finite upper bound on the length of the array---N cannot increase without bound. Without that bound, it is not clear that (c) holds; the precision needed may well increase with the length of the array. – Headley 14/7, 2011 at 12:59

I think you are confused: if the precision needed increases with the length of the array, and the length of the array is limited, then there is a limit to the precision required. – Electuary 14/7, 2011 at 13:4

Not confused at all. I'm saying that you've rewritten the question in a way that allows you to determine a limit to the precision, but that you can't apply that precision to the question actually asked. – Headley 14/7, 2011 at 13:7

(a) All of the solutions here rely on rewriting the question into something answerable. (b) Given the usual definition of both array and array of integers in computing machinery, I don't accept that assuming finite representations of the integers or finite arrays represents rewriting the question. – Electuary 14/7, 2011 at 13:18

Even if we want to assume infinite precision for free this does not work. For arrays [a,b,c,d and [w,x,y,z], there are assignments of values such that a+b+c+d=w+x+y+z and log(w)+log(x)+log(y)+log(z) where the arrays are not equal. – Heraclitean 14/7, 2011 at 13:20

@Electuary You've done more than assume a finite array---you've assumed a finite array whose maximum length is independent of N. – Headley 14/7, 2011 at 13:26

When you say 'N', I take it you mean the size of the set of (finite-width) integers. I don't see why you are imputing to me the assumption that the length of the array is independent of N. Indeed, because I assume that there are no duplicates, the maximum size of the arrays would be N. – Electuary 14/7, 2011 at 13:33

@sshannin: This no doubt reveals the limits of my mathematical powers, but I can't think of a scheme for generating integers that would defeat the algorithm above. Perhaps you'd like to outline one? – Electuary 14/7, 2011 at 13:35

I just asked asked wolfram to solve for non-identity solutions. It's a bit messy, and doesn't paste well, but you can check there. – Heraclitean 14/7, 2011 at 13:41

@Electuary - Now take N to infinity while keeping the number of bits in the integers fixed. Of course it can't be done with unique integers in the array; the number of bits is what determines the maximum length of the array. You're trying to pass off a finite-sized problem as describing the asymptotic behavior as the size of the problem increases to infinity. – Headley 14/7, 2011 at 14:17

Michael - I don't understand why you consider this to be a problem. When dealing with arrays and integers in real computing machinery we always have a finite sized problem. This doesn't invalidate anything. – Electuary 14/7, 2011 at 14:31

@Electuary Any given problem is finite, but the asymptotic behavior describes how the time and space behavior of the algorithm scales with the size of the problem. By putting an upper bound on the length of the array (i.e., the size of the problem), nothing meaningful is said about the problem as the length goes to infinity. You could call it O(1) at that point, as there would be a strict upper bound on the time and space requirements of the problem. It's completely meaningless. – Headley 14/7, 2011 at 14:39

Michael - So, you believe that computer scientists should never think about bounded problems? You can analyze the complexity of this problem towards infinity as long as you posit that the integers will grow in size. I really don't see what point you're trying to make. – Electuary 14/7, 2011 at 14:56

@Electuary You need to take the size of the integers into account in your analysis. Summing the array will be exponential in the number of bits. Nor will it be pseudo-linear in the length of the array, since the cost of addition will increase with length of the array, going as the number of bits needed to produce an array of length N, which is log(N). Summing the logarithms is a more difficult assessment, as is determining the precision needed for the logarithm function to ensure that no wrong answers result---I'd be amazed if it were easier than just summing up the numbers. – Headley 14/7, 2011 at 17:27

@Electuary let us continue this discussion in chat – Headley 14/7, 2011 at 17:29

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags