Detecting repetition with infinite input
Asked Answered
P

5

1

What is the most optimal way to find repetition in a infinite sequence of integers?

i.e. if in the infinite sequence the number '5' appears twice then we will return 'false' the first time and 'true' the second time.

In the end what we need is a function that returns 'true' if the integer appeared before and 'false' if the function received the integer the first time.

If there are two solutions, one is space-wise and the second is time-wise, then mention both. I will write my solution in the answers, but I don't think it is the optimal one.

edit: Please don't assume the trivial cases (i.e. no repetitions, a constantly rising sequence). What interests me is how to reduce the space complexity of the non-trivial case (random numbers with repetitions).

Poyssick answered 17/2, 2010 at 9:39 Comment(1)
Are the numbers allways integer numbers? Which range have the numbers?Subminiature
C
1

I'd use the following approach:

Use a hash table as your datastructure. For every number read, store it in your datastructure. If it's already stored before you found a repetition.

If n is the number of elements in the sequence from start to the repetition, then this only requires O(n) time and space. Time complexity is optimal, as you need to at least read the input sequence's elements up to the repetition point.

How long of a sequence are we talking (before the repetition occurs)? Is a repetition even guaranteed at all? For extreme cases the space complexity might become problematic. But to improve it you will probably need to know more structural information on your sequence.

Update: If the sequence is as you say very long with seldom repetitions and you have to cut down on the space requirement, then you might (given sufficient structural information on the sequence) be able to cut down the space cost.

As an example: let's say you know that your infinite sequence has a general tendency to return numbers that fit within the current range of witnessed min-max numbers. Then you will eventually have whole intervals that have already been contained in the sequence. In that case you can save space by storing such intervals instead of all the elements contained within it.

Corrinnecorrival answered 17/2, 2010 at 9:47 Comment(1)
Regarding the last paragraph, this is more of a thought experiment than a real problem that I have, so I don't have a practical answer. But what really interests me is what people do in the more problematic case - a really infinite sequence (let's say you leave the system on for a very long time) with seldom repetitions.Poyssick
S
1

A BitSet for int values (2^32 numbers) would consume 512Mb. This may be ok if the BitSets are allocated not to often, fast enough and the mem is available.

An alternative are compressed BitSets that work best for sparse BitSets.

Subminiature answered 17/2, 2010 at 14:19 Comment(0)
E
1

Actually, if the max number of values is infinite, you can use any lossless compression algorithm for a monochrome bitmap. IF you imagine a square with at least as many pixels as the number of possible values, you can map each value to a pixel (with a few to spare). Then you can represent white as the pixels that appeared and black for the others and use any compression algorithm if space is at a premium (that is certainly a problem that has been studied)

You can also store blocks. The worst case is the same in space O(n) but for that worst case you need that the number appeared have exactly 1 in between them. Once more numbers appear, then the storage will decrease: I will write pseudocode and I will use a List, but you can always use a different structure

List changes // global

boolean addNumber(int number):
  boolean appeared = false
  it = changes.begin()
  while it.hasNext():
    if it.get() < number:
      appeared != appeared
      it = it.next()
    else if it.get() == number:
      if !appeared: return true
      if it.next().get() == number + 1
        it.next().remove() // Join 2 blocks 
      else 
        it.insertAfter(number + 1)  // Insert split and create 2 blocks
      it.remove()
        return false
    else: // it.get() > number
      if appeared: return true
      it.insertBefore(number)
      if it.get() == number + 1:
        it.remove() // Extend next block
      else:
        it.insertBefore(number + 1)  
  }
  return false
}

What this code is the following: it stores a list of blocks. For each number that you add, it iterates over the list storing blocks of numbers that appeared and numbers that didn't. Let me illustrate with an example; I will add [) to illustrate which numbers in the block, the first number is included, the last is not.In the pseudocode it is replaced by the boolean appeared. For instance, if you get the 5, 9, 6, 8, 7 (in this order) you will have the following sequences after each function:

[5,6)

[5,6),[9,10)

[5,7),[9,10)

[5,7),[8,10)

[5,10)

In the last value you keep a block of 5 numbers with only 2.

Exhale answered 28/1, 2012 at 3:17 Comment(0)
P
0

Return TRUE

If the sequence is infinite then there will be repetition of every conceivable pattern.

If what you want to know is the first place in the sequence when there is a repeated digit that's another matter, but there's some difference between your question and your example.

Pretorius answered 17/2, 2010 at 9:52 Comment(3)
Even with a finite set of numbers the only thing known is that at least one has duplicates for an infinite sequence.Subminiature
Well, let's say that the first time we meet a new number then the system needs to do something...Poyssick
@SF good point, I meant that the digits would repeat with any pattern you care to mention. Actually, what I really meant, of course, is that OP's question is vague about what the desired function is to return.Pretorius
P
0

Well, it seems obvious that in any solution we'll need to save the numbers that already appeared, so space wise we will always have a worst-case of O(N) where N<=possible numbers with the word size of our number type (i.e. 2^32 for C# int) - this is problematic over a long time if the sequence is really infinite/rarely repeats itself.

For saving the numbers that already appeared I would use an hash table and then check it each time I receive a new number.

Poyssick answered 17/2, 2010 at 9:57 Comment(4)
2^32 is only 512Mb in a BitSet. :-)Subminiature
The hash table has to much overhead in any case. Arrays for bucket arrays and entry objects. This adds up.Subminiature
I do not consider it obvious that saving n numbers requires O(n) space. This heavily depends on the structure of your sequence. If the sequence is given by s[i] = i, i.e. numbers are steadily increased by one, then you can store all appeared numbers in constant space by just storing the largest witnessed number.Corrinnecorrival
Thomas, could you elaborate more on the alternatives? Frank, you are right, but is it correct to say that in the general case (let's say receiving random integers) it MUST be O(N)?Poyssick

© 2022 - 2024 — McMap. All rights reserved.