Find the indices where a sorted list of integer changes
Asked Answered
C

3

6

Assuming a sorted list of integers as below:

data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3
# [1, 1, 1, 4, 4, 4, 4, 4, 5, 5, 9, 9, 9]

I want to find the indices where the values changes, i.e.

[3, 8, 10, 13]

One approach is to use itertools.groupby:

cursor = 0
result = []
for key, group in groupby(data):
    cursor += sum(1 for _ in group)
    result.append(cursor)
print(result)

Output

[3, 8, 10, 13]

This approach is O(n). Another possible approach is to use bisect.bisect_left:

cursor = 0
result = []
while cursor < len(data):
    cursor = bisect_left(data, data[cursor] + 1, cursor, len(data))
    result.append(cursor)
print(result)

Output

[3, 8, 10, 13]

This approach is O(k*log n) where k is the number of distinct elements. A variant of this approach is to use an exponential search.

Is there any faster or more performant way of doing this?

Cowskin answered 23/11, 2021 at 9:16 Comment(8)
What type of "more performant" are you looking for? If you're looking for asymptotic complexity, I think you found the optimal solutions already. The tradeoff between O(n) and O(k*log n) depends on k, which you probably don't know upfront. If you're looking for the fastest algorithm in practice, though, I would expect that a simple for loop that checks elements against the previous one would be fastest, unless k is a lot smaller than n. But you would need to benchmark that to be sure.Rinse
I'm more interested in runtime, but asymptotic complexity could be nice also. You can assume k will be a lot smaller than n.Cowskin
In that case, I would expect the exponential search and bisect approach to be fastest. Somehow the exponential search feels faster because you're only moving forward in the array but I think that's just my intuition messing with me. The bisect approach is probably simpler to implement (since you already have it and it's just a couple of lines), so I'd go for that. But I'd love to see someone run a benchmark to make sure.Rinse
Why is 13 in the expected output? Either it should not be there, or 0 should be included for full symmetry. But taking "indices where the values changes" literally, 13 should not be there, as it is not a valid index.Cantilever
@Cantilever Apologies. I did not express myself correctly, but 13 should be part of the outputCowskin
Just trying to understand: then why not 0?Cantilever
I'm using this a sub-routine of larger algorithm, but yeah 0 can be part of the outputCowskin
I think bisect(data, data[cursor], cursor) also works and is nicer (less code and doesn't need the values to be integers).Consternation
C
5

When it comes to asymptotic complexity I think you can improve slightly on the binary search on average when you apply a more evenly spread divide-and-conquer approach: try to first pinpoint the value-change that occurs closer to the middle of the input list, thereby splitting the range in approximately two halves, which would reduce the next binary search operation path by about one.

Yet, because this is Python, the gain might not be noticeable, because of the Python-code overhead (like for yield, yield from, the recursion, ...). It might even perform worse for the list sizes you work with:

from bisect import bisect_left

def locate(data, start, end):
    if start >= end or data[start] == data[end - 1]:
        return
    mid = (start + end) // 2
    val = data[mid] 
    if val == data[start]:
        start = mid
        val += 1
    i = bisect_left(data, val, start + 1, end)
    yield from locate(data, start, i)
    yield i
    yield from locate(data, i, end)

data = [1] * 3 + [4] * 5 + [5] * 2 + [9] * 3
print(*locate(data, 0, len(data)))  # 3 8 10

Note that this only outputs valid indices, so 13 is not included for this example input.

Cantilever answered 23/11, 2021 at 11:2 Comment(12)
Out of curiosity what would be the complexity of this approach? O(log n)?Cowskin
No, it could never be sublinear, because if k==n, then the output is O(n). I'm not sure yet about the actual complexity. The worst case would still be O(klogn), but not sure about the average case... Maybe a mathematician could help out here :-)Cantilever
I think you can skip the bisect step and just split at the middle each time, yield i if the element at either side are different. What do you think?Cowskin
Should be possible, but then you will have more recursive calls.Cantilever
@DaniMesejo's idea was actually correct. It will not result in more recursive calls. For each block (defined as a block of same number), the cost is log(block size). But it can be proved that the sum of them will not exceed O(k).Perambulate
That's quite a claim, @lavin, and would need proof. Let k=2 and n large, then it takes O(n) to find the index where the second block starts, so O(k) does not cover it.Cantilever
Why is it O(n) not O(logn) to find the second block? But still thanks for the point. I guess my answer was incomplete in this sense. It should be changed to O(max(k, log(largest block size)) instead. Thanks!Perambulate
Yes I meant O(logn)Cantilever
@Cantilever so more precisely, what I wanted to say was, if we fix k, then the complexity grows in O(log(n)) time ; if we fix n, then the complexity grows linearly with k as O(k); if we take both n and k as free variables, then it becomes O(max(k, log(n)).Perambulate
Yes, or equivalently O(k + logn)Cantilever
Revised my answer. I think it may still have issue. I'm just too old for this :'( Maybe someone can take it from here or just downvote it.Perambulate
@Perambulate Btw the benchmark in your answer was wrong, rather obvious that 8.344650268554688e-06 seconds for data2 can't possibly be correct. Your function was a generator and you didn't consume its result, so what you measured was merely the time to create the iterator.Consternation
A
4

I tested execution time of your approaches on two sets of data and added a third one using numpy

data1 = [1] * 30000000 + [2] * 30000000 + [4] * 50000000 + [5] * 20000000 + [7] * 40000000 + [9] * 30000000 + [11] * 10000000 + [15] * 30000000
data2 = list(range(10000000))

cursor = 0
result = []
start_time = time.time()
for key, group in groupby(data):
    cursor += sum(1 for _ in group)
    result.append(cursor)
print(f'groupby {time.time() - start_time} seconds')

cursor = 0
result = []
start_time = time.time()
while cursor < len(data):
    cursor = bisect_left(data, data[cursor] + 1, cursor, len(data))
    result.append(cursor)
print(f'bisect_left {time.time() - start_time} seconds')

data = np.array(data)
start_time = time.time()
[i + 1 for i in np.where(data[:-1] != data[1:])[0]] + [len(data)]
print(f'numpy {time.time() - start_time} seconds')

# We need to iterate over the results array to add 1 to each index for your expected results.

With data1

groupby 8.864859104156494 seconds
bisect_left 0.0 seconds
numpy 0.27180027961730957 seconds

With data2

groupby 3.602466583251953 seconds
bisect_left 5.440978765487671 seconds
numpy 2.2847368717193604 seconds

As you mentioned bisect_left is very much depends on the number of unique elements, but it seems using numpy has better performance than itertools.groupby even with the additional iteration on the indices list.

Aggappera answered 23/11, 2021 at 10:58 Comment(1)
Your numpy timing is odd, like you can't decide whether you're using numpy or not. I'd either include the data = np.array(data) in the time or take the index list comp out (for the question as stated it should be the former, but the latter would be interesting in addition).Consternation
C
2

Since you said "I'm more interested in runtime", here are some faster replacements for cursor += sum(1 for _ in group) of your groupby solution.

Using @Guy's data1 but all segment lengths divided by 10:

             normal  optimized
original     870 ms  871 ms
zip_count    612 ms  611 ms
count_of     344 ms  345 ms
list_index   387 ms  386 ms
length_hint  223 ms  222 ms

Using list(range(1000000)) instead:

             normal  optimized
original     385 ms  331 ms
zip_count    463 ms  401 ms
count_of     197 ms  165 ms
list_index   175 ms  143 ms
length_hint  226 ms  127 ms

The optimized versions use more local variables or list comprehensions.

I don't think __length_hint__ is guaranteed to be exact, not even for a list iterator, but it appears to be (passes my correctness checks) and I don't see why it wouldn't.

The code (Try it online!, but you'll have to reduce something to not exceed the time limit):

from timeit import default_timer as timer
from itertools import groupby, count
from collections import deque
from operator import countOf

def original(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        cursor += sum(1 for _ in group)
        result.append(cursor)
    return result

def original_opti(data):
    cursor = 0
    sum_ = sum
    return [cursor := cursor + sum_(1 for _ in group)
            for _, group in groupby(data)]

def zip_count(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        index = count()
        deque(zip(group, index), 0)
        cursor += next(index)
        result.append(cursor)
    return result

def zip_count_opti(data):
    cursor = 0
    result = []
    append = result.append
    count_, deque_, zip_, next_ = count, deque, zip, next
    for key, group in groupby(data):
        index = count_()
        deque_(zip_(group, index), 0)
        cursor += next_(index)
        append(cursor)
    return result

def count_of(data):
    cursor = 0
    result = []
    for key, group in groupby(data):
        cursor += countOf(group, key)
        result.append(cursor)
    return result

def count_of_opti(data):
    cursor = 0
    countOf_ = countOf
    result = [cursor := cursor + countOf_(group, key)
              for key, group in groupby(data)]
    return result

def list_index(data):
    cursor = 0
    result = []
    for key, _ in groupby(data):
        cursor = data.index(key, cursor)
        result.append(cursor)
    del result[0]
    result.append(len(data))
    return result

def list_index_opti(data):
    cursor = 0
    index = data.index
    groups = groupby(data)
    next(groups, None)
    result = [cursor := index(key, cursor)
              for key, _ in groups]
    result.append(len(data))
    return result

def length_hint(data):
    result = []
    it = iter(data)
    for _ in groupby(it):
        result.append(len(data) - (1 + it.__length_hint__()))
    del result[0]
    result.append(len(data))
    return result

def length_hint_opti(data):
    it = iter(data)
    groups = groupby(it)
    next(groups, None)
    n_minus_1 = len(data) - 1
    length_hint = it.__length_hint__
    result = [n_minus_1 - length_hint()
              for _ in groups]
    result.append(len(data))
    return result

funcss = [
    (original, original_opti),
    (zip_count, zip_count_opti),
    (count_of, count_of_opti),
    (list_index, list_index_opti),
    (length_hint, length_hint_opti),
]

data1 = [1] * 3 + [2] * 3 + [4] * 5 + [5] * 2 + [7] * 4 + [9] * 3 + [11] * 1 + [15] * 3
data1 = [x for x in data1 for _ in range(1000000)]
data2 = list(range(1000000))

for _ in range(3):
    for name in 'data1', 'data2':
        print(name)
        data = eval(name)
        expect = None
        for funcs in funcss:
            print(f'{funcs[0].__name__:11}', end='')
            for func in funcs:
                times = []
                for _ in range(5):
                    start_time = timer()
                    result = func(data)
                    end_time = timer()
                    times.append(end_time - start_time)
                print(f'{round(min(times) * 1e3):5d} ms', end='')
                if expect is None:
                    expect = result
                else:
                    assert result == expect
            print()
        print()
Consternation answered 24/11, 2021 at 16:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.