Given a string of a million numbers, return all repeating 3 digit numbers

Asked 30/11, 2017 at 19:37 Answered 23/2, 2018 at 19:41

Solved python algorithm data-structures number-theory

139

I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked the solution to be in Python.)

I pretty much screwed up on the first interview problem...

Question: Given a string of a million numbers (Pi for example), write a function/program that returns all repeating 3 digit numbers and number of repetition greater than 1

For example: if the string was: 123412345123456 then the function/program would return:

123 - 3 times
234 - 3 times
345 - 2 times

They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:

000 --> 999

Now that I'm thinking about it, I don't think it's possible to come up with a constant time algorithm. Is it?

Vestryman answered 30/11, 2017 at 19:37 Comment(24)

If they think the solution is a constant of 1000, then that makes me think that they would have built all the three-digit numbers, and then regex searched for them. It's very common for people to think that operations they didn't actually write/see are "free". I'm pretty sure this would be linear to the length of the string. – Silicone 30/11, 2017 at 19:41

@Silicone thanks for the response, what do you mean by regex searched them? Is that some sort of concept like Bitwise operations? – Vestryman 30/11, 2017 at 19:44

No, regex (or Regular Expressions) are a string searching/matching tool. So what I mean is that they would have got all the three digit numbers as strings and then used a tool to find all the occurrences of those number strings within the longer string. You can read about the Python implementation of regex here: docs.python.org/3/library/re.html and the general concept here: aivosto.com/vbtips/regex.html – Silicone 30/11, 2017 at 19:51

Nitpickingly, if the input size is a constant, every algorithm is constant time ;-) – Walcott 30/11, 2017 at 21:57

a constant of 1000 what? (additions? elephants?) – Escapism 30/11, 2017 at 22:28

@Escapism - exactly, what's being counted is important. On the one hand it is possible the interviewer meant "linear in the length of the string". On the other hand, since the length of the string was fixed at 1M, the interviewer might have meant "O(1)" in terms of time (or space) on the accumulating data structure. E.g., since there are only 1000 three digit numbers use an array to hold the counts, not a linked list. BTW, OP, in a case like this it is always valid to ask the interviewer what is being counted as an operation. – Escolar 1/12, 2017 at 1:46

Well, if the string length is constant (1M) and the substring/number length is constant (3), then technically every solution is constant time… – Dreamworld 1/12, 2017 at 1:55

you can get somewhat close to constant time if you have enough parallel processing resources and can execute all the comparisons in parallel. obviously there is some contention on the atomic increments, but for random data this will approach a small logarithmic rate. also, for some of these kinds of algorithms, doing them in hardware makes some things possible that aren't in linear SW for the same reason. – Semen 1/12, 2017 at 8:7

Reading this, three things stand out: 1) This was the very first question of the interview, 2) they specifically mention Pi (an irrational number) and 3) the ambiguous "constant of 1000". As the answers state it's not possible, making me wonder whether this was a trick question and with a sufficiently long test string, all combinations could be expected to repeat an equal number of times. Was this an interactive question between you an interviewer? Just a different perspective to the O(n) answers if you got the impression it had to be constant time. – Zehe 1/12, 2017 at 10:18

They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:  000 --> 999

This was likely the actual test. To see if you could prove to them why this is not possible and to show them the correct minimum time complexity. – Emaemaciate 1/12, 2017 at 15:31

@TechMedicNYC: My thoughts too. Assuming the interviewer wasn't stupid, he probably wanted to see how long it would take OP to answer "Wait a minute! That's not possible!". – Cystotomy 1/12, 2017 at 17:45

@roganjosh: You have a point. Every 1000 substrings is present in the first 1E6 decimals of pi, the least frequent appears 898 times and the most frequent one appears 1092 times. Simply returning the 1000 substrings with a frequency of approximately 1000 times wouldn't be too wrong. – Cystotomy 1/12, 2017 at 17:58

All this talk of O(n) vs O(1) reminds me of when I joked to a friend that Java's ArrayList#contains is technically O(1), since Java arrays are capped (by spec, not just implementation) at 2^31-1 elements. :-) – Warehouseman 1/12, 2017 at 20:16

Obligatory honest description of a Job Interview. – Defeasance 1/12, 2017 at 22:19

It could be done with O(1) constant time complexity, but you'd need huge space complexity: a lookup table with 10^6 inputs and 10^3 6-bit outputs, one for each 3-digit string combination xyz. So 10^9 bytes of storage, i.e. an SSD with 1Gb useable space. However to sum the intermediate results you'd also need a huge array of 1000 x depth-20 trees of adders to sum the partial results (can't use Wallace trees because they're O(N) time-complexity). Theoretically possible but in reality silly for a very regular map-reduce type problem like this. – Compulsive 2/12, 2017 at 0:28

@smci: I don't get it. You'd still need to iterate at least once on each decimal of pi, if only to find it in the lookup table. Is your plan really to save the result for every number with 1E6 decimals? Your input size would be 10**(10**6) then. – Cystotomy 2/12, 2017 at 10:12

@EricDuminil exactly my thinking, I've been thrown curve-balls before where the expectation is completely unknown but not in this field. They stress you out. In this case, I think it would be safer to assume that the answer is along the lines of "Did you want it O(1) <for clarification>? The closest you can get is an approximation that probabilistically gets better with longer inputs. However, here's a correct O(n) answer". Not safe to assume the asker is simply dumb as some answers suggest. – Zehe 2/12, 2017 at 11:39

@ezzzCash Do you mean something like Demo? – Swami 2/12, 2017 at 16:11

@lad2025 hey thanks for taking the time to make a demo using SQL. I wasn't particularly looking for a demo but if it was, I wanted it in python (like what I said above and which the answer has already been given). However, thanks for doing it in SQL. Always nice to see how it could also be implemented using a server language – Vestryman 2/12, 2017 at 16:55

Giving them the benefit of the doubt, I'd guess they mean O(1000 + n): Under any implementation reading the string is (at least) O(n) and if you play your cards right, the rest of it is O(1000) (iterating the buckets after you've finished). So if you assume every solution must read the input, and consider only how you store, aggregate and retrieve the answers, their statement makes sense. Otherwise, they are loonies, but that seems less likely than they are just not conveying their ideas effectively. – Nipping 2/12, 2017 at 22:18

Yeah I agree when your reasoning. Considering it was the first round and it was only 30min, I was give 15min to solve it and I guess they're trying to rush it and explain it me. I hold the most fault but they weren't perfect either. – Vestryman 2/12, 2017 at 23:22

Honestly, if they think it is constant 1000, you should be happy they didn't hire you. You need somebody that actually know what they are doing with your first job to give you proper mentorship....Speak from experience... – Celanese 3/12, 2017 at 23:24

Yeah I wasn’t thinking that at first but now with the number of helpful advice and answers, maybe the rejection was the good thing – Vestryman 3/12, 2017 at 23:49

In c I would do it as sums of absolute values of circular convolution [1,-1,0] [0,1,-1] on a char ring buffer. No idea how to do that the best way in Python though. – Lindsy 28/1, 2018 at 12:7

169

You got off lightly, you probably don't want to be working for a hedge fund where the quants don't understand basic algorithms :-)

There is no way to process an arbitrarily-sized data structure in O(1) if, as in this case, you need to visit every element at least once. The best you can hope for is O(n) in this case, where n is the length of the string.

Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that's not usually how people use complexity analysis.

It appears to me you could have impressed them in a number of ways.

First, by informing them that it's not possible to do it in O(1), unless you use the "suspect" reasoning given above.

Second, by showing your elite skills by providing Pythonic code such as:

inpStr = '123412345123456'

# O(1) array creation.
freq = [0] * 1000

# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
    freq[val] += 1

# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])

This outputs:

[(123, 3), (234, 3), (345, 2)]

though you could, of course, modify the output format to anything you desire.

And, finally, by telling them there's almost certainly no problem with an O(n) solution, since the code above delivers results for a one-million-digit string in well under half a second. It seems to scale quite linearly as well, since a 10,000,000-character string takes 3.5 seconds and a 100,000,000-character one takes 36 seconds.

And, if they need better than that, there are ways to parallelise this sort of stuff that can greatly speed it up.

Not within a single Python interpreter of course, due to the GIL, but you could split the string into something like (overlap indicated by vv is required to allow proper processing of the boundary areas):

You can farm these out to separate workers and combine the results afterwards.

The splitting of input and combining of output are likely to swamp any saving with small strings (and possibly even million-digit strings) but, for much larger data sets, it may well make a difference. My usual mantra of "measure, don't guess" applies here, of course.

This mantra also applies to other possibilities, such as bypassing Python altogether and using a different language which may be faster.

For example, the following C code, running on the same hardware as the earlier Python code, handles a hundred million digits in 0.6 seconds, roughly the same amount of time as the Python code processed one million. In other words, much faster:

#include <stdio.h>
#include <string.h>

int main(void) {
    static char inpStr[100000000+1];
    static int freq[1000];

    // Set up test data.

    memset(inpStr, '1', sizeof(inpStr));
    inpStr[sizeof(inpStr)-1] = '\0';

    // Need at least three digits to do anything useful.

    if (strlen(inpStr) <= 2) return 0;

    // Get initial feed from first two digits, process others.

    int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
    char *inpPtr = &(inpStr[2]);
    while (*inpPtr != '\0') {
        // Remove hundreds, add next digit as units, adjust table.

        val = (val % 100) * 10 + *inpPtr++ - '0';
        freq[val]++;
    }

    // Output (relevant part of) table.

    for (int i = 0; i < 1000; ++i)
        if (freq[i] > 1)
            printf("%3d -> %d\n", i, freq[i]);

    return 0;
}

Disseminule answered 1/12, 2017 at 1:35 Comment(15)

This "fixed input size" really sounds like a bad joke either the interviewer or the interviewee didn't get. Every algorithm becomes O(1) is n is fixed or bounded. – Cystotomy 1/12, 2017 at 13:11

If they need better than that, maybe they shouldn't be using Python, at least for the specific algorithm. – Kissie 1/12, 2017 at 13:28

@paxidiablo, thank so much for the detailed response. Not only did you tackle the problem and answered my question, but you also described the interview process and how to do well next time. I'll keep your advice in mind for the next interview also why is it len(inpStr) - 2 (why minus 2?) – Vestryman 1/12, 2017 at 21:10

@ezzzCash Because there may be overlap at the points where the string is being "broken up" when trying a parallel approach. Since you're looking for 3-digit groups, -2 allows the check on both parallel groupings to not miss a potentially valid match. – Defeasance 1/12, 2017 at 22:32

@Ray, I still don't understand the idea? I think I'm lacking in parallel programming knowledge? Is there a resource that I can look into to make me understand why it's -2? – Vestryman 2/12, 2017 at 2:11

@ezzzCash It's not a lack of parallel programming knowledge. Consider a string of length N. If you break it into two parts at position N/2, you still need to account for the fact that you could miss a valid 3-digit match at the "border", at the end of string1 and the beginning of string2. Thus, you need to check matches between string1[N/2-2] and string2[2] (using a zero-based index), etc. That's the idea. – Defeasance 2/12, 2017 at 3:28

@ray Okay I see it. Gosh that's so simple, how did I not understand that aha. Thanks for taking the time to explain it to me :) – Vestryman 2/12, 2017 at 3:49

With longer digit-sequences, there'd be something to gain from optimizing the conversion to integer with a sliding window that lets you drop the highest digit and add a new digit. (Python overhead would probably kill this, so it would only apply to C or other low-level implementations). val -= 100 * (d[i]-'0'); to drop the leading digit. val = 10*val + d[i+2]-'0' to accumulate a new least-significant digit (normal string->integer parsing). val % 100 is possibly non-horrible, but only if 100 is a compile-time constant so it doesn't use a real HW divide. – Fair 2/12, 2017 at 10:9

Or maybe optimize the string->int with SIMD: #35127560. Or save temporary values during string->int conversion so you don't have to undo the most significant digit in the first place. (Of course, with longer substrings, the histogram size grows exponentially, so you'd want to start maybe making multiple passes over the array using only a single prefix to keep some memory locality in your histogram. But then there must be ways to optimize across multiple passes...) – Fair 2/12, 2017 at 10:13

@PeterCordes - in one of my comments following my answer, I mentioned the option of getting 4 indexes at a time, num = int(str[i:i+6]) ... i += 4, (grab 6 digits at a time, advance by 4 digits to handle overlap) then using division and modulo to pick out the set of four 3 digit values. As you mentioned, this assumes that divide by constant is optimized to multiply + shift. Considering python can be 100+ times slower than C / C++, I'm not sure if division overhead in Python is that significant. – Outspread 2/12, 2017 at 11:19

@rcgldr: That seems bad vs. saving temporary values (e.g. using 10*d4 + d5 as part of computing both 100*d3 + 10*d4 + d5 and (10*d4 + d5) * 10 + d6). Maybe in Python where the cost of gluing together operations is much higher than the cost of the operation itself. It's easy to say "if you care about performance don't use Python this way in the first place", but sometimes just finding an efficient way in Python is enough to not bother making a C or asm implementation. But note that actual HW division has ~30x (32-bit) or ~90x (64-bit int) the throughput cost of integer mul on x86. – Fair 2/12, 2017 at 11:30

@PeterCordes - I had the impression that the main bottleneck is converting a string to an integer: int(str[start:end]), so I convert 6 digits at a time (although only advancing 4 at a time). My comment about the division overhead being masked by Python overhead is assuming that division (or modulo) by a constant was implemented as multiply + shift (+ subtract for modulo). I looked at a couple of web site benchmarks and python was about 100+ times slower than C. Perhaps this was an interpreted versus compiled version of Python. Java uses a virtual machine, but it's no where near as slow. – Outspread 2/12, 2017 at 13:44

@rcgldr: Right, modern JVMs JIT-compile and do a reasonable job for hot functions. Most Python interpreters only interpret. IDK what the state of the art is, I don't use Python. Int->string overhead presumably depends on the length of the string, unless Python overhead is dwarfs it. For 3-digit numbers, it should be very cheap. – Fair 2/12, 2017 at 13:49

How does freq = [inpStr.count(str(n) for n in range(1000) if inpStr.count(str(n)>1] compare? Or using a Counter object? – Heathheathberry 4/12, 2017 at 3:7

@Acccumulation, surely that's something you could test yourself. I'm assuming you didn't come up with that list comprehension as a Python beginner so I'd suspect you have an environment available to you :-) – Disseminule 4/12, 2017 at 3:29

Constant time isn't possible. All 1 million digits need to be looked at at least once, so that is a time complexity of O(n), where n = 1 million in this case.

For a simple O(n) solution, create an array of size 1000 that represents the number of occurrences of each possible 3 digit number. Advance 1 digit at a time, first index == 0, last index == 999997, and increment array[3 digit number] to create a histogram (count of occurrences for each possible 3 digit number). Then output the content of the array with counts > 1.

Outspread answered 30/11, 2017 at 19:53 Comment(19)

Okay thanks for the response. Would a dictionary work? the key would be the 3 digits repeating and the value would be the number of times it would repeat? As you go through the string, you would add new key value pairs to your dictionary (although, that dictionary would be quite long) – Vestryman 30/11, 2017 at 20:7

@ezzzCash - yes a dictionary would work, but it's not needed. All possible "keys" are known in advance, limited to the range 0 to 999. The difference in overhead would be the time it takes to do a key based access using 3 character strings as keys, versus the time it takes to convert a 3 digit string to an index and then using the index to access the array. – Outspread 30/11, 2017 at 20:14

If you want numeric tricks, you could also decide to go BCD and store the three digits in 12 bits. And decode ASCII digits by masking the low 4 bits. But that x-'0' pattern is not valid in Python, it's a C-ism (where characters are integers). – Cirrocumulus 30/11, 2017 at 20:32

A dictionary would be much slower than a simple array. – Postdoctoral 30/11, 2017 at 23:54

@LorenPechtel: Dictionary lookups in Python are really fast. Granted, array access is even faster, so if we were dealing with integers from the beginning, you'd be right. However, in this case, we have 3-length strings, which we first have to convert to integers if we want to use them with arrays. It turns out that contrary to what one might first expect, the dictionary lookup is actually faster than the integer conversion + array access. The array solution is in fact 50% slower in this case. – Iqbal 1/12, 2017 at 10:44

@AleksiTorhamo That's really weird. Converting a string to a number should not be more expensive than hashing it, which is what the dictionary lookup has to do. – Kissie 1/12, 2017 at 13:24

I guess one could argue that if the input number has always exactly 1 million digits, than that algorithm is O(1), with a constant factor of 1 million. – Castled 1/12, 2017 at 14:57

@AleksiTorhamo - If the goal is to compare relative speeds of implementations for an algorithm, I would prefer a traditional language like C or C++, since Python is significantly slower and seems to have overheads unique to Python compared to other languages. – Outspread 1/12, 2017 at 16:16

For purposes of pedantry - the problem is of constant complexity - the input is fixed length of 1,000,000 numbers. Therefore its time complexity is O(1) (and O(1000) and O(10^10^10^10) - these are all the same complexity). In an interview you may be able to score some points by being clear on this but also being able to articulate how your algorithm would scale with a variable length input. – Overshoot 1/12, 2017 at 19:18

@rcdldr - Technically correct is the best kind of correct ;). I think since the interviewer introduced the notion of time complexity, this is an opportunity to show your knowledge of time complexity, and there is a big risk to saying: Here is my O(N) algorithm and have them correct you / mark you down for not understanding / arguing with them that the problem can be solved in constant time. I think the fullest, most correct interview response is to acknowledge this problem is solvable in constant time since the input is limited, then show your good solution that scales with variable input. – Overshoot 1/12, 2017 at 21:49

@Castled - It doesn't matter if the input always has exactly 1 million digits, the term constant time only applies if the algorithm's time complexity is independent of the size of the input. – Outspread 2/12, 2017 at 1:32

@Overshoot - It doesn't matter if the input always has exactly 1 million digits, the term constant time only applies if the algorithm's time complexity is independent of the size of the input. – Outspread 2/12, 2017 at 1:32

@Outspread - You are missing an important distinction: if an algorithm is specified to only be able to take a constant size input (as is the case here) - its runtime is constant. What is the runtime of enumerating all legal configurations of a chessboard? Constant (because there are only a constant number of such configurations). What is the runtime of printing the position and velocity of every particle in the universe? Constant (because there are a finite number of particles). What is the runtime of printing the length of an arbitrary string? O(N) - because the string's length is not bounded. – Overshoot 2/12, 2017 at 5:57

Let us continue this discussion in chat. – Overshoot 2/12, 2017 at 7:20

You can think of it that way: Instead of having a loop over all the digits in the input string, you could replace the code with literally one million lines, each checking a different digit in the fixed-length input. But as @Overshoot said, that's really a bit pedantic and besides any practicality. – Castled 2/12, 2017 at 9:54

As mentioned in paxdiablo's answer, I also consider this as "suspect reasoning" of what I consider mis-usage of the term "constant time". – Outspread 2/12, 2017 at 11:13

@Outspread If the problem space has only one element, then technically all algorithms are independent of the input size. – Heathheathberry 4/12, 2017 at 3:11

@ Aleksi Torhamo You should avoid saying "50% slower". It's ambiguous between "50% more time" and "100% more time. – Heathheathberry 4/12, 2017 at 3:12

A million is small for the answer I give below. Expecting only that you have to be able to run the solution in the interview, without a pause, then The following works in less than two seconds and gives the required result:

from collections import Counter

def triple_counter(s):
    c = Counter(s[n-3: n] for n in range(3, len(s)))
    for tri, n in c.most_common():
        if n > 1:
            print('%s - %i times.' % (tri, n))
        else:
            break

if __name__ == '__main__':
    import random

    s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
    triple_counter(s)

Hopefully the interviewer would be looking for use of the standard libraries collections.Counter class.

Parallel execution version

I wrote a blog post on this with more explanation.

Coast answered 1/12, 2017 at 10:54 Comment(7)

It works fine and seems to be the fastest, non numpy solution. – Cystotomy 1/12, 2017 at 13:53

@EricDuminil, I don't think you should worry about having the fastet timings here, when most solutions given won't delay you much. Far better to show that you have a good grasp of the Python standard library and can write maintainable code in an interview situation I would think. (Unless the interviewer stressed the time criticality whereupon you should ask for actual timings before assessing what comes next). – Coast 1/12, 2017 at 14:19

We agree 100%. Though I'm not sure any answer is relevant at all if the interviewer really thinks it's possible to do in O(1). – Cystotomy 1/12, 2017 at 14:34

If the interviewer stressed it was time critical, then, after profiling to confirm this is the limit, it may be time to write a C module to address this bottleneck. I have a script that saw an 84x improvement over python code after we switched to using a c module. – Nipping 1/12, 2017 at 23:13

Hi @TemporalWolf, I read what you said then thought that another, faster, and scaleable solution might be to change it to a parallel algorithm so it could be run on many processes on a compute farm/cloud. You have to split the string into n sections; overlapping the last 3 chars of each section with its next section. Each section can then be scanned for triples independently, the triples summed, and the three char triple at the end of all but the last section subtracted out as it would have been double counted. I have the code, and will probably turn it into a blog post... – Coast 3/12, 2017 at 21:22

Parallelizing, especially across multiple machines, incurs a significant overhead. For a string of 1M characters, it is likely overkill. The more we optimize for speed, the greater the maintenance burden of the code. So we need to justify that burden beforehand (via profiling). – Nipping 3/12, 2017 at 23:28

Of course we need to measure before any optimisation that takes you away from pythonic code. Instead of just wondering about a multi core solution I decided to write the code so I get a first hand idea of the effort. Multi cores are everywhere! – Coast 4/12, 2017 at 7:2

The simple O(n) solution would be to count each 3-digit number:

for nr in range(1000):
    cnt = text.count('%03d' % nr)
    if cnt > 1:
        print '%03d is found %d times' % (nr, cnt)

This would search through all 1 million digits 1000 times.

Traversing the digits only once:

counts = [0] * 1000
for idx in range(len(text)-2):
    counts[int(text[idx:idx+3])] += 1

for nr, cnt in enumerate(counts):
    if cnt > 1:
        print '%03d is found %d times' % (nr, cnt)

Timing shows that iterating only once over the index is twice as fast as using count.

Mickel answered 30/11, 2017 at 19:54 Comment(5)

Is there a black friday discount on text.count()? – Cystotomy 30/11, 2017 at 19:57

@EricDuminil You do have a good point but, since text.count is done in a high-speed compiled language (e.g. C) as opposed to slow python-level interpreted looping, yes there is a discount. – Schizophyceous 30/11, 2017 at 20:2

It's very inefficient to count each number separately but it's a constant time, thus still O(n). – Postdoctoral 30/11, 2017 at 23:55

The option you proposed that uses count is incorrect, since it won't count overlapping patterns. Note that '111'.count('11') == 1 when we would expect it to be 2. – Valuable 1/12, 2017 at 0:50

Also, your "simple O(n) solution" is actually O(10**d * n) with d the number of searched digits and n the total length of the string. The second one is O(n) time and O(10**d + n) space. – Cystotomy 1/12, 2017 at 13:7

Here is a NumPy implementation of the "consensus" O(n) algorithm: walk through all triplets and bin as you go. The binning is done by upon encountering say "385", adding one to bin[3, 8, 5] which is an O(1) operation. Bins are arranged in a 10x10x10 cube. As the binning is fully vectorized there is no loop in the code.

def setup_data(n):
    import random
    digits = "0123456789"
    return dict(text = ''.join(random.choice(digits) for i in range(n)))

def f_np(text):
    # Get the data into NumPy
    import numpy as np
    a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
    # Rolling triplets
    a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)

    bins = np.zeros((10, 10, 10), dtype=int)
    # Next line performs O(n) binning
    np.add.at(bins, tuple(a3), 1)
    # Filtering is left as an exercise
    return bins.ravel()

def f_py(text):
    counts = [0] * 1000
    for idx in range(len(text)-2):
        counts[int(text[idx:idx+3])] += 1
    return counts

import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
    data = setup_data(n)
    ref = f_np(**data)
    print(f'n = {n}')
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        try:
            assert np.all(ref == func(**data))
            print("{:16s}{:16.8f} ms".format(name[2:], timeit(
                'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
        except:
            print("{:16s} apparently crashed".format(name[2:]))

Unsurprisingly, NumPy is a bit faster than @Daniel's pure Python solution on large data sets. Sample output:

# n = 10
# np                    0.03481400 ms
# py                    0.00669330 ms
# n = 1000
# np                    0.11215360 ms
# py                    0.34836530 ms
# n = 1000000
# np                   82.46765980 ms
# py                  360.51235450 ms

Recruit answered 30/11, 2017 at 20:45 Comment(4)

Probably significantly faster to flatten the digit string instead of having nested bins, unless NumPy ends up implementing it as a 3D matrix with efficient indexing. Which version of @Daniel's did you time against; the one that runs a string search for each integer, or the one with a histogram? – Fair 2/12, 2017 at 9:50

@PeterCordes I doubt it. ndarrays, the core numpy type, are all about efficient storage, manipulation and indexing of multidimensional arrays of numbers. Sometimes you can shave off a few % by flattening, but in this case doing the 100 x[0] + 10 x[1] + x[2] by hand won't gain you much. I used the one @Mickel said was faster, you can check the benchmark code yourself. – Recruit 2/12, 2017 at 13:35

I don't really know NumPy (or Python in general; mostly I do C and assembly performance tuning for x86), but I think you have a single 3D array, right? I was thinking from your English text (which apparently I didn't even read carefully) that you had actual nested Python objects and were indexing them separately. But that's not the case, so nvm my first comment. – Fair 2/12, 2017 at 13:54

I think the pure Python version you used is pretty much the same histogram implementation that the even higher voted answers used, but if different ways of writing it in Python affect the speed much. – Fair 2/12, 2017 at 13:57

I would solve the problem as follows:

def find_numbers(str_num):
    final_dict = {}
    buffer = {}
    for idx in range(len(str_num) - 3):
        num = int(str_num[idx:idx + 3])
        if num not in buffer:
            buffer[num] = 0
        buffer[num] += 1
        if buffer[num] > 1:
            final_dict[num] = buffer[num]
    return final_dict

Applied to your example string, this yields:

>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}

This solution runs in O(n) for n being the length of the provided string, and is, I guess, the best you can get.

Codel answered 30/11, 2017 at 20:23 Comment(1)

You could simply use a Counter. You don't need a final_dict, and you don't have to update it at each iteration. – Cystotomy 30/11, 2017 at 20:28

As per my understanding, you cannot have the solution in a constant time. It will take at least one pass over the million digit number (assuming its a string). You can have a 3-digit rolling iteration over the digits of the million length number and increase the value of hash key by 1 if it already exists or create a new hash key (initialized by value 1) if it doesn't exists already in the dictionary.

The code will look something like this:

def calc_repeating_digits(number):

    hash = {}

    for i in range(len(str(number))-2):

        current_three_digits = number[i:i+3]
        if current_three_digits in hash.keys():
            hash[current_three_digits] += 1

        else:
            hash[current_three_digits] = 1

    return hash

You can filter down to the keys which have item value greater than 1.

Antung answered 30/11, 2017 at 20:9 Comment(0)

As mentioned in another answer, you cannot do this algorithm in constant time, because you must look at at least n digits. Linear time is the fastest you can get.

However, the algorithm can be done in O(1) space. You only need to store the counts of each 3 digit number, so you need an array of 1000 entries. You can then stream the number in.

My guess is that either the interviewer misspoke when they gave you the solution, or you misheard "constant time" when they said "constant space."

Mong answered 1/12, 2017 at 5:11 Comment(2)

As others have pointed out, the histogram approach is O(10**d) extra space, where d is the number of decimal digits you're looking for. – Fair 2/12, 2017 at 9:56

Dictionary approach would be O (min (10^d, n)) for n digits. For example if you have n = 10^9 digits and want to find the rare 15 digit sequences that occur more than once. – Leatherworker 2/12, 2017 at 14:57

Here's my answer:

from timeit import timeit
from collections import Counter
import types
import random

def setup_data(n):
    digits = "0123456789"
    return dict(text = ''.join(random.choice(digits) for i in range(n)))


def f_counter(text):
    c = Counter()
    for i in range(len(text)-2):
        ss = text[i:i+3]
        c.update([ss])
    return (i for i in c.items() if i[1] > 1)

def f_dict(text):
    d = {}
    for i in range(len(text)-2):
        ss = text[i:i+3]
        if ss not in d:
            d[ss] = 0
        d[ss] += 1
    return ((i, d[i]) for i in d if d[i] > 1)

def f_array(text):
    a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
    for n in range(len(text)-2):
        i, j, k = (int(ss) for ss in text[n:n+3])
        a[i][j][k] += 1
    for i, b in enumerate(a):
        for j, c in enumerate(b):
            for k, d in enumerate(c):
                if d > 1: yield (f'{i}{j}{k}', d)


for n in (1E1, 1E3, 1E6):
    n = int(n)
    data = setup_data(n)
    print(f'n = {n}')
    results = {}
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        print("{:16s}{:16.8f} ms".format(name[2:], timeit(
            'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
    for r in results:
        print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))

The array lookup method is very fast (even faster than @paul-panzer's numpy method!). Of course, it cheats since it isn't technicailly finished after it completes, because it's returning a generator. It also doesn't have to check every iteration if the value already exists, which is likely to help a lot.

n = 10
counter               0.10595780 ms
dict                  0.01070654 ms
array                 0.00135370 ms
f_counter : []
f_dict    : []
f_array   : []
n = 1000
counter               2.89462101 ms
dict                  0.40434612 ms
array                 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict    : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array   : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter            2849.00500992 ms
dict                438.44007806 ms
array                 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict    : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array   : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]

Mccleary answered 1/12, 2017 at 0:41 Comment(3)

So what are you comparing exactly? Shouldn't you return lists instead of unused generators? – Cystotomy 1/12, 2017 at 13:20

Counters aren't used that way. Used properly, they become the fastest option with your example. If you use timeit with a list insted of a generator, your method becomes slower than Counter or dict. See here. – Cystotomy 1/12, 2017 at 13:39

Finally, your f_array could be faster if you first convert every char to an int : ints = [int(c) for c in text] and then use i, j, k = ints[n:n+3]. – Cystotomy 1/12, 2017 at 13:42

Image as answer:

IMAGE AS ANSWER

Looks like a sliding window.

Valentinavalentine answered 23/12, 2017 at 11:25 Comment(0)

Here is my solution:

from collections import defaultdict
string = "103264685134845354863"
d = defaultdict(int)
for elt in range(len(string)-2):
    d[string[elt:elt+3]] += 1
d = {key: d[key] for key in d.keys() if d[key] > 1}

With a bit of creativity in for loop(and additional lookup list with True/False/None for example) you should be able to get rid of last line, as you only want to create keys in dict that we visited once up to that point. Hope it helps :)

Olivarez answered 26/12, 2017 at 17:57 Comment(1)

See pho7's answer. And comments. Try to figure out why it doesn't get loads of votes. – Jabberwocky 27/12, 2017 at 1:7

-Telling from the perspective of C. -You can have an int 3-d array results[10][10][10]; -Go from 0th location to n-4th location, where n being the size of the string array. -On each location, check the current, next and next's next. -Increment the cntr as resutls[current][next][next's next]++; -Print the values of

results[1][2][3]
results[2][3][4]
results[3][4][5]
results[4][5][6]
results[5][6][7]
results[6][7][8]
results[7][8][9]

-It is O(n) time, there is no comparisons involved. -You can run some parallel stuff here by partitioning the array and calculating the matches around the partitions.

Gemmiparous answered 23/2, 2018 at 19:41 Comment(0)

-1

inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'

count = {}
for i in range(len(inputStr) - 2):
    subNum = int(inputStr[i:i+3])
    if subNum not in count:
        count[subNum] = 1
    else:
        count[subNum] += 1

print count

Frederickafredericks answered 5/12, 2017 at 14:52 Comment(1)

Thanks for your answer but it is too similar of an algorithm as was given by @abhishek arora 5-6 days ago. Also the original question was not asking for the algorithm but rather a different question (which was already answered multiple times) – Vestryman 5/12, 2017 at 14:57

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Parallel execution version

Recommended topics

Hot tags