Finding shortest repeating cycle in word?
Asked Answered
K

13

23

I'm about to write a function which, would return me a shortest period of group of letters which would eventually create the given word.

For example word abkebabkebabkeb is created by repeated abkeb word. I would like to know, how efficiently analyze input word, to get the shortest period of characters creating input word.

Kuykendall answered 16/5, 2011 at 17:50 Comment(1)
@Tony The Tiger, the result (shortest period) does not have to be a real word.Kuykendall
D
1

O(n) solution. Assumes that the entire string must be covered. The key observation is that we generate the pattern and test it, but if we find something along the way that doesn't match, we must include the entire string that we already tested, so we don't have to reobserve those characters.

def pattern(inputv):
    pattern_end =0
    for j in range(pattern_end+1,len(inputv)):

        pattern_dex = j%(pattern_end+1)
        if(inputv[pattern_dex] != inputv[j]):

            pattern_end = j;
            continue

        if(j == len(inputv)-1):
            print pattern_end
            return inputv[0:pattern_end+1];
    return inputv;
Dolce answered 16/5, 2011 at 18:12 Comment(7)
Is for pattern_end in range(len(inputv)/2) necessary? I don't think it is.Peculation
@Peculation - sorry I'm not following. Do you mean the look of the len()/2 partDolce
I mean, replacing that line with pattern_end = 0.Peculation
I'm afraid the algorithm is incorrect. Please consider the input: "BCBDBCBCBDBC". The smallest repeating pattern is "BCBDBC", but the algorithm above will miss it. Also, I think it doesn't deal correctly with the case "HELLOHELL" (where it returns "HELLO" instead of the complete string).Spacesuit
@EyalSchneider returning "HELLO" for "HELLOHELL" makes perfect sense. Otherwise you could be using the length of the string to determine which patterns are even allowed. You only need to check patterns whose lengths are combinations of the prime factors (and 1) of the length of the input. For example if you get a string that's 101 characters long what's the point of even trying a 5 letter subsequence? It's not going to fit perfectly into 101 characters. If your input's length is 13 there's no point in even checking, no pattern (except a 1 letter pattern) repeated n times can ever add up to 13.Ciborium
@Boris: The problem is finding the smallest sub-sequence of S such that K>=1 repetitions of it would result in S itself. The input "HELLOHELL" has no repeating subsequence with K>1, so "HELLOHELL" should be returned.Spacesuit
This is a correct answer. not the most voted one. This answer is also work for "aabbccaa" and find "aabbcc" but the top answer doesn't work and print "aabbccaa".Opportunity
T
24

Here is a correct O(n) algorithm. The first for loop is the table building portion of KMP. There are various proofs that it always runs in linear time.

Since this question has 4 previous answers, none of which are O(n) and correct, I heavily tested this solution for both correctness and runtime.

def pattern(inputv):
    if not inputv:
        return inputv

    nxt = [0]*len(inputv)
    for i in range(1, len(nxt)):
        k = nxt[i - 1]
        while True:
            if inputv[i] == inputv[k]:
                nxt[i] = k + 1
                break
            elif k == 0:
                nxt[i] = 0
                break
            else:
                k = nxt[k - 1]

    smallPieceLen = len(inputv) - nxt[-1]
    if len(inputv) % smallPieceLen != 0:
        return inputv

    return inputv[0:smallPieceLen]
Tuberose answered 23/11, 2015 at 5:44 Comment(14)
So is this a solution you have come up with or is this a known algorithm?Anecdotage
Well KMP is a known algorithm. This question was very similar to a homework problem I had, and this is the answer I came up with for the homework. The instructor's solution was a bit different, but also used KMP.Tuberose
Hi Buge, love your solution, and vote up. but confused by this line smallPieceLen = len(inputv) - nxt[-1], and nxt[-1] means if the whole string does not match, what index we will be used to compare next. smallPieceLen represents the differences total length of string and nxt[-1], how it could be represented as shortest repetitive string?Devon
@LinMa: (Buge wasn't active lately) nxt[-1] means if the whole string does not match, what index we will be used to compare next no (contorted grammar, btw.). It is the index to compare next when all of the pattern matched and you want to find its next occurrence in a longer text. nxt[i] = p means pattern[i+1-p:i+1] equals pattern[0:p] (& != for p+1). nxt[-1] is the index to compare next if the "first" mismatch was "at len+1". (In many a presentation/implementation of KMP, there is a special value of -1 at index 0, with the n values as above "shifted to an index higher by one".)Riddance
Hi @greybeard, spend time today to debug your comments and Buge's code carefully, and you are correct, thanks for clarifying the meaning of nxt[i]=j. There is one thing I still mis-understand, len(inputv) - nxt[-1], this line checks the longest prefix, which matches suffix of the same string, smallPieceLen is the remaining part length, why if len(inputv) could be divided by smallPieceLen, then it is shortest repetitive string? Is there a simple prove or some explanation? Is it too aggressive to make the conclusion?Devon
Maybe this is a question to both @Tuberose and greybeard. :)Devon
@LinMa: (both are notified, anyway) Let me call len(inputv) len and nxt[-1] matchLen. If matchLen < smallPieceLen, the only chance for smallPieceLen to divide len is to be equal to it. If smallPieceLenmatchLen, inputv[0:smallPieceLen] equals inputv[smallPieceLen:2*smallPieceLen], and k never got reset (again): inputv is made up of repetitions of inputv[0:smallPieceLen] - the divisibility check just ensures that it ends with a full repetition.Riddance
@greybeard, nice answer, I think about your reply for more than one day, but not sure how do you get this conclusion inputv[0:smallPieceLen] equals inputv[smallPieceLen:2*smallPieceLen], would you elaborate a bit more?Devon
The way I think about it is nxt[-1] is the max amount that the beginning of inputv can overlap with the end of itself (without overlapping the entire thing). For example in abcabcabc nxt[-1]==6 because the last 6 letters are the same as the first 6. I specifically like to think of it as the length of the last characters. So in the example it can be split into 2 parts: abc|abcabc where the first part has length smallPieceLen and the second part has length nxt[-1]. Because of the overlap knowledge, we know abc overlaps at the beginning with abcabc as far as the overlap or `abcTuberose
Can someone translate this answer to a known programming language like java or C? It would be much appreciated to someone like me who doesn't understand this one. Thank you.Myranda
what does: nxt = [0]*len(inputv) do?Myranda
It's python which I consider a known programming language, the same language as the accepted answer. [0]*len(inputv) means create a python list (array) with the same length as inputv and with all elements being 0. Anyway, here it is in Java. C would be a little annoying because of its less flexible strings and arrays.Tuberose
This doesn't work the way I expected if the pattern ends abruptly. For example [1, 2, 1] should return [1, 2] but this code returns [1, 2, 1].Ciborium
@Boris I consider [1, 2] wrong. The question says "For example word abkebabkebabkeb is created by repeated abkeb word." But it's not possible to create [1, 2, 1] by repeating [1, 2]. You need to repeat it, then chop off the end to get [1, 2, 1]. But yes, the question is somewhat ambiguous about what the correct answer to [1, 2, 1] is.Tuberose
N
2

This is an example for PHP:

<?php
function getrepeatedstring($string) {
    if (strlen($string)<2) return $string;
    for($i = 1; $i<strlen($string); $i++) {
        if (substr(str_repeat(substr($string, 0, $i),strlen($string)/$i+1), 0, strlen($string))==$string)
            return substr($string, 0, $i);
    }
    return $string;
}
?>
Nassi answered 16/5, 2011 at 18:8 Comment(7)
This returns 'abkeb' which should be correct but I'm not sure in what way the OP is asking for 'kebab' rather than 'abkeb'.Instability
This is what I'm looking for. But it runs in O(n). Any ideas if this can be speeded up?Kuykendall
@jack44: You can't know if you have the shortest cycle until you've examined the entire string. Unless you have other knowledge, like what the largest possible cycle might be. It might be that the last character in the string throws the whole cycle off, you don't know.Megmega
I don't know PHP, but this looks like it's O(n^2).Dolce
@Richard86 - String comparison is going to O(n), though, isn't it?Dolce
@Kuykendall The speed could be improved. One solution is to start at pos strlen($string)/2. Then, if it is a match, you would go to strlen($string)/4, strlen($string)/8, and so on, until it is not a match, after that, you would check each char from the current position until match. If it isn't a match at pos strlen($string)/2, you would check at pos strlen($string)-strlen($string)/2, if not a match, continue with strlen($string)-strlen($string)/4, strlen($string)-strlen($string)/8 and so on until a match, and then, check from one step back. I'll use PHP if no one tries with pseudo.Nassi
@Kuykendall I believe the O(n) solution by @spin is to prefer, and there is no need to improve this answer as it will not be able to be as fast as the O(n) solution is.Nassi
M
2

Simpler answer which I can come up in an interview is just a O(n^2) solution, which tries out all combinations of substring starting from 0.

int findSmallestUnit(string str){
    for(int i=1;i<str.length();i++){
        int j=0;
        for(;j<str.length();j++){
            if(str[j%i] != str[j]){
                break;
            }
        }
        if(j==str.length()) return str.substr(0,i);
    }
    return str;
}

Now if someone is interested in O(n) solution to this problem in c++:

  int findSmallestUnit(string str){
      vector<int> lps(str.length(),0);
      int i=1;
      int len=0;

      while(i<str.length()){
          if(str[i] == str[len]){
              len++;
              lps[i] = len;
              i++;
          }
          else{
              if(len == 0) i++;
              else{
                  len = lps[len-1];
              }
          }
      }
      int n=str.length();
      int x = lps[n-1];
      if(n%(n-x) == 0){
          return str.substr(0,n-x);    
      }
      return str;
  }

The above is just @Buge's answer in c++, since someone asked in comments.

Myongmyopia answered 31/12, 2020 at 0:31 Comment(0)
D
1

O(n) solution. Assumes that the entire string must be covered. The key observation is that we generate the pattern and test it, but if we find something along the way that doesn't match, we must include the entire string that we already tested, so we don't have to reobserve those characters.

def pattern(inputv):
    pattern_end =0
    for j in range(pattern_end+1,len(inputv)):

        pattern_dex = j%(pattern_end+1)
        if(inputv[pattern_dex] != inputv[j]):

            pattern_end = j;
            continue

        if(j == len(inputv)-1):
            print pattern_end
            return inputv[0:pattern_end+1];
    return inputv;
Dolce answered 16/5, 2011 at 18:12 Comment(7)
Is for pattern_end in range(len(inputv)/2) necessary? I don't think it is.Peculation
@Peculation - sorry I'm not following. Do you mean the look of the len()/2 partDolce
I mean, replacing that line with pattern_end = 0.Peculation
I'm afraid the algorithm is incorrect. Please consider the input: "BCBDBCBCBDBC". The smallest repeating pattern is "BCBDBC", but the algorithm above will miss it. Also, I think it doesn't deal correctly with the case "HELLOHELL" (where it returns "HELLO" instead of the complete string).Spacesuit
@EyalSchneider returning "HELLO" for "HELLOHELL" makes perfect sense. Otherwise you could be using the length of the string to determine which patterns are even allowed. You only need to check patterns whose lengths are combinations of the prime factors (and 1) of the length of the input. For example if you get a string that's 101 characters long what's the point of even trying a 5 letter subsequence? It's not going to fit perfectly into 101 characters. If your input's length is 13 there's no point in even checking, no pattern (except a 1 letter pattern) repeated n times can ever add up to 13.Ciborium
@Boris: The problem is finding the smallest sub-sequence of S such that K>=1 repetitions of it would result in S itself. The input "HELLOHELL" has no repeating subsequence with K>1, so "HELLOHELL" should be returned.Spacesuit
This is a correct answer. not the most voted one. This answer is also work for "aabbccaa" and find "aabbcc" but the top answer doesn't work and print "aabbccaa".Opportunity
F
1

Most easiest one in python:

def pattern(self, s):
    ans=(s+s).find(s,1,-1)
    return len(pat) if ans == -1 else ans
Freehearted answered 26/8, 2020 at 21:3 Comment(1)
It will be helpful if you explain what you didBillhead
A
0

I believe there is a very elegant recursive solution. Many of the proposed solutions solve the extra complexity where the string ends with part of the pattern, like abcabca. But I do not think that is asked for.

My solution for the simple version of the problem in clojure:

 (defn find-shortest-repeating [pattern string]
  (if (empty? (str/replace string pattern ""))
   pattern
   (find-shortest-repeating (str pattern (nth string (count pattern))) string)))

(find-shortest-repeating "" "abcabcabc") ;; "abc"

But be aware that this will not find patterns that are uncomplete at the end.

Amperage answered 4/4, 2017 at 21:1 Comment(0)
T
0

I found a solution based on your post, that could take an incomplete pattern:

(defn find-shortest-repeating [pattern string]
   (if (or (empty? (clojure.string/split string (re-pattern pattern)))
          (empty? (second (clojure.string/split string (re-pattern pattern)))))
    pattern
    (find-shortest-repeating (str pattern (nth string (count pattern))) string)))
Troll answered 12/10, 2017 at 14:29 Comment(1)
@ward (defn find-pattern-string [string] (let [pattern "" working-str string] (reduce #(if (not (or (empty? (clojure.string/split string (re-pattern %1))) (empty? (second (clojure.string/split string (re-pattern %1)))))) (str %1 %2) %1) pattern working-str)))Troll
H
0

My Solution: The idea is to find a substring from the position zero such that it becomes equal to the adjacent substring of same length, when such a substring is found return the substring. Please note if no repeating substring is found I am printing the entire input String.

public static void repeatingSubstring(String input){
    for(int i=0;i<input.length();i++){
        if(i==input.length()-1){
            System.out.println("There is no repetition "+input);
        }
        else if(input.length()%(i+1)==0){
            int size = i+1;
            if(input.substring(0, i+1).equals(input.substring(i+1, i+1+size))){
                System.out.println("The subString which repeats itself is "+input.substring(0, i+1));
                break;
            }
        }
    }
}
Humankind answered 11/2, 2018 at 21:12 Comment(1)
I think this would fail for the string "ababcababc"Angelenaangeleno
M
0

Regex solution:

Use the following regex replacement to find the shortest repeating substring, and only keeping that substring:

^(.+?)\1*$
$1

Explanation:

^(.+?)\1*$
^        $   # Start and end, to match the entire input-string
 (   )       # Capture group 1:
  .+         #  One or more characters,
    ?        #  with a reluctant instead of greedy match†
      \1*    # Followed by the first capture group repeated zero or more times

$1           # Replace the entire input-string with the first capture group match,
             # removing all other duplicated substrings

Greedy vs reluctant would in this case mean: greedy = consumes as many characters as it can; reluctant = consumes as few characters as it can. Since we want the shortest repeating substring, we would want a reluctant match in our regex.

Example input: "abkebabkebabkeb"
Example output: "abkeb"

Try it online in Retina.

Here an example implementation in Java.

Mcguire answered 5/6, 2018 at 15:5 Comment(3)
This won't work for the string abaa where expected output is aPistil
@HimanshuPoddar If I read the question correctly, the full word is always build from multiple repeating substrings without any other characters, so abab (2x ab) or aaa (3x a) are both valid inputs, but abaa not (unless you count it as 1x abaa).Mcguire
Oh yeah, my bad I confused between questions, was reading similar question on the same lines. Thanks for the reply thoPistil
B
0

This is a solution I came up with using the queue, it passed all the test cases of a similar problem in codeforces. Problem No is 745A.

#include<bits/stdc++.h>
using namespace std;
typedef long long ll;

int main()
{
    ios_base::sync_with_stdio(false);
    cin.tie(NULL);

    string s, s1, s2; cin >> s; queue<char> qu; qu.push(s[0]); bool flag = true; int ind = -1;
    s1 = s.substr(0, s.size() / 2);
    s2 = s.substr(s.size() / 2);
    if(s1 == s2)
    {
        for(int i=0; i<s1.size(); i++)
        {
            s += s1[i];
        }
    }
    //cout << s1 << " " << s2 << " " << s << "\n";
    for(int i=1; i<s.size(); i++)
    {
        if(qu.front() == s[i]) {qu.pop();}
        qu.push(s[i]);
    }
    int cycle = qu.size();

    /*queue<char> qu2 = qu; string str = "";
    while(!qu2.empty())
    {
        cout << qu2.front() << " ";
        str += qu2.front();
        qu2.pop();
    }*/


    while(!qu.empty())
    {
        if(s[++ind] != qu.front()) {flag = false; break;}
        qu.pop();
    }
    flag == true ? cout << cycle : cout << s.size();
    return 0;
}

Bowles answered 2/2, 2020 at 7:21 Comment(0)
E
-1

Super delayed answer, but I got the question in an interview, here was my answer (probably not the most optimal but it works for strange test cases as well).

private void run(String[] args) throws IOException {
    File file = new File(args[0]);
    BufferedReader buffer = new BufferedReader(new FileReader(file));
    String line;
    while ((line = buffer.readLine()) != null) {
        ArrayList<String> subs = new ArrayList<>();
        String t = line.trim();
        String out = null;
        for (int i = 0; i < t.length(); i++) {
            if (t.substring(0, t.length() - (i + 1)).equals(t.substring(i + 1, t.length()))) {
                subs.add(t.substring(0, t.length() - (i + 1)));
            }
        }
        subs.add(0, t);
        for (int j = subs.size() - 2; j >= 0; j--) {
            String match = subs.get(j);
            int mLength = match.length();
            if (j != 0 && mLength <= t.length() / 2) {
                if (t.substring(mLength, mLength * 2).equals(match)) {
                    out = match;
                    break;
                }
            } else {
                out = match;
            }
        }
        System.out.println(out);
    }
}

Testcases:

abcabcabcabc
bcbcbcbcbcbcbcbcbcbcbcbcbcbc
dddddddddddddddddddd
adcdefg
bcbdbcbcbdbc
hellohell

Code returns:

abc
bc
d
adcdefg
bcbdbc
hellohell

Equipotential answered 3/8, 2015 at 16:49 Comment(1)
Just looking at the first for loop this is O(n^2), because each .equals() can take n time.Tuberose
A
-1

Works in cases such as bcbdbcbcbdbc.

function smallestRepeatingString(sequence){
  var currentRepeat = '';
  var currentRepeatPos = 0;

  for(var i=0, ii=sequence.length; i<ii; i++){
    if(currentRepeat[currentRepeatPos] !== sequence[i]){
      currentRepeatPos = 0;
      // Add next character available to the repeat and reset i so we don't miss any matches inbetween
      currentRepeat = currentRepeat + sequence.slice(currentRepeat.length, currentRepeat.length+1);
      i = currentRepeat.length-1;
    }else{
      currentRepeatPos++;
    }
    if(currentRepeatPos === currentRepeat.length){
      currentRepeatPos = 0;
    }
  }

  // If repeat wasn't reset then we didn't find a full repeat at the end.
  if(currentRepeatPos !== 0){ return sequence; }

  return currentRepeat;
}
Anecdotage answered 16/11, 2015 at 23:50 Comment(1)
This is actually O(n^2). That is because you reset i to be smaller with i = currentRepeat.length-1;. So with a 10 character string ling 'aaaaaaaaab' it takes 46 iterations. With a 20 character string it takes 191 iterations.Tuberose
S
-1

I came up with a simple solution that works flawlessly even with very large strings.
PHP Implementation:

function get_srs($s){
    $hash = md5( $s );
    $i = 0; $p = '';

    do {
        $p .= $s[$i++];
        preg_match_all( "/{$p}/", $s, $m );
    } while ( ! hash_equals( $hash, md5( implode( '', $m[0] ) ) ) );

    return $p;
}
Sixteenth answered 15/9, 2016 at 20:28 Comment(1)
Would be good if you gave some detail regarding why exactly this works. Providing more detail helps the entire community and helps to get more up votes.Salacious

© 2022 - 2025 — McMap. All rights reserved.