Replace repeating strings in a string
Asked Answered
S

7

5

I'm trying to find (and replace) repeated string in a string.

My string can look like this:

Lorem ipsum dolor sit amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.

This should become:

Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.

Note how the amit sit isn't removed since its not repeated.

Or the string can be like this:

Lorem ipsum dolor sit amet () sit amet () sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat.

which should become:

Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

So its not just a-z but can also have other (ascii) chars. I'm verry happy if someone can help me with this.

The next step would be to match (and replace) something like this:

2 questions 3 questions 4 questions 5 questions

which would become:

2 questions

The number in the final output can be any number 2,3,4, it doesn't matter. There will only be different numbers in the final example but the words will be the same.

Schramke answered 21/7, 2011 at 19:55 Comment(6)
Why is the second sit not removed in the first paragraph? It is still a repeat of the first sit. How are we able to determine the word boundaries correctly?Pour
Because its not repeated direclty. So in one two one one is not repeated but it is in one one two. Does that answer your question?Schramke
Is this only supposed to work for words? Define what is a word then, since () clearly isn't. And I quote tandu above, "how are we able to determine the word boundaries correctly?" What result would you want from of these examples: foo foo., foo foobar, foo foo-foo, foofoofoo, #¤% #¤% #¤%, #¤%#¤%#¤%.Landreth
Because I had had so much to drink, I didn't realise that that regex might not be so simple...Anh
your first 2 examples are, I think, wrong; the string to reduce is not "... sit amet sit amet sit amet sit ..." but "... sit amet sit amet sit amet sit ...". So the string which is repeated is sec amet, not amet sic. (the resulting recuntion looks the same, but the logic differs).Virgil
Qtax, lets say all whitepace or non-printed chars as boundries and words can cossist of the other basic ascii chars. So you examples will be: foo, foo foobar, foo foo-foo, foofoofoo,#¤% #¤% #¤%, #¤%#¤%#¤%. Pavel: You're right, I noticed it after I posted the question. That's what happens if you make up some text :)Schramke
C
0

First task solution code:

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        // Here I iterate through the number of words that will be repeated and check all the possible positions reps
        for ($i = round($words_count / 2); $i >= 1; $i--) {
            for ($j = 0; $j < ($words_count - $i); $j++) {
                $need_remove_item = !$need_remove[$j];
                for ($k = $j; $k < ($j + $i); $k++) {
                    if ($words[$k] != $words[$k + $i]) {
                        $need_remove_item = false;
                        break;
                    }
                }
                if ($need_remove_item) {
                    for ($k = $j; $k < ($j + $i); $k++) {
                        $need_remove[$k] = true;
                    }
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = 'Lorem ipsum dolor sit amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.' . '<br>' . '<br>';



    $string = 'Lorem ipsum dolor sit amet () sit amet () sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.';

?>

Second task solution code:

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        for ($j = 0; $j < ($words_count - 1); $j++) {
            $need_remove_item = !$need_remove[$j];
            for ($k = $j + 1; $k < ($words_count - 1); $k += 2) {
                if ($words[$k] != $words[$k + 2]) {
                    $need_remove_item = false;
                    break;
                }
            }
            if ($need_remove_item) {
                for ($k = $j + 2; $k < $words_count; $k++) {
                    $need_remove[$k] = true;
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = '2 questions 3 questions 4 questions 5 questions';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo '2 questions';

?>
Cyrano answered 21/7, 2011 at 21:32 Comment(0)
D
2

If it helps, \1, \2, etc. is used to reference previous grouping. so, for example, the following would pick out repeated words and make them repeat only once:

$string =~ s/(\w+) ( \1)+/$1/g

Repeated phrases could be similiarly put.

Dobbins answered 21/7, 2011 at 20:23 Comment(1)
Two minor nits: The \w only matches alphanumeric and underscores, you should probably use .*. And you have an extra space between the two captures.Ramify
L
2

Interesting question. This can be solved with a single preg_replace() statement but the length of the repeated phrase must be limited to avoid excessive backtracking. Here is a solution with a commented regex that works for the test data and fixes doubled, tripled (or repeated n times) phrases having a max length of 50 chars:

Solution to part 1:

$result = preg_replace('/
    # Match a doubled "phrase" having length up to 50 chars.
    (            # $1: Phrase having whitespace boundaries.
      (?<=\s|^)  # Assert phrase preceded by ws or BOL.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $1: Phrase
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \1         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1', $text);

Note that with this solution, a "phrase" can consist of a single word, and there are legitimate cases where doubled words are valid grammar and should not be fixed. If the above solution is not the desired behavior, the regex can be easily modified to define a "phrase" as being two or more "words".

Edit: Modified above regex to handle any number of phrase repetitions. Also added solution to the second part of the question below.

And here is a similar solution where the phrase begins with a word of digits and the repeating phrases must also begin with a word of digits (but the repeating phrases' first word of digits do not need to match the original):

Solution to part 2:

$result = preg_replace('/
    # Match doubled "phrases" with wildcard digits first word.
    (            # $1: 1st word of phrase (digits).
    \b           # Anchor 1st phrase word to word boundary.
    \d+          # Phrase 1st word is string of digits.
    \s+          # 1st and 2nd words separated by whitespace.
    )            # End $1:  1st word of phrase (digits).
    (            # $2: Part of phrase after 1st digits word.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $2: Part of phrase after 1st digits word.
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \d+        # Match duplicate of phrase.
      \s+        # Doubled phrase separated by whitespace.
      \2         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1$2', $text);
Lenssen answered 22/7, 2011 at 1:59 Comment(0)
B
1

((?:\b|^)[\x20-\x7E]+)(\1)+ will match any repeating string of printable ASCII characters that start on a word boundary. Meaning it will match hello hello but not the double l in hello.

If you want to adjust the characters that will match, you can change and add ranges in the form \x##-\x##\x##-\x## (where ## is a hex value) and omit the -\x## where you just want to add one character.

The only problem I can see is that this somewhat simple approach would would pick out a legitimately repeated word rather than a repeated phrase. If you want to force it to only pick off repeated phrases composed of multiple words, you could use something like ((?:\b|^)[\x20-\x7E]+\s)(\1)+ (note the extra \s).

((?:\b|^)[\x20-\x7E]+\s)(.*(\1))+ is getting close to solving your second problem, but I may have thought myself into a corner on that one.

Edit: just to clarify, you'd use $string ~= /((?:\b|^)[\x20-\x7E]+\s)(.*(\1))+/$1/ig in Perl or the PHP equivalent to use that.

Bey answered 21/7, 2011 at 20:24 Comment(1)
This seems to work OK and might be enough for what I need. I made one change: /(((?:\b|^)[\x20-\x7E]+)\s)(\1){2,}/ so it will only replace if there are more than 2 repeated strings. But it still has some issuesSchramke
S
1

Good old bruteforce...

It's so ugly I inclined to post it as eval(base64_decode(...)), but here it is:

function fixi($str) {
    $a = explode(" ", $str);
    return implode(' ', fix($a));
}

function fix($a) {
    $l = count($a);
    $len = 0;
    for($i=1; $i <= $l/2; $i++) {
        for($j=0; $j <= $l - 2*$i; $j++) {
            $n = 1;
            $found = false;
            while(1) {
                $a1 = array_slice($a, $j, $i);
                $a2 = array_slice($a, $j+$n*$i, $i);
                if ($a1 != $a2)
                    break;
                $found = true;
                $n++;
            }
            if ($found && $n*$i > $len) {
                $len = $n*$i;
                $f_j = $j;
                $f_i = $i;
            }
        }
    }
    if ($len) {
        return array_merge(
            fix(array_slice($a, 0, $f_j)),
            array_slice($a, $f_j, $f_i),
            fix(array_slice($a, $f_j+$len, $l))
        );
    }
    return $a;
}

Punctuation is part of the word, so don't expect miracles.

Stimulant answered 21/7, 2011 at 21:17 Comment(2)
Although this seems to work very good its sloooow. A string with length 3500 takes about 18 secs...Schramke
Well what you expected from a bruteforce algorithm? :)Stimulant
R
1

2 questions 3 questions 4 questions 5 questions

becoming

2 questions

Can be solved using:

$string =~ s/(\d+ (.*))( \d+ (\2))+/$1/g;

It matches a number followed by anything (greedily), and then a series of things beginning with a space followed by a number followed by something that matches the previous anything. For all of that, it replaces it with the first number anything pair.

Ramify answered 21/7, 2011 at 21:27 Comment(0)
C
0

First task solution code:

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        // Here I iterate through the number of words that will be repeated and check all the possible positions reps
        for ($i = round($words_count / 2); $i >= 1; $i--) {
            for ($j = 0; $j < ($words_count - $i); $j++) {
                $need_remove_item = !$need_remove[$j];
                for ($k = $j; $k < ($j + $i); $k++) {
                    if ($words[$k] != $words[$k + $i]) {
                        $need_remove_item = false;
                        break;
                    }
                }
                if ($need_remove_item) {
                    for ($k = $j; $k < ($j + $i); $k++) {
                        $need_remove[$k] = true;
                    }
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = 'Lorem ipsum dolor sit amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.' . '<br>' . '<br>';



    $string = 'Lorem ipsum dolor sit amet () sit amet () sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.';

?>

Second task solution code:

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        for ($j = 0; $j < ($words_count - 1); $j++) {
            $need_remove_item = !$need_remove[$j];
            for ($k = $j + 1; $k < ($words_count - 1); $k += 2) {
                if ($words[$k] != $words[$k + 2]) {
                    $need_remove_item = false;
                    break;
                }
            }
            if ($need_remove_item) {
                for ($k = $j + 2; $k < $words_count; $k++) {
                    $need_remove[$k] = true;
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = '2 questions 3 questions 4 questions 5 questions';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo '2 questions';

?>
Cyrano answered 21/7, 2011 at 21:32 Comment(0)
S
0

Thanks a lot all for answering the question. It helped me a lot!. I tried both Ridgerunners and dtanders regular expressions and although they worked (with some modifications) on some test strings I had troubles with other strings.

So I went for the brute force attack :) which is inspired by Nox. This way I could combine both problems and still have a good performance (even better than regexp since that is slow in PHP).

For anyone interested here is the concept code:

function split_repeating_num($string) {
$words = explode(' ', $string);
$all_words = $words;
$num_words = count($words);
$max_length = 100; //max length of substring to check
$max_words = 4; //maximum number of words in substring 
$found = array();
$current_pos = 0;
$unset = array();
foreach ($words as $key=>$word) {
    //see if this word exist in the next part of the string
    $len = strlen($word);
    if ($len === 0) continue;
    $current_pos += $len + 1; //+1 for the space
    $substr = substr($string, $current_pos, $max_length);
    if (($pos = strpos(substr($string, $current_pos, $max_length), $word)) !== false) {
        //found it
        //set pointer words and all_words to same value
        while (key($all_words) < $key ) next($all_words);
        while (key($all_words) > $key ) prev($all_words);
        $next_word = next($all_words);

        while (is_numeric($next_word) || $next_word === '') {
            $next_word = next($all_words);
        }
        // see if it follows the word directly
        if ($word === $next_word) {
            $unset [$key] = 1;
        } elseif ($key + 3 < $num_words) {
            for($i = $max_words; $i > 0; $i --) {
                $x = 0;
                $string_a = '';
                $string_b = '';
                while ($x < $i ) {
                    while (is_numeric($next_word) || $next_word === '' ) {
                        $next_word = each($all_words);
                    }
                    $x ++;
                    $string_a .= $next_word;
                    $string_b .= $words [key($all_words) + $i];
                }

                if ($string_a === $string_b) {
                    //we have a match
                    for($x = $key; $x < $i + $key; $x ++)
                        $unset [$x] = 1;
                }
            }
        }
    }

}
foreach ($unset as $k=>$v) {
    unset($words [$k]);
}
return implode(' ', $words);

}

There are still some minor issues and I do need to test is, but it seems to do its job.

Schramke answered 22/7, 2011 at 9:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.