How to replace pattern of repeating characters/words only at the beginning of the string?
Asked Answered
K

2

8

Note that this question is in the context of Julia, and therefore (to my knowledge) PCRE.

Suppose that you had a string like this:

"sssppaaasspaapppssss"

and you wanted to match, individually, the repeating characters at the end of the string (in the case of our string, the four "s" characters - that is, so that matchall gives ["s","s","s","s"], not ["ssss"]). This is easy:

r"(.)(?=\1*$)"

It's practically trivial (and easily used - replace(r"(.)(?=\1*$)","hell","k") will give "hekk" while replace(r"(.)(?=\1*$)","hello","k") will give "hellk"). And it can be generalised for repeating patterns by switching out the dot for something more complex:

r"(\S+)(?=( \1)*$)"

which will, for instance, independently match the last three instances of "abc" in "abc abc defg abc h abc abc abc".

Which then leads to the question... how would you match the repeating character or pattern at the start of the string, instead? Specifically, using regex in the way it's used above.

The obvious approach would be to reverse the direction of the above regex as r"(?<=^\1*)(.)" - but PCRE/Julia doesn't allow lookbehinds to have variable length (except where it's fixed-variable, like (?<=ab|cde)), and thus throws an error. The next thought is to use "\K" as something along the lines of r"^\1*\K(.)", but this only manages to match the first character (presumably because it "advances" after matching it, and no longer matches the caret).

For clarity: I'm seeking a regex that will, for instance, result in

replace("abc abc defg abc h abc abc abc",<regex here>,"hello")

producing

"hello hello defg abc h abc abc abc"

As you can see, it's replacing each "abc" from the start with "hello", but only until the first non-match. The reverse one I provide above does this at the other end of the string:

replace("abc abc defg abc h abc abc abc",r"(\S+)(?=( \1)*$)","hello")

produces

"abc abc defg abc h hello hello hello"
Knobkerrie answered 19/7, 2015 at 15:31 Comment(11)
use this ^(\S+)(?:\s+\1)* and then do splitting on space character,Haydenhaydn
@AvinashRaj - if you read through the full question, you'll see that I'm wanting to know if it can be done without multiple steps - that is, just with a regex. I can do it with the end-of-string equivalent, with r"(.)(?=\1*$)" (or more generally r"(\S+)(?=( \1)*$)").Knobkerrie
Reverse the string. ;-)Valerievalerio
@KingMob - you can do that, if you only want to match in one direction. But it's not a regex solution, and it doesn't help if you want to match in both directions (for instance, to capture the first sho in each part of the string "shorts shoes shop shovel shortstuff shoplifter shopshop", and not the second sho in shopshop, you need to look in both directions - you're basically figuring out which part isn't the delimiter, in a sense)Knobkerrie
Well, you could replace (\S+)((?= \1)|.*) with "hello$2" (JavaScript for example), but unfortunately you need to be able to reference the submatches in the replacement string which AFAIK you can't do in Julia :-(Valerievalerio
Or maybe that's (\S+ )((?=\1)|.*) with "hello $2" since the space could be considered part of the repeat, depending on your specification.Valerievalerio
BTW, I think you're headed down a rabbit hole by trying to do all this in a one-pass regexp. And Julia's regexp functions suck.Valerievalerio
Just to mention, for the first string something like ^.|\G(?<=(.))\1 or (?<=(.)(?!\1)).*(*SKIP)(*F)|. would do. What one has to read out is that the headline says repeating characters/words but (\S+)(?=( \1)*$) will always match the last "word" eg cd in ab bc cd regardless repetition. Further it seems you also want to work the "word-version" without any separator, but in your sample (\S+)(?=( \1)*$) there is a space separator. However I found it very interesting :]Apnea
@Jonny5 - the headline was changed by someone else, and the word version should be able to work without a separator, but with the separator is easier to read. The real use I wanted it for has a non-consistent separator (varying length and/or content with one feature maintaining it as a separator), so it should work with any kind of separator or none at all. The question was originally specific to a "lookbehind" method or similar, so it could pair with a lookahead. It would be nice if you could access the previous match's groups or the match itself.Knobkerrie
@GlenO I see edit, yea - anyway I found it not clear out of the text but understood better from studying your tries and comments. So this now sounds more like different tasks. A word version would always rely on some kind of boundary. And if there's none, probably each match of the longest repeated sequence from start if there is repetition, else the first character.Apnea
@Jonny5 - in the simple case, where this matching is the only one being done, you're right. But the intent is to add further conditions on the string... but the extra conditions are separate from this requirement, and I'm happy to bring it together. I wanted the question general enough to be useful for other people as well as myself.Knobkerrie
I
8

You can use the \G anchor that matches the position after the previous match or at the start of the string. In this way you ensure the contiguity of results from the start of the string to the last occurrence:

\G(\S+)( (?=\1 ))?

demo

or to be able to match until the end of the string:

\G(\S+)( (?=\1(?: |\z)))?
Incline answered 19/7, 2015 at 17:32 Comment(5)
+1 - This is certainly a great idea, and once Julia adds the PCRE2 substitution functionality, I think it'll allow me to do what I want to do... unfortunately, it's not going to give me the desired functionality in the meantime, because it captures the space as well.Knobkerrie
@GlenO: I don't think using a reference to a capture group in the replacement string is particular to PCRE2. I don't find the way to do it with julia (I don't know this language) nor any example with a capture in the documentation, so if it isn't possible, it is more an implementation particularity. Whatever, an easy workaround consists to use " hello" as replacement string and to trim the string with lstrip() (or another way to remove the first character). If this limitation really exists, I don't think there is another way to go.Incline
@GlenO: I read somewhere that replace can use a function as replacement, but I don't know if the default parameter is the whole match or a match object (with the capture groups available) and if user defined functions are allowed.Incline
Unfortunately, it's the whole match, not the match object. And I am aware that there are workarounds. What I was really hoping for was a way to achieve, as I said, the same thing that can be done at the end of the string - it's easy to match the repeating patterns at the end of the string... but at the start, the best I've seen is your trick, which doesn't work if there's no nice delimiter for the pattern. If nobody offers a solution that manages to do it perfectly within the next few days, I'll accept your answer as the best possible at this time.Knobkerrie
Yeah, this assumes that a known delimiter is present (which is hinted at) so wouldn't work on the first string in the question. It also won't work for the string "abc abc", though if the space is not part of the repeat then it shouldn't. \1( |$) might fix that. I think at this point we need a better-specified problem.Valerievalerio
H
3

For PCRE style engines, unfortunately there is no way to do this without
variable length lookbehind.

A pure solution is not possible.
There is no \G anchor trickery that can accomplish this.

Here is why the \G anchor won't work.

With the anchor, the only guarantee you have is that the last match
resulted in a match where the forward overlap was checked to be equal
to the current match.

As a result, you can only globally match up to N-1 of the duplicate's from the beginning.

Here is a proof:

Regex:

 # (?:\G([a-c]+)(?=\1))

 (?:
      \G 
      ( [a-c]+ )                    # (1)
      (?=
           \1 
      )
 )

Input:

abcabcabcbca

Output:

 **  Grp 0 -  ( pos 0 , len 3 ) 
abc  
 **  Grp 1 -  ( pos 0 , len 3 ) 
abc  
------------
 **  Grp 0 -  ( pos 3 , len 3 ) 
abc  
 **  Grp 1 -  ( pos 3 , len 3 ) 
abc  

Conclusion:

Even though you know the Nth one is there from the previous lookahead,
the Nth one can't be matched without the condition of the current lookahead.

Sorry, and good luck!
Let me know if you find a pure regex solution.

Handle answered 22/7, 2015 at 22:49 Comment(6)
I think you should clarify that in cases where there is no clear separator, it's not possible to do such replacement. While it was mentioned in the comment, it's not clear from the question.Ronironica
By the way, this is only possible with conditional replacement, which is available in Boost, AFAIK. This is a rare feature among the implementations.Ronironica
@Ronironica - As far as a clear separator, that's the fallacy here. If the lookahead is optional then it will match a single instance. I offered a minimum proof that no way is valid. To think otherwise is a conceptual error. It kind of bugs me people don't accept facts, trying to get blood out of a stone.Handle
For the case with separator, something like this, I guess? regex101.com/r/bV0uA5/1 ^(\S+)( (?=\1(?: |$)))|(?!^)\G(\S+)( (?=\3(?: |$)))? and replacement hello$2$4Ronironica
@Ronironica - Yep, that takes care of the single instance part. But, I don't know, in a sense, it seems to me that the delimiter case is too specialized to be considered a solution since it is not aggregated with the repetition, but must be consumed as a condition for the next match. And I thought he was looking for a general solution...Handle
The OP is indeed looking for a general solution. I only nitpick about the fact that while general solution is not possible, if there is clear separator, then it is possible.Ronironica

© 2022 - 2024 — McMap. All rights reserved.