How do I find the shortest overlapping match using regular expressions?
Asked Answered
S

9

18

I'm still relatively new to regex. I'm trying to find the shortest string of text that matches a particular pattern, but am having trouble if the shortest pattern is a substring of a larger match. For example:

import re
string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'a.*?b.*?c'

my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string)

for match in matches:
    print match

prints:

A|B|A|B|C

but I'd want it to return:

A|B|C

Is there a way to do this without having to loop over each match to see if it contains a substring that matches?

Stethoscope answered 27/1, 2010 at 16:49 Comment(1)
Please check Tim's answer; it's the most concise one, probably should be marked as the answer to your question.Ina
D
16

Contrary to most other answers here, this can be done in a single regex using a positive lookahead assertion with a capturing group:

>>> my_pattern = '(?=(a.*?b.*?c))'
>>> my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
>>> matches = my_regex.findall(string)
>>> print min(matches, key=len)
A|B|C

findall() will return all possible matches, so you need min() to get the shortest one.

How this works:

  • We're not matching any text in this regex, just positions in the string (which the regex engine steps through during a match attempt).
  • At each position, the regex engine looks ahead to see whether your regex would match at this position.
  • If so, it will be captured by the capturing group.
  • If not, it won't.
  • In either case, the regex engine then steps ahead one character and repeats the process until the end of the string.
  • Since the lookahead assertion doesn't consume any characters, all overlapping matches will be found.
Duodenary answered 26/9, 2011 at 11:49 Comment(4)
@JustinHarris: Unless we're using lookaheads.Duodenary
@TimPietzcker good point but I think it's worth clarifying that. I'll delete my earlier comment.Epitome
@JustinHarris You're right. Since I'm not a native speaker, I'm very open to suggestions (e. g. how to clarify the final sentence of my answer).Duodenary
@TimPietzcker my main concern is the line: "findall() will return all overlapping matches...", I think "findall() will return all overlapping matches in this case because of the lookahead..." would be clearer.Epitome
B
1

No. Perl returns the longest, leftmost match, while obeying your non-greedy quantifiers. You'll have to loop, I'm afraid.

Edit: Yes, I realize I said Perl above, but I believe it is true for Python.

Bobbie answered 27/1, 2010 at 17:0 Comment(6)
Perl? what it has to do with Perl?Friesland
Bummer. Ok, that's what I though the answer would be, but thought I'd check with the masters first :). Thanks.Stethoscope
No need to loop. See my answer.Duodenary
Leftmost yes, longest no. Regex-directed flavors like Perl and Python (in search mode) return a match at the earliest possible starting position, but not necessarily the longest possible match at that position.Alcahest
Perl regexes are greedy by default, so leftmost, longest is true for Perl.Bobbie
@Paul: This is simply wrong. The regex (foo|foobar) when applied to "foobar" will match foo because the first alternative matches, so the next one isn't even tried.Duodenary
B
1

This might be a useful application of sexegers. Regular-expression matching is biased toward the longest, leftmost choice. Using non-greedy quantifiers such as in .*? skirts the longest part, and reversing both the input and pattern can get around leftmost-matching semantics.

Consider the following program that outputs A|B|C as desired:

#! /usr/bin/env python

import re

string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'c.*?b.*?a'

my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string[::-1])

for match in matches:
    print match[::-1]

Another way is to make a stricter pattern. Say you don't want to allow repetitions of characters already seen:

my_pattern = 'a[^a]*?b[^ab]*?c'

Your example is generic and contrived, but if we had a better idea of the inputs you're working with, we could offer better, more helpful suggestions.

Brentbrenton answered 27/1, 2010 at 18:25 Comment(1)
All reversing does is get rightmost-matching semantics, which is equally broken, but for different input (such as "A|B|C|B|C").Haemophiliac
H
1

Another regex solution; it finds only the last occurence of .*a.*b.*c:

my_pattern = 'a(?!.*a.*b.*c).*b[^c]*c'

a(?!.*a.*?b.*?c) ensures that there is no 'a.*?b.*?c' after first 'A' strings like A|A|B|C or A|B|A|B|C or A|B|C|A|B|C in results are eliminated

b[^c]*c ensures that after 'B' there is only one 'C' strings like A|B|C|B|C or A|B|C|C in results are eliminated

So you have the smallest matching 'a.*?b.*?c'

Herisau answered 26/9, 2011 at 11:14 Comment(0)
S
0

The regex engine starts searching from the beginning of the string till it finds a match and then exits. Thus if it finds a match before it even considers the smaller one, there is no way for you to force it to consider later matches in the same run - you will have to rerun the regex on substrings.

Setting the global flag and choosing the shortest matched string won't help as it is evident from your example - the shorter match might be a substring of another match (or partly included in it). I believe you will have to start subsequent searches from (1 + index of previous match) and go on like that.

Systematics answered 27/1, 2010 at 17:4 Comment(0)
H
0

I do not think that this task can be accomplished by a single regular expression. I have no proof that this is the case, but there are quite a lot of things that can't be done with regexes and I expected this problem to be one of them. Some good examples of the limitations of regexes are given in this blog post.

Hilton answered 27/1, 2010 at 17:4 Comment(0)
H
0

You might be able to write the regex in such a way that it can't contain smaller matches.

For your regex:

a.*?b.*?c

I think you can write this:

a[^ab]*b[^c]*c

It's tricky to get that correct, and I don't see any more general or more obviously correct way to do it. (Edit—earlier I suggested a negative lookahead assertion, but I don't see a way to make that work.)

Haemophiliac answered 27/1, 2010 at 18:31 Comment(0)
R
0

A Python loop to look for the shortest match, by brute force testing each substring from left to right, picking the shortest:

shortest = None
for i in range(len(string)):
    m = my_regex.match(string[i:])
    if m: 
        mstr = m.group()
        if shortest is None or len(mstr) < len(shortest):
            shortest = mstr

print shortest

Another loop, this time letting re.findall do the hard work of searching for all possible matches, then brute force testing each match right-to-left looking for a shorter substring:

# find all matches using findall
matches = my_regex.findall(string)

# for each match, try to match right-hand substrings
shortest = None
for m in matches:
    for i in range(-1,-len(m),-1):
        mstr = m[i:]        
        if my_regex.match(mstr):
            break
    else:
        mstr = m

    if shortest is None or len(mstr) < len(shortest):
        shortest = mstr

print shortest
Rosalvarosalyn answered 27/1, 2010 at 19:53 Comment(0)
I
0

No, there isn't in the Python regular expression engine.

My take for a custom function, though:

import re, itertools

# directly from itertools recipes
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = itertools.tee(iterable)
    for elem in b:
        break
    return itertools.izip(a, b)

def find_matches(rex, text):
    "Find all matches, even overlapping ones"
    matches= list(rex.finditer(text))

    # first produce typical matches
    for match in matches:
        yield match.group(0)

    # next, run it for any patterns included in matches
    for match1, match2 in pairwise(matches):
        subtext= text[match1.start()+1:match2.end()+1]
        for result in find_matches(rex, subtext):
            yield result

    # also test the last match, if there was at least one
    if matches:
        subtext= text[matches[-1].start()+1:matches[-1].end()+1]
        # perhaps the previous "matches[-1].end()+1" can be omitted
        for result in find_matches(rex, subtext):
            yield result

def shortest_match(rex, text):
    "Find the shortest match"
    return min(find_matches(rex, text), key=len)

if __name__ == "__main__":
    pattern= re.compile('a.*?b.*?c', re.I)
    searched_text= "A|B|A|B|C|D|E|F|G"
    print (shortest_match(pattern, searched_text))
Ina answered 28/1, 2010 at 1:3 Comment(1)
@TimPietzcker: thanks for your comment and your answer. I never tried capturing groups in either look-ahead or look-behind assertions.Ina

© 2022 - 2024 — McMap. All rights reserved.