How do I find the shortest overlapping match using regular expressions?

Asked 27/1, 2010 at 16:49 Answered 26/9, 2011 at 11:49

I'm still relatively new to regex. I'm trying to find the shortest string of text that matches a particular pattern, but am having trouble if the shortest pattern is a substring of a larger match. For example:

import re
string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'a.*?b.*?c'

my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string)

for match in matches:
    print match

prints:

A|B|A|B|C

but I'd want it to return:

A|B|C

Is there a way to do this without having to loop over each match to see if it contains a substring that matches?

Stethoscope answered 27/1, 2010 at 16:49 Comment(1)

Please check Tim's answer; it's the most concise one, probably should be marked as the answer to your question. – Ina 27/9, 2011 at 11:23

Contrary to most other answers here, this can be done in a single regex using a positive lookahead assertion with a capturing group:

>>> my_pattern = '(?=(a.*?b.*?c))'
>>> my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
>>> matches = my_regex.findall(string)
>>> print min(matches, key=len)
A|B|C

findall() will return all possible matches, so you need min() to get the shortest one.

How this works:

We're not matching any text in this regex, just positions in the string (which the regex engine steps through during a match attempt).
At each position, the regex engine looks ahead to see whether your regex would match at this position.
If so, it will be captured by the capturing group.
If not, it won't.
In either case, the regex engine then steps ahead one character and repeats the process until the end of the string.
Since the lookahead assertion doesn't consume any characters, all overlapping matches will be found.

Duodenary answered 26/9, 2011 at 11:49 Comment(4)

@JustinHarris: Unless we're using lookaheads. – Duodenary 18/5, 2016 at 13:29

@TimPietzcker good point but I think it's worth clarifying that. I'll delete my earlier comment. – Epitome 18/5, 2016 at 18:26

@JustinHarris You're right. Since I'm not a native speaker, I'm very open to suggestions (e. g. how to clarify the final sentence of my answer). – Duodenary 18/5, 2016 at 19:19

@TimPietzcker my main concern is the line: "findall() will return all overlapping matches...", I think "findall() will return all overlapping matches in this case because of the lookahead..." would be clearer. – Epitome 18/5, 2016 at 21:8

No. Perl returns the longest, leftmost match, while obeying your non-greedy quantifiers. You'll have to loop, I'm afraid.

Edit: Yes, I realize I said Perl above, but I believe it is true for Python.

Bobbie answered 27/1, 2010 at 17:0 Comment(6)

Perl? what it has to do with Perl? – Friesland 27/1, 2010 at 17:4

Bummer. Ok, that's what I though the answer would be, but thought I'd check with the masters first :). Thanks. – Stethoscope 27/1, 2010 at 17:17

No need to loop. See my answer. – Duodenary 26/9, 2011 at 11:52

Leftmost yes, longest no. Regex-directed flavors like Perl and Python (in search mode) return a match at the earliest possible starting position, but not necessarily the longest possible match at that position. – Alcahest 26/9, 2011 at 23:50

Perl regexes are greedy by default, so leftmost, longest is true for Perl. – Bobbie 27/9, 2011 at 18:17

@Paul: This is simply wrong. The regex (foo|foobar) when applied to "foobar" will match foo because the first alternative matches, so the next one isn't even tried. – Duodenary 1/10, 2011 at 6:37

This might be a useful application of sexegers. Regular-expression matching is biased toward the longest, leftmost choice. Using non-greedy quantifiers such as in .*? skirts the longest part, and reversing both the input and pattern can get around leftmost-matching semantics.

Consider the following program that outputs A|B|C as desired:

#! /usr/bin/env python

import re

string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'c.*?b.*?a'

my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string[::-1])

for match in matches:
    print match[::-1]

Another way is to make a stricter pattern. Say you don't want to allow repetitions of characters already seen:

my_pattern = 'a[^a]*?b[^ab]*?c'

Your example is generic and contrived, but if we had a better idea of the inputs you're working with, we could offer better, more helpful suggestions.

Brentbrenton answered 27/1, 2010 at 18:25 Comment(1)

All reversing does is get rightmost-matching semantics, which is equally broken, but for different input (such as "A|B|C|B|C"). – Haemophiliac 27/1, 2010 at 18:39

Another regex solution; it finds only the last occurence of .*a.*b.*c:

my_pattern = 'a(?!.*a.*b.*c).*b[^c]*c'

a(?!.*a.*?b.*?c) ensures that there is no 'a.*?b.*?c' after first 'A' strings like A|A|B|C or A|B|A|B|C or A|B|C|A|B|C in results are eliminated

b[^c]*c ensures that after 'B' there is only one 'C' strings like A|B|C|B|C or A|B|C|C in results are eliminated

So you have the smallest matching 'a.*?b.*?c'

Herisau answered 26/9, 2011 at 11:14 Comment(0)

The regex engine starts searching from the beginning of the string till it finds a match and then exits. Thus if it finds a match before it even considers the smaller one, there is no way for you to force it to consider later matches in the same run - you will have to rerun the regex on substrings.

Setting the global flag and choosing the shortest matched string won't help as it is evident from your example - the shorter match might be a substring of another match (or partly included in it). I believe you will have to start subsequent searches from (1 + index of previous match) and go on like that.

Systematics answered 27/1, 2010 at 17:4 Comment(0)

I do not think that this task can be accomplished by a single regular expression. I have no proof that this is the case, but there are quite a lot of things that can't be done with regexes and I expected this problem to be one of them. Some good examples of the limitations of regexes are given in this blog post.

Hilton answered 27/1, 2010 at 17:4 Comment(0)

You might be able to write the regex in such a way that it can't contain smaller matches.

For your regex:

a.*?b.*?c

I think you can write this:

a[^ab]*b[^c]*c

It's tricky to get that correct, and I don't see any more general or more obviously correct way to do it. (Edit—earlier I suggested a negative lookahead assertion, but I don't see a way to make that work.)

Haemophiliac answered 27/1, 2010 at 18:31 Comment(0)

A Python loop to look for the shortest match, by brute force testing each substring from left to right, picking the shortest:

shortest = None
for i in range(len(string)):
    m = my_regex.match(string[i:])
    if m: 
        mstr = m.group()
        if shortest is None or len(mstr) < len(shortest):
            shortest = mstr

print shortest

Another loop, this time letting re.findall do the hard work of searching for all possible matches, then brute force testing each match right-to-left looking for a shorter substring:

# find all matches using findall
matches = my_regex.findall(string)

# for each match, try to match right-hand substrings
shortest = None
for m in matches:
    for i in range(-1,-len(m),-1):
        mstr = m[i:]        
        if my_regex.match(mstr):
            break
    else:
        mstr = m

    if shortest is None or len(mstr) < len(shortest):
        shortest = mstr

print shortest

Rosalvarosalyn answered 27/1, 2010 at 19:53 Comment(0)

No, there isn't in the Python regular expression engine.

My take for a custom function, though:

import re, itertools

# directly from itertools recipes
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = itertools.tee(iterable)
    for elem in b:
        break
    return itertools.izip(a, b)

def find_matches(rex, text):
    "Find all matches, even overlapping ones"
    matches= list(rex.finditer(text))

    # first produce typical matches
    for match in matches:
        yield match.group(0)

    # next, run it for any patterns included in matches
    for match1, match2 in pairwise(matches):
        subtext= text[match1.start()+1:match2.end()+1]
        for result in find_matches(rex, subtext):
            yield result

    # also test the last match, if there was at least one
    if matches:
        subtext= text[matches[-1].start()+1:matches[-1].end()+1]
        # perhaps the previous "matches[-1].end()+1" can be omitted
        for result in find_matches(rex, subtext):
            yield result

def shortest_match(rex, text):
    "Find the shortest match"
    return min(find_matches(rex, text), key=len)

if __name__ == "__main__":
    pattern= re.compile('a.*?b.*?c', re.I)
    searched_text= "A|B|A|B|C|D|E|F|G"
    print (shortest_match(pattern, searched_text))

Ina answered 28/1, 2010 at 1:3 Comment(1)

@TimPietzcker: thanks for your comment and your answer. I never tried capturing groups in either look-ahead or look-behind assertions. – Ina 27/9, 2011 at 11:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags