Python Regex Engine - "look-behind requires fixed-width pattern" Error
Asked Answered
A

3

64

I am trying to handle un-matched double quotes within a string in the CSV format.

To be precise,

"It "does "not "make "sense", Well, "Does "it"

should be corrected as

"It" "does" "not" "make" "sense", Well, "Does" "it"

So basically what I am trying to do is to

replace all the ' " '

  1. Not preceded by a beginning of line or a comma (and)
  2. Not followed by a comma or an end of line

with ' " " '

For that I use the below regex

(?<!^|,)"(?!,|$)

The problem is while Ruby regex engines ( http://www.rubular.com/ ) are able to parse the regex, python regex engines (https://pythex.org/ , http://www.pyregex.com/) throw the following error

Invalid regular expression: look-behind requires fixed-width pattern

And with python 2.7.3 it throws

sre_constants.error: look-behind requires fixed-width pattern

Can anyone tell me what vexes python here?


Edit:

Following Tim's response, I got the below output for a multi line string

>>> str = """ "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it" """
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str)
' "It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" " '

At the end of each line, next to 'it' two double-quotes were added.

So I made a very small change to the regex to handle a new-line.

re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)

But this gives the output

>>> re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
' "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it" " '

The last 'it' alone has two double-quotes.

But I wonder why the '$' end of line character will not identify that the line has ended.


The final answer is

re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)
Addy answered 20/11, 2013 at 7:31 Comment(1)
Python lookbehind assertions need to be of contant length, and (?<!^|,) is length 0 or 1, so it doesn't work. I'll think up an alternative solution.Kaylil
K
25

Python lookbehind assertions need to be fixed width, but you can try this:

>>> s = '"It "does "not "make "sense", Well, "Does "it"'
>>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
'"It" "does" "not" "make" "sense", Well, "Does" "it"'

Explanation:

\b      # Start the match at the end of a "word"
\s*     # Match optional whitespace
"       # Match a quote
(?!,|$) # unless it's followed by a comma or end of string
Kaylil answered 20/11, 2013 at 7:39 Comment(2)
look-behind assertions need to be fixed width Is this specific to python only? Or it is a generic constraintCoimbatore
As far as I know, fixed-length lookbehinds are required by Python, Perl, and the Boost library. Some regex flavors don't support lookbehind at all, some allow variable but finite lengths, and some have no restrictions.Kaylil
U
97

Python re lookbehinds really need to be fixed-width, and when you have alternations in a lookbehind pattern that are of different length, there are several ways to handle this situation:

  • Rewrite the pattern so that you do not have to use alternation (e.g. Tim's above answer using a word boundary, or you might also use an exact equivalent (?<=[^,])"(?!,|$) of your current pattern that requires a char other than a comma before the double quote, or a common pattern to match words enclosed with whitespace, (?<=\s|^)\w+(?=\s|$), can be written as (?<!\S)\w+(?!\S)), or
  • Split the lookbehinds:
    • Positive lookbehinds need to be alternated in a group (e.g. (?<=a|bc) should be rewritten as (?:(?<=a)|(?<=bc)))
    • If the pattern in a lookbehind is an alternation of an anchor with a single char, you can reverse the sign of the lookbehind and use a negated character class with the char inside. E.g. (?<=\s|^) matches either a whitespace or start of a string/line (if re.M is used). So, in Python re, use (?<!\S). The (?<=^|;) will be converted to (?<![^;]). And if you also want to make sure the start of a line is matched, too, add \n to the negated character class, e.g. (?<![^;\n]) (see Python Regex: Match start of line, or semi-colon, or start of string, none capturing group). Note this is not necessary with (?<!\S) as \S does not match a line feed char.
    • Negative lookbehinds can be just concatenated (e.g. (?<!^|,)"(?!,|$) should look like (?<!^)(?<!,)"(?!,|$)).

Or, simply install PyPi regex module using pip install regex (or pip3 install regex) and enjoy infinite width lookbehind.

Umlaut answered 15/11, 2016 at 18:39 Comment(4)
why do you think we need the extra ?: in (?:(?<=a)|(?<=bc)), why isn't (?<=a)|(?<=bc) good enough? Great answer btwCheapjack
@user1993 Grouping is necessary if you plan to add it to another bigger pattern. Otherwise, the lookbehinds won't be applied to the subsequent pattern, the pattern will get corrupted. If you need to use (?<=a)|(?<=bc) as is, as a standalone pattern, yes, no grouping is necessary.Flanker
(?: (?<=SEP1)|(?<=SEP2)|(?<=SEP3)|(?<=SEP4) ) solves for me. Put you alterations in SEP! :)Benbena
extra-strength upvote for PyPi regex module. Thank you!Ejectment
K
25

Python lookbehind assertions need to be fixed width, but you can try this:

>>> s = '"It "does "not "make "sense", Well, "Does "it"'
>>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
'"It" "does" "not" "make" "sense", Well, "Does" "it"'

Explanation:

\b      # Start the match at the end of a "word"
\s*     # Match optional whitespace
"       # Match a quote
(?!,|$) # unless it's followed by a comma or end of string
Kaylil answered 20/11, 2013 at 7:39 Comment(2)
look-behind assertions need to be fixed width Is this specific to python only? Or it is a generic constraintCoimbatore
As far as I know, fixed-length lookbehinds are required by Python, Perl, and the Boost library. Some regex flavors don't support lookbehind at all, some allow variable but finite lengths, and some have no restrictions.Kaylil
C
2

The simplest solution would be:

import regex as re

regex supports varying length of look-behind patterns.

Chuckle answered 26/9, 2022 at 8:46 Comment(3)
This of course requires you to first download and install the third-party regex library for Python, which permits variable-width lookarounds.Jungly
This has already been suggested.Flanker
@WiktorStribiżew Sorry just noticed the last part of your answerChuckle

© 2022 - 2024 — McMap. All rights reserved.