Fixed Length Regex Required?

Asked 25/6, 2012 at 21:26 Answered 25/6, 2012 at 21:36

I have this regex that uses forward and backward look-aheads:

import re
re.compile("<!inc\((?=.*?\)!>)|(?<=<!inc\(.*?)\)!>")

I'm trying to port it from C# to Python but keep getting the error

look-behind requires fixed-width pattern

Is it possible to rewrite this in Python without losing meaning?

The idea is for it to match something like

<!inc(C:\My Documents\file.jpg)!>

Update

I'm using the lookarounds to parse HTTP multipart text that I've modified

body = r"""------abc
Content-Disposition: form-data; name="upfile"; filename="file.txt"
Content-Type: text/plain

<!inc(C:\Temp\file.txt)!>
------abc
Content-Disposition: form-data; name="upfile2"; filename="pic.png"
Content-Type: image/png

<!inc(C:\Temp\pic.png)!>
------abc
Content-Disposition: form-data; name="note"

this is a note
------abc--
"""

multiparts = re.compile(...).split(body)

I want to just get the file path and other text when I do the split and not have to remove the opening and closing tags

Code brevity is important, but I'm open to changing the <!inc( format if it makes the regex doable.

Fireworm answered 25/6, 2012 at 21:26 Comment(9)

Have you tried using a raw string? re.compile(r'''regex here''') – Trod 25/6, 2012 at 21:31

"backward lookahead". You mean a lookbehind. – Affectional 25/6, 2012 at 21:31

You can use the regex module instead of the standard re, which does support variable-length lookbehinds. – Einstein 25/6, 2012 at 21:36

Apparently you are looking for the <!inc( and )!> parts. Why are you not looking for the file part with (?<=<!inc\().*?(?=\)!>)? – Accessible 25/6, 2012 at 21:36

@thg435 - Sorry, this will be used on other's computers and I don't want to have to distribute an additional module if possible. Thanks. – Fireworm 25/6, 2012 at 22:10

@OlivierJacot-Descombes - See my update, sorry for not explaining better in the first place. – Fireworm 25/6, 2012 at 22:10

@Trod - Thanks for the tip, that will make the code nicer to look at! – Fireworm 25/6, 2012 at 22:14

You want to capture the file path, and everything except the opening <!inc( and closing )!> tags? – Gadid 25/6, 2012 at 22:17

@Gadid - Yes, exactly and I want them to be split so I have an array of "everythings" and "file paths"... and no opening or closing tags. – Fireworm 26/6, 2012 at 1:55

For paths + "everything" in the same array, just split on the opening and closing tag:

import re
p = re.compile(r'''<!inc\(|\)!>''')
awesome = p.split(body)

You say you're flexible on the closing tags, if )!> can occur elsewhere in the code, you may want to consider changing that closing tag to something like )!/inc> (or anything, as long as it's unique).

See it run.

Gadid answered 25/6, 2012 at 21:33 Comment(7)

+1 :: Optionally replace .*? with .+? for non-blank inside match – Committee 25/6, 2012 at 21:46

@user1215106: That wouldn't match his already existing regex. Keep in mind this is a port from C# to Python. – Gadid 25/6, 2012 at 21:47

That's why I wrote optionally and explain what would change, Sir. – Committee 25/6, 2012 at 21:48

BTW :: For better performance, don't use *? or +? at all, if you don't have to... – Committee 25/6, 2012 at 21:49

Just google for that - for example: blog.stevenlevithan.com/archives/greedy-lazy-performance – Committee 25/6, 2012 at 21:56

Sorry - I should have explained better - see my updated question. Thanks! – Fireworm 25/6, 2012 at 22:6

Yes, I think I'll take this approach. It does lose some accuracy since it doesn't verify the matching end tags, but I'm not too worried about that being a problem. Thanks! – Fireworm 26/6, 2012 at 21:19

From the documentation:

(?<!...)

Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

(?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in abcdef, since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:

Emphasis mine. No, I don't imagine you can port it to Python in it's current form.

Clew answered 25/6, 2012 at 21:33 Comment(2)

Yeah, I read the documentation and was hoping someone on SO is smart enough to help me rewrite this without the lookarounds since the documentation says they're not allowed. Thanks! – Fireworm 25/6, 2012 at 22:12

This answer has been added to the Stack Overflow Regular Expression FAQ, under "Lookarounds". – Hispanic 10/4, 2014 at 0:30

For paths + "everything" in the same array, just split on the opening and closing tag:

import re
p = re.compile(r'''<!inc\(|\)!>''')
awesome = p.split(body)

See it run.

Gadid answered 25/6, 2012 at 21:33 Comment(7)

+1 :: Optionally replace .*? with .+? for non-blank inside match – Committee 25/6, 2012 at 21:46

@user1215106: That wouldn't match his already existing regex. Keep in mind this is a port from C# to Python. – Gadid 25/6, 2012 at 21:47

That's why I wrote optionally and explain what would change, Sir. – Committee 25/6, 2012 at 21:48

BTW :: For better performance, don't use *? or +? at all, if you don't have to... – Committee 25/6, 2012 at 21:49

Just google for that - for example: blog.stevenlevithan.com/archives/greedy-lazy-performance – Committee 25/6, 2012 at 21:56

Sorry - I should have explained better - see my updated question. Thanks! – Fireworm 25/6, 2012 at 22:6

import re

pat = re.compile("\<\!inc\((.*?)\)\!\>")

f = pat.match(r"<!inc(C:\My Documents\file.jpg)!>").group(1)

results in f == 'C:\My Documents\file.jpg'

In response to Jon Clements:

print re.escape("<!inc(filename)!>")

results in

\<\!inc\(filename\)\!\>

Conclusion: re.escape seems to think they should be escaped.

Emmeram answered 25/6, 2012 at 21:36 Comment(2)

Any reason to escape <,! and > ? The compile statement should traditionally be an r'' str – Suppose 25/6, 2012 at 21:49

11 years later: In Python 3.7 (which reached its EOL just a few months ago) and later, !, ", %, ', ,, /, :, ;, <, =, >, @, and ` are no longer escaped. That said, <!inc\(filename\)!> is sufficient. – Respecting 9/9, 2023 at 12:43

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags