Fixed Length Regex Required?
Asked Answered
F

3

0

I have this regex that uses forward and backward look-aheads:

import re
re.compile("<!inc\((?=.*?\)!>)|(?<=<!inc\(.*?)\)!>")

I'm trying to port it from C# to Python but keep getting the error

look-behind requires fixed-width pattern

Is it possible to rewrite this in Python without losing meaning?

The idea is for it to match something like

<!inc(C:\My Documents\file.jpg)!>

Update

I'm using the lookarounds to parse HTTP multipart text that I've modified

body = r"""------abc
Content-Disposition: form-data; name="upfile"; filename="file.txt"
Content-Type: text/plain

<!inc(C:\Temp\file.txt)!>
------abc
Content-Disposition: form-data; name="upfile2"; filename="pic.png"
Content-Type: image/png

<!inc(C:\Temp\pic.png)!>
------abc
Content-Disposition: form-data; name="note"

this is a note
------abc--
"""

multiparts = re.compile(...).split(body)

I want to just get the file path and other text when I do the split and not have to remove the opening and closing tags

Code brevity is important, but I'm open to changing the <!inc( format if it makes the regex doable.

Fireworm answered 25/6, 2012 at 21:26 Comment(9)
Have you tried using a raw string? re.compile(r'''regex here''')Trod
"backward lookahead". You mean a lookbehind.Affectional
You can use the regex module instead of the standard re, which does support variable-length lookbehinds.Einstein
Apparently you are looking for the <!inc( and )!> parts. Why are you not looking for the file part with (?<=<!inc\().*?(?=\)!>)?Accessible
@thg435 - Sorry, this will be used on other's computers and I don't want to have to distribute an additional module if possible. Thanks.Fireworm
@OlivierJacot-Descombes - See my update, sorry for not explaining better in the first place.Fireworm
@Trod - Thanks for the tip, that will make the code nicer to look at!Fireworm
You want to capture the file path, and everything except the opening <!inc( and closing )!> tags?Gadid
@Gadid - Yes, exactly and I want them to be split so I have an array of "everythings" and "file paths"... and no opening or closing tags.Fireworm
G
3

For paths + "everything" in the same array, just split on the opening and closing tag:

import re
p = re.compile(r'''<!inc\(|\)!>''')
awesome = p.split(body)

You say you're flexible on the closing tags, if )!> can occur elsewhere in the code, you may want to consider changing that closing tag to something like )!/inc> (or anything, as long as it's unique).

See it run.

Gadid answered 25/6, 2012 at 21:33 Comment(7)
+1 :: Optionally replace .*? with .+? for non-blank inside matchCommittee
@user1215106: That wouldn't match his already existing regex. Keep in mind this is a port from C# to Python.Gadid
That's why I wrote optionally and explain what would change, Sir.Committee
BTW :: For better performance, don't use *? or +? at all, if you don't have to...Committee
Just google for that - for example: blog.stevenlevithan.com/archives/greedy-lazy-performanceCommittee
Sorry - I should have explained better - see my updated question. Thanks!Fireworm
Yes, I think I'll take this approach. It does lose some accuracy since it doesn't verify the matching end tags, but I'm not too worried about that being a problem. Thanks!Fireworm
C
5

From the documentation:

(?<!...)

Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

(?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in abcdef, since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:

Emphasis mine. No, I don't imagine you can port it to Python in it's current form.

Clew answered 25/6, 2012 at 21:33 Comment(2)
Yeah, I read the documentation and was hoping someone on SO is smart enough to help me rewrite this without the lookarounds since the documentation says they're not allowed. Thanks!Fireworm
This answer has been added to the Stack Overflow Regular Expression FAQ, under "Lookarounds".Hispanic
G
3

For paths + "everything" in the same array, just split on the opening and closing tag:

import re
p = re.compile(r'''<!inc\(|\)!>''')
awesome = p.split(body)

You say you're flexible on the closing tags, if )!> can occur elsewhere in the code, you may want to consider changing that closing tag to something like )!/inc> (or anything, as long as it's unique).

See it run.

Gadid answered 25/6, 2012 at 21:33 Comment(7)
+1 :: Optionally replace .*? with .+? for non-blank inside matchCommittee
@user1215106: That wouldn't match his already existing regex. Keep in mind this is a port from C# to Python.Gadid
That's why I wrote optionally and explain what would change, Sir.Committee
BTW :: For better performance, don't use *? or +? at all, if you don't have to...Committee
Just google for that - for example: blog.stevenlevithan.com/archives/greedy-lazy-performanceCommittee
Sorry - I should have explained better - see my updated question. Thanks!Fireworm
Yes, I think I'll take this approach. It does lose some accuracy since it doesn't verify the matching end tags, but I'm not too worried about that being a problem. Thanks!Fireworm
E
1
import re

pat = re.compile("\<\!inc\((.*?)\)\!\>")

f = pat.match(r"<!inc(C:\My Documents\file.jpg)!>").group(1)

results in f == 'C:\My Documents\file.jpg'

In response to Jon Clements:

print re.escape("<!inc(filename)!>")

results in

\<\!inc\(filename\)\!\>

Conclusion: re.escape seems to think they should be escaped.

Emmeram answered 25/6, 2012 at 21:36 Comment(2)
Any reason to escape <,! and > ? The compile statement should traditionally be an r'' strSuppose
11 years later: In Python 3.7 (which reached its EOL just a few months ago) and later, !, ", %, ', ,, /, :, ;, <, =, >, @, and ` are no longer escaped. That said, <!inc\(filename\)!> is sufficient.Respecting

© 2022 - 2024 — McMap. All rights reserved.