Regular expression lookbehind problem

About

Asked 5/2, 2010 at 23:50 Answered 6/2, 2010 at 5:14

I use

(?<!value=\")##(.*)##

to match string like ##MyString## that's not in the form of:

<input type="text" value="##MyString##">

This works for the above form, but not for this: (It still matches, should not match)

<input type="text" value="Here is my ##MyString## coming..">

I tried:

(?<!value=\").*##(.*)##

with no luck. Any suggestions will be deeply appreciated.

Edit: I am using PHP preg_match() function

Genie answered 5/2, 2010 at 23:50 Comment(14)

Don't use regex to parse HTML - use an HTML parser. #1732848 – Donatus 5/2, 2010 at 23:51

I am using this to replace certain text in the HTML code, so preg_match is ok for me. I dont need an HTML parser – Genie 5/2, 2010 at 23:56

Mark, I get it. Don't parse HTML using regex. But what if the user isn't trying to PARSE HTML, but rather search HTML for a specific string? Is it really necessary to parse the whole document using an XML parser to do this work? I feel that a lot of people are answering regex questions with this answer when it really isn't the right answer. – Palish 5/2, 2010 at 23:57

@Mike, i totally agree, everyone seems to regurgitates the"no regex with HTML" rhetoric without thinking. – Nonary 5/2, 2010 at 23:59

@Paul: I'm not "everyone". I'm not saying it without thinking. I'm saying it because I think that regex is a poor way to solve this problem. If you think it can be done easily with a regex, please do show how. :) – Donatus 6/2, 2010 at 0:9

I was about to post a "working" regex solution, but stopped because it made me feel sick to my stomach. BTW, any solution you find is easily thwarted by a valid HTML counter-example. Heed @Mark's advice. – Curlicue 6/2, 2010 at 0:10

@Dali: Will the input be an entire HTML document, or just a small fragment? What sort of inputs should we expect to see, what HTML could be present in the document - just a limited set of tags, or any tags? Is it important to get 100% accuracy? Can you trust the source of the HTML not to do something malicious to try to cause your code to fail? – Donatus 6/2, 2010 at 0:19

Zano, my point was that sometimes a regex solution does exist for a block of HTML. Look at this question for example: #2174406 There was a valid regex answer for it. I feel like most users just see HTML and regex in the same sentence and post "dont parse HTML with regex" without attempting to even examine the question. – Palish 6/2, 2010 at 0:27

@Mike Sherov: Whilst regex wasn't totally impossible there, and you did get the accepted answer, it's still far more complex and less robust solution than using an HTML parser. See my answer for that question: #2174406 It is much easier to parse HTML using XPath than regular expressions, because XPath was designed for that purpose. – Donatus 6/2, 2010 at 1:7

@Mark: It will be an entire document which will be used to replace strings with the correct language selected. Yes I can fully trust the source because actually I am producing it :) – Genie 6/2, 2010 at 1:22

@Dali: Can't you just change your document format slightly so that you can do a simpler search and replace without having some parts that mustn't match. For example use $$foobar$$ for the bits you do want to replace, and ##foobar## for those you don't want to replace. Perhaps you could explain in your question a bit more about why you have chosen the format you did. – Donatus 6/2, 2010 at 1:57

@Mark: I want to take my chance and push the regex way. It's not efficient for me to make the changes you suggest. – Genie 6/2, 2010 at 3:45

@Mark, true. It is much easier to use XPath in that case! And you showed a valid solution. I guess my point is that if you're going to post "use an xml parser", show how you would use the XML parser to get the answer (which you did in that case). Most times I see it though, the answer stops at just "use an xml parser", when the asker is already so close to a valid answer and they just need a slight tweak for their specific case. I do get your point though. Perhaps you've converted me to put the hard work in and write XML parser answers for these questions in the future. Thanks. – Palish 6/2, 2010 at 12:58

Keep in mind that parsing HTML is a lot harder than just using DOMDocument if you plan to work with real world HTML that you have no control over, which could be HTML5 with unicode for example... in which case html5lib should be used (but it is also still in alpha) – Gluck 9/9, 2013 at 9:43

This is not perfect (that's what HTML parsers are for), but it will work for the vast majority of HTML files:

(^|>)[^<>]*##[^#]*##[^<>]*(<|$)

The idea is simple. You're looking for a string that is outside of tags. To be outside of tags, the closest preceding angled bracket to it must be closing (or there's no bracket at all), and the closest following one must be opening (or none). This assumes that angled brackets are not used in attribute values.

If you actually care that the attribute name be "value", then you can match for:

value\s*=\s*"([^\"]|\\\")*##[^#]*##([^\"]|\\\")*\"

... and then simply negate the match (!preg_match(...)).

Gametophyte answered 6/2, 2010 at 5:14 Comment(0)

@OP, you can do it simply without regex.

$text = '<input type="text" value="   ##MyString##">';
$text = str_replace(" ","",$text);
if (strpos($text,'value="##' ) !==FALSE ){
    $s = explode('value="##',$text);
    $t = explode("##",$s[1]);
    print "$t[0]\n";
}

Pulverize answered 6/2, 2010 at 1:2 Comment(3)

I believe there's too much overhead in this. When it comes to replace, let's say 50 strings, it will consume too much resource. And it is not always whitespaces before ##MyString##, it may be anything – Genie 6/2, 2010 at 1:35

if its anything but spaces before ##Mystring## , then it shouldn't match, as per your criteria correct? As for overheads, there's no way to tell unless you do some benchmarks. – Pulverize 6/2, 2010 at 2:0

@Dali more code does not mean more overhead, this solution might even be faster than the regex one in some situations and slower in others, as ghostdog74 says, you need to actually try it. – Gluck 9/9, 2013 at 10:22

here is a starting point at least, it works for the given examples.

(?<!<[^>]*value="[^>"]*)##(.*)##

Nonary answered 6/2, 2010 at 0:7 Comment(3)

Warning: preg_match(): Compilation failed: lookbehind assertion is not fixed length – Donatus 6/2, 2010 at 0:21

It fails with "Compilation failed: lookbehind assertion is not fixed length at offset 23" I am using PHP preg_match function – Genie 6/2, 2010 at 0:35

@mark, I think .net is the only engine to support this kind of lookbehind now you mention it! I concede that this problem is actually pretty challenging in any other language, my point above wasn't aimed specifically at you, you are in fact probably right in this case, but i still say that alot of people jump on the bandwangon without understanding. – Nonary 6/2, 2010 at 0:42

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags