Get src value of <img> tags with inconsistent quoting
Asked Answered
S

2

5

I need a clever regex to match ... in these:

<img src="..."
<img src='...'
<img src=...

I want to match the inner content of src, but only if it is surrounded by ", ' or none. This means that <img src=..." or <img src='... must not be accepted.

Any ideas how to match these 3 cases with one regex.

So far I use something like this ("|'|[\s\S])(.*?)\1 and the part that I want to get loose is the hacky [\S\s] which I use to match "missing symbol" on the beginning and the end of the ....

Spasmodic answered 28/10, 2010 at 22:26 Comment(5)
#1732848Nahshunn
It sounds like what you really need is an HTML parser, and not a regular expression.Microbiology
I use Java. ANd I DONT need HTML parser... realy.Spasmodic
"clever" and "regex" rarely go together with a happy ending.Derm
This question is similar to: How to extract img src, title and alt from html using php?. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem.Considering
T
13

Wow, second one I'm answering today.

Don't parse HTML with regex. Use an HTML/XML parser and your life will be much easier. Tidy will clean up your HTML code for you, so you can run the HTML through Tidy first and then through a parser. Some tidy-based libraries will perform parsing in addition to santizing, and so you may not even have to run it through another parser.

Java, for example has JTidy and PHP has PHP Tidy.

UPDATE

Against my better judgement, I'm giving you this:

/<img\s+src\s*=\s*(["'][^"']+["']|[^>]+)>/

Which works only for your specific case. Even so, it will not take into account escaped " or ' in your image-source names, or the > character. There are probably a bunch of other limitations as well. The capturing group gives you your image names (in the case of names surrounded by single or double quotes, it gives you those as well, but you can strip those out).

Tripp answered 28/10, 2010 at 22:34 Comment(5)
No, I planned not to use parser. The task is simple enough to be done by a small regex.Spasmodic
What we are telling you is that the task is not simple enough to be done by a small regex. If it was, you'd have already made it happen.Nevlin
@Lucho, if the task is simple enough to be done by a regex, why are you asking us? We're telling you that the task is not simple enough to be solved by a regex (small or otherwise).Tripp
Ok, you convince me :-) The world is cruel and probably full of ugly and messed up html code, so a parser is a rescue... but in one perfect world probably there will be possible to just grep the content of src attributes of img tags :DSpasmodic
@Lucho perhaps, but probably not HTML is not regular :)Tripp
I
0

Depending on what scripting or programming language you are using to solve this, it can be done with either multiple regex, or simply one regex that checks groups.

<img[^s]+src=("(.+)"|'(.+)'|(.+))[^/<]+(/>|</img>)

If all you want is the image src attribute, you don't have to parse using a parser. In fact, if you're wanting other attributes, just use a different regex. You will run into issues with multiple matches of the image tag, but in that case just match image tags, and for each one perform your desired regex.

Inebriate answered 7/6, 2014 at 14:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.