How to use JavaScript regex over multiple lines?
Asked Answered
E

8

352
var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre.*?<\/pre>/gm );
alert(arr);     // null

I'd want the PRE block be picked up, even though it spans over newline characters. I thought the 'm' flag does it. Does not.

Found the answer here before posting. SInce I thought I knew JavaScript (read three books, worked hours) and there wasn't an existing solution at SO, I'll dare to post anyways. throw stones here

So the solution is:

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[\s\S]*?<\/pre>/gm );
alert(arr);     // <pre>...</pre> :)

Does anyone have a less cryptic way?

Edit: this is a duplicate but since it's harder to find than mine, I don't remove.

It proposes [^] as a "multiline dot". What I still don't understand is why [.\n] does not work. Guess this is one of the sad parts of JavaScript..

Emersen answered 30/12, 2009 at 12:13 Comment(3)
A less cryptic regex? Impossible, by nature.Spirituality
btw, you should to read: "Parsing Html: The Cthulhu Way" codinghorror.com/blog/archives/001311.htmlSpirituality
The link changed from the previous comment: blog.codinghorror.com/parsing-html-the-cthulhu-way (5yrs-ish later)Reverberation
S
283

[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).

That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.

In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.

Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.

Strati answered 30/12, 2009 at 18:29 Comment(6)
What I'm doing is making .wiki -> HTML conversion on the fly, using JavaScript. Therefore, I don't have the DOM available, yet. Wiki file is mostly its own syntax, but I allow HTML tags to be used if needed. Your advice is very valid, if I was dealing in DOM with this. Thanks. :)Emersen
Fair enough. I suppose that is a valid reason to want to use regexes on HTML, though wiki syntaxes mixed with HTML can have all kinds of fun corner cases themselves.Strati
[\r\n]applied to a sequence \r\n, would first match \r and then \n. If you want to match the entire sequence at once, regardless of whether that sequence is \r\n or just \n, use the pattern .|\r?\nBouffant
To match an entire multiline string, try the greedy [\s\S]+.Electromagnetism
I just want to add for posterity that JS regex syntax ignoring the meaning of . inside [] is different than other regex frameworks, particularly the advanced one in .NET. People, please do not assume that regexes are cross platform, they frequently are not!!Faulk
Also, be careful with escaping: In new RegExp('...') you will need [\\s\\S] whereas in /.../ a simple [\s\S] is sufficient! Another thing to be aware of is to NOT use the multiline flag!Flimflam
L
398

DON'T use (.|[\r\n]) instead of . for multiline matching.

DO use [\s\S] instead of . for multiline matching

Also, avoid greediness where not needed by using *? or +? quantifier instead of * or +. This can have a huge performance impact.

See the benchmark I have made: https://jsben.ch/R4Hxu

Using [^]: fastest
Using [\s\S]: 0.83% slower
Using (.|\r|\n): 96% slower
Using (.|[\r\n]): 96% slower

NB: You can also use [^] but it is deprecated in the below comment.

Leggat answered 20/4, 2013 at 11:20 Comment(7)
Good points, but I recommend against using [^] anyway. On one hand, JavaScript is the only flavor I know that supports that idiom, and even there it's used nowhere near as often as [\s\S]. On the other hand, most other flavors let you escape the ] by listing it first. In other words, in JavaScript [^][^] matches any two characters, but in .NET it matches any one character other than ], [, or ^.Heliport
How do you know that \S will match \r or \n versus some other character?Falconer
See this question for \s\S details. This is a hack to match all white-space characters + all non-whitespace characters = all characters. See also MDN for regexp special character documentation.Leggat
+1 [\s\S] is just what I need instead of (.|\n)* to prevent the result from creating an extra capturing group :)Hodge
So what is the solution, then? Is it ss.match( /<pre[\s\S]*?<\/pre>/gm ); or is it something else? Thanks.Misology
Any reason to prefer [\s\S] over others, like [\d\D] or [\w\W]?Wilsey
Let me quickly point out that your test for the greedy operator is rigged. /<p>Can[^]*?<\/p>/ doesn't matches the same content as /<p>Can[^]*<\/p>/. The greedy variant should be changed to /<p>(?:[^<]|<(?!\/p>))*<\/p>/ to match the same content.Azole
S
283

[.\n] does not work because . has no special meaning inside of [], it just means a literal .. (.|\n) would be a way to specify "any character, including a newline". If you want to match all newlines, you would need to add \r as well to include Windows and classic Mac OS style line endings: (.|[\r\n]).

That turns out to be somewhat cumbersome, as well as slow, (see KrisWebDev's answer for details), so a better approach would be to match all whitespace characters and all non-whitespace characters, with [\s\S], which will match everything, and is faster and simpler.

In general, you shouldn't try to use a regexp to match the actual HTML tags. See, for instance, these questions for more information on why.

Instead, try actually searching the DOM for the tag you need (using jQuery makes this easier, but you can always do document.getElementsByTagName("pre") with the standard DOM), and then search the text content of those results with a regexp if you need to match against the contents.

Strati answered 30/12, 2009 at 18:29 Comment(6)
What I'm doing is making .wiki -> HTML conversion on the fly, using JavaScript. Therefore, I don't have the DOM available, yet. Wiki file is mostly its own syntax, but I allow HTML tags to be used if needed. Your advice is very valid, if I was dealing in DOM with this. Thanks. :)Emersen
Fair enough. I suppose that is a valid reason to want to use regexes on HTML, though wiki syntaxes mixed with HTML can have all kinds of fun corner cases themselves.Strati
[\r\n]applied to a sequence \r\n, would first match \r and then \n. If you want to match the entire sequence at once, regardless of whether that sequence is \r\n or just \n, use the pattern .|\r?\nBouffant
To match an entire multiline string, try the greedy [\s\S]+.Electromagnetism
I just want to add for posterity that JS regex syntax ignoring the meaning of . inside [] is different than other regex frameworks, particularly the advanced one in .NET. People, please do not assume that regexes are cross platform, they frequently are not!!Faulk
Also, be careful with escaping: In new RegExp('...') you will need [\\s\\S] whereas in /.../ a simple [\s\S] is sufficient! Another thing to be aware of is to NOT use the multiline flag!Flimflam
O
58

You do not specify your environment and version of JavaScript (ECMAScript), and I realise this post was from 2009, but just for completeness:

With the release of ECMA2018 we can now use the s flag to cause . to match \n (see https://mcmap.net/q/94053/-how-to-use-dotall-flag-for-regex-exec).

Thus:

let s = 'I am a string\nover several\nlines.';
console.log('String: "' + s + '".');

let r = /string.*several.*lines/s; // Note 's' modifier
console.log('Match? ' + r.test(s)); // 'test' returns true

This is a recent addition and will not work in many current environments, for example Node v8.7.0 does not seem to recognise it, but it works in Chromium, and I'm using it in a Typescript test I'm writing and presumably it will become more mainstream as time goes by.

Oxblood answered 11/4, 2018 at 4:17 Comment(2)
This works great in Chrome (v67) but completely breaks the regex (also stops working line-by-line) in IE11 and IEdge(v42)Cribbing
Thanks @freedomn-m .. IE not supporting a very new feature is almost completely unsurprising :) But yes, it's worth mentioning where it doesn't work to save anyone trying to 'debug' why their attempt to use it isn't working as expected.Oxblood
R
22

Now there's the s (single line) modifier, that lets the dot matches new lines as well :) \s will also match new lines :D

Just add the s behind the slash

 /<pre>.*?<\/pre>/gms
Rudiment answered 1/2, 2021 at 22:42 Comment(0)
P
13

[.\n] doesn't work, because dot in [] (by regex definition; not javascript only) means the dot-character. You can use (.|\n) (or (.|[\n\r])) instead.

Pangenesis answered 30/12, 2009 at 18:18 Comment(2)
[\s\S] is the most common JavaScript idiom for matching everything including newlines. It's easier on the eyes and much more efficient than an alternation-based approach like (.|\n). (It literally means "any character that is whitespace or any character that isn't whitespace.)Heliport
You're right, but the question was about . and \n, and why [.\n] doesn't work. As mentioned in the question, the [^] is also nice approach.Pangenesis
B
11

I have tested it (Chrome) and it's working for me (both [^] and [^\0]), by changing the dot (.) with either [^\0] or [^] , because dot doesn't match line break (See here: http://www.regular-expressions.info/dot.html).

var ss= "<pre>aaaa\nbbb\nccc</pre>ddd";
var arr= ss.match( /<pre[^\0]*?<\/pre>/gm );
alert(arr);     //Working
Berzelius answered 4/7, 2017 at 13:10 Comment(1)
The problem with [^\0] is that it won't match null characters even though null characters are allowed in Javascript strings (see this answer).Scornik
B
0

In addition to above-said examples, it is an alternate.

^[\\w\\s]*$

Where \w is for words and \s is for white spaces

Burletta answered 16/2, 2018 at 7:4 Comment(0)
M
0

[\\w\\s]*

This one was beyond helpful for me, especially for matching multiple things that include new lines, every single other answer ended up just grouping all of the matches together.

Morena answered 3/1, 2021 at 7:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.