Lua pattern matching vs. regular expressions
Asked Answered
E

4

67

I'm currently learning lua. regarding pattern-matching in lua I found the following sentence in the lua documentation on lua.org:

Nevertheless, pattern matching in Lua is a powerful tool and includes some features that are difficult to match with standard POSIX implementations.

As I'm familiar with posix regular expressions I would like to know if there are any common samples where lua pattern matching is "better" compared to regular expression -- or did I misinterpret the sentence? and if there are any common examples: why is any of pattern-matching vs. regular expressions better suited?

Edveh answered 22/4, 2010 at 18:18 Comment(2)
link to where you read this in the docs would be niceSupersensual
@g33kz0r the docs are available at: lua.org/pil/20.1.html the citation is from the second paragraph (the one starting with: "Unlike several other scripting languages, ...) the last sentence.Edveh
P
73

Are any common samples where lua pattern matching is "better" compared to regular expression?

It is not so much particular examples as that Lua patterns have a higher signal-to-noise ratio than POSIX regular expressions. It is the overall design that is often preferable, not particular examples.

Here are some factors that contribute to the good design:

  • Very lightweight syntax for matching common character types including uppercase letters (%u), decimal digits (%d), space characters (%s) and so on. Any character type can be complemented by using the corresponding capital letter, so pattern %S matches any nonspace character.

  • Quoting is extremely simple and regular. The quoting character is %, so it is always distinct from the string-quoting character \, which makes Lua patterns much easier to read than POSIX regular expressions (when quoting is necessary). It is always safe to quote symbols, and it is never necessary to quote letters, so you can just go by that rule of thumb instead of memorizing what symbols are special metacharacters.

  • Lua offers "captures" and can return multiple captures as the result of a match call. This interface is much, much better than capturing substrings through side effects or having some hidden state that has to be interrogated to find captures. Capture syntax is simple: just use parentheses.

  • Lua has a "shortest match" - modifier to go along with the "longest match" * operator. So for example s:find '%s(%S-)%.' finds the shortest sequence of nonspace characters that is preceded by space and followed by a dot.

  • The expressive power of Lua patterns is comparable to POSIX "basic" regular expressions, without the alternation operator |. What you are giving up is "extended" regular expressions with |. If you need that much expressive power I recommend going all the way to LPEG which gives you essentially the power of context-free grammars at quite reasonable cost.

Prostate answered 23/4, 2010 at 5:47 Comment(9)
thanks -- a lot of information. i think i've to delve deeper into lua pattern matching, before i fully understand, what was ment with the quoted sentence ...Edveh
Isn't the "shortest match" modifier just the same as the PCRE "frugal match" operator "*?" ?Rhythmandblues
There is also %bxy which matches a balanced pair of delimiters, such as parenthesis or braces. Balanced parenthesis matching cannot be done in POSIX regular expressions. Also, there is the frontier pattern which is present but undocumented in Lua 5.1, and becomes a documented feature in 5.2. The wiki says "The frontier pattern %f followed by a set detects the transition from "not in set" to "in set"" This operation is possible but a lot more verbose in regexp.Nesselrode
(This post is immortalized as a top result for google.com/search?q=lua+decimal+regex , where I came to find out what to do when "\d" didn't work. Good to save the next person the half-step toward the solution. [Thanks for writing this post to do most of the work.])Aerophobia
In Lua, the modifiers *, +, -, and ? can only be applied to a character class. I wish I could group patterns under a modifier. For example, '(xx)*x' would match an odd number of x's. I have an app that lets users perform searches with Lua pattern strings. I would like to be able to modify their pattern to make it case-insensitive. Thus '%%ab%ac%%%a' would become '%%[aA][bB]%a[cC]%%%a'. The ability to search for an even number of escape chars ('%') would be useful here. Something like p = str:gsub("(%%%%)*%a", function(a,b) return string.format("%s[%s%s]", a, b:lower(), b:upper())) end)Kablesh
Bear in mind that you don't get Unicode. Lua patterns match on bytes. If you're using a multibyte encoding, you have to be very careful.Framing
Beside what is stated in Lua specs, the only effective advantages that Lua patterns offers is for "%bXY" to match pairs (by adding an additional counter within the finite state machine) and for "%f[set]" frontiers (additional types of anchors).Prent
Everything else is fully covered in POSIX regexps. I see no real advantage of using '%' instead of '\' when Lua also has its own use of '\' for escaping, which creates even more confusion as well if you need to write it as "%\\" or if you still need a "%" before "\045" or "\x2D" or "\u{002D}" match a litteral dot only, but MUST not use a "%" before "\d091" or "\x61" or "\x{0061}" to match a litteral 'a' only)!)Prent
Also the minor advantage is that it just requires ~500 lines of code to implement it in C, versus ~4000 for full POSIX regexps (but much less if all you want is to add the critically missing "|" feature). Those additional lines of source code don't generate much binary code and POSIX regexps are already implemented and used on the same system. This extra cost in the engine is very small (compared to the memory needs for the whole basic Lua engine itself and its default "standard" library). But it saves costs for Lua implementers to do coverage tests.Prent
I
9

http://lua-users.org/wiki/LibrariesAndBindings contains a listing of functionality including regex libraries if you wish to continue using them.

To answer the question (and note that I'm by no means a Lua guru), the language has a strong tradition of being used in embedded applications, where a full regex engine would unduly increase the size of the code being used on the platform, sometimes much larger than just all of the Lua library itself.

[Edit] I just found in the online version of Programming in Lua (an excellent resource for learning the language) where this is described by one of the principles of the language: see the comments below [/Edit]

I find personally that the default pattern matching Lua provides satisfies most of my regex-y needs. Your mileage may vary.

Inferno answered 22/4, 2010 at 18:42 Comment(2)
ok -- i thought it wasn't just about the size. i read, that lua's pattern matching library is about 500 loc compared to regexp libs with ~4000 loc -- that's cool, but i thought it was also about convenience: i'm doing a lot with regexp and i know, that this stuff can get very complex and complicated -- so: are there any other features which makes lua's pattern matching more convenient or easier to use or ... than posix regexp -- besides the loc? please keep in mind: it's about learning not flaming.Edveh
I'd agree with what Norman posted (which is why he would get my upvote if I had the reputation!). I can't add much more than that other than the personal aesthetic of using it - it just feels better to me. Again, YMMV :) FWIW, when I bounce between differing regex/pattern-matching styles (sed vs. Lua, for instance), it does cause me a headache and often running to documentation. I tend to stay in the tool that I use the most often for this, which happens to be Lua.Inferno
G
3

Ok, just a slight noob note for this discussion; I particularly got confused by this page:

SciTE Regular Expressions

since that one says \s matches whitespace, as I know from other regular expression syntaxes... And so I'm trying it in a shell:

$ lua
Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> c="   d"
> print(c:match(" "))

> print(c:match("."))

> print(c:match("\s"))
nil
> print("_".. c:match("[ ]") .."_")
_ _
> print("_".. c:match("[ ]*") .."_")
_   _
> print("_".. c:match("[\s]*") .."_")
__

Hmmm... seems \s doesn't get recognized here - so that page probably refers to the regular expression in Scite's Find/Replace - not to Lua's regex syntax (which scite also uses).

Then I reread lua-users wiki: Patterns Tutorial, and start getting the comment about the escape character being %, not \ in @NormanRamsey's answer. So, trying this:

> print("_".. c:match("[%s]*") .."_")
_   _

... does indeed work.

So, as I originally thought that Lua's "patterns" are different commands/engine from Lua's "regular expression", I guess a better way to say it is: Lua's "patterns" are the Lua-specific "regular expression" syntax/engine (in other words, there aren't two of them :) )

Cheers!

Grimonia answered 2/5, 2012 at 10:34 Comment(0)
H
3

With the risk of getting some downvotes for speaking the truth, I'll be bluntly honest about it (like an answer should be, after all): aside from being able to return multiple captures for a single match call (possible in regular expressions, but in a much more convoluted manner) and the %bxy pattern which matches a balanced pair of delimiters (e.g. all kind of brackets and such) and qualifies as useful, powerful and "better", almost everything Lua patterns can do, regular expressions can do as well.

The shortcomings of Lua patterns compared to regular expressions when it comes to "features" on the other hand are significant and too many too mention (e.g. lack of OR, lack of non-capturing groups, lookaround expressions, etc). Now that would be balanced if, say, Lua patterns would be significantly faster that the usually slower regular expressions, but I'm not sure whether - and where - such a comparison exists, one that would exclude the general native Lua speed due to its lightweight nature, the use of tables and so on.

The real reason Lua didn't bother to add regular expressions to its toolbox can't be the length of the required code (that's nonsense, modern computers don't even blink when it comes to 4000 lines of code vs "just" 500, even if it translates a bit differently into a library), but is probably due to the fact that being a scripting language, it was assumed that the "parent" language already includes the ability to use regular expressions. It is plain obvious when looking at the overall picture that Lua as a language was designed with simplicity, speed and only the necessary features in mind. It works well in most cases, but if you need more capabilities in this area and you cannot replicate them using Lua's other features, regular expressions are more comprehensive.

The good thing is that the differences in syntax between the Lua pattern and regular expressions are mostly minor, so if you know one you can relatively easy adapt to the other.

Hutcheson answered 26/5, 2021 at 11:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.