What is the proper Lua pattern for quoted text?
Asked Answered
T

4

5

I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.

The pattern I have come up with so far is: (\?[\"\'])(.-)%1

This works in some cases but, not all cases:

Working: "This \"is a\" string of \"text to\" test with"

Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"

In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):

 string
 a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit

I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.

(a few edits b/c I forgot about stackoverflows formating) (another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)

Top answered 30/11, 2010 at 18:50 Comment(1)
possible duplicate of RegEx match open tags except XHTML self-contained tagsGoulash
S
5

Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.

I made a function that gets the matches I desire

This is the correct move.

I'm curious if a lua pattern can do this

From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)

[[[^\](\\)*"(.-[^\](\\)*)"]]

And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern. So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.

Shaeffer answered 1/12, 2010 at 3:5 Comment(1)
Thanks to both Norman and Kevin, exactly the answers I was expecting and looking for.Top
P
2

The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).

If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.

Pacifism answered 30/11, 2010 at 22:6 Comment(0)
X
1

Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.

When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.

Xerophyte answered 1/12, 2010 at 7:26 Comment(0)
G
0

you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.

Goulash answered 30/11, 2010 at 20:41 Comment(4)
I couldn't careless about the html, it was just my test string that i grabbed from a random file I had open. All i care about are the quotesTop
then I would suggest using a non-HTML example and remove that ambiguityGoulash
Are you treating ' and " equally as quotes? If so, how would you expect your 'not working' example to be parsed? For instance, "is' a" string" of' text" contains overlapping quotes. Are we supposed to find "is' a" and " of' text", or ' a" string" of', or all three? If it's the latter, you're going to need to do that in two passes.Whitman
(Arguably) THE answer: #1732848Walkling

© 2022 - 2024 — McMap. All rights reserved.