Using regexes, how to efficiently match strings between double quotes with embedded double quotes?
Asked Answered
D

1

6

Let us have a text in which we want to match all strings between double quotes; but within these double quotes, there can be quoted double quotes. Example:

"He said \"Hello\" to me for the first time"

Using regexes, how do you match this efficiently?

Dardanus answered 11/6, 2013 at 11:54 Comment(7)
Shouldn't you ask a question and then post this as an answer ? I'm afraid this gets closed as not a real question ...Stabler
This is called "unrolling the loop" (or Friedl calls it that in his book). I think you should move this (or the main part of this) into an answer, as it's certainly not a question. A good question could be "What's an efficient way to match strings with embedded quotes?" or something like that.Smidgen
OK, I will make that an answer... This is my first attempt at such an entry!Dardanus
How about now? Anything else to modify/improve?Dardanus
Very good. Well, the title still contains the answer (and is a bit long because of that) - why not remove the second half?Smidgen
@TimPietzcker I can do that... What about "flagging" it as a community entry in the subject? This is meant to be a candidate. Are there best practices for that?Dardanus
@fge: Why? You deserve the reputation you're earning for it. And when enough other people modify it, it becomes a CW automatically.Smidgen
D
17

A very efficient solution to match such inputs is to use the normal* (special normal*)* pattern; this name is quoted from the excellent book by Jeffrey Friedl, Mastering Regular Expressions.

It is a pattern useful in general to match inputs consisting of regular entries (the normal part) with separators inbetween (the special part).

Note that like all things regex, it should be used when there is no better choice; while one could use this pattern for parsing CSV data, for instance, if you use Java, you're better off using OpenCSV instead.

Also note that while the quantifiers in the pattern name are stars (ie, zero or more), you can vary them to suit your needs.

Strings with embedded double quotes

Let us take the above example again; and please consider that this text sample may be anywhere in your input:

"He said \"Hello\" to me for the first time"

No matter how hard you try, no amount of "dot plus greedy/lazy quantifiers" magic will help you solve it. Instead, categorize the input between quotes as normal and special:

  • normal is anything but a backslash or a double quote: [^\\"];
  • special is the sequence of a backslash followed by a double quote: \\".

Substituting this into the normal* (special normal*)* pattern, this gives the following regex:

[^\\"]*(\\"[^\\"]*)*

Adding the double quotes around to match the full text gives the final regex:

"[^\\"]*(\\"[^\\"]*)*"

You will note that this will also match empty quoted strings.

Words with dash separators

Here we will have to use a variant on the quantifiers, since:

  • we don't want empty words,
  • we don't want words starting with a dash,
  • when a dash appears, it must have at least one letter before another dash, if any.

For simplicity, we will also suppose that only lowercase, ASCII letters are allowed.

Sample input:

the-word-to-match

Let us decompose again into normal and special:

  • normal: a lowercase, ASCII letter: [a-z];
  • special: the dash: -

The canonical form of the pattern would be:

[a-z]*(-[a-z]*)*

But as we said:

  • we don't want words starting with a dash: the first * should become +;
  • when a dash is found, there should be at least one letter after it: the second * should become +.

We end up with:

[a-z]+(-[a-z]+)*

Adding word anchors around it to obtain the final result:

\b[a-z]+(-[a-z]+)*\b

Other operator variations

The examples above limit themselves to replacing * with +, but of course you can have as many variations as you wish. One ultra classical example would be an IP address:

  • normal is up to three digits (\d{1,3}),
  • special is the dot: (\.),
  • the first normal appears only once, therefore no quantifier,
  • the normal inside the (special normal*) also appears only once, therefore no quantifier,
  • finally the (special normal*) part appears exactly three times, therefore {3}.

Which gives the expresison (decorated with word anchors):

\b\d{1,3}(\.\d{1,3}){3}\b

Conclusion

This pattern's flexibility makes it one of the most useful tools in your regex toolbox. While many problems exist which you should not use regexes for if libraries exist, in some situations, you have to use regexes. And this will become one of your best friends once you have practiced with it a bit!

Tips

  • It is more than likely that you don't need (or want) to capture the repeated part (the (special normal*) part); it is therefore recommended that you use a non-capturing group. For instance, use "[^\\"]*(?:\\"[^\\"]*)*" for quoted strings. In fact, had you wanted it, capturing would almost never lead to the desired results in this case, because repeating a capturing group will only ever give you the last capture (all previous repetitions will be overwritten), unless you are using this pattern in .NET. (thanks @ohaal)
Dardanus answered 11/6, 2013 at 12:1 Comment(10)
If you want to go really crazy, this pattern can also be applied to recursive loops in regex flavors that support them. See https://mcmap.net/q/1632328/-is-there-a-regex-like-that-is-capable-of-parsing-matching-symbolsAcatalectic
Wouldn't it be more efficient to use a non-capturing group instead? (Referring to "[^\\"]*(\\"[^\\"]*)*")Distinct
@Distinct that is true (should be mentioned in a tips section or something); however, I wanted to keep things relatively simple. Also, I believe some regex engines will optimize capturing groups away if you don't actually use them.Dardanus
@Dardanus I took my freedom to change that tip a bit, as I think the wording slightly missed the point.Acatalectic
@m.buettner actually that was not the intent of my comment in the first place; as I said, many regex engines optimize captures away (ie, discard them) when they are not used. What I mentioned was the case where the intent was to capture.Dardanus
@Dardanus only very few can possibly optimize them away, because they cannot not know if you will access them later from outside the engine. also, just adding capturing around the pattern has nothing to do with this particular technique, so I thought it's more important to focus on the optimization inside the pattern.Acatalectic
@m.buettner OK, re-edited a bit. Also, what do you mean with .NET languages?Dardanus
@Dardanus ah yeah I like that edit better. and .NET has the only regex flavor that allows you to access all captures of a group, i.e. if it's repeated or if you use the same group name twice. with this kind of feature it can make sense to capture the repeated part (or at least the "normal" bit of the repeated part)Acatalectic
@Dardanus Sir, what to do when you need to match strings between parenthesis with embedded parenthesis?Train
You're string matching is wrong, this it he correct one [^\\"]*(\\.[^\\"]*)* (before it wasn't matching other escaped characters like \\ or \nFredenburg

© 2022 - 2024 — McMap. All rights reserved.