Regular expression for a string literal in flex/lex
Asked Answered
S

6

60

I'm experimenting to learn flex and would like to match string literals. My code currently looks like:

"\""([^\n\"\\]*(\\[.\n])*)*"\""        {/*matches string-literal*/;}

I've been struggling with variations for an hour or so and can't get it working the way it should. I'm essentially hoping to match a string literal that can't contain a new-line (unless it's escaped) and supports escaped characters.

I am probably just writing a poor regular expression or one incompatible with flex. Please advise!

Shoelace answered 11/1, 2010 at 3:45 Comment(3)
Thanks so much everyone! All your comments were very helpful. The regex that has finally worked for me is a variant of the one used in the C specification linked by codadict (and explained by Jonathan): \"(\(.|\n)|[^\\"\n])*\"Shoelace
Since you found Jonathan's answer helpful, consider adding an upvote for his answer.Gateshead
By the way: nowhere in your question do you specify what language's string literals you're interested in. It's a very good idea to put the language you're asking about in one of the question's tags.Tamarisk
A
131

A string consists of a quote mark

"

followed by zero or more of either an escaped anything

\\.

or a non-quote character, non-backslash character

[^"\\]

and finally a terminating quote

"

Put it all together, and you've got

\"(\\.|[^"\\])*\"

The delimiting quotes are escaped because they are Flex meta-characters.

Appressed answered 11/1, 2010 at 3:53 Comment(8)
This doesn't handle escaping, unfortunately. So this would incorrectly lex "\""Concelebrate
You must have missed "zero or more of an escaped anything"?Appressed
There are several problems with this answer. First, it's not a valid flex pattern. The leading and trailing double-quotes need to be escaped because otherwise flex treats them as meta-characters. So the pattern should be (perhaps) \"(\\.|[^"])*\" . Second, that pattern still doesn't work. For example, it gets this input wrong: "\\\\" . Third, it doesn't meet the original question's requirement of disallowing newlines.Kob
As a regex, this is totally correct. Except for the newline thing, which is easily fixed by replacing . with [^\n] and [^"] with [^"\n]. It certainly should match "\\\\" too, since the repetition will match the quote ", then the escaped slash \\ , then the next escaped slash \\ , then the terminating quote ". The pattern certainly works for me outside of the scope of flex.Slogan
It doesn't matter whether it works outside the scope of flex. The question was about flex. If the lexer produced by flex sees "\\\\"foo", it will match the entire input, instead of just matching the "\\\\" part, because the character class doesn't exclude backslashes.Kob
@robmayoff is correct. This will incorrectly match all of "\\"a" (as: quote, not-quote, backslash-dot-anything, not-quote, quote). The regex should say [^"\\], not [^"].Glare
There's actually one more subtlety here since . won't match a \n. So the final flex pattern needed with escaping is \"(\\.|\\\n|[^"\])*\"Impermeable
lol upvoted not b/c it's in flex, but I've been looking for a regex that matches a string literal.Lightsome
S
29

For a single line... you can use this:

\"([^\\\"]|\\.)*\"  {/*matches string-literal on a single line*/;}
Shocking answered 13/2, 2012 at 12:30 Comment(3)
This is the best answer here.Steinberg
Shouldn't we make sure that the starting quote is not preceded by a backslash (is escaped) ?Barramunda
If it is preceded by a backslash, it should match no other rule and thus signal an error.Burglar
A
9

How about using a start state...

int enter_dblquotes = 0;

%x DBLQUOTES
%%

\"  { BEGIN(DBLQUOTES); enter_dblquotes++; }

<DBLQUOTES>*\" 
{ 
   if (enter_dblquotes){
       handle_this_dblquotes(yytext); 
       BEGIN(INITIAL); /* revert back to normal */
       enter_dblquotes--; 
   } 
}
         ...more rules follow...

It was similar to that effect (flex uses %s or %x to indicate what state would be expected. When the flex input detects a quote, it switches to another state, then continues lexing until it reaches another quote, in which it reverts back to the normal state.

Aneroid answered 11/1, 2010 at 4:4 Comment(2)
@Samoz: Not really, it's actually used in languages where string literals are used, it eats up what's between a beginning quote and an end quote, even if there's extra quotes inside it hence the usage of switching states in order to chew up the quotes...Aneroid
The flex manual contains a full example (in terms of flex usage) of parsing C-style strings: flex.sourceforge.net/manual/Start-Conditions.html . Search for "quoted strings" on that page.Kob
S
3

Paste my code snippet about handling string in flex, hope inspire your thinking.

Use Start Condition to handle string literal will be more scalable and clear.

%x SINGLE_STRING

%%

\"                          BEGIN(SINGLE_STRING);
<SINGLE_STRING>{
  \n                        yyerror("the string misses \" to termiate before newline");
  <<EOF>>                   yyerror("the string misses \" to terminate before EOF");
  ([^\\\"]|\\.)*            {/* do your work like save in here */}
  \"                        BEGIN(INITIAL);
  .                         ;
}
Superficial answered 20/2, 2019 at 15:23 Comment(1)
How do I save the string to yytext using this method ?Heptagon
E
2

This is what we use in Zolang for single line string literals with embedded templates ${...}

\"(\$\{.*\}|\\.|[^\"\\])*\"

Epistasis answered 27/9, 2018 at 0:48 Comment(1)
It would be able to match this properly: "Hello ${some + "world"}"Hellbender
M
0

An answer that arrives late but which can be useful for the next one who will need it:

\"(([^\"]|\\\")*[^\\])?\"
Methuselah answered 3/6, 2017 at 20:31 Comment(1)
Welcome to SO. This answer would be improved with text explaining how it works and how it is different.Ummersen

© 2022 - 2024 — McMap. All rights reserved.