Unix Flex Regex for Multi-Line Comments
Asked Answered
C

3

12

I am making a Lexical Analyzer using Flex on Unix. If you've ever used it before you know that you mainly just define the regex for the tokens of whatever language you are writing the Lexical Analyzer for. I am stuck on the final part. I need the correct Regex for multi-line comments that allows something like

/* This is a comment \*/

but also allows

/* This **** //// is another type of comment */

Can anyone help with this?

Cohabit answered 21/1, 2011 at 6:15 Comment(2)
Can you edit your question to improve the “problem” samples? They need newlines to properly express what you're having problems with, but I couldn't work out where they were missing. (Indenting by 4 spaces makes a paragraph into a sample code section.)Apennines
possible duplicate of Why are multi-line comments in flex/bison so evasive?Bornholm
A
19

You don't match C style comments with a simple regular expression in Flex; they require a more complex matching method based on start states. The Flex FAQ says how (well, they do for the /*...*/ form; handling the other form in just the <INITIAL> state should be simple).

Apennines answered 21/1, 2011 at 8:59 Comment(4)
Ah, I figured there was a FAQ about it! :) +1Bornholm
@Bart: I found it the other day when answering a SO question (on parsing XML CDATA sections, a very similar problem in parsing terms except for the fact that it's even more important to do it the right way because the end-section sequence is three characters long).Apennines
If RegEx-only is necessary,"/*"( [^*] | (*+[^*/]) )*\*+\/ would do the job. I've explained in greater detail in https://mcmap.net/q/895951/-unix-flex-regex-for-multi-line-commentsFellah
@DonalFellows please add the answer of that page to your answer. The answer could now get lost.Kele
F
12

If you're required to make do with just regex, however, there is indeed a not-too-complex solution:

"/*"( [^*] | (\*+[^*/]) )*\*+\/

The full explanation and derivation of that regex is excellently elaborated upon here.

In short:

  • "/*" marks the start of the comment
  • ( [^*] | (\*+[^*/]) )* says accept all characters that are not '*' (the [^*] ) or accept a sequence of one or more '*' as long as the sequence does not have a '*' or a '/' following it (the (\*+[^*/])). This means that all '******...' sequences will be accepted except for '*****/' since you can't find a sequence of '*' there that isn't followed by a '*' or a '/'.
  • The '*******/' case is then handled by the last bit of the RegEx which matches any number of '*' followed by a '/' to mark the end of the comment i.e \*+\/
Fellah answered 31/8, 2015 at 22:15 Comment(3)
I don't think this regex will compile. Flex doesn't accept non-escaped white spaces in patterns. See: https://mcmap.net/q/910301/-whitespace-in-flex-patterns-leads-to-quot-unrecognized-rule-quotTramp
In any case the problem here is that comments can be arbitrary long and the rules of lex(1) and flex(1) would require it to accumulate the entire rule before despatching it, which is entirely undesirable.Evans
@Evans It depends. If you just want to ignore comments, that can somewhat slow down the lexer without giving any profits but if you want to capture the comment, it is desirable.Tramp
B
1

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html does:

"/*"            { comment(); }

comment() {
    char c, c1;

loop:
    while ((c = input()) != '*' && c != 0)
        putchar(c);

    if ((c1 = input()) != '/' && c != 0) {
        unput(c1);
        goto loop;
    }

    if (c != 0)
        putchar(c1);
}

A question which would also solve this is How do I write a non-greedy match in LEX / FLEX?

Balsaminaceous answered 10/4, 2013 at 12:54 Comment(4)
If anyone can guess why the downvote, I'd love to hear.Balsaminaceous
Not a downvote here - but that fails for even asterisks: /** hello ****/ printf("lol"); /** hmmm */ The reason is the *[^/] will consume two *s at a time if the * is not followed by a /Fellah
@AbrahamPhilip thanks! My regex was wrong, and yours seems right. Removed it from the answer.Balsaminaceous
np, glad to be of help :)Fellah

© 2022 - 2024 — McMap. All rights reserved.