How to ignore comments inside string literals
Asked Answered
H

2

6

I'm doing a lexer as a part of a university course. One of the brain teasers (extra assignments that don't contribute to the scoring) our professor gave us is how could we implement comments inside string literals.

Our string literals start and end with exclamation mark. e.g. !this is a string literal!

Our comments start and end with three periods. e.g. ...This is a comment...

Removing comments from string literals was relatively straightforward. Just match string literal via /!.*!/ and remove the comment via regex. If there's more than three consecutive commas, but no ending commas, throw an error.

However, I want to take this even further. I want to implement the escaping of the exclamation mark within the string literal. Unfortunately, I can't seem to get both comments and exclamation mark escapes working together.

What I want to create are string literals that can contain both comments and exclamation mark escapes. How could this be done?

Examples:

!Normal string!
!String with escaped \! exclamation mark!
!String with a comment ... comment ...!
!String \! with both ... comments can have unescaped exclamation marks!!!... !

This is my current code that can't ignore exclamation marks inside comments:

def t_STRING_LITERAL(t):
    r'![^!\\]*(?:\\.[^!\\]*)*!'
    # remove the escape characters from the string
    t.value = re.sub(r'\\!', "!", t.value)
    # remove single line comments
    t.value = re.sub(r'\.\.\.[^\r\n]*\.\.\.', "", t.value)
    return t
Heeled answered 5/10, 2020 at 14:37 Comment(0)
C
1

Look at this regex to match string literals: https://regex101.com/r/v2bjWi/2. (?<!\\)!(?:\\!|(?:\.\.\.(?P<comment>.*?)\.\.\.)|[^!])*?(?<!\\)!.

  • It is surrounded by two (?<!\\)! meaning unescaped exclamation mark,
  • It consists of alternating escaped exclamation marks \\!, comments (?:\.\.\.(?P<comment>.*?)\.\.\.) and non-exclamation marks [^!]. Note that this is about as much as you can achieve with a regular expression. Any additional request, and it will not be sufficient any more.
Cue answered 5/10, 2020 at 15:20 Comment(4)
This was exactly what I needed. Thank you Alexander! I never thought about negative lookbehinds, since they weren't included in any regex cheatsheets or tutorials I saw. Your regex skills are superb!Heeled
(?<!\\)! recognises ! not preceded by a backslash. So it will not match a ! preceded by an escaped backslash (i.e. two backslashes). This is a common bug in regexes, which manifests as a frustrating impossibility of entering a string whose value ends with a backslash. If you scan backwards, you need to match delimiters preceded by an even number of escapes, and skip the ones preceded by an odd number. Forward scanning, as in OP's original, is less complicated.Cockneyfy
Well, the author hadn't asked for escaped backslashes.Cue
True, but their regex handles \\. as escapes.Cockneyfy
M
2

Perhaps this might be another option.

Match 0+ times any character except a backslash, dot or exclamation mark using the first negated character class.

Then when you do match a character that the first character class does not matches, use an alternation to match either:

  • repeat 0+ times matching either a dot that is not directly followed by 2 dots
  • or match from 3 dots to the next first match of 3 dots
  • or match only an escaped character

To prevent catastrophic backtracking, you can mimic an atomic group in Python using a positive lookahead with a capturing group inside. If the assertion is true, then use the backreference to \1 to match.

For example

(?<!\\)![^!\\.]*(?:(?:\.(?!\.\.)|(?=(\.{3}.*?\.{3}))\1|\\.)[^!\\.]*)*!

Explanation

  • (?<!\\)! Match ! not directly preceded by \
  • [^!\\.]* Match 1+ times any char except ! \ or .
  • (?: Non capture group
    • (?:\.(?!\.\.) Match a dot not directly followed by 2 dots
    • | Or
    • (?=(\.{3}.*?\.{3}))\1 Assert and capture in group 1 from ... to the nearest ...
    • | Or
    • \\. Match an escaped char
  • ) Close group
  • [^!\\.]* Match 1+ times any char except ! \ or .
  • )*! Close non capture group and repeat 0+ times, then match !

Regex demo

Mold answered 5/10, 2020 at 18:21 Comment(0)
C
1

Look at this regex to match string literals: https://regex101.com/r/v2bjWi/2. (?<!\\)!(?:\\!|(?:\.\.\.(?P<comment>.*?)\.\.\.)|[^!])*?(?<!\\)!.

  • It is surrounded by two (?<!\\)! meaning unescaped exclamation mark,
  • It consists of alternating escaped exclamation marks \\!, comments (?:\.\.\.(?P<comment>.*?)\.\.\.) and non-exclamation marks [^!]. Note that this is about as much as you can achieve with a regular expression. Any additional request, and it will not be sufficient any more.
Cue answered 5/10, 2020 at 15:20 Comment(4)
This was exactly what I needed. Thank you Alexander! I never thought about negative lookbehinds, since they weren't included in any regex cheatsheets or tutorials I saw. Your regex skills are superb!Heeled
(?<!\\)! recognises ! not preceded by a backslash. So it will not match a ! preceded by an escaped backslash (i.e. two backslashes). This is a common bug in regexes, which manifests as a frustrating impossibility of entering a string whose value ends with a backslash. If you scan backwards, you need to match delimiters preceded by an even number of escapes, and skip the ones preceded by an odd number. Forward scanning, as in OP's original, is less complicated.Cockneyfy
Well, the author hadn't asked for escaped backslashes.Cue
True, but their regex handles \\. as escapes.Cockneyfy

© 2022 - 2024 — McMap. All rights reserved.