Regex quirk in tcl
Asked Answered
S

1

9

This question is about understanding the behaviour of a specific regex in TCL 8.5 built into Vivado, in particular or-ing together two regex parts I get unexpected results:

I worked on indenting a block of text for the command line using regular expressions. My first thought was to replace every newline by a newline and some spaces (replaced by X here for clarity) for indentation, so:

puts [regsub -all "\n" "foo\nBar\nBaz" "\nXX"]
foo
XXBar
XXBaz

This does not indent the first line, to match the first line I use ^:

puts [regsub -all "^" "foo\nBar\nBaz" "\nXX"]

XXfoo
Bar
Baz

Now it should just be a matter of comibining the two regex parts with an |, however I get output I can not explain:

puts [regsub -all "^|\n" "foo\nBar\nBaz" "\nXX"]

XXfoo
XX
XXBar
XX
XXBaz

demo

Where do the additonal newlines and identiation marks (X) come from? Why does it look like I get two substitutions? Is this a bug, or is there a bit I do not understand about regular expression syntax?

For completnes sake here is the regex I use now puts [regsub -all -line "^" "foo\nBar\nBaz" "XX"]

Strophanthus answered 27/12, 2017 at 16:27 Comment(4)
Interesting question. Parenthetically, you can use the -line option in place of (?n) -> set t [regsub -all -line "^" $string "XX"]. IMO that's more readable.Pean
Also, a -linestop would suffice here, or (?w) inline option, "(?w)^" pattern. -line or (?n) also modify the behavior of . and negated bracket exprrssions that are not used in the pattern.Incandesce
@glennjackman nice catch, the 8.0 documentation left me under the impression that I cant get the substitution result as a return value, and had to specify it as a variable. I agree on your version beeing a lot more readable.Strophanthus
Note 1) that Tcl 8.0 is over twenty years old, and a lot will work differently in a modern Tcl, and 2) while I recognize that the question is about regexes, a much better solution is to use ::textutil::adjust::indent foo\nBar\nBaz XX or at least join [lmap line [split foo\nBar\nBaz \n] {format {XX%s} $line}] \n.Dorotheadorothee
H
3

Basic versus Extended regular expressions

I think the explanation hinges on the fact that the expression ^ is treated as a basic regular expression (BRE), but when you add | it is treated like an advanced regular expression (ARE), which is a superset of extended regular expressions (ERE). This is based on the following, from the re_syntax man page:

An ARE is one or more branches, separated by “|”, matching anything that matches any of the branches.

The second part of the puzzle is that ^ is treated differently in basic and extended/advanced regular expressions. In a basic regular expression, ^ only has a special meaning when it is the first character of the expression. Again, from the re_syntax man page:

BREs differ from EREs in several respects ... ^ is an ordinary character except at the beginning of the RE or the beginning of a parenthesized subexpression,...

In other words, for a BRE, ^ will only match the very start of the string, but in an ARE it will match the beginning of a line.

So, what exactly is happening?

First, ^ matches the beginning of a string, so it replaces it with the replacement \nXX. Next, it sees f, then o, then o, none of which matches. Then it sees '\n`, which it matches, so it replaces it with the replacement.

At this point the matcher has consumed the characters foo\n. What remains is Bar\nBaz. The matcher now looks at that string, and the pattern ^ matches, so it again replaces it with the replacement. Thus, you end up with two copies of the replacement string, one for the newline and one for the beginning of the string that remains.

Adding something to the start of every line

If your end goal is to add indentation to every line, you can use newline sensitive matching with regsub and then use ^ to match every line including the first, rather than try to match both newlines and the start of the string. You do this by adding the --line option to regsub. For example:

regsub -line -all "^" "foo\nBar\nBaz" "XX" t; puts $t
Helles answered 27/12, 2017 at 17:37 Comment(5)
Thanks for a very detailed explanation, now I am only left to wonder who came up with the idea that the input parsing behaviour changes implicitly based on the input (rather than explicitly based on flags).Strophanthus
It doesn't appear to be changing to BRE mode, but rather looks like a plain old bug in the RE engine. Huh. (Also, use -line "^"; much clearer.)Ingaborg
@DonalFellows: I suggested it changed from BRE mode, not to it. But I'll defer to your analysis. My gut reaction was this was a bug, but then I followed my own advice that "when you think you've found a bug in Tcl, you're probably wrong". :-)Helles
The switching of modes would definitely be a bug, as BRE mode is supposed to only enable when asked for (via the documented-but-rarely-used (?b) flag) but I really don't think that that's the issue. The problem (probably) is that ^ is matching at newlines even when not in -line mode. It might relate to the code in the RE engine that determines whether it is at a start when matching recommences to find second-or-later matches… but I hesitate to peek inside the RE engine in the first place. (That code is seriously scary!)Ingaborg
And I think the issue can be boiled down to regexp -all {^|\n} "foo\nbar" returning 3 instead of 2; that's the true unexpected result given that the string only has one start and one embedded newline. Issue filed.Ingaborg

© 2022 - 2024 — McMap. All rights reserved.