How to do conditional greedy match in Perl?
Asked Answered
S

1

5

I want Perl to parse a code text and identify certain stuffs, example code:

use strict;
use warnings;

$/ = undef;

while (<DATA>) {
  s/(\w+)(\s*<=.*?;)/$1_yes$2/gs;
  print;
}

__DATA__
always @(posedge clk or negedge rst_n)
if(!rst_n)begin
        d1 <= 0; //perl_comment_4
        //perl_comment_5
        d2 <= 1  //perl_comment_6
                 + 2;
        end
else if( d3 <= d4 && ( d5 <= 3 ) ) begin
        d6 <= d7 +
                 (d8 <= d9 ? 1 : 0);
        //perl_comment_7
        d10 <= d11 <=
                      d12
                        + d13
                            <= d14 ? 1 : 0;
        end

Match target is something that meets all of the following:

(1) It begins with the combination word\s*<=. Here \s* maybe 0 or more spaces, newlines, tabs.

(2) The aforementioned "combination" should be out of any pair of ( and ).

(3) If multiple "combinations" appear consecutively, then take the first one as the beginning. (Something like "greedy" matching at the left boundary)

(4) it ends with the first ; after the "combination" mentioned in (1).

There may be word\s*<= and ; in code comments (there may be anything in comments); this makes things more complicated. To make life easier, I already pre-processed the text, scanning for comments and replacing them with stuff like //perl_comment_6. (This solution seems rather cumbersome and stupid. Any smarter, more elegant solutions?)

What I wanna do:

For all matched word\s*<=, replace word with word_yes. For the example code, d1, d2, d6 and d10 should be replaced by d1_yes, d2_yes, d6_yes and d10_yes, respectively, and all other parts of the text should remain unchanged.

In my current code I use s/(\w+)(\s*<=.*?;)/$1_yes$2/gs;, which correctly recognizes d1, d2 and d10, but fails to recognize d6 and mistakenly recognizes d3.

Any suggestions? Thanks in advance~

Swastika answered 22/2, 2016 at 11:11 Comment(10)
Check this code and this regex demo.Stephan
Write a parser for the language. See Marpa::R2 or Parse::RecDescent.Buenabuenaventura
@WiktorStribiżew The match pattern may or may NOT follow an if, so your regex is a little bit limited in its scope, but I really think it's tidy and neat. Is it possible to make it applicable in more scenarioes?Swastika
Katyusza, You may remove if\s* and it will be a rather generic pattern.Stephan
It is not trivial to parse Verilog : metacpan.org/pod/Verilog-PerlDragnet
@WiktorStribiżew Thanks buddy; I'll try your (\((?>[^()]|(?1))*\))(*SKIP)(*F)|(\w+)(\s*<=[^;]*) tomorrow ;-)Swastika
I believe you should try toolic's suggestion.Stephan
The reason why I do not use a ready parser is that, I think as a beginner I can learn a lot while hand-crafting such a parser on my own, even if it's very, very ugly parser >,<Swastika
@katyusza: If you go down that route then you should be aware that you're setting yourself an enormous task. But do take note of choroba's comment regarding modules that you might use. You are doomed to fail if you start with simple regexesCarmarthenshire
@Carmarthenshire Got that; I will read it. Never thought there are existing stuffs to help parse any designated language!Swastika
C
7

This is a lot more complicated that you might imagine, and it is impossible to do properly without writing a parser for the language you are trying to process. However, you may be in luck if your sample is a consistently limited subset of the language

The best way I can see to do this is to use split to separate out all the subsections of the string that are in parentheses from the "top level" sections where the replacements are to be done. Then the changes can be made to the relevant parts and the split sections joined back together

Even this relies on the code having properly balanced parentheses, and an odd open or closing parenthesis that appears in, say, a string or a comment will throw the process out. The regex used in the split has to be recursive so that nested parentheses can be matched, and making it a capturing regex makes split returns all of the parts of the string instead of just the sections between the matches

This code will do as you ask, but beware that, as I described, it is extremely fragile

use strict;
use warnings;

my $data = do {
    local $/;
    <DATA>;
};

my @split = split / ( \( (?> [^()] | (?1) )* \) ) /x, $data;

for ( @split ) {
    next if /[()]/;
    s/ ^ \s* \w+ \K (?= \s* <= ) /_yes/xgm;
}

print join '', @split;


__DATA__
always @(posedge clk or negedge rst_n)
if(!rst_n)begin
        d1 <= 0; //perl_comment_4
        //perl_comment_5
        d2 <= 1  //perl_comment_6
                 + 2;
        end
else if( d3 <= d4 && ( d5 <= 3 ) ) begin
        d6 <= d7 +
                 (d8 <= d9 ? 1 : 0);
        //perl_comment_7
        d10 <= d11 <=
                      d12
                        + d13
                            <= d14 ? 1 : 0;
        end

output

always @(posedge clk or negedge rst_n)
if(!rst_n)begin
        d1_yes <= 0; //perl_comment_4
        //perl_comment_5
        d2_yes <= 1  //perl_comment_6
                 + 2;
        end
else if( d3 <= d4 && ( d5 <= 3 ) ) begin
        d6_yes <= d7 +
                 (d8 <= d9 ? 1 : 0);
        //perl_comment_7
        d10_yes <= d11 <=
                      d12
                        + d13
                            <= d14 ? 1 : 0;
        end
Carmarthenshire answered 22/2, 2016 at 11:35 Comment(4)
@Swastika It's similar to s/(\((?>[^()]+|(?1))*\))(*SKIP)(*F)|^\s*\w+\K(?=\s*<=)/_yes/gm; just more complicated and of worse performance. Further it relies on matches being preceded by linestart followed by any amount of whitespace. The recursive part was suboptimal and there is no need for s flag as no dot is used. See demo at regex101Dys
@bobblebubble: I've made the changes that I assume you meant, but they're trivial and I'm certain that they'll have no impact on the performance of the program. Unless the data is enormous this will be disk-bound. I really don't know what to do instead of anchroing at the start of a line: the whole thing really needs to be anchored behind at a statement boundary, which could be a semicolon of begin or perhaps something else, and then those must be ignored if they're inside quotes or comments. This really isn't a job for a one-off regexCarmarthenshire
@Carmarthenshire I don't see why to split/join over using (*SKIP)(*F) which is handy for such cases and not "magical". It discards parts that yours use as split sequence. If anchoring to start of line or consuming up to semicolon depends on the input. I would prefer semicolon variant but also yours is fine. For performance I would add + quantifier to [^()] for less alternating: 184 steps \((?>[^()]|(?R))*\) vs 50 steps \((?>[^()]+|(?R))*\) or (\((?>[^()]*(?R)?)*\))with 63 steps in between. Elaborative answer.Dys
@bobblebubble: As I said, this problem cannot be solved without regex recursion. I judge \K and look-aheads to be within an average Perl programmer's vocabulary, whereas far fewer have even considered backtracking or know about the star commands. If you think my criticism is wrong, then surely you are also wrong to copy it?Carmarthenshire

© 2022 - 2024 — McMap. All rights reserved.