In Perl, how can I get the matched substring from a regex?
Asked Answered
S

7

14

My program reads other programs' source code and collect information about used SQL queries. I have problem with getting substring.

...
$line = <FILE_IN>;
until( ($line =~m/$values_string/i && $line !~m/$rem_string/i) || eof )
{
   if($line =~m/ \S{2}DT\S{3}/i)
   {

   # here I wish to get (only) substring that match to pattern \S{2}DT\S{3} 
   # (7 letter table name) and display it.
      $line =~/\S{2}DT\S{3}/i;
      print $line."\n";
...

In result print prints whole line and not a substring I expect. I tried different approach, but I use Perl seldom and probably make basic concept error. ( position of tablename in line is not fixed. Another problem is multiple occurrence i.e.[... SELECT * FROM AADTTAB, BBDTTAB, ...] ). How can I obtain that substring?

Sillabub answered 15/7, 2009 at 15:13 Comment(3)
Thank you all for quick and various approaches. I tried to use them all yesterday and today morning and/but only $& works for me. Also thanks for (use strict; use warnings;) clue that showed me my improvisation style. Today I realize also I didn't inform that I work under windows (my pearl is: This is perl, v5.8.7 built for MSWin32-x86-multi-thread Copyright 1987-2005, Larry Wall Binary build 813 [148120] provided by ActiveState www.ActiveState.com Built Jun 6 2005 13:36:37). Thank you once again.Sillabub
I was little irritated after "ignorance is a bliss" in my face, but it push me to ... well ... let just say now I know what 'capturing group' 'paren/parentheses' means and it really works. Please don’t comment I feel silly already. BTW, is there anyone pro global vote to rename perl to – I don’t know - pearl ? ;)Sillabub
There was already a language named Pearl, when Larry Wall went looking for names.Loganloganberry
A
22

Use grouping with parenthesis and store the first group.

if( $line =~ /(\S{2}DT\S{3})/i )
{
  my $substring = $1;
}

The code above fixes the immediate problem of pulling out the first table name. However, the question also asked how to pull out all the table names. So:

# FROM\s+     match FROM followed by one or more spaces
# (.+?)       match (non-greedy) and capture any character until...
# (?:x|y)     match x OR y - next 2 matches
# [^,]\s+[^,] match non-comma, 1 or more spaces, and non-comma
# \s*;        match 0 or more spaces followed by a semi colon
if( $line =~ /FROM\s+(.+?)(?:[^,]\s+[^,]|\s*;)/i )
{
  # $1 will be table1, table2, table3
  my @tables = split(/\s*,\s*/, $1);
  # delim is a space/comma
  foreach(@tables)
  {
     # $_ = table name
     print $_ . "\n";
  }
}

Result:

If $line = "SELECT * FROM AADTTAB, BBDTTAB;"

Output:

AADTTAB
BBDTTAB

If $line = "SELECT * FROM AADTTAB;"

Output:

AADTTAB

Perl Version: v5.10.0 built for MSWin32-x86-multi-thread

Anselma answered 15/7, 2009 at 15:18 Comment(0)
H
19

I prefer this:

my ( $table_name ) = $line =~ m/(\S{2}DT\S{3})/i;

This

  1. scans $line and captures the text corresponding to the pattern
  2. returns "all" the captures (1) to the "list" on the other side.

This psuedo-list context is how we catch the first item in a list. It's done the same way as parameters passed to a subroutine.

my ( $first, $second, @rest ) = @_;


my ( $first_capture, $second_capture, @others ) = $feldman =~ /$some_pattern/;

NOTE:: That said, your regex assumes too much about the text to be useful in more than a handful of situations. Not capturing any table name that doesn't have dt as in positions 3 and 4 out of 7? It's good enough for 1) quick-and-dirty, 2) if you're okay with limited applicability.

Honea answered 15/7, 2009 at 19:8 Comment(2)
It's really list context, there's nothing pseudo about it! The tricky thing is using a list of one item. Capturing the results of an operation in a single item list can be very handy when you want to force list-context behavior from the operator or subroutine you are calling. my $foo = @bar; is very different from my ($foo) = @bar;, and the distinction can come in very handy.Sigmon
Oh, it does come in handy. I use it all the time. I guess "pseudo" is a bad way to put it. I know that a list of one is still a list, it just looks an awful lot like a scalar--and that's all I'm trying to get anyway.Honea
S
8

It would be better to match the pattern if it follows FROM. I assume table names consist solely of ASCII letters. In that case, it is best to say what you want. With those two remarks out of the way, note that a successful capturing regex match in list context returns the matched substring(s).

#!/usr/bin/perl

use strict;
use warnings;

my $s = 'select * from aadttab, bbdttab';
if ( my ($table) = $s =~ /FROM ([A-Z]{2}DT[A-Z]{3})/i ) {
    print $table, "\n";
}
__END__

Output:

C:\Temp> s
aadttab

Depending on the version of perl on your system, you may be able to use a named capturing group which might make the whole thing easier to read:

if ( $s =~ /FROM (?<table>[A-Z]{2}DT[A-Z]{3})/i ) {
    print $+{table}, "\n";
}

See perldoc perlre.

Symbol answered 15/7, 2009 at 15:18 Comment(0)
F
7

Parens will let you grab part of the regex into special variables: $1, $2, $3... So:

$line = ' abc andtabl 1234';
if($line =~m/ (\S{2}DT\S{3})/i)   {   
    # here I wish to get (only) substring that match to pattern \S{2}DT\S{3}    
    # (7 letter table name) and display it.      
    print $1."\n";
}
Feinleib answered 15/7, 2009 at 15:22 Comment(0)
L
3

Use a capturing group:

my $substr;
if( $line =~ /(\S{2}DT\S{3})/i ) {
    $substr = $1;
}
Lilytrotter answered 15/7, 2009 at 15:19 Comment(1)
Always check if the match succeeded before using match variables.Bioscope
S
3

$& contains the string matched by the last pattern match.

Example:

$str = "abcdefghijkl";
$str =~ m/cdefg/;
print $&;
# Output: "cdefg"

So you could do something like

if($line =~m/ \S{2}DT\S{3}/i) {
    print $&."\n";
}

WARNING:

If you use $& in your code it will slow down all pattern matches.

UPDATE 2023:

Brian D. Foy says in the comments (also see his recent answer here):

Perl v5.20 mostly did away with the performance penalty caused by $&, so this is the easiest way to go.

Schizoid answered 15/7, 2009 at 16:11 Comment(6)
Avoid using $& and the related $` and $', they cause to a performance penalty on all regexes in your code. See perlre (perldoc.perl.org/perlre.html) for more info.Sigmon
Just the mere mention of $&, any where in your code, will slow down all regexs. It doesn't even matter if you actually use the value.Loganloganberry
Durring studies I used to have habit to evaluate such statement. Do anybody check how bad is this ($&) bad practice? Up to 10%/30% and can share results?Sillabub
I think I remember reading that $& was slated to be deprecated, sometime in the future.Loganloganberry
I think there may have been some changes that reduce the effect in perl 5.10Loganloganberry
Perl v5.20 mostly did away with the performance penalty caused by $&, so this is the easiest way to go.Foment
F
2

The advice about using a capture is probably might have been a way to go when people originally answered this. Perl has moved on since then, and using $& is probably the best answer now.

There's one big reason not to use a capture: it throws off the numbering for all other captures inside the pattern. In that case, you can use labeled captures, such as (?<name>\w+), and look in either %- or %+ for them so you don't have the numbers.

Another answer mentioned $&, which is the part of the string that matched the pattern. That answer also noted that it slows down the overall program because perl now needs to track this information for every regex just in case you use it for that pattern.

However, Perl v5.20 started using copy-on-write in many places, and the issue with $& became mostly moot. Perl v5.18 had also made some changes so it only tracked the special per-match variables that you actually used instead of all three of them ($`, $&, $').

Previously, Perl v5.10 had already added the /p switch to enable a parallel set of per-match variables that did not have this performance penalty. These variables only have long names:

use v5.10;
if( $string =~ m/.../p ) {
    say <<"HERE";
Before match: ${^PREMATCH}    
Matched: ${^MATCH}
After match: ${^POSTMATCH}    
HERE
    }

And, v5.26 added @{^CAPTURE} so you could get a list of all captures without knowing how many captures there were. However, instead of having the first item (index 0) be the equivalent of $&, it's just $1 so that everything is one off. :/

Foment answered 31/8, 2023 at 12:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.