How to redefine \s to match underscores?
Asked Answered
N

1

6

Perl (< v5.18) regular expression character class \s for whitespace is the same as [\t\n\f\r ].

Now, since some filenames use underscore as spaces, I was wondering if it's possible to redefine \s (locally) to match underscores in addition to whitespaces.

This would be merely for the sake of readability of otherwise convoluted regular expressions having many [\s_]. Can I do this? If so, how?

Nmr answered 13/7, 2015 at 23:56 Comment(6)
IMHO, changing the meaning of \s to silently behave in a non-standard way would harm readability, not improve it. Even if you clearly document this in the comments, it requires anyone reading your code to remember that every time they see \s, they need to mentally replace it with [\s_].Kilimanjaro
$s=qr/[\s_]/;Essay
The offhand way its done is using qr in custom overloads. See Creating Custom RE Engines from the perlre docs.Layfield
In ASCII only, \s matches [\t\n\x0B\f\r ]. \x0B is a vertical tab character or line tabulation. In Unicode it matches another 18 extended charactersSoche
Overriding qr would be very complicated. You can actually create custom properties quite easilyEssay
You could use the (?(DEFINE)(?<MY_PATTERN>...)) mechanism, but that'd end up uglier than [\s_]Balzer
H
12

Whenever I think that something is impossible in Perl, it usually turns out that I am wrong. And sometimes when I think that something is very difficult in Perl, I am wrong, too. @sln pointed me to the right track

Let's not override \s just yet, although you could. For the sake of the heirs of your program who expect \s to mean something specific, instead let's define the sequence \_ to mean "any whitespace character or the _ character" inside a regular expression. The details are in the link above, but the implementation looks like:

package myspace;  # redefine  \_  to mean  [\s_]
use overload;
my %rules = ('\\' => '\\\\', '_' => qr/[\t\n\x{0B}\f\r _]/ );
sub import {
    die if @_ > 1;
    overload::constant 'qr' => sub {
        my $re = shift;
        $re =~ s{\\(\\|_)}{$rules{$1}}gse;
        return $re;
    };
}
1;

Now in your script, say

use myspace;

and now \_ in a regular expression means [\s_].

Demo:

use myspace;
while (<DATA>) {
    chomp;
    if ($_ =~ /aaa\s.*txt/) {      # match whitespace
        print "match[1]: $_\n";
    }
    if ($_ =~ /aaa\_.*txt/) {      # match [\s_]
        print "match[2]: $_\n";
    }
    if ($_ =~ /\\_/) {             # match literal  '\_'
        print "match[3]: $_\n";
    }
}
__DATA__
aaabbb.txt
aaa\_ccc.txt
cccaaa bbb.txt
aaa_bbb.txt

Output:

match[3]: aaa\_ccc.txt
match[1]: cccaaa bbb.txt
match[2]: cccaaa bbb.txt
match[2]: aaa_bbb.txt

The third case is to demonstrate that \\_ in a regular expression will match a literal \_, like \\s will match a literal \s.

Hazen answered 14/7, 2015 at 1:26 Comment(1)
Noone is using myspace nowadays... :)Tallboy

© 2022 - 2024 — McMap. All rights reserved.