Perl: How to match FULLWIDTH LATIN SMALL

Asked 9/5, 2013 at 20:17 Answered 9/5, 2013 at 20:34

Solved regex perl unicode character-properties

I am using listadmin to manage many mailman-based mailing lists. I have a long list of subjects and from addresses set up to block spam. Recently, I received smarter spam in the sense that it uses nice-looking Unicode characters, eg:

Subject: Ａl l ｔhe ad ult mov ies you' ve see n a r e nothing c ompari- ng t o our eｘｘ xci t i ng compilation of 1３' ０00 mov ies in HD t hat arｅ a v ailable for y ｏu now!

Subject: HD qua lit y vi d eos an d pho to graph s ｏ f ho t c hic kｓ
are here for ｕ

Now I want to use a smart Perl regex to block that. Piping these subjects to hexdump revealed many characters are a FULLWIDTH LATIN SMALL LETTER. However, \p{FULLWIDTH LATIN SMALL LETTER} doesn't work: Can't find Unicode property definition "FULLWIDTH LATIN SMALL LETTER"

So the question is: Is there a \p{something} to match those fullwidth characters? Alternatively: is there another way to match those characters?

Kelso answered 9/5, 2013 at 20:17 Comment(0)

The page perlunicode documents available unicode character classes. I found it as a reference in perlrebackslash, which documents special character classes and backslash sequences like \p{...} in regexes.

The summary is that all but the most common property classes require a property type and a property value, which are separated by : or =. However, there does not seem to be a mention of fullwidth characters as a predefined property.

But there is the Block/Blk property, which can have Halfwidth and Fullwidth Forms (U+FF00–U+FFEF) as value:

/\p{Block=Halfwidth and Fullwidth Forms}/

This will match on your input (tested on v16.3).

A useful tool for this is uniprops.

$ uniprops U+FF41
U+FF41 ‹ａ› \N{FULLWIDTH LATIN SMALL LETTER A}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    All Any Alnum Alpha Alphabetic Assigned InHalfwidthAndFullwidthForms
    Cased Cased_Letter LC Changes_When_Casemapped CWCM
    Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT
    Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase
    Halfwidth_And_Fullwidth_Forms Hex XDigit Hex_Digit ID_Continue IDC
    ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower Lowercase
    Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum
    X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
    X_POSIX_XDigit

As you can see, \p{Block=Halfwidth and Fullwidth Forms} can also be written \p{In Halfwidth and Fullwidth Forms}.

Mcquillin answered 9/5, 2013 at 20:34 Comment(2)

many thanks @ikegami for the enlightening edit and the entertaining module it linked to. – Mcquillin 9/5, 2013 at 23:26

It's one of tchrist's. unichars can be used to do the converse. e.g. unichars -au '\p{InHalfwidthAndFullwidthForms}' lists the chars in the HalfwidthAndFullwidthForms block. – Nucleus 9/5, 2013 at 23:36

You can use charnames::viacode to get the character names from their codes:

#!/usr/bin/perl
use warnings;
use strict;
use utf8;

use charnames qw();


my $string = q(Subject: Ａl l ｔhe ad ult mov ies you' ve see n a r e nothing )
            .q(c ompari- ng t o our eｘｘ xci t i ng compilation of 1３' ０00 )
            .q(mov ies in HD t hat arｅ a v ailable for y ｏu now!);

my $count = grep /FULLWIDTH/, map charnames::viacode(ord), split //, $string;
print "$count fullwidth characters.\n";

Mohsen answered 9/5, 2013 at 20:34 Comment(0)

Recommended topics

Hot tags