How do I match only fully-composed characters in a Unicode string in Perl?
Asked Answered
E

5

8

I'm looking for a way to match only fully composed characters in a Unicode string.

Is [:print:] dependent upon locale in any regular expression implementation that incorporates this character class? For example, will it match Japanese character 'あ', since it is not a control character, or is [:print:] always going to be ASCII codes 0x20 to 0x7E?

Is there any character class, including Perl REs, that can be used to match anything other than a control character? If [:print:] includes only characters in ASCII range I would assume [:cntrl:] does too.

Electrograph answered 15/10, 2008 at 3:10 Comment(0)
N
6
echo あ| perl -nle 'BEGIN{binmode STDIN,":utf8"} print"[$_]"; print /[[:print:]]/ ? "YES" : "NO"'

This mostly works, though it generates a warning about a wide character. But it gives you the idea: you must be sure you're dealing with a real unicode string (check utf8::is_utf8). Or just check perlunicode at all - the whole subject still makes my head spin.

Nebraska answered 15/10, 2008 at 5:27 Comment(4)
You can get rid of the ugly BEGIN{binmode STDIN, ":utf8"} kludge by supplying the option -CS on the command line.Censorship
... that will also make the warning go away, because it sets up STDOUT in the same way as STDIN.Censorship
That may not be as much of an option if the OP is writing a module to handle this instead of a standalone script. So I'm going to leave my solution, as well as your fix in the hopes the OP can figure out which one is better for his/her scenario. Thanks :-)Nebraska
This pattern is wrong. [[:print:]] will match "\x{3099}" which is not a fully-composed character! See my answer for a working pattern.Crosslink
C
5

I think you don't want or need locales for that but, but rather Unicode. If you have decoded a text string, \w will match word characters in any language, \d matches not just 0..9 but every Unicode digit etc. In regexes you can query Unicode properties with \p{PropertyName}. Particularly interesting for you might be \p{Print}. Here's a list of all the available Unicode character properties.

I wrote an article about the basics and subtleties of Unicode and Perl, it should give you a good idea on what to do that perl will recognize your string as a sequence of characters, not just a sequence of bytes.

Update: with Unicode you don't get language dependent behaviour, but instead sane defaults regardless of language. This may or may not be what you want, but for the distinction of priintable/control character I don't see why you'd need language dependent behaviour.

Censorship answered 15/10, 2008 at 6:48 Comment(0)
C
4

\X matches a fully-composed character (sequence). Proof:

#!/usr/bin/env perl
use 5.010;
use utf8;
use Encode qw(encode_utf8);

for my $string (qw(あ ご ご), "\x{3099}") {
    say encode_utf8 sprintf "%s $string", $string =~ /\A \X \z/msx ? 'ok' : 'nok';
}

The test data are: a normal character, a pre-combined character, a combining character sequence and a combining character (which "doesn't count" on its own, a simplification of Chapter 3 of Unicode).

Substitute \X with [[:print:]] to see that Tanktalus' answer produces false matches for the last two cases.

Crosslink answered 7/1, 2010 at 23:12 Comment(0)
R
2

Yes, those expressions are locale dependant.

Raby answered 15/10, 2008 at 3:11 Comment(1)
Can you name an environment and/or regular expression implementation that allows [:print:] to respect a Japanese UTF-8 locale/encoding? I am using Perl in Linux with Japanese UTF-8 locale/encoding and it does not match Japanese character.Electrograph
N
1

You could always use the character class [^[:cntrl:]] to match non-control characters.

Nadenenader answered 15/10, 2008 at 3:26 Comment(1)
This does not match Unicode control characters (in my environment setup and using Perl). There are Unicode control characters for changing text direction and so on. Using [^[:ctrnl:]] will match these Unicode ones but not ASCII ones.Electrograph

© 2022 - 2024 — McMap. All rights reserved.