Regular expression to match boundary between different Unicode scripts
Asked Answered
G

1

10

Regular expression engines have a concept of "zero width" matches, some of which are useful for finding edges of words:

  • \b - present in most engines to match any boundary between word and non-word characters
  • \< and \> - present in Vim to match only the boundary at the beginning of a word, and at the end of a word, respectively.

A newer concept in some regular expression engines is Unicode classes. One such class is script, which can distinguish Latin, Greek, Cyrillic, etc. These examples are all equivalent and match any character of the Greek writing system:

  • \p{greek}
  • \p{script=greek}
  • \p{script:greek}
  • [:script=greek:]
  • [:script:greek:]

But so far in my reading through sources on regular expressions and Unicode I haven't been able to determine if there is any standard or nonstandard way to achieve a zero-width match where one script ends and another begins.

In the string παν語 there would be a match between the ν and characters, just as \b and \< would match just before the π character.

Now for this example I could hack something together based on looking for \p{Greek} followed by \p{Han}, and I could even hack something together based on all possible combinations of two Unicode script names.

But this wouldn't be a deterministic solution since new scripts are still being added to Unicode with each release. Is there a future-proof way to express this? Or is there a proposal to add it?

Grose answered 11/5, 2013 at 1:39 Comment(4)
Close but not exactly the same: #14943152 My answer is boundary for a single character class (and this applies for any character class). Your question is about boundary between any language.Dorindadorine
@nhahtdh: Thanks. I'm suprised I didn't find your question in my searching.Grose
I think everyone should read the section 2 of this: unicode.org/reports/tr24Dorindadorine
I have a really hairy solution, one that pretty much works. However, it also core dumps under certain predictable circumstances, which means there’s an interpreter bug somewhere that my solution is tickling. I’m checking into it, because I don’t want to give you a solution that might dump core.Bronk
B
5

EDIT: I just noticed you didn’t actually specify which pattern-matching language you were using. Well, I hope a Perl solution will work for you, since the needed mechanations are likely to be really tough in any other language. Plus if you’re doing pattern matching with Unicode, Perl really is the best choice available for that particular kind of work.


When the $rx variable below is set to the appropriate pattern, this little snippet of Perl code:

my $data = "foo1 and Πππ 語語語 done";

while ($data =~ /($rx)/g) {
   print "Got string: '$1'\n"; 
} 

Generates this output:

Got string: 'foo1 and '
Got string: 'Πππ '
Got string: '語語語 '
Got string: 'done'

That is, it pulls out a Latin string, a Greek string, a Han string, and another Latin string. This is pretty darned closed to what I think you actually need.

The reason I didn’t post this yesterday is that I was getting weird core dumps. Now I know why.

My solution uses lexical variables inside of a (??{...}) construct. Turns out that that is unstable before v5.17.1, and at best worked only by accident. It fails on v5.17.0, but succeeds on v5.18.0 RC0 and RC2. So I’ve added a use v5.17.1 to make sure you’re running something recent enough to trust with this approach.

First, I decided that you didn’t actually want a run of all the same script type; you wanted a run of all the same script type plus Common and Inherited. Otherwise you will get messed up by punctuation and whitespace and digits for Common, and by combining characters for Inherited. I really don’t think you want those to interrupt your run of “all the same script”, but if you do, it’s easy to stop considering those.

So what we do is lookahead for the first character that has a script type of other than Common or Inherited. More than that, we extract from it what that script type actually is, and use this information to construct a new pattern that is any number of characters whose script type is either Common, Inherited, or whatever script type we just found and saved off. Then we evaluate that new pattern and continue.

Hey, I said it was hairy, didn’t I?

In the program I’m about to show, I’ve left in some commented-out debugging statements that show just what it’s doing. If you uncomment them, you get this output for the last run, which should help understand the approach:

DEBUG: Got peekahead character f, U+0066
DEBUG: Scriptname is Latin
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
Got string: 'foo1 and '
DEBUG: Got peekahead character Π, U+03a0
DEBUG: Scriptname is Greek
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Greek}]*}
Got string: 'Πππ '
DEBUG: Got peekahead character 語, U+8a9e
DEBUG: Scriptname is Han
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Han}]*}
Got string: '語語語 '
DEBUG: Got peekahead character d, U+0064
DEBUG: Scriptname is Latin
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
Got string: 'done'

And here at last is the big hairy deal:

use v5.17.1;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:std :utf8);
use utf8;

use Unicode::UCD qw(charscript);

# regex to match a string that's all of the
# same Script=XXX type
#
my $rx = qr{
    (?=
       [\p{Script=Common}\p{Script=Inherited}] *
        (?<CAPTURE>
            [^\p{Script=Common}\p{Script=Inherited}]
        )
    )
    (??{
        my $capture = $+{CAPTURE};
   #####printf "DEBUG: Got peekahead character %s, U+%04x\n", $capture, ord $capture;
        my $scriptname = charscript(ord $capture);
   #####print "DEBUG: Scriptname is $scriptname\n";
        my $run = q([\p{Script=Common}\p{Script=Inherited}\p{Script=)
                . $scriptname
                . q(}]*);
   #####print "DEBUG: string to re-interpolate as regex is q{$run}\n";
        $run;
    })
}x;


my $data = "foo1 and Πππ 語語語 done";

$| = 1;

while ($data =~ /($rx)/g) {
   print "Got string: '$1'\n";
}

Yeah, there oughta be a better way. I don’t think there is—yet.

So for now, enjoy.

Bronk answered 14/5, 2013 at 0:14 Comment(5)
Oh I specifically didn't specify a regex dialect, rather I asked about "standard", "nonstandard", and "proposed". I'm actually playing with XRegExp and reading through UTS #18 and regular-expressions.info but I'm more accustomed to Perl's and Vim's implementations. I guess I want to know what I should be able to do, even if specific dialects haven't implemented it yet. For workarounds I suppose JavaScript or even an extension to XRegExp would be best. (I'm writing this before reading the body of your answer by the way...)Grose
@Grose UTS#18 wouldn’t cover this until at least Level 3, and nobody implements that yet. So we make do with what we can in the meanwhile. I haven’t actually looked it lately, so don’t know if this would be possible under Level 3.Bronk
Besides yourself of course, who is actively pushing forward Unicode regex development these days? I know Perl has by far the best Unicode support and it's one of the main reasons it was my main language for years, but now I've moved for other reasons to a language with some of the worst Unicode support. I can definitely come up with a non regex string splitter but it seemed like an obvious feature to include in a regex engine. Maybe I should submit some proposals?Grose
@Grose Yes, you probably should. There’s a UTS out there on security and identifiers that might be worth looking at, because mixed scripts are a spoofing issue that I seem to recall are mentioned there. This would be really useful in that context.Bronk
@Grose UTS#18 RL 2.2 Extended Grapheme Clusters talks about the possibility of \b{w} for word boundaries, \b{s} for sentence boundaries, etc. Seems like what you might want here is something more like a hypothetical \b{script}. But remember the Common and Inherited issue. There’s also RL3.3 Tailored Word Boundaries, but I don’t think that’s quite right either.Bronk

© 2022 - 2024 — McMap. All rights reserved.