How to properly write regex for unicode first name in Java?
Asked Answered
A

2

5

I need to write a regular expression so I could replace the invalid characters in user's input before sending it further. I think i need to use string.replaceAll("regex", "replacement") to do that. The particular line of code should replace all characters which are not unicode letters. So it's a white list of unicode characters. Basically it's validating and replacing the invalid characters of user's first name.

What I've found so far is this: \p{L}\p{M}, but I'm not sure how to fire it up in regexp so it would work as I explained above. Would this be a regex negation case?

Associative answered 27/6, 2011 at 13:54 Comment(2)
Take note that some people's first names can contain spaces (María de la Cruz), hyphens (Anne-Marie), apostrophes (D'Juan), and maybe some other characters too.Sigismond
@KarolS I would argue that de la is part of the family name in this case. Though there certainly are people with multiple first names (like me being "Paul Peter Richard" officially), which would create spaces.Concatenate
T
10

Yes, you need negation. The regular expression would be [^\p{L}] for anything except letters. Another way to write this would be \P{L}.

\p{M} means "all marks", thus [^\p{L}\p{M}] means **anything which is neither letter nor mark. This also could be written as [\P{L}&&[\P{M}]], but this is not really better.

In a Java-String all \ have to be doubled, so you would write string.replaceAll("[^\\p{L}\\p{M}]", "replacement") there.


From a comment:

By the way, regarding to your answer, what fall in the marks category? Do I even need that? Wouldn't just letters be fine for firstname?

This category consists of the subcategories

  • Mn: Mark, Non-Spacing

    An example for this is ̀, U+0300. This is the COMBINING GRAVE ACCENT, and can be used together with a letter (the letter before) to create accented characters. For the commonly used accented characters there is already a precomposed form (e.g. é), but for other ones there is not.

  • Mc: Mark, Spacing Combining.

    These are quite seldom ... I found them mainly in south-asian scripts, and for musical notes. For example, we have U+1D165, MUSICAL SYMBOL COMBINING STEM. 텦, which could be combined with U+1D15D, MUSICAL SYMBOL WHOLE NOTE, 텝, to something like 텝텦. (Hmm, the images do not look right here. I suppose my browser does not support these characters. Have a look at the code charts, if they are wrong here.)

  • Me: Mark, Enclosing

    These are marks which somehow enclose the base letter (the previous one, if I understand right). One example would be U+20DD, ⃝, which allows creating things like A⃝. (This should be rendered as an A enclosed by a circle, if I understand right. It does not, in my browser.) Another one would be U+20E3, ⃣, COMBINING ENCLOSING KEYCAP, which should give the look of a key cap with the letter on it (A⃣). (They do not show in my browser. Have a look at the code chart, if you can't see them.)

You can find them all by searching in Unicode-Data.txt for ;Mn;, ;Mc; or ;Me;, respectively. Some more information is in the FAQ: Characters and Combining Marks.

Do you need them? I'm not sure here. Most common names (at least in latin alphabets) would use precomposed letters, I think. But the user might input them in decomposed form - I think on Mac OS X this is actually the default. You would have to run the normalization algorithm before filtering away unknown characters. (Running the normalization seems a good idea anyway if you want to compare the names and not only show them on screen.)


Edit: not directly relating to the question, but relating to the discussion in the comments:

I wrote a quick test program to show that [^\pL\pM] is not equivalent to [\PL\PM]:

package de.fencing_game.paul.examples;

import java.util.regex.*;

public class RegexSample {

    static String[] regexps = {
        "[^\\pL\\pM]", "[\\PL\\PM]",
        ".", "\\pL", "\\pM",
        "\\PL", "\\PM"
    };

    static String[] strings = {
        "x", "A", "3", "\n", ".", "\t", "\r", "\f",
        " ", "-", "!", "»", "›", "‹", "«",
        "ͳ", "Θ", "Σ", "Ϫ", "Ж", "ؤ",
        "༬", "༺", "༼", "ང", "⃓", "✄",
        "⟪", "や", "゙", 
        "+", "→", "∑", "∢", "※", "⁉", "⧓", "⧻",
        "⑪", "⒄", "⒰", "ⓛ", "⓶",
        "\u0300" /* COMBINING GRAVE ACCENT, Mn */,
        "\u0BCD" /* TAMIL SIGN VIRAMA, Me */,
        "\u20DD" /* COMBINING ENCLOSING CIRCLE, Me */,
        "\u2166" /* ROMAN NUMERAL SEVEN, Nl */,
    };


    public static void main(String[] params) {
        Pattern[] patterns = new Pattern[regexps.length];

        System.out.print("       ");
        for(int i = 0; i < regexps.length; i++) {
            patterns[i] = Pattern.compile(regexps[i]);
            System.out.print("| " + patterns[i] + " ");
        }
        System.out.println();
        System.out.print("-------");
        for(int i = 0; i < regexps.length; i++) {
            System.out.print("|-" +
                             "--------------".substring(0,
                                                        regexps[i].length()) +
                             "-");
        }
        System.out.println();

        for(int j = 0; j < strings.length; j++) {
            System.out.printf("U+%04x ", (int)strings[j].charAt(0));
            for(int i = 0; i < regexps.length; i++) {
                boolean match = patterns[i].matcher(strings[j]).matches();
                System.out.print("| " + (match ? "✔" : "-")  +
                                 "         ".substring(0, regexps[i].length()));
            }
            System.out.println();
        }
    }
}

Here is the output (with OpenJDK 1.6.0_20 on OpenSUSE):

       | [^\pL\pM] | [\PL\PM] | . | \pL | \pM | \PL | \PM 
-------|-----------|----------|---|-----|-----|-----|-----
U+0078 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0041 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0033 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+000a | ✔         | ✔        | - | -   | -   | ✔   | ✔   
U+002e | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0009 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+000d | ✔         | ✔        | - | -   | -   | ✔   | ✔   
U+000c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0020 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+002d | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0021 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+00bb | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+203a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2039 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+00ab | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0373 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0398 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+03a3 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+03ea | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0416 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0624 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0f2c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0f3a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0f3c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0f44 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+20d3 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+2704 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+27ea | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+3084 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+3099 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+002b | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2192 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2211 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2222 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+203b | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2049 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+29d3 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+29fb | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+246a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2484 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+24b0 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+24db | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+24f6 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0300 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+0bcd | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+20dd | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+2166 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   

We can see that:

  1. [^\pL\pM] is not equivalent to [\PL\PM]
  2. [\PL\PM] really matches everything, but
  3. still [\PL\PM] is not equal to ., since . does not match \n and \r.

The second point is caused by the fact that [\PL\PM] is the union of \PL and \PM: \PL contains characters from all categories other than L (including M), and \PM contains characters from all categories other than M (including L) - together they contain the whole character repertoire.

[^pL\pM], on the other hand, is the complement of the union of \pL and \pM, which is equivalent to the intersection of \PL and PM.

Triglyph answered 27/6, 2011 at 14:13 Comment(11)
The braces are superfluous. And remember that [^\pL\pM] is the same as [\PL\PM], which means that you don’t need negation at all.Tho
@tchrist: Is it the same? I would understand the latter as anything that is not a latter, or not a mark, which would be everything which is not both mark and letter.Concatenate
(About the braces, no idea. The documentation shows it only with braces, thus I put then in.)Concatenate
[\PL\PM] means anything that is either a non-letter or a non-mark. It cannot be something that is “not both a mark and a letter” because these are (shortcuts to) General Category assignments, and it is guaranteed that any code point as only a single GC. \pL is an alias for [\p{Lu}\p{Lt}\p{Ll}\p{Lm}\p{Lo}] and \pM an alias for [\p{Mn}\p{Me}\p{Mc}]. That reminds me, you should (probably?) be allowing through \p{Lower} and \p{Upper}, but Java doesn’t support full Unicode properties. Alas.Tho
Regarding braces, I know they aren’t needed for the single-letter GC group aliases for three reasons: ① They are modelled after Perl’s, which work that way. ② I’ve glared at the relevant Java source code for Pattern.java really rather hard vis-à-vis JDK7. ③ I’ve very often used those particular shortcuts in real Java regex code.Tho
Yes, [PL\PM] means either a non-letter or a non-mark, and this will include both marks and letters, e.g. this is equivalent to .. On the other hand, [^\pL\pM] would mean neither letter nor mark, which will include only digits, punctuation, spaces and such. But I think I should try it instead of trying to apply my knowledge of set theory here.Concatenate
It is not “equivalent to .”. The 7 General Classes are L, N, M, P, S, Z, and C. So if it is neither L nor M, it could be any of the remaining 5. Note however that there are alphabetic and indeed cased code points in other categories than L! However, for want of full properties, you can’t do \p{Alphabetic} or \p{Cased} in Java — yet. In JDK7, the UNICODE_CHARACTER_CLASSES compilation flag or the (?U) embedded flag will swap \p{alpha} around to behave properly.Tho
@tchrist: You are right, [\PL\PM] is not equivalent to . - but only as . does not match everything. I added an experiment to the answer, have a look.Concatenate
By the way, regarding to your answer, what fall in the marks category? Do I even need that? Wouldn't just letters be fine for firstname?Associative
@Richards: I must admit that before your question I did not ever hear about the question. I searched some examples of this category for my example program (and I'm now adding some text about this to the answer).Concatenate
@Richards: No, “just letters” are very much not “just fine” for a first name! Consider simple names like Renée, François, José, &c. Yes, you can normalize those particular examples into NFC and sneak past on those three alone, but there are plenty you cannot. (You did normalize to NFC or NFD, didn’t you?) Plus there are innumerable other issues, as Elizabeth mentions. Even using JDK7’s \p{alpha} with the new UNICODE_CHARACTER_CLASSES compilation flag leaves out quite a bit.Tho
T
2

I don't believe that Java’s default regex library (read: outside of linking to ICU’s, which I would suggest doing even though it requires JNI) supports the Unicode properties you need for this.

If it did, you would include \p{Diacritic} in your pattern. But you need full property support for that.

I suppose that you could shoot for (\pL\pM*)+ but that fails for various diacritics: What if someone’s first name is not just Étoile but L’étoile?

Also, I thought that the problem of validating people’s names was considered virtually unsolvable, and so you should just let people use whatever they like, possibly cleaned up per RFC 3454’s “stringprep” algorithm.

Tho answered 27/6, 2011 at 14:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.