What does regex "\\p{Z}" mean?
Asked Answered
P

3

26

I am working with some code in java that has an statement like

String tempAttribute = ((String) attributes.get(i)).replaceAll("\\p{Z}","")

I am not used to regex, so what is the meaning of it? (If you could provide a website to learn the basics of regex that would be wonderful) I've seen that for a string like

ept as y it gets transformed into eptasy, but this doesn't seem right. I believe the guy who wrote this wanted to trim leading and trailing spaces maybe.

Public answered 12/5, 2015 at 15:40 Comment(2)
No it's correct you can see here that it does match all the whitespaces, so it removes them in the given code with replaceAll().Vinegarroon
I found this "General Category" table on Wikipedia very helpful en.wikipedia.org/wiki/… tldr, its the unicode-db equivalent of a character class for "glyphs that separate words"Tellurium
L
20

It removes all the whitespace (replaces all whitespace matches with empty strings).

A wonderful regex tutorial is available at regular-expressions.info. A citation from this site:

\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.

Lais answered 12/5, 2015 at 15:42 Comment(3)
And what about the first slash?Public
First slash is an escape character to denote that p{Z} refers to a regular expression looking for whitespace instead of just p, {, Z, and }.Zebulun
The backslash is doubled in program code because it is Java's syntax for string literals. Java compiler makes one backslash out of it, and the string with one slash is passed to Regex engine. See Regex tutorial, "Special Characters and Programming Languages" sectionLais
B
8

First of all, \p means you are going to match a class, a collection of characters, not a single one. For reference, this is Javadoc of Pattern class. https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Unicode scripts, blocks, categories and binary properties are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property.

And then Z is the name of a class (collection, set) of characters. In this case, it's abbreviation of Separator characters. Separator contains 3 sub classes: Space_Separator(Zs), Line_Separator(Zl) and Paragraph_Separator(Zp).

Refer here for which characters those classes contains here: Unicode Character Database or Unicode Character Categories

More document: http://www.unicode.org/reports/tr18/#General_Category_Property

Balkh answered 18/7, 2019 at 3:31 Comment(0)
P
7

The OP stated that the code fragment was in Java. To comment on the statement:

\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.

the sample code below shows that this does not apply in Java.

public static void main(String[] args) {

    // some normal white space characters
    String str = "word1 \t \n \f \r " + '\u000B' + " word2"; 

    // various regex patterns meant to remove ALL white spaces
    String s = str.replaceAll("\\s", "");
    String p = str.replaceAll("\\p{Space}", "");
    String b = str.replaceAll("\\p{Blank}", "");
    String z = str.replaceAll("\\p{Z}", "");

    // \\s removed all white spaces
    System.out.println("s [" + s + "]\n"); 

    // \\p{Space} removed all white spaces
    System.out.println("p [" + p + "]\n"); 

    // \\p{Blank} removed only \t and spaces not \n\f\r
    System.out.println("b [" + b + "]\n"); 

    // \\p{Z} removed only spaces not \t\n\f\r
    System.out.println("z [" + z + "]\n"); 

    // NOTE: \p{Separator} throws a PatternSyntaxException
    try {
        String t = str.replaceAll("\\p{Separator}","");
        System.out.println("t [" + t + "]\n"); // N/A
    } catch ( Exception e ) {
        System.out.println("throws " + e.getClass().getName() + 
                " with message\n" + e.getMessage());
    }

} // public static void main

The output for this is:

s [word1word2]

p [word1word2]

b [word1


word2]

z [word1    


word2]

throws java.util.regex.PatternSyntaxException with message
Unknown character property name {Separator} near index 12
\p{Separator}
            ^

This shows that in Java \\p{Z} removes only spaces and not "any kind of whitespace or invisible separator".

These results also show that in Java \\p{Separator} throws a PatternSyntaxException.

Privilege answered 7/11, 2016 at 3:19 Comment(2)
\\s does not match '\u00A0' (no-break space char)Skantze
\n \r \f are in category Cc which is not included in Z, so they are not replacedStalinism

© 2022 - 2024 — McMap. All rights reserved.