In Java, a symbol is \pS
, which is not the same as punctuation characters, which are \pP
.
I talk about this issue, plus enumerate the types for all the ASCII punctuation and symbols, here in this answer.
Patterns like [\p{Alnum}\s]
only work on legacy dataset from the 1960s. To work on things with the Java native characters set, you needs something on the order of
identifier_charclass = "[\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}[\\p{InEnclosedAlphanumerics}&&\\p{So}]]";
whitespace_charclass = "[\\u000A\\u000B\\u000C\\u000D\\u0020\\u0085\\u00A0\\u1680\\u180E\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2007\\u2008\\u2009\\u200A\\u2028\\u2029\\u202F\\u205F\\u3000]";
ident_or_white = "[" + identifier_charclass + whitespace_charclass + "]";
I’m sorry that Java makes it so difficult to work with modern dataset, but at least it is possible.
Just don’t ask about boundaries or grapheme clusters. For that, see my others posting.
!"#$%&'()*+,-./:;<=>?@[\]^_ˋ{|}~¡¢£¤¥¦§¨©«¬®¯°±´¶·¸»¿×÷˂˃˄˅˘˙˚˜˝϶҂՚׀׃׆׳״‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‹›‼‽‾‿⁀
then you must use my fancier formulations. – Entryway\S
class, no? – Nixon^\s*\S+$
“succeeds” against"\t\n "
. I find that counterintuitive to the point of being wrong: obviously it should fail, not succeed. Nothing but the casuistry of a language-lawyer paid off by the Evil Empire could make anyone believe otherwise. It is simply nuts! – Entryway"\t\n "
does not match^\s*\S+$
.\S+
says that there must be at least one non-whitespace character, and there are none. Check this ideone.com demo. – NixonString sample = "\t\n "; String regex = "\\s*\\S+$"; stdout.printf("String '%s' %s pattern /%s/\n", sample, sample.matches(regex) ? "MATCHES" : "FAILS TO MATCH", regex);
that prints this out (with the newline gobbled by SO):String ' ' MATCHES pattern /^\s*\S+$/
. Do you understand why? I think you may become upset with me if I have to tell you instead of your figuring it out for yourself. ☹ This is real-world problem I stumbled upon in my job doing biomedical text-mining. It really sucks! – Entryway