Java regex for any symbol?
Asked Answered
I

2

5

Is there a regex which accepts any symbol?

EDIT: To clarify what I'm looking for.. I want to build a regex which will accept ANY number of whitespaces and the it must contain atleast 1 symbol (e.g , . " ' $ £ etc.) or (not exclusive or) at least 1 character.

Interrogate answered 3/12, 2010 at 12:48 Comment(7)
Please define "Symbol" - is it any char including whitespaces? Or anything but whitespaces...Malvoisie
@Ulkmum: See my answer: you are including things that Java has trouble with, because they’re in its native character set instead of the legacy character set. If you have to do deal with any of these: !"#$%&'()*+,-./:;<=>?@[\]^_ˋ{|}~¡¢£¤¥¦§¨©«¬®¯°±´¶·¸»¿×÷˂˃˄˅˘˙˚˜˝϶҂՚׀׃׆׳״‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‹›‼‽‾‿⁀ then you must use my fancier formulations.Entryway
Uhm, correct me if I'm wrong, but all of those characters are included in the \S class, no?Nixon
@Ulkmun: I’m afraid the selected answer is wrong. I can make it fail on simple data very easily. :(Entryway
@aioobe: In Java — but not in Perl — the pattern ^\s*\S+$ “succeeds” against "\t\n   ". I find that counterintuitive to the point of being wrong: obviously it should fail, not succeed. Nothing but the casuistry of a language-lawyer paid off by the Evil Empire could make anyone believe otherwise. It is simply nuts!Entryway
@tchrist: I'm not sure I follow you. "\t\n " does not match ^\s*\S+$. \S+ says that there must be at least one non-whitespace character, and there are none. Check this ideone.com demo.Nixon
Wrong, check this demo: String sample = "\t\n "; String regex = "\\s*\\S+$"; stdout.printf("String '%s' %s pattern /%s/\n", sample, sample.matches(regex) ? "MATCHES" : "FAILS TO MATCH", regex); that prints this out (with the newline gobbled by SO): String '  ' MATCHES pattern /^\s*\S+$/. Do you understand why? I think you may become upset with me if I have to tell you instead of your figuring it out for yourself. ☹ This is real-world problem I stumbled upon in my job doing biomedical text-mining. It really sucks!Entryway
N
8

Yes. The dot (.) will match any symbol, at least if you use it in conjunction with Pattern.DOTALL flag (otherwise it won't match new-line characters). From the docs:

In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.


Regarding your edit:

I want to build a regex which will accept ANY number of whitespaces and the it must contain atleast 1 symbol (e.g , . " ' $ £ etc.) or (not exclusive or) at least 1 character.

Here is a suggestion:

\s*\S+
  • \s* any number of whitespace characters
  • \S+ one or more ("at least one") non-whitespace character.
Nixon answered 3/12, 2010 at 12:48 Comment(13)
Right, so a regex that would accept strings which contain any number of whitespaces and ATLEAST 1 word and any number of symbols would be... \\s*\\p{Alnum}[\\p{Alnum}\\s]* ... where does the dot go?Interrogate
Strictly speaking LF and CR are control codes not symbols but you're still correct in that . won't match every possible character value.Fratricide
Aren't we confusing "symbol" with "character"? I interpreted "symbol" in the question as "non-alphanumeric character".Douche
I suppose you could change [\\p{Alnum}\\s]* into .* instead.Nixon
Generally when you ask for help with regular expressions, it helps a lot if you provide a few examples of strings that should match, and a few examples of strings that should not match.Nixon
ah, well change it to \s*\S.* then. Then you're actually quite close to what I suggested previously, change [\p{Alnum}\s]* into .*: you would then get \s*\p{Alnum}.*.Nixon
I see a non-ASCII character in your example. You therefore must use my solution. Sorry!Entryway
WARNING: Java’s \s fails to match things like U+A0, NO-BREAK SPACE or &nbsp;. Java’s \p{Punct} fails to match things like the £ used in the OP’s example. Java’s \S fails to match things like U+85, NEXT LINE (NEL). And Java’s \b\w+\b fails to match the string "élève" anywhere whatsoever. Java’s regex char-class are completely broken. You cannot use them. You have to use the formulations I describe in my answer. I deeply regret this, but it is true, and regret will not change that.Entryway
@tchrist, IMHO, I believe you're a bit picky. Besides \w is clearly document not to match é, right? Also, there is no need to "shout" using bold caps...Nixon
@Ulkmun, I see your concern. Try this pattern: (?s)\s*\S.* (or construct your pattern using the Pattern.DOTALL flag).Nixon
@aioobe: There is a need to shout when day in and day out, you see people making the same mistakes over and over again. The Java charclass shortcuts and the POSIX character classes in Java work only on legacy data. They do not even work with Java’s own native character set! This is a very serious issue, one I feel people need to be fully informed of. The user in this case mentioned non-legacy data, and you are all giving him legacy-only solutions. If I need to shout to get this gross oversight noticed, then I shall certainly do so.Entryway
I work for a biomedical text-mining group at a public university. Well under 1% of text we process falls in the legacy category. Our code is all in Java and Perl. Because Perl regexes handle modern data transparently, but Java’s do not, a great deal of effort must be made just to get Java regexes to work with Java characters! It is an important issue, one that everyone doing regexes in Java needs to understand. Do you understand why Java is incapable of matching the string "élève" with the pattern \b\w+\b anywhere at all, not just the whole thing? Few do, and we few fear it.Entryway
@aioobe: “Picky?” Is it picky to expect "élève" to have at least one match using \b\w+\b? What’s so picky about that? It’s not picky. Compile that pattern. Match it against that string. Try both matches() for the whole thing and find() for anywhere. Nothing. Niente. Nada. Do you understand why? If you do not, you should not be advocating these legacy-only solutions. If you do, then could you please explain to me the contorted justification under which this insane situation makes the least scintilla of sense?Entryway
E
0

In Java, a symbol is \pS, which is not the same as punctuation characters, which are \pP.

I talk about this issue, plus enumerate the types for all the ASCII punctuation and symbols, here in this answer.

Patterns like [\p{Alnum}\s] only work on legacy dataset from the 1960s. To work on things with the Java native characters set, you needs something on the order of

identifier_charclass = "[\\pL\\pM\\p{Nd}\\p{Nl}\\p{Pc}[\\p{InEnclosedAlphanumerics}&&\\p{So}]]";
whitespace_charclass = "[\\u000A\\u000B\\u000C\\u000D\\u0020\\u0085\\u00A0\\u1680\\u180E\\u2000\\u2001\\u2002\\u2003\\u2004\\u2005\\u2006\\u2007\\u2008\\u2009\\u200A\\u2028\\u2029\\u202F\\u205F\\u3000]";

ident_or_white = "[" + identifier_charclass + whitespace_charclass + "]";

I’m sorry that Java makes it so difficult to work with modern dataset, but at least it is possible.

Just don’t ask about boundaries or grapheme clusters. For that, see my others posting.

Entryway answered 3/12, 2010 at 13:0 Comment(4)
"Patterns like [\p{Alnum}\s] only work on legacy dataset from the 1960s" -- Uhm, no, I've seen them work on a few newer ones too...Nixon
@aioobe: Nope, you have not: [\p{Alnum}\s]+$ fails on even simple things like £20, on "this and that", and on "the Molière exhibition". Welcome to Java! Are we having fun yet?Entryway
Well, \p{Alnum} is clearly documented to match [a-zA-Z0-9], so I wouldn't say that the behavior is buggy. Heck I would have been surprised if it matched a £.Nixon
Fine: add \p{Punct} then. Despite their disingenuous bait&switch re Unicode,Java’s stuck in the Dark Ages of computing, the 1960s. They have fundamentally misunderstood that \b and \w are and must be ineluctably linked. By severing that linkage they have created asinine Catch-22s in their language that confuse, confound, and consternate anyone trying to use them. You have 3 choices: [1] Don’t use Java regexes [2] Painstakingly rewrite all Java regexes by hand following the guidelines I have here and elsehwere set forth [3] Use my alpha rewrite code now, beta and production later.Entryway

© 2022 - 2024 — McMap. All rights reserved.