java regex to exclude specific strings from a larger one
Asked Answered
B

3

5

I have been banging my head against this for some time now: I want to capture all [a-z]+[0-9]? character sequences excluding strings such as sin|cos|tan etc. So having done my regex homework the following regex should work:

(?:(?!(sin|cos|tan)))\b[a-z]+[0-9]?

As you see I am using negative lookahead along with alternation - the \b after the non-capturing group closing parenthesis is critical to avoid matching the in of sin etc. The regex makes sense and as a matter of fact I have tried it with RegexBuddy and Java as the target implementation and get the wanted result but it doesn't work using Java Matcher and Pattern objects! Any thoughts?

cheers

Beffrey answered 3/2, 2010 at 10:21 Comment(6)
Note: I don't think you need ?: when you use ?!.Briarwood
the ?: is for not capturing the groups with backreferences, it's there for perfomance and shouldn't be trouble. But i have tried without it to no availBeffrey
if you posted some sample inputs and what you expect from the output in each case, I think more people would be in a position to help.Tehuantepec
@nvrs: regarding the ?: - zero-width assertions are not captured by default. As far as the regex engine is concerned, (?:(?!(sin|cos|tan))) is a complex way of saying (?!sin|cos|tan).Denyse
@ninesided: You are right. I am actually trying to parse a mathematical equation and extract the variables. The variables could be any string with characters [a-z] followed by an optional single digit. e.g. x1 + yvar2 however i want to exclude some strings such as log,sin,etc since they are bound by implemented functions by my lib.Beffrey
If something works in RegexBuddy but not in your actual application, the most likely cause is that you're not doing the same thing in RegexBuddy as in your actual application. In such cases it is very helpful if you post both the regex you're using in RegexBuddy and the code you're using in your application (Java code, in this case).Landlord
K
6

The \b is in the wrong place. It would be looking for a word boundary that didn't have sin/cos/tan before it. But a boundary just after any of those would have a letter at the end, so it would have to be an end-of-word boundary, which is can't be if the next character is a-z.

Also, the negative lookahead would (if it worked) exclude strings like cost, which I'm not sure you want if you're just filtering out keywords.

I suggest:

\b(?!sin\b|cos\b|tan\b)[a-z]+[0-9]?\b

Or, more simply, you could just match \b[a-z]+[0-9]?\b and filter out the strings in the keyword list afterwards. You don't always have to do everything in regex.

Kimon answered 3/2, 2010 at 10:42 Comment(7)
Matches cos1 but it should not (if I understood the requirement correctly).Denyse
@Tomalak: No, the negative lookahead is meant to match full words, not prefixes. If there were a trig function called cos1, it would be listed as such: (?!(?:sin|cos1?|tan)\b)Guria
Yeah, the requirements aren't wholly clear, but that was my guess.Kimon
@bobince: Thanks, you were right about the the positioniong of \b. Of course the original regex would match (although not completely correct according to the equirements i described) most of what i wanted if i hand't forgotten to escape the \b for java i.e. \\b. Now i think how ridiculous \\\\ will look when you want to include a literal \ in the regex...Beffrey
Yeah, backslashes easily get out of hand in nested escaping contexts! It's a pity Java doesn't have the ‘raw strings’ some languages use to get around the problem. (Or regex literals like in JS, though I personally find that a bit ugly.)Kimon
@nvrs, The problem is solved, then? Have you considered marking this answer "accepted"? It improves on your regex in ways other than the escaping issue you mentioned.Guria
The difference between the regex in this answer and the regex in the question is not the positioning of the initial word bounary but the addition of the trailing word bounaries. The regexes (?!lookahead)\b and \b(?!lookahead) yield the same matches. Both \b and (?!lookahead) are zero-width so they're attempted at the same position regardless of their order.Landlord
D
1

So you want [a-z]+[0-9]? (a sequence of at least one letter, optionally followed by a digit), unless that letter sequence resembles one of sin cos tan?

\b(?!(sin|cos|tan)(?=\d|\b))[a-z]+\d?\b

results:

cos   - no match
cosy  - full match
cos1  - no match
cosy1 - full match
bla9  - full match
bla99 - no match
Denyse answered 3/2, 2010 at 10:45 Comment(2)
Hi, thanks for replying but i still dont get any matches. I see that based on what i said you added matches such as cosy etc. which is correct but using: Pattern p = Pattern.compile("\b(?!(sin|cos|tan)(?=[^a-z]|\b))[a-z]+[0-9]?\b"); Matcher m = f.matcher(stringToMatch); i get no matches at all!Beffrey
In Java strings backslashes need to be escaped. I have shown the pure regex. Of course you need to adapt it to the string escaping rules of your programming language yourself.Denyse
B
0

i forgot to escape the \b for java so \b should be \\b and it now works. cheers

Beffrey answered 3/2, 2010 at 11:11 Comment(2)
When posting regex questions, it's a good idea to include the regex exactly as it appears in your source code; \bfoo\b looks fine, but "\bfoo\b" is likely to raise questions, even from people who don't speak Java and aren't sure how its string literals work.Guria
Also, did you try having RegexBuddy generate the Java source code? (That's the "Use" tab, in case you don't know.) I've never liked auto-generated source code, but I sometimes use "Use" to remind myself about the escaping rules for languages I'm not fluent in.Guria

© 2022 - 2024 — McMap. All rights reserved.