How would you use a regular expression to ignore strings that contain a specific substring?
Asked Answered
A

4

4

How would I go about using a negative lookbehind(or any other method) regular expression to ignore strings that contains a specific substring?

I've read two previous stackoverflow questions:
java-regexp-for-file-filtering
regex-to-match-against-something-that-is-not-a-specific-substring

They are nearly what I want... my problem is the string doesn't end with what I want to ignore. If it did this would not be a problem.

I have a feeling this has to do with the fact that lookarounds are zero-width and something is matching on the second pass through the string... but, I'm none too sure of the internals.

Anyway, if anyone is willing to take the time and explain it I will greatly appreciate it.

Here is an example of an input string that I want to ignore:

192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] "GET /FOO/BAR/ HTTP/1.1" 200 2246

Here is an example of an input string that I want to keep for further evaluation:

192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] "GET /FOO/BAR/content.js HTTP/1.1" 200 2246

The key for me is that I want to ignore any HTTP GET that is going after a document root default page.

Following is my little test harness and the best RegEx I've come up with so far.

public static void main(String[] args){
String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/ HTTP/1.1\" 200 2246";
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/content.js HTTP/1.1\" 200 2246";
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/content.js HTTP/"; // This works
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/ HTTP/"; // This works
String inRegEx = "^.*(?:GET).*$(?<!.?/ HTTP/)";
try {
  Pattern pattern = Pattern.compile(inRegEx);

  Matcher matcher = pattern.matcher(inString);

  if (matcher.find()) {
    System.out.printf("I found the text \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
  } else {
    System.out.printf("No match found.%n");
  }
} catch (PatternSyntaxException pse) {
  System.out.println("Invalid RegEx: " + inRegEx);
  pse.printStackTrace();
}
}
Angloirish answered 9/2, 2009 at 22:55 Comment(2)
so, you're only interested in something that's explicitly requesting a "file" (e.g. /path/to/file.txt) and not something pointing at a "directory" (e.g. /path/to/) Is the only requirement that the requested URI end with some "extension" (.js in your example)?Giga
Correct on the first question. I only want "files" and not "directories." The file name and extension don't matter... just want to ignore requests to the document rootAngloirish
D
4

Could you just match any path that doesn't end with a /

String inRegEx = "^.* \"GET (.*[^/]) HTTP/.*$";

This can also be done using negative lookbehind

String inRegEx = "^.* \"GET (.+)(?<!/) HTTP/.*$";

Here, (?<!/) says "the preceding sequence must not match /".

Dafodil answered 9/2, 2009 at 23:1 Comment(1)
Thank you Zack. This works perfectly and I'm sure it is a much better performer than doing a lookaround. Now, for my own edification, is it possible to do with a lookaround and the java regex engine?Angloirish
L
1

Maybe I'm missing something here, but couldn't you just go without any regular expression and ignore anything for which this is true:

string.contains("/ HTTP")

Because a file path will never end with a slash.

Lithophyte answered 9/2, 2009 at 23:42 Comment(0)
P
0

I would use something like this:

"\"GET /FOO/BAR/[^ ]+ HTTP/1\.[01]\""

This matches every path that’s not just /FOO/BAR/.

Previous answered 9/2, 2009 at 23:6 Comment(0)
T
-1

If you are writing Regex this complex, I would recommend building a library of resources outside of StackOverflow.

Teagan answered 9/2, 2009 at 23:17 Comment(2)
Thank you for the great recommendations... I, oddly, have Friedl's book and Habibi's book and I am just too ignorant to discern the whole negative lookbehind(lookaround in general) in reading about the topics. Generally I get most everything from those two sources; but this one has me perplexed!Angloirish
While a helpful comment, it's not really an answer to the question. Recommend changing it to a comment.Menides

© 2022 - 2024 — McMap. All rights reserved.