Snort/PCRE Regex: odd character class syntax
Asked Answered
A

2

13

While I was parsing the Snort regex set I found a very odd character class syntax, like [\x80-t] or [\x01-t\x0B\x0C\x0E-t\x80-t], and I can't figure out (really no clue) what -t means. I don't even know if it's standard PCRE or a sort of Snort extension.

Here are some regular expression that contains these character classes:

/\x3d\x00\x12\x00..........(.[\x80-t]|...[\x80-t])/smiR
/^To\x3A[^\r\n]+[\x01-t\x0B\x0C\x0E-t\x80-t]/smi

PS: please note that \x80-t is not even a valid range in the standard way because character t is \x74.

Anthonyanthophore answered 12/12, 2013 at 14:50 Comment(8)
I'm intrigued. Can I ask exactly where you found this?Maigre
@polkadotcadaver Of course. I was investigating some projects, one is netbench. It contains several regular expression from L7, Bro and Snort under pattern_match/rules directory. There are some of these character classes in Snort/voip.rules.pcre, others in Snort/exploit.rules.pcre.Anthonyanthophore
@Anthonyanthophore It's definitely a range. I searched the pcre manual for -t\b but there was not match. Which means there's nothing special about -t in pcre. Now there are a few possibilities: 1) The range is just an error from the author 2) 0x80 is 128 in decimal, if you try € in a browser you get the euro symbol . So maybe the program is using some kind of other encoding/character table ?Restless
Does the code where that comes from compile?Fluorometer
Also, did you copy/paste the regex? (Just to be sure it is t and not τ or other letters that look close to t)Fluorometer
@Hamza I've also looked into the pcre manual and I thougth about an error, too. But there are more than two regexes, it's strange to find so many similar errors. About the encoding/character table I have found nothing, but this doesn't mean it can't be.Anthonyanthophore
@AngeloNeuschitzer I have done a copy/paste of the regexes from the files mentioned above.Anthonyanthophore
@Anthonyanthophore Did you ever come around to solve this?Fluorometer
F
4

This could reference a different character encoding where t is larger than x80 and x80 can't be addressed normally.

Take EBCDIC Scan codes for example (see here for a reference).

(But I too have no clue why somebody would want to write it that way)

For ASCII I have a wild guess: If -t means "until the next token -1" or if placed last in line "until the end of allowed characters" the second query would state this:

To:(not a newline, more than one character)(not a newline)

So basically the expression [\x01-t\x0B\x0C\x0E-t\x80-t] would mean [^\r\n].

If one applies that to (.Ç-t]|...[Ç-t]) that would address any character larger than 7bit ASCII which also could address all of unicode (besides the first 127 characters).

(That being said, I still have no clue why somebody should write it like this, but at least thats a coherent explanation besides "Its a bug")

Maybe helpful: What does the rexexes you posted mean if one writes out the \xYY? ASCII:

/=\NULL\DEVICE_CONTROL_2\NULL\.{10}\(.Ç-t]|...[Ç-t])/smiR
/^To\:[^\r\n]+[\START_OF_HEADING-t\VERTICALTAB\FORMFEED\SHIFTOUT\Ç-t]/smi

Looking after the \0x12 aka Device control 2 could help, because that won't show up in text, but maybe in net traffic.

Fluorometer answered 17/12, 2013 at 9:35 Comment(6)
This is an interesting point, but in this case I can't undestand the class [\x01-t\x0B\x0C\x0E-t\x80-t] that would have overlapping ranges.Anthonyanthophore
@Anthonyanthophore Could you post some code that uses that regex? - Also, overlapping ranges may not be "good" but they should work, so as it may not "make sense" to the observer, it can be explained by an author that meddles with his regexes "until they work" but does not check them throughly afterwards.Fluorometer
This is true, but really weird. About the code, these regexes should come from Snort regex set (if netbench team has collected them without mistakes, see one of my first comment in the question for netbench). Up to now I'm trying to convert it into java regex and into a parse tree for other purposes, so I would say that the regex are the data.Anthonyanthophore
You have found a strange artifact, no question. I don't know netbench but if it can be "run" did you try to run it and trigger that expression? I mean, it SHOULD break, maybe it does?Fluorometer
This is a good piece of advice. I will investigate in this direction for sure.Anthonyanthophore
@Anthonyanthophore I added a wild guess on what that could mean into my answer and some further thoughts on the queries.Fluorometer
C
3

The second regex matches lines that begin with To: (case-insensitive) followed by at least one character that isn't a line feed or carriage return. Since this is a greedy match, I'd expect \r or \n to be the only possible terminating matches in the [\x01-t\x0B\x0C\x0E-t\x80-t] character class. Note: \r is equivalent to \x0D and \n is equivalent to \x0A. Not sure what -t means but let's pretend it was - instead. Then the character class would be [\x01-\x0B\x0C\x0E-\x80-], which is still a bit convoluted but would make a little bit more sense - i.e. allowing a \n as a terminating character but not \r.

This is a very long shot but is there any chance this could be some kind of search-and-replace gone wrong?! (Guess this can probably be quickly discounted if there are other regexes that have normal ranges without the t.)

Cascabel answered 19/12, 2013 at 0:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.