Posix regular expression non-greedy
Asked Answered
R

5

11

Is there a way to use a non-greedy regular expression in C like one can use in Perl? I tried several things, but it's actually not working.

I'm currently using this regex that matches an IP address and the corresponding HTTP request, but it's greedy although I'm using the *?:

([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1

In this example, it always matches the whole string:

#include <regex.h>
#include <stdio.h>

int main() {

    int a, i;
    regex_t re;
    regmatch_t pm;
    char *mpages = "TEST 127.0.0.1 GET /test.php HTTP/1.1\" 404 525 \"-\" \"Mozilla/5.0 (Windows NT  HTTP/1.1 TEST";

    a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED);

    if(a!=0)
        printf(" -> Error: Invalid Regex");

    a = regexec(&re, &mpages[0], 1, &pm, REG_EXTENDED);

    if(a==0) {

        for(i = pm.rm_so; i < pm.rm_eo; i++)
            printf("%c", mpages[i]);
        printf("\n");
    }
    return 0;
}

$ ./regtest

127.0.0.1 GET /test.php HTTP/1.1" 404 525 "-" "Mozilla/5.0 (Windows NT HTTP/1.1

Resolutive answered 27/11, 2013 at 10:26 Comment(7)
Can you add your input string to the question. It seems to work for me.Shamrock
I don't know c so can't advise, but the problem is in your code not your regex. If you add more to the end of your input string it'll probably become apparent that it's not matching to the second HTTP/1.1 but rather returning the entire input string.Shamrock
You may use a more accurate IP matching. Check this answer: https://mcmap.net/q/18841/-regular-expression-to-match-dns-hostname-or-ip-addressNiobous
Don't use magic values. What 1 means when you call regcomp ?Niobous
I used more accurate IP matching: same results, I also added content to the start and the end of the string, same resultsResolutive
REG_EXTENDED means "Use Extended Regular Expressions" that should be okayResolutive
It's better if you use grep for testing purpose, in that way you can re-factor your question to hit a broader "audience".Regeniaregensburg
T
9

No, there are no non-greedy quantifiers in POSIX regular expressions. But there is a library that provides perl-like regular expressions for C: http://www.pcre.org/

Titanothere answered 27/11, 2013 at 11:46 Comment(0)
R
0

As I said earlier in a comment, use grep -E to run tests with POSIX regexes, in that way development time will be improved. Either way, It seems your problem it's with the regular expression rather than with the missing feature.

I'm not quite clear of what you want to grab from the request... supposing you just want the IP address, the HTTP verb and the resource, one could end up with the following regex.

regcomp(&re, "\\b(.?[0-9])+\\s+(GET|POST|PUT)\\s+([^ ]+)", REG_EXTENDED);

Be aware that several assumptions have been made. For example, this regex assumes the IP address will be well formed, it also assumes a request with a HTTP verb either GET, POST, PUT. Edit accordantly to your needs.

Regeniaregensburg answered 27/11, 2013 at 12:32 Comment(0)
L
0

The brute-force method of getting a regex to match up to the next occurrence of a word is:

"([^H]|H[^T]|HT[^T]|HTT[^P]|HTTP{^/]|HTTP/[^1]|HTTP/1[^.]|HTTP/1\\.[^1])*HTTP/1\\.1"

unless you can get smarter about your match -- which you can: HTTP requests are

Request-Line   = Method SP Request-URI SP HTTP-Version CRLF

and none of the nonterminals on the right match embedded spaces. So:

"[0-9]{1,3}(\\.[0-9]{1,3}){3} [^ ]* [^ ]* HTTP/1\\.1"

since you're only allocating space for the whole-expression match, or put the parens back in to get pieces.

Lighthouse answered 27/11, 2013 at 13:3 Comment(0)
T
0
a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1",  REG_EXTENDED|REG_ENHANCED);  

Doesn't have this macro in the old time

#if __MAC_OS_X_VERSION_MIN_REQUIRED  >= __MAC_10_8 \
 || __IPHONE_OS_VERSION_MIN_REQUIRED >= __IPHONE_6_0
#define REG_ENHANCED    0400    /* Additional (non-POSIX) features */
#endif
Tremolant answered 6/3, 2017 at 10:30 Comment(0)
M
-1

In your code, pm should be an array of regmatch_t, and in your case, should have at least 2 to 4 elements, depending upon which () sub-expressions you want to capture.

You have only one element. The first element, pm[0], always gets whatever text matches your entire RE. That's the one you'll be getting. It is pm[1] that will get the text of the first () sub-expression (the IP address), and pm[3] that will get the text matching your (.*?) term.

But even so, as stated above (by Wumbley, W. Q.) the POSIX regex library may not support non-greedy quantifiers.

Mather answered 11/11, 2015 at 2:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.