Best ways of parsing a URL using C?
Asked Answered
C

10

38

I have a URL like this:

http://192.168.0.1:8080/servlet/rece

I want to parse the URL to get the values:

IP: 192.168.0.1
Port: 8080
page:  /servlet/rece

How do I do that?

Conduplicate answered 7/4, 2009 at 14:47 Comment(1)
for windows, use CoInternetParseUrlZingaro
I
-3

Write a custom parser or use one of the string replace functions to replace the separator ':' and then use sscanf().

Inflammatory answered 7/4, 2009 at 14:54 Comment(3)
There are many traps to watch so a custom parser seems to me a bad idea.Corporative
@bortzmeye: that doesn't make the suggestion invalid. It's vague reasoning. Also, a custom parser is the most powerful/efficient/dependency free. The sscanf is easier to get wrong.Inflammatory
how is "write some code that does what you need" an accepted answer?Looper
C
29

Personally, I steal the HTParse.c module from the W3C (it is used in the lynx Web browser, for instance). Then, you can do things like:

 strncpy(hostname, HTParse(url, "", PARSE_HOST), size)

The important thing about using a well-established and debugged library is that you do not fall into the typical traps of URL parsing (many regexps fail when the host is an IP address, for instance, specially an IPv6 one).

Corporative answered 7/4, 2009 at 16:57 Comment(4)
In particular, be aware that with IPv6 there are ambiguous cases if you try to use the colon separator. e.g. 3ffe:0501::1:2, is that a port of 2, or a full address with your default port. The URL specs have dealt with this, as have the the prewritten libraries.Elevated
Note there is no real ambiguity. The URI standard, RFC 3986, is clear and your example is illegal (you need square brackets).Corporative
Thanks, this is comforting. I was under the mistaken impression that user facing code, like browser address bars, was accepting the addresses without square brackets. A quick tour of some popular browsers reveals this is not the case.Elevated
HTParse.c has a number of dependencies, any chance you can explain how you can "steal" this from the project easily? Maybe back in 2009 it did not ;)Linkboy
C
15

I wrote a simple code using sscanf, which can parse very basic URLs.

#include <stdio.h>

int main(void)
{
    const char text[] = "http://192.168.0.2:8888/servlet/rece";
    char ip[100];
    int port = 80;
    char page[100];
    sscanf(text, "http://%99[^:]:%99d/%99[^\n]", ip, &port, page);
    printf("ip = \"%s\"\n", ip);
    printf("port = \"%d\"\n", port);
    printf("page = \"%s\"\n", page);
    return 0;
}

./urlparse
ip = "192.168.0.2"
port = "8888"
page = "servlet/rece"
Conduplicate answered 7/4, 2009 at 15:2 Comment(5)
What platform is this on? I did not know you could put regexp like [^:] in a sscanf format.In
My platform is: uname -a Linux ubuntu 2.6.24-21-generic #1 SMP Tue Oct 21 23:43:45 UTC 2008 i686 GNU/LinuxConduplicate
[^:] is not a regexp in this context, it's merely a special format specifier for sscanf(). It is standard. See for instance this manual page: <linux.die.net/man/3/sscanf>.Aurel
The parse had some mistakes when no port number, It con't work well. How can i fix it.Conduplicate
What does %99 do here? How does it work? Please guideHavard
A
11

With a regular expression if you want the easy way. Otherwise use FLEX/BISON.

You could also use a URI parsing library

Amedeo answered 7/4, 2009 at 14:54 Comment(2)
Indeed, using a library seems the only reasonable thing, since there are many traps (http vs. https, explicit port, encoding in the path, etc).Corporative
Hi, I wrote a BNF for url, like this. URL = "http://" {IP} {PORT}? {PAGE}? A flex generated a file which parsed the url. But how to fetch the individual parts like IP, PORT and PAGE. from the URLLianna
F
11

May be late,... what I have used, is - the http_parser_parse_url() function and the required macros separated out from Joyent/HTTP parser lib - that worked well, ~600LOC.

Ferrante answered 29/11, 2013 at 6:48 Comment(1)
Yep. The node.js HTTP parser lib is great and very well tested for anything that has to do with HTTP requests / responses.Lewse
G
3

Libcurl now has curl_url_get() function that can extract host, path, etc.

Example code: https://curl.haxx.se/libcurl/c/parseurl.html

/* extract host name from the parsed URL */ 
uc = curl_url_get(h, CURLUPART_HOST, &host, 0);
if(!uc) {
  printf("Host name: %s\n", host);
  curl_free(host);
}
Grouse answered 6/12, 2018 at 19:32 Comment(0)
D
2

This one has reduced size and worked excellent for me http://draft.scyphus.co.jp/lang/c/url_parser.html . Just two files (*.c, *.h).
I had to adapt code [1].

[1]Change all the function calls from http_parsed_url_free(purl) to parsed_url_free(purl)

   //Rename the function called
   //http_parsed_url_free(purl);
   parsed_url_free(purl);
Demodulator answered 23/8, 2013 at 10:8 Comment(2)
@ tremendows : Excellent link. It works like a charm.Kief
Sadly that excellent code is copyrighted 'all rights reserved', so it should not be used in other than a personal project.Jewfish
O
2

Pure sscanf() based solution:

//Code
#include <stdio.h>

int
main (int argc, char *argv[])
{
    char *uri = "http://192.168.0.1:8080/servlet/rece"; 
    char ip_addr[12], path[100];
    int port;
    
    int uri_scan_status = sscanf(uri, "%*[^:]%*[:/]%[^:]:%d%s", ip_addr, &port, path);
    
    printf("[info] URI scan status : %d\n", uri_scan_status);
    if( uri_scan_status == 3 )
    {   
        printf("[info] IP Address : '%s'\n", ip_addr);
        printf("[info] Port: '%d'\n", port);
        printf("[info] Path : '%s'\n", path);
    }
    
    return 0;
}

However, keep in mind that this solution is tailor made for [protocol_name]://[ip_address]:[port][/path] type of URI's. For understanding more about the components present in the syntax of URI, you can head over to RFC 3986.

Now let's breakdown our tailor made format string : "%*[^:]%*[:/]%[^:]:%d%s"

  • %*[^:] helps to ignore the protocol/scheme (eg. http, https, ftp, etc.)

    It basically captures the string from the beginning until it encounters the : character for the first time. And since we have used * right after the % character, therefore the captured string will be ignored.

  • %*[:/] helps to ignore the separator that sits between the protocol and the IP address, i.e. ://

  • %[^:] helps to capture the string present after the separator, until it encounters :. And this captured string is nothing but the IP address.

  • :%d helps to capture the no. sitting right after the : character (the one which was encountered during the capturing of IP address). The no. captured over here is basically your port no.

  • %s as you may know, will help you to capture the remaining string which is nothing but the path of the resource you are looking for.

Opportunity answered 9/9, 2020 at 12:14 Comment(0)
V
1

This C gist could be useful. It implements a pure C solution with sscanf.

https://github.com/luismartingil/per.scripts/tree/master/c_parse_http_url

It uses

// Parsing the tmp_source char*
if (sscanf(tmp_source, "http://%99[^:]:%i/%199[^\n]", ip, &port, page) == 3) { succ_parsing = 1;}
else if (sscanf(tmp_source, "http://%99[^/]/%199[^\n]", ip, page) == 2) { succ_parsing = 1;}
else if (sscanf(tmp_source, "http://%99[^:]:%i[^\n]", ip, &port) == 2) { succ_parsing = 1;}
else if (sscanf(tmp_source, "http://%99[^\n]", ip) == 1) { succ_parsing = 1;}
(...)
Vacancy answered 17/9, 2013 at 15:58 Comment(1)
third if statement will never be tested, becouse second one has the same meaning, so this could make a problem with port/pageIllmannered
S
1

I wrote this

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
typedef struct
{
    const char* protocol = 0;
    const char* site = 0;
    const char* port = 0;
    const char* path = 0;
} URL_INFO;
URL_INFO* split_url(URL_INFO* info, const char* url)
{
    if (!info || !url)
        return NULL;
    info->protocol = strtok(strcpy((char*)malloc(strlen(url)+1), url), "://");
    info->site = strstr(url, "://");
    if (info->site)
    {
        info->site += 3;
        char* site_port_path = strcpy((char*)calloc(1, strlen(info->site) + 1), info->site);
        info->site = strtok(site_port_path, ":");
        info->site = strtok(site_port_path, "/");
    }
    else
    {
        char* site_port_path = strcpy((char*)calloc(1, strlen(url) + 1), url);
        info->site = strtok(site_port_path, ":");
        info->site = strtok(site_port_path, "/");
    }
    char* URL = strcpy((char*)malloc(strlen(url) + 1), url);
    info->port = strstr(URL + 6, ":");
    char* port_path = 0;
    char* port_path_copy = 0;
    if (info->port && isdigit(*(port_path = (char*)info->port + 1)))
    {
        port_path_copy = strcpy((char*)malloc(strlen(port_path) + 1), port_path);
        char * r = strtok(port_path, "/");
        if (r)
            info->port = r;
        else
            info->port = port_path;
    }
    else
        info->port = "80";
    if (port_path_copy)
        info->path = port_path_copy + strlen(info->port ? info->port : "");
    else 
    {
        char* path = strstr(URL + 8, "/");
        info->path = path ? path : "/";
    }
    int r = strcmp(info->protocol, info->site) == 0;
    if (r && info->port == "80")
        info->protocol = "http";
    else if (r)
        info->protocol = "tcp";
    return info;
}

Test

int main()
{
    URL_INFO info;
    split_url(&info, "ftp://192.168.0.1:8080/servlet/rece");
    printf("Protocol: %s\nSite: %s\nPort: %s\nPath: %s\n", info.protocol, info.site, info.port, info.path);
    return 0;
}

Out

Protocol: ftp
Site: 192.168.0.1
Port: 8080
Path: /servlet/rece
Stationer answered 18/8, 2018 at 8:34 Comment(0)
I
-3

Write a custom parser or use one of the string replace functions to replace the separator ':' and then use sscanf().

Inflammatory answered 7/4, 2009 at 14:54 Comment(3)
There are many traps to watch so a custom parser seems to me a bad idea.Corporative
@bortzmeye: that doesn't make the suggestion invalid. It's vague reasoning. Also, a custom parser is the most powerful/efficient/dependency free. The sscanf is easier to get wrong.Inflammatory
how is "write some code that does what you need" an accepted answer?Looper

© 2022 - 2024 — McMap. All rights reserved.