Top level domain from URL in C#
Asked Answered
A

7

17

I am using C# and ASP.NET for this.

We receive a lot of "strange" requests on our IIS 6.0 servers and I want to log and catalog these by domain.

Eg. we get some strange requests like these:

the latter three are kinda obvious, but I would like to sort them all into one as "example.com" IS hosted on our servers. The rest isn't, sorry :-)

So I am looking for some good ideas for how to retrieve example.com from the above. Secondly I would like to match the m., wap., iphone etc into a group, but that's probably just a quick lookup in a list of mobile shortcuts.I could handcode this list for a start.

But is regexp the answer here or is pure string manipulation the easiest way? I was thinking of "splitting" the URL string by "." and the look for item[0] and item[1]...

Any ideas?

Adalai answered 10/1, 2011 at 2:28 Comment(2)
I also need a solution that will work well for co.uk type domains...Cohlier
I think you should detect its a co.uk first, then go to special case for that. Not every country has similar "top/second" level domains. So I am going for "top level" selection first, then sorting down afterwards.Adalai
M
24

You can use the following nuget Nager.PublicSuffix package. It uses the same data source that browser vendors use.

nuget

PM> Install-Package Nager.PublicSuffix

Example

var domainParser = new DomainParser(new WebTldRuleProvider());

var domainInfo = domainParser.Parse("sub.test.co.uk");
//domainInfo.Domain = "test";
//domainInfo.Hostname = "sub.test.co.uk";
//domainInfo.RegistrableDomain = "test.co.uk";
//domainInfo.SubDomain = "sub";
//domainInfo.TLD = "co.uk";
Muttonhead answered 27/10, 2016 at 17:59 Comment(2)
Thank you this was exactly what I was looking for.Rondeau
This is the only correct approach; this package downloads and caches the publicsuffix.org maintained and curated suffix list, the same list that browser vendors use.Jaclin
R
12

The following code uses the Uri class to obtain the host name, and then obtains the second level host (examplecompany.com) from Uri.Host by splitting the host name on periods.

var uri = new Uri("http://www.poker.winner4ever.examplecompany.com/");
var splitHostName = uri.Host.Split('.');
if (splitHostName.Length >= 2)
{
    var secondLevelHostName = splitHostName[splitHostName.Length - 2] + "." +
                              splitHostName[splitHostName.Length - 1];
}
Ramayana answered 10/1, 2011 at 2:37 Comment(5)
This might be suitable for the OP's needs but it isn't correct for all domains. For example, the hostname for google.co.uk or bbc.co.uk would be given as "co.uk".Karlow
@LukeH: Very good point. I was just considering the OP's needs and country code TLD's didn't even cross my mind :-/Ramayana
@Karlow - The OP already specified the domain in which he is interested, so it doesn't appear he is looking for a general solution that would work for any TLD - he says '"examplecompany.com" IS hosted on our servers'. Matching TLDs using a regex in the general case is actually pretty difficult and full of pitfalls.Foppery
I get a LONG logfile of strange URL's - I dont know the incomming URLs on beforehand. So I cant use some "indexOf" on the strings as we also receive queries for none-existent, never-been, has-been domains on our IP's. Sometimes I think people have wrongly pointed their IP at our servers just for fun... but what do I know.Adalai
@LukeH: I am aware of this "problem", but I will deal with co.uk in special cases, if I can just get the domains sorted. So if I split it up, I get "domain.tld" - I could make a "trouble list" of "co.uk" etc. where I add an extra level if this is a match though. I guess thats going to be the only way to deal with them.Adalai
P
9

There may be some examples where this returns something other than what is desired, but country codes are the only ones that are 2 characters, and they may or may not have a short second level (2 or 3 characters) typically used. Therefore, this will give you what you want in most cases:

string GetRootDomain(string host)
{
    string[] domains = host.Split('.');

    if (domains.Length >= 3)
    {
        int c = domains.Length;
        // handle international country code TLDs 
        // www.amazon.co.uk => amazon.co.uk
        if (domains[c - 1].Length < 3 && domains[c - 2].Length <= 3)
            return string.Join(".", domains, c - 3, 3);
        else
            return string.Join(".", domains, c - 2, 2);
    }
    else
        return host;
}
Pedanticism answered 4/2, 2016 at 23:33 Comment(3)
actually should give you "police.uk" for gmp.police.uk, since "police" is longer than 3 characters.Pedanticism
Ah, that's wrong. The domain is 'police.uk' the host is 'gmp'. Another example is: devon-cornwall.police.ukDansby
very nice in between solutionBiparous
N
5

This is not possible without a up-to-date database of different domain levels.

Consider:

s1.moh.gov.cn
moh.gov.cn
s1.google.com
google.com

Then at which level you want to get the domain? It's completely depends of the TLD, SLD, ccTLD... because ccTLD in under control of countries they may define very special SLD which is unknown to you.

Neile answered 10/1, 2011 at 5:53 Comment(3)
I agree, but I still want to be able to sort our incomming traffic.Adalai
At that point I suggest to go with normal TLD format and sacrifice rare ccTLD domains. Then other answers will be more of help.Neile
gov.cn is a TLD in s1.moh.gov.cn , do you think otherwise?Pneumo
C
2

I've written a library for use in .NET 2+ to help pick out the domain components of a URL.

More details are on github but one benefit over previous options is that it can download the latest data from http://publicsuffix.org automatically (once per month) so the output from the library should be more-or-less on a par with the output used by web browsers to establish domain security boundaries (i.e. pretty good).

It's not perfect yet but suits my needs and shouldn't take much work to adapt to other use cases so please fork and send a pull request if you want.

Cowitch answered 1/7, 2015 at 21:12 Comment(2)
have you considered the new toplevel domains too in your lib?Adalai
Yes. Since the lib is built on data from publicsuffix.org new toplevel domains will be supported within a month of support being added to the nightly builds of browsers like Firefox and Chrome. You could force this to occur more quickly by deleting the cached copy of the publicsuffix database before it expires within a month but that's only likely to be useful in rare cases when developing software far in advance of mainstream support for the new suffix.Cowitch
F
1

Use a regular expression:

^https?://([\w./]+[^.])?\.?(\w+\.(com)|(co.uk)|(com.au))$

This will match any URL ending with a TLD in which you are interested. Extend the list for as many as you want. Further, the capturing groups will contain the subdomain, hostname and TLD respectively.

Foppery answered 10/1, 2011 at 4:15 Comment(5)
hmmm, wouldnt this require me to know what the two domains are first?Adalai
For the general case you need the complete list of rules on how each country organizes their domains. Some countries we are familiar with (eg. we know that anything before .com or .co.uk is the website name), but how do they do it in eg. Romania? For instance, in the URL something.com.ro, is the website called "com" and the subdomain "something"? Or do Romania use "com.ro" as their the TLD for commercial sites? I don't know, but I believe that you're going to need this kind of information if you want to do this properly.Foppery
The Mozilla foundation have made a (probably non-exhaustive) list of these TLDs: mxr.mozilla.org/mozilla-central/source/netwerk/dns/…Foppery
thanks for that link, I can see your point. Ouch! At the moment though, its a script for our own servers and we know the types of toplevels included in our hosting. Not that many. Problem is more when we receive tons of "strange url mappings" to our domains - its those I would like to retreive and sort for easy viewing. But really, thanks for the TLD link. Going to check up on it too. Perhaps build somekinda import from that page.Adalai
@Mikey Cee , good link. Previously found a list of patterns to determine the TLD in google Guava library and translated it to xml. Here is the link to it docs.google.com/file/d/0B8ALaar6dLM7ZUc2MUtidVE4RXM/…Semiautomatic
L
0
uri.Host.ToLower().Replace("www.","").Substring(uri.Host.ToLower().Replace("www.","").IndexOf('.'))
  • returns ".com" for

    Uri uri = new Uri("https://mcmap.net/q/276958/-top-level-domain-from-url-in-c");

  • returns ".co.jp" for Uri uri = new Uri("http://stackoverflow.co.jp");

  • returns ".s1.moh.gov.cn" for Uri uri = new Uri("http://stackoverflow.s1.moh.gov.cn");

etc.

Leprous answered 18/2, 2011 at 16:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.