Need a regular expression to capture second level domain (SLD)
Asked Answered
B

1

1

i need a regular expression to capture a given URLs SLD.

Examples:

jack.bop.com -> bop
bop.com -> bop
bop.de -> bop
bop.co.uk -> bop
bop.com.br -> bop

All bops :). So this regex needs to ignore ccTLDs, gTLDs and ccSLDs. The latter is the difficult part, since i wanna keep the regex as un-complex as possible.

The first task would be to remove ccTLDs then gTLDs, and then check for ccSLDs and remove them if present.

Any help is much appreciated :)

--

If it helps, ccTLDs are matched by:

\.([a-z]{2})$

And gTLDs are matched by:

\.([a-z]{3-6})$

Luckily it's two mutually exclusive patterns.

Body answered 15/12, 2010 at 17:23 Comment(0)
G
5

Technically, '.co.uk' is the second level domain in 'bop.co.uk'. What you seem to be asking for is the highest level part of the domain that was open to public registration, and you want to strip off the domain of the registrar.

RFC 6265 §5.3 calls the suffx that you don't want a "public suffix":

A "public suffix" is a domain that is controlled by a public registry, such as "com", "co.uk", and "pvt.k12.wy.us".

Mozilla maintains a list of all known public suffixes.

To create your regex, you'll have to enumerate all of the public suffixes. You should order them such that elements that are suffixes of other elements to appear later. An easy way to do this is to sort by descending length. It looks like reversing Mozilla's list would also suffice.

After that, the regex is pretty straightforward:

(.+\.)?([^.]+)\.(?:<suffixes>)$

Where <suffixes> would be the | separated list of suffixes. A piece of it would look something like:

gov\.uk|ac\.uk|co\.uk|com|org|net|us|uk

There are ways to make this shorter, by collapsing common-suffixes, though this makes the regex (and the process of computing it) much more complex. For example:

(?:gov\.|ac\.|co\.|)uk|com|org|net|us
Gnathous answered 15/12, 2010 at 17:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.