How to get domain name from URL
Asked Answered
C

25

65

How can I fetch a domain name from a URL String?

Examples:

+----------------------+------------+
| input                | output     |
+----------------------+------------+
| www.google.com       | google     |
| www.mail.yahoo.com   | mail.yahoo |
| www.mail.yahoo.co.in | mail.yahoo |
| www.abc.au.uk        | abc        |
+----------------------+------------+

Related:

Clarendon answered 20/2, 2009 at 11:1 Comment(9)
what about www.abc.def.ghi.au.uk?Kolyma
What about “foo.bar.com”? And “foo.com”?Upturn
Well, the second post in minutes about a very similar topic -- homework? (stackoverflow.com/questions/568864/…)Guillerminaguillermo
What for may I ask? It's hard to invent what for do you need domain names without 2nd level domain suffix (like .co.uk)Moulton
@ Hemal: in this case expected output is abc.def.ghi @ Bombe: i need to remove www prefix anywaysClarendon
Problem is not solvable. You can't tell if xx in foo.xx.yy has to be removed too (Why did you remove au.uk and not just uk?)Pompous
Agree with 'not solvable'. Too many mutually exclusive conditions.Moulton
@Chinmay: Your terminology is all sorts of wrong here. All of the inputs you list are domain names, not URLs. This is a URL: http://en.wikipedia.org/wiki/URL, the domain name in that URL is en.wikipedia.orgChickweed
I found this answer very useful: https://mcmap.net/q/18878/-implementing-public-suffix-extraction-using-java.Carrasquillo
B
45

I once had to write such a regex for a company I worked for. The solution was this:

  • Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
  • Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.

Example regex:

.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$

This worked really well and also matched weird, unofficial top-levels like de.com and friends.

The upside:

  • Very fast if regex is optimally ordered

The downside of this solution is of course:

  • Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
  • Very large regex so not very readable.
Bellied answered 20/2, 2009 at 11:30 Comment(9)
RE: tedious to update - Write a little code generator program to generate the regex based on the input data files.Hardpan
True. With a good test harness this should be possible. We of course did no testing then...Bellied
The list from Mozilla seems pretty good actually -- it has *.uk to match .ac.uk . You just have to figure out the format and interpret the rules correctly.Mccorkle
Worth noting that if you parse the mozilla list for ALL the possible tlds, the regex compilation fails. (on PHP at least)Sigrid
I needed this for a couple projects, so I implemented it in Python and opened it up on GitHub. You can also query it via an HTTP endpoint on App Engine. Feel free to contribute!Notability
There are libraries based upon Mozilla's Public Suffix List that make getting any portion of the domain easy.Bighead
The Mozilla PSL now matches *.uk, so @pi.'s concern about it being unable to matching ac.uk no longer applies.Textuary
I think this answer is useless nowadays since restrictions for gTLDs have been removed by IANA and the list would need to be updated very frequently.Winny
interesting approach, however it needed a bit change to work for me. this regex works fine if you are interested to find hostname (root domain address): /([^\.]+)\.(com|net|org|info)$/iUpkeep
M
24

A little late to the party, but:

const urls = [
  'www.abc.au.uk',
  'https://github.com',
  'http://github.ca',
  'https://www.google.ru',
  'http://www.google.co.uk',
  'www.yandex.com',
  'yandex.ru',
  'yandex'
]

urls.forEach(url => console.log(url.replace(/.+\/\/|www.|\..+/g, '')))
Mcnutt answered 24/1, 2020 at 13:48 Comment(3)
This is my favourite answer. Thank you.Hideout
This doesn't work: for the input www.mail.yahoo.co.in the desired output is mail.yahoo but this outputs mailPulsatory
That's more than fine, neither does the accepted answer, but this way is scalable and more dynamic. Whatever 10-20% of cases that you need to specifically match, where this approach comes short, you can hardcode as the accepted answer has done. It's an answer for the community, not for the OP, who's already received his answer 11 years ago.Mcnutt
N
16

Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the subdomain (the prefix) may or may not be there. Listing all domain extensions is not an option because there are hundreds of these. EuroDNS.com for example lists over 800 domain name extensions.

I therefore wrote a short php function that uses 'parse_url()' and some observations about domain extensions to accurately extract the url components AND the domain name. The function is as follows:

function parse_url_all($url){
    $url = substr($url,0,4)=='http'? $url: 'http://'.$url;
    $d = parse_url($url);
    $tmp = explode('.',$d['host']);
    $n = count($tmp);
    if ($n>=2){
        if ($n==4 || ($n==3 && strlen($tmp[($n-2)])<=3)){
            $d['domain'] = $tmp[($n-3)].".".$tmp[($n-2)].".".$tmp[($n-1)];
            $d['domainX'] = $tmp[($n-3)];
        } else {
            $d['domain'] = $tmp[($n-2)].".".$tmp[($n-1)];
            $d['domainX'] = $tmp[($n-2)];
        }
    }
    return $d;
}

This simple function will work in almost every case. There are a few exceptions, but these are very rare.

To demonstrate / test this function you can use the following:

$urls = array('www.test.com', 'test.com', 'cp.test.com' .....);
echo "<div style='overflow-x:auto;'>";
echo "<table>";
echo "<tr><th>URL</th><th>Host</th><th>Domain</th><th>Domain X</th></tr>";
foreach ($urls as $url) {
    $info = parse_url_all($url);
    echo "<tr><td>".$url."</td><td>".$info['host'].
    "</td><td>".$info['domain']."</td><td>".$info['domainX']."</td></tr>";
}
echo "</table></div>";

The output will be as follows for the URL's listed:

enter image description here

As you can see, the domain name and the domain name without the extension are consistently extracted whatever the URL that is presented to the function.

I hope that this helps.

Nebulous answered 11/7, 2017 at 20:36 Comment(6)
Clinton said: "I therefore wrote a short php function that uses 'parse_url()' and some observations about domain extensions to accurately extract the url components AND the domain name." Anyone have a JavaScript version of this function?Avigation
Good script. Is it still safe to use?Castalia
Thank you. I still use it on a number of applications that involve URL and domain checks and it works every time for me.Nebulous
I dont have PHP to test your code, does sub1.sub2.test.co.it work in your case?Olathe
The code as it stands will work with 4 layers in the domain definition. You can easily extend this to 5 layers (as in your example) by changing the innermost if statement.Nebulous
This is a nice little script that works for 95% of the cases. Thanks! I just wanted to point out that it will fail if the domain is 3 or fewer letters long (www.cnn.com) so be careful if you just copy and paste. The problem is that it's impossible to know if the domain is "www" with "cnn.com" as the TLD or "cnn" with "com" as the TLD. In this case it's obvious, but you need to know all the TLDs to know for certain.Iterate
C
9
/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/
Chrisom answered 20/2, 2009 at 11:19 Comment(1)
I think the examples are there just to illustrate a general rule. This would only work on the input the OP gave.Incidental
L
9

There are two ways

Using split

Then just parse that string

var domain;
//find & remove protocol (http, ftp, etc.) and get domain
if (url.indexOf('://') > -1) {
    domain = url.split('/')[2];
} if (url.indexOf('//') === 0) {
    domain = url.split('/')[2];
} else {
    domain = url.split('/')[0];
}

//find & remove port number
domain = domain.split(':')[0];

Using Regex

 var r = /:\/\/(.[^/]+)/;
 "https://mcmap.net/q/18879/-get-the-domain-and-page-name-from-a-string-url".match(r)[1] 
 => stackoverflow.com

Hope this helps

Lathrop answered 10/9, 2015 at 6:22 Comment(1)
This works but needs to have protocol in the urlTadeas
P
4

I don't know of any libraries, but the string manipulation of domain names is easy enough.

The hard part is knowing if the name is at the second or third level. For this you will need a data file you maintain (e.g. for .uk is is not always the third level, some organisations (e.g. bl.uk, jet.uk) exist at the second level).

The source of Firefox from Mozilla has such a data file, check the Mozilla licensing to see if you could reuse that.

Peltier answered 20/2, 2009 at 11:7 Comment(0)
S
4

It is not possible without using a TLD list to compare with as their exist many cases like http://www.db.de/ or http://bbc.co.uk/ that will be interpreted by a regex as the domains db.de (correct) and co.uk (wrong).

But even with that you won't have success if your list does not contain SLDs, too. URLs like https://liverpool.gov.uk.com/ would be interpreted as gov.uk.com (wrong).

Because of that, all browsers use Mozilla's Public Suffix List: https://en.wikipedia.org/wiki/Public_Suffix_List

You can use it in your code by importing it through this URL: https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat

Feel free to extend my function to extract the domain name, only. It won't use regex and it is fast: http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878

Stopcock answered 9/3, 2012 at 10:48 Comment(0)
P
3
import urlparse

GENERIC_TLDS = [
    'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs', 
    'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
    ]

def get_domain(url):
    hostname = urlparse.urlparse(url.lower()).netloc
    if hostname == '':
        # Force the recognition as a full URL
        hostname = urlparse.urlparse('http://' + uri).netloc

    # Remove the 'user:passw', 'www.' and ':port' parts
    hostname = hostname.split('@')[-1].split(':')[0].lstrip('www.').split('.')

    num_parts = len(hostname)
    if (num_parts < 3) or (len(hostname[-1]) > 2):
        return '.'.join(hostname[:-1])
    if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
        return '.'.join(hostname[:-1])
    if num_parts >= 3:
        return '.'.join(hostname[:-2])

This code isn't guaranteed to work with all URLs and doesn't filter those that are grammatically correct but invalid like 'example.uk'.

However it'll do the job in most cases.

Pyrology answered 19/7, 2010 at 3:21 Comment(0)
I
2

Basically, what you want is:

google.com        -> google.com    -> google
www.google.com    -> google.com    -> google
google.co.uk      -> google.co.uk  -> google
www.google.co.uk  -> google.co.uk  -> google
www.google.org    -> google.org    -> google
www.google.org.uk -> google.org.uk -> google

Optional:

www.google.com     -> google.com    -> www.google
images.google.com  -> google.com    -> images.google
mail.yahoo.co.uk   -> yahoo.co.uk   -> mail.yahoo
mail.yahoo.com     -> yahoo.com     -> mail.yahoo
www.mail.yahoo.com -> yahoo.com     -> mail.yahoo

You don't need to construct an ever-changing regex as 99% of domains will be matched properly if you simply look at the 2nd last part of the name:

(co|com|gov|net|org)

If it is one of these, then you need to match 3 dots, else 2. Simple. Now, my regex wizardry is no match for that of some other SO'ers, so the best way I've found to achieve this is with some code, assuming you've already stripped off the path:

 my @d=split /\./,$domain;                # split the domain part into an array
 $c=@d;                                   # count how many parts
 $dest=$d[$c-2].'.'.$d[$c-1];             # use the last 2 parts
 if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
   $dest=$d[$c-3].'.'.$dest;              # if so, add a third part
 };
 print $dest;                             # show it

To just get the name, as per your question:

 my @d=split /\./,$domain;                # split the domain part into an array
 $c=@d;                                   # count how many parts
 if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
   $dest=$d[$c-3];                        # if so, give the third last
   $dest=$d[$c-4].'.'.$dest if ($c>3);    # optional bit
 } else {
   $dest=$d[$c-2];                        # else the second last
   $dest=$d[$c-3].'.'.$dest if ($c>2);    # optional bit 
 };
 print $dest;                             # show it

I like this approach because it's maintenance-free. Unless you want to validate that it's actually a legitimate domain, but that's kind of pointless because you're most likely only using this to process log files and an invalid domain wouldn't find its way in there in the first place.

If you'd like to match "unofficial" subdomains such as bozo.za.net, or bozo.au.uk, bozo.msf.ru just add (za|au|msf) to the regex.

I'd love to see someone do all of this using just a regex, I'm sure it's possible.

Impropriety answered 20/2, 2009 at 11:1 Comment(0)
D
2

Could you just look for the word before .com (or other) (the order of the other list would be the opposite of the frequency see here

and take the first matching group i.e.

window.location.host.match(/(\w|-)+(?=(\.(com|net|org|info|coop|int|co|ac|ie|co|ai|eu|ca|icu|top|xyz|tk|cn|ga|cf|nl|us|eu|de|hk|am|tv|bingo|blackfriday|gov|edu|mil|arpa|au|ru)(\.|\/|$)))/g)[0]

You can test it could by copying this line into the developers' console on any tab

This example works in the following cases:

enter image description here

Derogatory answered 21/4, 2021 at 10:44 Comment(0)
S
1

/[^w{3}\.]([a-zA-Z0-9]([a-zA-Z0-9\-]{0,65}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}/gim

usage of this javascript regex ignores www and following dot, while retaining the domain intact. also properly matches no www and cc tld

Selfdelusion answered 3/11, 2010 at 21:9 Comment(0)
E
1

I know you actually asked for Regex and were not specific to a language. But In Javascript you can do this like this. Maybe other languages can parse URL in a similar way.

Easy Javascript solution

const domain = (new URL(str)).hostname.replace("www.", "");

Leave this solution in js for completeness.

Ean answered 26/1, 2022 at 18:55 Comment(0)
D
0

So if you just have a string and not a window.location you could use...

String.prototype.toUrl = function(){

if(!this && 0 < this.length)
{
    return undefined;
}
var original = this.toString();
var s = original;
if(!original.toLowerCase().startsWith('http'))
{
    s = 'http://' + original;
}

s = this.split('/');

var protocol = s[0];
var host = s[2];
var relativePath = '';

if(s.length > 3){
    for(var i=3;i< s.length;i++)
    {
        relativePath += '/' + s[i];
    }
}

s = host.split('.');
var domain = s[s.length-2] + '.' + s[s.length-1];    

return {
    original: original,
    protocol: protocol,
    domain: domain,
    host: host,
    relativePath: relativePath,
    getParameter: function(param)
    {
        return this.getParameters()[param];
    },
    getParameters: function(){
        var vars = [], hash;
        var hashes = this.original.slice(this.original.indexOf('?') + 1).split('&');
        for (var i = 0; i < hashes.length; i++) {
            hash = hashes[i].split('=');
            vars.push(hash[0]);
            vars[hash[0]] = hash[1];
        }
        return vars;
    }
};};

How to use.

var str = "http://en.wikipedia.org/wiki/Knopf?q=1&t=2";
var url = str.toUrl;

var host = url.host;
var domain = url.domain;
var original = url.original;
var relativePath = url.relativePath;
var paramQ = url.getParameter('q');
var paramT = url.getParamter('t');
Deadradeadweight answered 28/2, 2013 at 16:26 Comment(0)
A
0

For a certain purpose I did this quick Python function yesterday. It returns domain from URL. It's quick and doesn't need any input file listing stuff. However, I don't pretend it works in all cases, but it really does the job I needed for a simple text mining script.

Output looks like this :

http://www.google.co.uk => google.co.uk
http://24.media.tumblr.com/tumblr_m04s34rqh567ij78k_250.gif => tumblr.com

def getDomain(url):    
        parts = re.split("\/", url)
        match = re.match("([\w\-]+\.)*([\w\-]+\.\w{2,6}$)", parts[2]) 
        if match != None:
            if re.search("\.uk", parts[2]): 
                match = re.match("([\w\-]+\.)*([\w\-]+\.[\w\-]+\.\w{2,6}$)", parts[2])
            return match.group(2)
        else: return ''  

Seems to work pretty well.
However, it has to be modified to remove domain extensions on output as you wished.

Alansen answered 15/6, 2013 at 8:27 Comment(0)
S
0
/^(?:https?:\/\/)?(?:www\.)?([^\/]+)/i
Schutzstaffel answered 1/4, 2015 at 22:38 Comment(1)
Generally, answers are much more helpful if they include an explanation of what the code is intended to do, and why that solves the problem without introducing others. This is especially true of regexen, which are notorious for being opaque line noise to most. Here, too, it's not especially clear that it solves the entirety of the problem, and since there are answers that do, and do so well, and with excellent explanations….Dorolisa
C
0
  1. how is this

    =((?:(?:(?:http)s?:)?\/\/)?(?:(?:[a-zA-Z0-9]+)\.?)*(?:(?:[a-zA-Z0-9]+))\.[a-zA-Z0-9]{2,3}) (you may want to add "\/" to end of pattern

  2. if your goal is to rid url's passed in as a param you may add the equal sign as the first char, like:

    =((?:(?:(?:http)s?:)?//)?(?:(?:[a-zA-Z0-9]+).?)*(?:(?:[a-zA-Z0-9]+)).[a-zA-Z0-9]{2,3}/)

    and replace with "/"

The goal of this example to get rid of any domain name regardless of the form it appears in. (i.e. to ensure url parameters don't incldue domain names to avoid xss attack)

Columbian answered 24/5, 2017 at 16:27 Comment(0)
S
0

All answers here are very nice, but all will fails sometime. So i know it is not common to link something else, already answered elsewhere, but you'll find that you have to not waste your time into impossible thing. This because domains like mydomain.co.uk there is no way to know if an extracted domain is correct. If you speak about to extract by URLs, something that ever have http or https or nothing in front (but if it is possible nothing in front, you have to remove

filter_var($url, filter_var($url, FILTER_VALIDATE_URL))

here below, because FILTER_VALIDATE_URL do not recognize as url a string that do not begin with http, so may remove it, and you can also achieve with something stupid like this, that never will fail:

$url = strtolower('hTTps://www.example.com/w3/forum/index.php');

if( filter_var($url, FILTER_VALIDATE_URL) && substr($url, 0, 4) == 'http' )
{
// array order is !important
$domain = str_replace(array("http://www.","https://www.","http://","https://"), array("","","",""), $url);
$spos = strpos($domain,'/');
if($spos !== false)
{
 $domain = substr($domain, 0, $spos);
} } else { $domain = "can't extract a domain"; }

echo $domain;

Check FILTER_VALIDATE_URL default behavior here

But, if you want to check a domain for his validity, and ALWAYS be sure that the extracted value is correct, then you have to check against an array of valid top domains, as explained here: https://mcmap.net/q/18880/-how-to-validate-a-domain-name-using-regex-amp-php or you'll NEVER be sure that the extracted string is the correct domain. Unfortunately, all the answers here sometime will fails.

P.s the unique answer that make sense here seem to me this (i did not read it before sorry. It provide the same solution, even if do not provide an example as mine above mentioned or linked): https://mcmap.net/q/18847/-how-to-get-domain-name-from-url

Syringa answered 10/1, 2022 at 9:54 Comment(0)
B
0

In Javascript, the best way to do this is using the tld-extract npm package. Check out an example at the following link.

Below is the code for the same:

var tldExtract = require("tld-extract")

const urls = [
  'http://www.mail.yahoo.co.in/',
  'https://mail.yahoo.com/',
  'https://www.abc.au.uk',
  'https://github.com',
  'http://github.ca',
  'https://www.google.ru',
  'https://google.co.uk',
  'https://www.yandex.com',
  'https://yandex.ru',
]

const tldList = [];

urls.forEach(url => tldList.push(tldExtract(url)))

console.log({tldList})

which results in the following output:

0: Object {tld: "co.in", domain: "yahoo.co.in", sub: "www.mail"}
1: Object {tld: "com", domain: "yahoo.com", sub: "mail"}
2: Object {tld: "uk", domain: "au.uk", sub: "www.abc"}
3: Object {tld: "com", domain: "github.com", sub: ""}
4: Object {tld: "ca", domain: "github.ca", sub: ""}
5: Object {tld: "ru", domain: "google.ru", sub: "www"}
6: Object {tld: "co.uk", domain: "google.co.uk", sub: ""}
7: Object {tld: "com", domain: "yandex.com", sub: "www"}
8: Object {tld: "ru", domain: "yandex.ru", sub: ""}
Burchfield answered 7/2, 2023 at 15:37 Comment(0)
B
0

Found a custom function which works in most of the cases:

function getDomainWithoutSubdomain(url) {
    const urlParts = new URL(url).hostname.split('.')

    return urlParts
        .slice(0)
        .slice(-(urlParts.length === 4 ? 3 : 2))
        .join('.')

}
Burchfield answered 7/2, 2023 at 15:39 Comment(0)
U
-1

You need a list of what domain prefixes and suffixes can be removed. For example:

Prefixes:

  • www.

Suffixes:

  • .com
  • .co.in
  • .au.uk
Upbuild answered 20/2, 2009 at 11:15 Comment(1)
works only for the samples and maintaining such lists does not scaleSubkingdom
I
-1
#!/usr/bin/perl -w
use strict;

my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+)\.[^\/]+/g) {
  print $3;
}
Incoherent answered 23/3, 2010 at 4:2 Comment(4)
if you used other characters than a forward slash for the match operator, then you wouldn't need to have to have so many escape characters and can make the regex more readable, e.g. $url =~ m{([^:]*://)?([^/]*\.)*([^/\.]+)\.[^/]+} not sure you want the looping operator (/g) either?Velamen
True, although the big problem with my response is that it won't work for foreign domains since they don't follow the standard US format "xxx.(com|edu|org|etc)". Sot telegraph.co.uk won't match. Makes me think that you really do need to explicitly list out all of the various country codes in order to match something like that.Incoherent
or since other people have already figured this stuff out, just use a module to do it, such as URI::Find - search.cpan.org/perldoc?URI::Find or if you just want a regex then search.cpan.org/perldoc?Regexp::Common::URIVelamen
Of course, but when someone asks for a regex, it's always fun to work it out :)Incoherent
B
-1

Just for knowledge:

'http://api.livreto.co/books'.replace(/^(https?:\/\/)([a-z]{3}[0-9]?\.)?(\w+)(\.[a-zA-Z]{2,3})(\.[a-zA-Z]{2,3})?.*$/, '$3$4$5');

# returns livreto.co 
Bibbie answered 1/10, 2015 at 18:46 Comment(0)
G
-1

I know the question is seeking a regex solution but in every attempt it won't work to cover everything

I decided to write this method in Python which only works with urls that have a subdomain (i.e. www.mydomain.co.uk) and not multiple level subdomains like www.mail.yahoo.com

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}
Genitive answered 28/5, 2019 at 16:48 Comment(0)
M
-1

Let's say we have this: http://google.com

and you only want the domain name

let url = http://google.com;
let domainName = url.split("://")[1];
console.log(domainName);
Muskrat answered 30/12, 2021 at 17:54 Comment(0)
R
-2

Use this (.)(.*?)(.) then just extract the leading and end points. Easy, right?

Restraint answered 3/8, 2015 at 2:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.