implementing Public Suffix extraction using java

Asked 27/1, 2011 at 17:33 Answered 15/7, 2014 at 9:31

i need to extract the top domain of an url and i got his http://publicsuffix.org/index.html

and the java implementation is in http://guava-libraries.googlecode.com and i could not find any example to extract domain name

say example..
example.google.com
returns google.com

and bing.bing.bing.com
returns bing.com

can any one tell me how can i implement using this library with an example....

Shaina answered 27/1, 2011 at 17:33 Comment(5)

So, you're looking to extract TLD (the .com part) and SLD (the google or bing part) from URLs? – Rigadoon 27/1, 2011 at 17:36

If you just want the last two parts of the domain, couldn't you just String.split('\\.') to get the parts and return the last two? Or do a String.substring(indexOfPenultimatePeriod) after (easily) working out the appropriate index? What is the complexity here? – Gratin 27/1, 2011 at 17:37

@Andrzej Doyle ya..you are right and that is an url list with 10k urls with different suffix like it has .com,.com.jp,.org,com.in,etc.... – Shaina 27/1, 2011 at 17:40

@Shaina - good point, you should add those cases to the examples. The only way to cope with this is to have a list of definitive TLDs, and match the end of your domain string against them. – Gratin 27/1, 2011 at 18:6

@ramuvan: Guava does have a solution that makes this easy... see my answer. – Sibert 27/1, 2011 at 19:13

It looks to me like InternetDomainName.topPrivateDomain() does exactly what you want. Guava maintains a list of public suffixes (based on Mozilla's list at publicsuffix.org) that it uses to determine what the public suffix part of the host is... the top private domain is the public suffix plus its first child.

Here's a quick example:

public class Test {
  public static void main(String[] args) throws URISyntaxException {
    ImmutableList<String> urls = ImmutableList.of(
        "http://example.google.com", "http://google.com", 
        "http://bing.bing.bing.com", "http://www.amazon.co.jp/");
    for (String url : urls) {
      System.out.println(url + " -> " + getTopPrivateDomain(url));
    }
  }

  private static String getTopPrivateDomain(String url) throws URISyntaxException {
    String host = new URI(url).getHost();
    InternetDomainName domainName = InternetDomainName.from(host);
    return domainName.topPrivateDomain().name();
  }
}

Running this code prints:

http://example.google.com -> google.com
http://google.com -> google.com
http://bing.bing.bing.com -> bing.com
http://www.amazon.co.jp/ -> amazon.co.jp

Sibert answered 27/1, 2011 at 19:9 Comment(5)

TLD and Public Suffix are not the same. For example http://myblog.blogspot.com -> myblog.blogspot.com. Read this for further details – Guinna 3/3, 2014 at 17:47

Do you know why s3.amazonaws.com returns a null? – Flanders 14/11, 2014 at 16:15

@Liquid: s3.amazonaws.com is itself a public suffix: publicsuffix.org/list/effective_tld_names.dat – Sibert 14/11, 2014 at 17:11

Sorry, I works well... I implement it in a wrong way. – Flanders 14/11, 2014 at 17:40

[javac] symbol  : method name()     [javac] location: class com.google.common.net.InternetDomainName     [javac]                             this.domain = domainName.topPrivateDomain().name();     [javac]                                                                        ^     [javac] Note: Some input files use unchecked or unsafe operations.     [javac] Note: Recompile with -Xlint:unchecked for details.     [javac] 1 error

What does this error mean? why .name() method?? – Flanders 14/11, 2014 at 19:29

I recently implemented a Public Suffix List API:

PublicSuffixList suffixList = new PublicSuffixListFactory().build();

assertEquals(
    "google.com", suffixList.getRegistrableDomain("example.google.com"));

assertEquals(
    "bing.com", suffixList.getRegistrableDomain("bing.bing.bing.com"));

assertEquals(
    "amazon.co.jp", suffixList.getRegistrableDomain("www.amazon.co.jp"));

Buckels answered 15/7, 2014 at 9:31 Comment(2)

Do you know why s3.amazonaws.com returns a null? – Flanders 14/11, 2014 at 16:15

The PSL considers s3.amazonaws.com as an public suffix. – Buckels 19/11, 2014 at 15:56

EDIT: Sorry I've been a little too fast. I didn't think of co.jp. co.uk, and so on. You will need to get a list of possible TLDs from somewhere. You could also take a look at http://commons.apache.org/validator/ to validate a TLD.

I think something like this should work: But maybe there exists some Java-Standard Function.

String url = "http://www.foobar.com/someFolder/index.html";
if (url.contains("://")) {
  url = url.split("://")[1];
}

if (url.contains("/")) {
  url = url.split("/")[0];
}

// You need to get your TLDs from somewhere...
List<String> magicListofTLD = getTLDsFromSomewhere();

int positionOfTLD = -1;
String usedTLD = null;
for (String tld : magicListofTLD) {
  positionOfTLD = url.indexOf(tld);
  if (positionOfTLD > 0) {
    usedTLD = tld;
    break;
  }
}

if (positionOfTLD > 0) {
  url = url.substring(0, positionOfTLD);
} else {
  return;
}
String[] strings = url.split("\\.");

String foo = strings[strings.length - 1] + "." + usedTLD;
System.out.println(foo);

Barony answered 27/1, 2011 at 17:48 Comment(3)

yeah, sorry, didn't think of co.jp, co.uk and so on. I guess you have to get a list of possible TLDs and try to match them with the String. – Barony 27/1, 2011 at 18:0

Guava has built in functionality for doing this, including an internal TLD list that will be updated with new releases as the TLD list changes. On top of that, Java has built in functionality for parsing and getting the host part of a URL... I don't think parsing it out manually with split is a good idea. – Sibert 27/1, 2011 at 19:16

@ColinD: Nice library. Didn't know of it. – Barony 27/1, 2011 at 19:27

Recommended topics

Hot tags