implementing Public Suffix extraction using java
Asked Answered
S

3

5

i need to extract the top domain of an url and i got his http://publicsuffix.org/index.html

and the java implementation is in http://guava-libraries.googlecode.com and i could not find any example to extract domain name

say example..
example.google.com
returns google.com

and bing.bing.bing.com
returns bing.com

can any one tell me how can i implement using this library with an example....

Shaina answered 27/1, 2011 at 17:33 Comment(5)
So, you're looking to extract TLD (the .com part) and SLD (the google or bing part) from URLs?Rigadoon
If you just want the last two parts of the domain, couldn't you just String.split('\\.') to get the parts and return the last two? Or do a String.substring(indexOfPenultimatePeriod) after (easily) working out the appropriate index? What is the complexity here?Gratin
@Andrzej Doyle ya..you are right and that is an url list with 10k urls with different suffix like it has .com,.com.jp,.org,com.in,etc....Shaina
@Shaina - good point, you should add those cases to the examples. The only way to cope with this is to have a list of definitive TLDs, and match the end of your domain string against them.Gratin
@ramuvan: Guava does have a solution that makes this easy... see my answer.Sibert
S
18

It looks to me like InternetDomainName.topPrivateDomain() does exactly what you want. Guava maintains a list of public suffixes (based on Mozilla's list at publicsuffix.org) that it uses to determine what the public suffix part of the host is... the top private domain is the public suffix plus its first child.

Here's a quick example:

public class Test {
  public static void main(String[] args) throws URISyntaxException {
    ImmutableList<String> urls = ImmutableList.of(
        "http://example.google.com", "http://google.com", 
        "http://bing.bing.bing.com", "http://www.amazon.co.jp/");
    for (String url : urls) {
      System.out.println(url + " -> " + getTopPrivateDomain(url));
    }
  }

  private static String getTopPrivateDomain(String url) throws URISyntaxException {
    String host = new URI(url).getHost();
    InternetDomainName domainName = InternetDomainName.from(host);
    return domainName.topPrivateDomain().name();
  }
}

Running this code prints:

http://example.google.com -> google.com
http://google.com -> google.com
http://bing.bing.bing.com -> bing.com
http://www.amazon.co.jp/ -> amazon.co.jp
Sibert answered 27/1, 2011 at 19:9 Comment(5)
TLD and Public Suffix are not the same. For example http://myblog.blogspot.com -> myblog.blogspot.com. Read this for further detailsGuinna
Do you know why s3.amazonaws.com returns a null?Flanders
@Liquid: s3.amazonaws.com is itself a public suffix: publicsuffix.org/list/effective_tld_names.datSibert
Sorry, I works well... I implement it in a wrong way.Flanders
[javac] symbol : method name() [javac] location: class com.google.common.net.InternetDomainName [javac] this.domain = domainName.topPrivateDomain().name(); [javac] ^ [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error What does this error mean? why .name() method??Flanders
B
2

I recently implemented a Public Suffix List API:

PublicSuffixList suffixList = new PublicSuffixListFactory().build();

assertEquals(
    "google.com", suffixList.getRegistrableDomain("example.google.com"));

assertEquals(
    "bing.com", suffixList.getRegistrableDomain("bing.bing.bing.com"));

assertEquals(
    "amazon.co.jp", suffixList.getRegistrableDomain("www.amazon.co.jp"));
Buckels answered 15/7, 2014 at 9:31 Comment(2)
Do you know why s3.amazonaws.com returns a null?Flanders
The PSL considers s3.amazonaws.com as an public suffix.Buckels
B
1

EDIT: Sorry I've been a little too fast. I didn't think of co.jp. co.uk, and so on. You will need to get a list of possible TLDs from somewhere. You could also take a look at http://commons.apache.org/validator/ to validate a TLD.

I think something like this should work: But maybe there exists some Java-Standard Function.

String url = "http://www.foobar.com/someFolder/index.html";
if (url.contains("://")) {
  url = url.split("://")[1];
}

if (url.contains("/")) {
  url = url.split("/")[0];
}

// You need to get your TLDs from somewhere...
List<String> magicListofTLD = getTLDsFromSomewhere();

int positionOfTLD = -1;
String usedTLD = null;
for (String tld : magicListofTLD) {
  positionOfTLD = url.indexOf(tld);
  if (positionOfTLD > 0) {
    usedTLD = tld;
    break;
  }
}

if (positionOfTLD > 0) {
  url = url.substring(0, positionOfTLD);
} else {
  return;
}
String[] strings = url.split("\\.");

String foo = strings[strings.length - 1] + "." + usedTLD;
System.out.println(foo);
Barony answered 27/1, 2011 at 17:48 Comment(3)
yeah, sorry, didn't think of co.jp, co.uk and so on. I guess you have to get a list of possible TLDs and try to match them with the String.Barony
Guava has built in functionality for doing this, including an internal TLD list that will be updated with new releases as the TLD list changes. On top of that, Java has built in functionality for parsing and getting the host part of a URL... I don't think parsing it out manually with split is a good idea.Sibert
@ColinD: Nice library. Didn't know of it.Barony

© 2022 - 2024 — McMap. All rights reserved.