How to normalize a URL in Java?
Asked Answered
L

9

38

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many.

Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better.

I'll handcode something for now and keep an eye on this question.

EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.

Lunsford answered 7/6, 2010 at 22:33 Comment(0)
A
30

Have you taken a look at the URI class?

http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()

Alleras answered 7/6, 2010 at 22:36 Comment(3)
Good one! However, it doesn't go nearly far enough for me. The first thing I did which helped was to pitch the following parameters: utm_source, utm_medium, utm_campaign. They are on lots of URLs in the wild, but removing them leaves the URLs semantically the same for purposes of analyzing which content they refer to.Lunsford
@Lunsford That's not necessarily true. There's nothing to stop a site from serving different content based on those parameters.Delitescent
Sure, but practically speaking, those are used by some marketing package (Google analytics?) to track campaigns, so they will not likely vary.Lunsford
T
21

I found this question last night, but there wasn't an answer I was looking for so I made my own. Here it is incase somebody in the future wants it:

/**
 * - Covert the scheme and host to lowercase (done by java.net.URL)
 * - Normalize the path (done by java.net.URI)
 * - Add the port number.
 * - Remove the fragment (the part after the #).
 * - Remove trailing slash.
 * - Sort the query string params.
 * - Remove some query string params like "utm_*" and "*session*".
 */
public class NormalizeURL
{
    public static String normalize(final String taintedURL) throws MalformedURLException
    {
        final URL url;
        try
        {
            url = new URI(taintedURL).normalize().toURL();
        }
        catch (URISyntaxException e) {
            throw new MalformedURLException(e.getMessage());
        }

        final String path = url.getPath().replace("/$", "");
        final SortedMap<String, String> params = createParameterMap(url.getQuery());
        final int port = url.getPort();
        final String queryString;

        if (params != null)
        {
            // Some params are only relevant for user tracking, so remove the most commons ones.
            for (Iterator<String> i = params.keySet().iterator(); i.hasNext();)
            {
                final String key = i.next();
                if (key.startsWith("utm_") || key.contains("session"))
                {
                    i.remove();
                }
            }
            queryString = "?" + canonicalize(params);
        }
        else
        {
            queryString = "";
        }

        return url.getProtocol() + "://" + url.getHost()
            + (port != -1 && port != 80 ? ":" + port : "")
            + path + queryString;
    }

    /**
     * Takes a query string, separates the constituent name-value pairs, and
     * stores them in a SortedMap ordered by lexicographical order.
     * @return Null if there is no query string.
     */
    private static SortedMap<String, String> createParameterMap(final String queryString)
    {
        if (queryString == null || queryString.isEmpty())
        {
            return null;
        }

        final String[] pairs = queryString.split("&");
        final Map<String, String> params = new HashMap<String, String>(pairs.length);

        for (final String pair : pairs)
        {
            if (pair.length() < 1)
            {
                continue;
            }

            String[] tokens = pair.split("=", 2);
            for (int j = 0; j < tokens.length; j++)
            {
                try
                {
                    tokens[j] = URLDecoder.decode(tokens[j], "UTF-8");
                }
                catch (UnsupportedEncodingException ex)
                {
                    ex.printStackTrace();
                }
            }
            switch (tokens.length)
            {
                case 1:
                {
                    if (pair.charAt(0) == '=')
                    {
                        params.put("", tokens[0]);
                    }
                    else
                    {
                        params.put(tokens[0], "");
                    }
                    break;
                }
                case 2:
                {
                    params.put(tokens[0], tokens[1]);
                    break;
                }
            }
        }

        return new TreeMap<String, String>(params);
    }

    /**
     * Canonicalize the query string.
     *
     * @param sortedParamMap Parameter name-value pairs in lexicographical order.
     * @return Canonical form of query string.
     */
    private static String canonicalize(final SortedMap<String, String> sortedParamMap)
    {
        if (sortedParamMap == null || sortedParamMap.isEmpty())
        {
            return "";
        }

        final StringBuffer sb = new StringBuffer(350);
        final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator();

        while (iter.hasNext())
        {
            final Map.Entry<String, String> pair = iter.next();
            sb.append(percentEncodeRfc3986(pair.getKey()));
            sb.append('=');
            sb.append(percentEncodeRfc3986(pair.getValue()));
            if (iter.hasNext())
            {
                sb.append('&');
            }
        }

        return sb.toString();
    }

    /**
     * Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode
     * according to the RFC, so we make the extra replacements.
     *
     * @param string Decoded string.
     * @return Encoded string per RFC 3986.
     */
    private static String percentEncodeRfc3986(final String string)
    {
        try
        {
            return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
        }
        catch (UnsupportedEncodingException e)
        {
            return string;
        }
    }
}
Tullis answered 30/10, 2010 at 6:1 Comment(3)
Thanks for this, I like the approach, but I've found a few problems with the implementation: 1) A concurrent modification exception is raised in the loop removing utm_ and session keys (unless it's the last entry), since you're removing from the collection during iteration. You should use an iterator and the remove() method. 2) the re-escaping of the parameters breaks some websites I've tried. That's fine if you're just using the canonical version to compare URLs though, which is what I've ended up doing. I imagine removing the session token could also break some sites, so it's moot really.Bouchier
It's not good to strip the trailing slash from a URL. It makes a different URL in fact. For example Apache aliasing might not work if it's setup with a trailing slash.Proficient
I think it's not good to implement this type of functionality yourself. Too likely to forget some corner case.Chapeau
L
3

Because you also want to identify URLs which refer to the same content, I found this paper from the WWW2007 pretty interesting: Do Not Crawl in the DUST: Different URLs with Similar Text. It provides you with a nice theoretical approach.

Licensee answered 25/5, 2011 at 9:37 Comment(0)
C
3

No, there is nothing in the standard libraries to do this. Canonicalization includes things like decoding unnecessarily encoded characters, converting hostnames to lowercase, etc.

e.g. http://ACME.com/./foo%26bar becomes:

http://acme.com/foo&bar

URI's normalize() does not do this.

Conchita answered 2/8, 2012 at 16:22 Comment(1)
new URI("http://ACME.com/./foo%26bar").normalize() results in http://ACME.com/foo%26bar. It doesn't transform host to lowercase, but handles correctly equality: new URI("http://ACME.com/./foo%26bar").normalize().equals(new URI("http://acme.com/foo%26bar"))Conventioner
S
3

The RL library: https://github.com/backchatio/rl goes quite a ways beyond java.net.URL.normalize(). It's in Scala, but I imagine it should be useable from Java.

Summon answered 15/11, 2012 at 23:12 Comment(0)
P
1

You can do this with the Restlet framework using Reference.normalize(). You should also be able to remove the elements you don't need quite conveniently with this class.

Postulate answered 8/7, 2010 at 23:23 Comment(0)
V
1

In Java, normalize parts of a URL

Example of a URL: https://i0.wp.com:55/lplresearch.com/wp-content/feb.png?ssl=1&myvar=2#myfragment

protocol:        https 
domain name:     i0.wp.com 
subdomain:       i0 
port:            55 
path:            /lplresearch.com/wp-content/uploads/2019/01/feb.png?ssl=1 
query:           ?ssl=1" 
parameters:      &myvar=2 
fragment:        #myfragment 

Code to do the URL parsing:

import java.util.*; 
import java.util.regex.*; 
public class regex { 
    public static String getProtocol(String the_url){ 
        Pattern p = Pattern.compile("^(http|https|smtp|ftp|file|pop)://.*"); 
        Matcher m = p.matcher(the_url); 
        return m.group(1); 
    } 
    public static String getParameters(String the_url){ 
        Pattern p = Pattern.compile(".*(\\?[-a-zA-Z0-9_.@!$&''()*+,;=]+)(#.*)*$");
        Matcher m = p.matcher(the_url); 
        return m.group(1); 
    } 
    public static String getFragment(String the_url){ 
        Pattern p = Pattern.compile(".*(#.*)$"); 
        Matcher m = p.matcher(the_url); 
        return m.group(1); 
    } 
    public static void main(String[] args){ 
        String the_url = 
            "https://i0.wp.com:55/lplresearch.com/" + 
            "wp-content/feb.png?ssl=1&myvar=2#myfragment"; 
        System.out.println(getProtocol(the_url)); 
        System.out.println(getFragment(the_url)); 
        System.out.println(getParameters(the_url)); 
    }   
} 

Prints

https
#myfragment
?ssl=1&myvar=2

You can then push and pull on the parts of the URL until they are up to muster.

Velma answered 11/4, 2013 at 17:32 Comment(2)
Normalization/canonicalization refers to a transformation that ensures data that are defined to be semantically equivalent become identical. Stripping essential data is not normalization.Crore
Granted, but the official rules for "normalization" exist in conflict and are continuing to diverge, some out of malice and hostility under the general rules of data cyber warfare. And so differences that "normalize to the same" for you might be differences that introduce a breaking difference for someone else under a different country/culture/scheme. We have to hammer out disagreements such as: "Why does "ww3.whatever.com" normalize to the same with "btap7://ww9.whatever.drone", in Canada and Ukraine, but not in China over their content-censors under-sea cables?Velma
A
0
private String normalize(String path) {
  if (path != null) {
    String trim = path.trim();
    if (trim.endsWith("/")) {
        return trim.substring(0, trim.length() - 1);
    }
    return trim;
  }
  return path;
}

Or you can also use URI class from java like below.

URI.create(paramString).normalize()
Amylase answered 5/1, 2024 at 11:17 Comment(0)
H
-5

Im have a simple way to solve it. Here is my code

public static String normalizeURL(String oldLink)
{
    int pos=oldLink.indexOf("://");
    String newLink="http"+oldLink.substring(pos);
    return newLink;
}
Hiedihiemal answered 11/9, 2018 at 16:48 Comment(1)
This just changes the protocol to http in all cases. I don't think you understand the question.Tillo

© 2022 - 2025 — McMap. All rights reserved.