Normalising possibly encoded URI strings in Java
Asked Answered
G

2

8

Using Java, I want to strip the fragment identifier and do some simple normalisation (e.g., lowercase schemes, hosts) of a diverse set of URIs. The input and output URIs should be equivalent in a general HTTP sense.

Typically, this should be straightforward. However, for URIs like http://blah.org/A_%28Secret%29.xml#blah, which percent encodes (Secret), the behaviour of java.util.URI makes life difficult.

The normalisation method should return http://blah.org/A_%28Secret%29.xml since the URIs http://blah.org/A_%28Secret%29.xml and http://blah.org/A_(Secret).xml are not equivalent in interpretation [§2.2; RFC3968]

So we have the two following normalisation methods:

URI u = new URI("http://blah.org/A_%28Secret%29.xml#blah");
System.out.println(u);
        // prints "http://blah.org/A_%28Secret%29.xml#blah"

String path1 = u.getPath();      //gives "A_(Secret).xml"
String path2 = u.getRawPath();   //gives "A_%28Secret%29.xml"


//NORMALISE METHOD 1
URI norm1 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(), 
                      u.getHost().toLowerCase(), u.getPort(), path1, 
                      u.getQuery(), null);
System.out.println(norm1);
// prints "http://blah.org/A_(Secret).xml"

//NORMALISE METHOD 2
URI norm2 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(),
                      u.getHost().toLowerCase(), u.getPort(), path2, 
                      u.getQuery(), null);
System.out.println(norm2);
// prints "http://blah.org/A_%2528Secret%2529.xml"

As we see, the URI is parsed and rebuilt without the fragment identifier.

However, for method 1, u.getPath() returns an unencoded URI, which changes the final URI.

For method 2, u.getRawPath() returns the original path, but when passed to the URI constructor, Java decides to add double-encoding.

This feels like a Chinese finger trap.

So two main questions:

  • Why does java.util.URI feel the need to play with encoding?
  • How can this normalise method be implemented without fiddling with the original percent encoding?

(I would rather not have to implement the parse/concatenate methods of java.util.URI, which are non-trivial.)


EDIT: Here's some further info from URI javadoc.

  • The single-argument constructor requires any illegal characters in its argument to be quoted and preserves any escaped octets and other characters that are present.

  • The multi-argument constructors quote illegal characters as required by the components in which they appear. The percent character ('%') is always quoted by these constructors. Any other characters are preserved.

  • The getRawUserInfo, getRawPath, getRawQuery, getRawFragment, getRawAuthority, and getRawSchemeSpecificPart methods return the values of their corresponding components in raw form, without interpreting any escaped octets. The strings returned by these methods may contain both escaped octets and other characters, and will not contain any illegal characters.

  • The getUserInfo, getPath, getQuery, getFragment, getAuthority, and getSchemeSpecificPart methods decode any escaped octets in their corresponding components. The strings returned by these methods may contain both other characters and illegal characters, and will not contain any escaped octets.

  • The toString method returns a URI string with all necessary quotation but which may contain other characters.

  • The toASCIIString method returns a fully quoted and encoded URI string that does not contain any other characters.

So I cannot use the multi-argument constructor without having the URL encoding messed with internally by the URI class. Pah!

Gama answered 23/2, 2012 at 19:15 Comment(2)
The use-case is a crawler. We would like to take a set of extracted URIs and "normalise" them to as small a set as possible, still ensuring that the retrieved content is guaranteed to be the same. (The question #2994149 is related but does not address the issue of stripping fragment IDs, with URL encoding changing.)Gama
I am way away from the URI stuff, and is not sure whether you need it in the standard way with the URI API, but if I just wanted to get this functionality somehow implemented, I would either 1) Get the substring of the original url till the first occurance of # or ? or & since that is what actually separates the url from extra info or 2) Let the URI create the normal uri (norm2 in example) and then replace all the %<digits> with the original one's in the positional sequence (1st of norm2 with 1st of original etc). Ofcourse this is just if the standard way is not usable.Wentz
M
10

Because java.net.URI is introduced in java 1.4 (which comes out at 2002) and it's based on RFC2396 which treats '(' and ')' as characters which don't need escape and the semantic doesn't change even if it is escaped, furthermore it even says one should not escape it unless it's necessary (§2.3, RFC2396).

But RFC3986 (which comes out at 2005) changed this, and I guess developers of JDK decide not to change the behavior of java.net.URI for compatibility of existing code.

By random googling, I found Jena IRI looks good.

public class IRITest {
public static void main(String[] args) {
    IRIFactory factory = IRIFactory.uriImplementation();
    IRI iri = factory.construct("http://blah.org/A_%28Secret%29.xml#blah");
    ArrayList<String> a = new ArrayList<String>();
    a.add(iri.getScheme());
    a.add(iri.getRawUserinfo());
    a.add(iri.getRawHost());
    a.add(iri.getRawPath());
    a.add(iri.getRawQuery());
    a.add(iri.getRawFragment());
    IRI iri2 = factory.construct("http://blah.org/A_(Secret).xml#blah");
    ArrayList<String> b = new ArrayList<String>();
    b.add(iri2.getScheme());
    b.add(iri2.getRawUserinfo());
    b.add(iri2.getRawHost());
    b.add(iri2.getRawPath());
    b.add(iri2.getRawQuery());
    b.add(iri2.getRawFragment());

    System.out.println(a);
    //[http, null, blah.org, /A_%28Secret%29.xml, null, blah]
    System.out.println(b);
    //[http, null, blah.org, /A_(Secret).xml, null, blah]
}
}
Merline answered 3/3, 2012 at 16:53 Comment(0)
O
4

Note this passage at the end of [§2.2; RFC3968]

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character's encoding in US-ASCII.

So, as long as the scheme is http or https, the encoding is the correct behavior.

Try using the toASCIIString method instead of toString for printing the URI. E.g.:

System.put.println(norm1.toASCIIString());
Omidyar answered 23/2, 2012 at 19:37 Comment(1)
Thanks for the info! Not sure I agree with your interpretation of the passage. This part: "unless these characters are specifically allowed by the URI scheme to represent data in that component" suggests that it is not necessary for HTTP/HTTPS which allow, e.g., "()" chars. In any case, the question becomes moot for a crawler if you consider the passage "Percent-encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications.". (The toASCIIString method has no effect here.)Gama

© 2022 - 2024 — McMap. All rights reserved.