Using Java, I want to strip the fragment identifier and do some simple normalisation (e.g., lowercase schemes, hosts) of a diverse set of URIs. The input and output URIs should be equivalent in a general HTTP sense.
Typically, this should be straightforward. However, for URIs like http://blah.org/A_%28Secret%29.xml#blah
, which percent encodes (Secret)
, the behaviour of java.util.URI
makes life difficult.
The normalisation method should return http://blah.org/A_%28Secret%29.xml
since the URIs http://blah.org/A_%28Secret%29.xml
and http://blah.org/A_(Secret).xml
are not equivalent in interpretation [§2.2; RFC3968]
So we have the two following normalisation methods:
URI u = new URI("http://blah.org/A_%28Secret%29.xml#blah");
System.out.println(u);
// prints "http://blah.org/A_%28Secret%29.xml#blah"
String path1 = u.getPath(); //gives "A_(Secret).xml"
String path2 = u.getRawPath(); //gives "A_%28Secret%29.xml"
//NORMALISE METHOD 1
URI norm1 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(),
u.getHost().toLowerCase(), u.getPort(), path1,
u.getQuery(), null);
System.out.println(norm1);
// prints "http://blah.org/A_(Secret).xml"
//NORMALISE METHOD 2
URI norm2 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(),
u.getHost().toLowerCase(), u.getPort(), path2,
u.getQuery(), null);
System.out.println(norm2);
// prints "http://blah.org/A_%2528Secret%2529.xml"
As we see, the URI is parsed and rebuilt without the fragment identifier.
However, for method 1, u.getPath()
returns an unencoded URI, which changes the final URI.
For method 2, u.getRawPath()
returns the original path, but when passed to the URI
constructor, Java decides to add double-encoding.
This feels like a Chinese finger trap.
So two main questions:
- Why does
java.util.URI
feel the need to play with encoding? - How can this normalise method be implemented without fiddling with the original percent encoding?
(I would rather not have to implement the parse/concatenate methods of java.util.URI
, which are non-trivial.)
EDIT: Here's some further info from URI
javadoc.
The single-argument constructor requires any illegal characters in its argument to be quoted and preserves any escaped octets and other characters that are present.
The multi-argument constructors quote illegal characters as required by the components in which they appear. The percent character ('%') is always quoted by these constructors. Any other characters are preserved.
The getRawUserInfo, getRawPath, getRawQuery, getRawFragment, getRawAuthority, and getRawSchemeSpecificPart methods return the values of their corresponding components in raw form, without interpreting any escaped octets. The strings returned by these methods may contain both escaped octets and other characters, and will not contain any illegal characters.
The getUserInfo, getPath, getQuery, getFragment, getAuthority, and getSchemeSpecificPart methods decode any escaped octets in their corresponding components. The strings returned by these methods may contain both other characters and illegal characters, and will not contain any escaped octets.
The toString method returns a URI string with all necessary quotation but which may contain other characters.
The toASCIIString method returns a fully quoted and encoded URI string that does not contain any other characters.
So I cannot use the multi-argument constructor without having the URL encoding messed with internally by the URI
class. Pah!
#
or?
or&
since that is what actually separates the url from extra info or 2) Let the URI create the normal uri (norm2 in example) and then replace all the %<digits> with the original one's in the positional sequence (1st of norm2 with 1st of original etc). Ofcourse this is just if the standard way is not usable. – Wentz