How do you unescape URLs in Java?
Asked Answered
T

4

39

When I read the xml through a URL's InputStream, and then cut out everything except the url, I get "http://cliveg.bu.edu/people/sganguly/player/%20Rang%20De%20Basanti%20-%20Tu%20Bin%20Bataye.mp3".

As you can see, there are a lot of "%20"s.

I want the url to be unescaped.

Is there any way to do this in Java, without using a third-party library?

Tva answered 8/3, 2009 at 16:46 Comment(3)
Just to be pedantic, there is no such thing as "normal unicode". UTF8 is one of several ways to represent unicode text. But there is no "true" canonical representation.Fantastic
As Jon and ng said, this has nothing to do with Unicode or UTF-8. You might want to change the title.Consolation
The answer marked as correct now it is clearly wrong and should be removed.Bicker
B
68

This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.

URLDecoder.decode(url);

This will give you the correct text. The result of decoding the like you provided is this.

http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3

The %20 is an escaped space character. To get the above I used the URLDecoder object.

Breakdown answered 8/3, 2009 at 17:52 Comment(1)
That method is deprecated. Use URLDecoder.decode(location,"UTF-8");Mashie
B
19

Starting from Java 11 use

URLDecoder.decode(url, StandardCharsets.UTF_8).

for Java 7/8/9 use URLDecoder.decode(url, "UTF-8").

URLDecoder.decode(String s) has been deprecated since Java 5

Regarding the chosen encoding:

Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites.

Bicker answered 27/11, 2017 at 14:43 Comment(5)
for Java 8 & 9 Use URLDecoder.decode(s, "UTF-8");Tinytinya
Since Java 7 StandardCharsets. Am I wrong?Bicker
yes, but the URLDecoder method decode only takes (String, String) in Java 8Tinytinya
@user16320675 I'd considered that - but will it work with the underscore rather than hyphen in "UTF-8" ?Tinytinya
Thank you to share. I never knew about this handy method before I read this answer! One cool feature of URLDecoder.decode() vs new URI().getPath(): The URI ctor will reject decoded URLs! URLDecoder.decode() will accept both encoded and decoded URLs, e.g., (decoded) /path/to/here and there and (encoded) /path/to/here%20and%20there.Bobsledding
T
0

I'm having problems using this method when I have special characters like á, é, í, etc. My (probably wild) guess is widechars are not being encoded properly... well, at least I was expecting to see sequences like %uC2BF instead of %C2%BF.

Edited: My bad, this post explains the difference between URL encoding and JavaScript's escape sequences: URI encoding in UNICODE for apache httpclient 4

Thuggee answered 9/2, 2011 at 16:45 Comment(0)
S
0

In my case the URL contained escaped html entities, so StringEscapeUtils.unescapeHtml4() from apache-commons did the trick

Stratiform answered 25/2, 2023 at 11:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.