HTTP URL Address Encoding in Java
Asked Answered
B

24

391

My Java standalone application gets a URL (which points to a file) from the user and I need to hit it and download it. The problem I am facing is that I am not able to encode the HTTP URL address properly...

Example:

URL:  http://search.barnesandnoble.com/booksearch/first book.pdf

java.net.URLEncoder.encode(url.toString(), "ISO-8859-1");

returns me:

http%3A%2F%2Fsearch.barnesandnoble.com%2Fbooksearch%2Ffirst+book.pdf

But, what I want is

http://search.barnesandnoble.com/booksearch/first%20book.pdf

(space replaced by %20)

I guess URLEncoder is not designed to encode HTTP URLs... The JavaDoc says "Utility class for HTML form encoding"... Is there any other way to do this?

Baziotes answered 7/4, 2009 at 3:28 Comment(3)
Nitpicking: a string containing a whitespace character by definition is not a URI. So what you're looking for is code that implements the URI escaping defined in Section 2.1 of RFC 3986.Ion
See also #10786542Trinette
The behaviour is entirely correct. URL encode is to turn something into a string that can be safely passed as a URL parameter, and isn't interpreted as a URL at all. Whereas you want it to just convert one small part of the URL.Barrios
E
322

The java.net.URI class can help; in the documentation of URL you find

Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use an URI

Use one of the constructors with more than one argument, like:

URI uri = new URI(
    "http", 
    "search.barnesandnoble.com", 
    "/booksearch/first book.pdf",
    null);
URL url = uri.toURL();
//or String request = uri.toString();

(the single-argument constructor of URI does NOT escape illegal characters)


Only illegal characters get escaped by above code - it does NOT escape non-ASCII characters (see fatih's comment).
The toASCIIString method can be used to get a String only with US-ASCII characters:

URI uri = new URI(
    "http", 
    "search.barnesandnoble.com", 
    "/booksearch/é",
    null);
String request = uri.toASCIIString();

For an URL with a query like http://www.google.com/ig/api?weather=São Paulo, use the 5-parameter version of the constructor:

URI uri = new URI(
        "http", 
        "www.google.com", 
        "/ig/api",
        "weather=São Paulo",
        null);
String request = uri.toASCIIString();
Erbium answered 7/4, 2009 at 9:12 Comment(19)
Please note, the URI class mentioned here is from "org.apache.commons.httpclient.URI" not "java.net" , the "java.net" doesn't URI doesn't accept the illegal characters, unless you will use constructors that builds URL from its components , like the way mentioned in Matt comment belowWarbler
@Mohamed: the class I mentioned and used for testing actually is java.net.URI: it worked perfectly (Java 1.6). I would mention the fully qualified class name if it was not the standard Java one and the link points to the documentation of java.net.URI. And, by the comment of Sudhakar, it solved the problem without including any "commons libraries"!Erbium
URI uri = new URI("http", "search.barnesandnoble.com", "/booksearch/é",null); Does not do correct escaping with this sample? This should have been escaped with % escapesAugustin
@fatih - that's correct, thanks! Normally that should not be a problem, but there is a simple solution - almost same as I wrote before. See 2nd edit.Erbium
@Carlos Thx for the edit. Now it does escape but not correct escaping. It should be adding a % to the HEX value of char for Path params meaning é char should be converted to %e9Augustin
@fatih - not correct? Why not? The code uses just standard Java classes... It's returning %C3%A9 for me, which is working perfectly (firefox and google). The URI standard (RFC2396) does not specify any particular character set to be used for encoding. The URI class uses UTF-8. Just because there are other options, does not mean the one is wrong.Erbium
Sorry, I should have placed some link/more info in my previous reply. tools.ietf.org/html/rfc2396 The RFC2396 actually defines how to escape : " Data must be escaped if it does not have a representation using an unreserved character; this includes data that does not correspond to a printable character of the US-ASCII coded character set, or that corresponds to any US-ASCII character that is disallowed, as explained below."(2.4. Escape Sequences) So if the char is not in US-ASCII, then it needs to be escaped.Augustin
(2.4.1) Escape encoding is done by prepending a % char to the 2 digit hex value of that character. '%C3%A9' so this is encoded in UTF8 but URI escaping needs to be done like the way defined in 2.4.1 and it needs to be exactly 3 digits. However, one bit which is confused is; RFC2396 does not say anything about the character set so if you are to include € in the URI, it needs to be converted to ('\u20AC') and then escape the resulting string as "%E2%82%AC". download.oracle.com/javase/6/docs/api/java/net/URI.html search for rfc2396, example is from javadocs of URI class itself.Augustin
w3schools.com/TAGS/ref_urlencode.asp see the list of escaped chars values. Again, the table shows everything correctly but if you use the "URL Encode" functionality/button on the same page, you will see it is actually not escaping properly but returning utf-8 value for "é". :)Augustin
@fatih - it is done like defined - (2.4.1): “An escaped octet is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing the octet code. …” An octet is not the same as a char… see 2.1 “URI and non-ASCII characters” from the same RFC2396. Surely not so simple...Erbium
@Carlos, You are right, surely it is not char thats my poor wording just to explain.Augustin
this one breaks at the ? . Any other solutions?http://www.google.com/ig/api?weather=São PauloMeasly
@Paulo Casaretto - confira o EDIT 3 que adicionei a minha resposta acima! (Check the EDIT 3 that I added to my answer above)Erbium
I tried this proposed solution, but it failed because it also escaped ampersand (&) character.Ruthful
So, it seems that unicode characters (e.g. ã) are not being encoded in any way (except by toASCIIString()) nor are spaces being converted to '+'. The code from edit 3 returns "weather=São%20Paulo" as the query string. What steps should I take to escape the arguments to new URI()?Rabassa
@EdwardFalk This is "correct" behaviour: it appears that Java tried to add support for non-ASCII URIs before they were standardized as IRIs by RFC 3987; unhelpfully the spec for java.net.URI permits many unwise characters (e.g. Unicode control characters like directional formatting). Additionally, representing spaces with + is specific to HTML's application/x-www-form-urlencoded; to do that, use the (unhelpfully-named) java.net.URLEncoder class.Sassan
Anyone found a solution? I have a query with multiple variables and €-signs, how to deal with that?Nolde
It does help. You can use java.net.URL to decompose the bad URL, and java.net.URI to put it back together correctly. For http(s) URLs anyway. (Ugh!)Deledda
for Android developers there is a somewhat more convenient alternative: android.net.Uri.encode()Tutelage
Y
97

Please be warned that most of the answers above are INCORRECT.

The URLEncoder class, despite is name, is NOT what needs to be here. It's unfortunate that Sun named this class so annoyingly. URLEncoder is meant for passing data as parameters, not for encoding the URL itself.

In other words, "http://search.barnesandnoble.com/booksearch/first book.pdf" is the URL. Parameters would be, for example, "http://search.barnesandnoble.com/booksearch/first book.pdf?parameter1=this&param2=that". The parameters are what you would use URLEncoder for.

The following two examples highlights the differences between the two.

The following produces the wrong parameters, according to the HTTP standard. Note the ampersand (&) and plus (+) are encoded incorrectly.

uri = new URI("http", null, "www.google.com", 80, 
"/help/me/book name+me/", "MY CRZY QUERY! +&+ :)", null);

// URI: http://www.google.com:80/help/me/book%20name+me/?MY%20CRZY%20QUERY!%20+&+%20:)

The following will produce the correct parameters, with the query properly encoded. Note the spaces, ampersands, and plus marks.

uri = new URI("http", null, "www.google.com", 80, "/help/me/book name+me/", URLEncoder.encode("MY CRZY QUERY! +&+ :)", "UTF-8"), null);

// URI: http://www.google.com:80/help/me/book%20name+me/?MY+CRZY+QUERY%2521+%252B%2526%252B+%253A%2529
Yt answered 7/4, 2010 at 21:1 Comment(4)
That's right, the URI constructor already encodes the querystring, according to the documentation docs.oracle.com/javase/1.4.2/docs/api/java/net/…, java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, java.lang.String)Trafalgar
@Draemon The answer is correct but uses the query string in an uncommon way; a more normal example might be query = URLEncoder.encode(key) + "=" + URLEncoder.encode(value). The docs merely say that "any character that is not a legal URI character is quoted".Sassan
I agree with Matt here. If you type this URL: "google.com/help/me/book name+me/?MY CRZY QUERY! +&+ :)" in a browser, it automatically encodes the spaces but the "&" is used as query value separator and "+" are lost.Suppuration
Unfortunately, this answer is also wrong, because it double-encodes things. With the multi-param URI constructor, if you have slashes in your path, or '&' or '=' in your query params or values, you are either going to fail to encode these, or double encode them.Fun
E
92

I'm going to add one suggestion here aimed at Android users. You can do this which avoids having to get any external libraries. Also, all the search/replace characters solutions suggested in some of the answers above are perilous and should be avoided.

Give this a try:

String urlStr = "http://abc.dev.domain.com/0007AC/ads/800x480 15sec h.264.mp4";
URL url = new URL(urlStr);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
url = uri.toURL();

You can see that in this particular URL, I need to have those spaces encoded so that I can use it for a request.

This takes advantage of a couple features available to you in Android classes. First, the URL class can break a url into its proper components so there is no need for you to do any string search/replace work. Secondly, this approach takes advantage of the URI class feature of properly escaping components when you construct a URI via components rather than from a single string.

The beauty of this approach is that you can take any valid url string and have it work without needing any special knowledge of it yourself.

Erythema answered 22/1, 2012 at 17:4 Comment(4)
Nice approach, but I would like to point out that this code does not prevent double encoding, e.g. %20 got encoded into %2520. Scott's answer does not suffer from this.Ahl
Or if you just want to do path quoting: new URI(null, null, "/path with spaces", null, null).toString()Swine
@Stallman If your file name contains #, the URL class will put it into "ref" (equivalent of "fragment" in the URI class). You can detect whether URL.getRef() returns something that might be treated as a part of the path and pass URL.getPath() + "#" + URL.getRef() as the "path" parameter and null as the "fragment" parameter of the URI class 7 parameters constructor. By default, the string after # is treated as a reference (or an anchor).Jerkin
great answer, i have simple urls and it works for me. Although i don't think its very android specific. I used java.net.URI and java.net.URL and this answer was working perfectly. I am even able to unit test this.Educatee
A
49

a solution i developed and much more stable than any other:

public class URLParamEncoder {

    public static String encode(String input) {
        StringBuilder resultStr = new StringBuilder();
        for (char ch : input.toCharArray()) {
            if (isUnsafe(ch)) {
                resultStr.append('%');
                resultStr.append(toHex(ch / 16));
                resultStr.append(toHex(ch % 16));
            } else {
                resultStr.append(ch);
            }
        }
        return resultStr.toString();
    }

    private static char toHex(int ch) {
        return (char) (ch < 10 ? '0' + ch : 'A' + ch - 10);
    }

    private static boolean isUnsafe(char ch) {
        if (ch > 128 || ch < 0)
            return true;
        return " %$&+,/:;=?@<>#%".indexOf(ch) >= 0;
    }

}
Augustin answered 5/1, 2011 at 15:28 Comment(7)
that also requires you to break the url into pieces. There is no way for a computer to know which part of the url to encode. See my above editAugustin
@Augustin Thanks for that piece of code! It should be noted that this isn't UTF-8. To get UTF-8 just pre-process the input with String utf8Input = new String(Charset.forName("UTF-8").encode(input).array()); (taken from here)Lutz
Actually, I use it with a trim() and explicit encoding now although the latter is probably unnecessary: new String(Charset.forName("UTF-8").encode(q).array(), "ISO-8859-1").trim(); The trim() is needed as encode() appends null bytes at the end which the String constructor doesn't remove. Don't know if it's fully correct, but works for me...Lutz
This solution will actually also encode the "http://" part into "http%3A%2F%2F", which is what the initial question tried to avoid.Idiotism
You only pass what you need to encode, not the whole URL. There is no way to pass one whole URL string and expect correct encoding. In all cases, you need to break the url into its logical pieces.Augustin
I had problems with this answer because it doesn't encode unsafe chars to UTF-8.. may be dependent on the peer application though.Hetaerism
This fails when a string having Chinese characters is passed as input: eg: "Test Sample-000001363/这是一个演示文件.docx"Gonion
P
40

If you have a URL, you can pass url.toString() into this method. First decode, to avoid double encoding (for example, encoding a space results in %20 and encoding a percent sign results in %25, so double encoding will turn a space into %2520). Then, use the URI as explained above, adding in all the parts of the URL (so that you don't drop the query parameters).

public URL convertToURLEscapingIllegalCharacters(String string){
    try {
        String decodedURL = URLDecoder.decode(string, "UTF-8");
        URL url = new URL(decodedURL);
        URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef()); 
        return uri.toURL(); 
    } catch (Exception ex) {
        ex.printStackTrace();
        return null;
    }
}
Punak answered 3/3, 2012 at 2:12 Comment(1)
URLDecoder.decode(string, "UTF-8") fails with an IllegalArgumentException when you pass the string as "google.co.in/search?q=123%!123". This is a valid URL. I guess this API doesn't work when % is used as data instead of the encoding character.Dennet
C
27

Yeah URL encoding is going to encode that string so that it would be passed properly in a url to a final destination. For example you could not have http://stackoverflow.com?url=http://yyy.com. UrlEncoding the parameter would fix that parameter value.

So i have two choices for you:

  1. Do you have access to the path separate from the domain? If so you may be able to simply UrlEncode the path. However, if this is not the case then option 2 may be for you.

  2. Get commons-httpclient-3.1. This has a class URIUtil:

    System.out.println(URIUtil.encodePath("http://example.com/x y", "ISO-8859-1"));

This will output exactly what you are looking for, as it will only encode the path part of the URI.

FYI, you'll need commons-codec and commons-logging for this method to work at runtime.

Clementinaclementine answered 7/4, 2009 at 3:34 Comment(2)
Sidenote apache commons stopped maintaining URIUtil in 4.x branches apparently, recommending you use JDK's URI class instead. Just means you have to break up the string yourself.Seignior
2) Exactly it is also suggested here #5330604 I also used URIUtil solutionDeuno
W
14

If anybody doesn't want to add a dependency to their project, these functions may be helpful.

We pass the 'path' part of our URL into here. You probably don't want to pass the full URL in as a parameter (query strings need different escapes, etc).

/**
 * Percent-encodes a string so it's suitable for use in a URL Path (not a query string / form encode, which uses + for spaces, etc)
 */
public static String percentEncode(String encodeMe) {
    if (encodeMe == null) {
        return "";
    }
    String encoded = encodeMe.replace("%", "%25");
    encoded = encoded.replace(" ", "%20");
    encoded = encoded.replace("!", "%21");
    encoded = encoded.replace("#", "%23");
    encoded = encoded.replace("$", "%24");
    encoded = encoded.replace("&", "%26");
    encoded = encoded.replace("'", "%27");
    encoded = encoded.replace("(", "%28");
    encoded = encoded.replace(")", "%29");
    encoded = encoded.replace("*", "%2A");
    encoded = encoded.replace("+", "%2B");
    encoded = encoded.replace(",", "%2C");
    encoded = encoded.replace("/", "%2F");
    encoded = encoded.replace(":", "%3A");
    encoded = encoded.replace(";", "%3B");
    encoded = encoded.replace("=", "%3D");
    encoded = encoded.replace("?", "%3F");
    encoded = encoded.replace("@", "%40");
    encoded = encoded.replace("[", "%5B");
    encoded = encoded.replace("]", "%5D");
    return encoded;
}

/**
 * Percent-decodes a string, such as used in a URL Path (not a query string / form encode, which uses + for spaces, etc)
 */
public static String percentDecode(String encodeMe) {
    if (encodeMe == null) {
        return "";
    }
    String decoded = encodeMe.replace("%21", "!");
    decoded = decoded.replace("%20", " ");
    decoded = decoded.replace("%23", "#");
    decoded = decoded.replace("%24", "$");
    decoded = decoded.replace("%26", "&");
    decoded = decoded.replace("%27", "'");
    decoded = decoded.replace("%28", "(");
    decoded = decoded.replace("%29", ")");
    decoded = decoded.replace("%2A", "*");
    decoded = decoded.replace("%2B", "+");
    decoded = decoded.replace("%2C", ",");
    decoded = decoded.replace("%2F", "/");
    decoded = decoded.replace("%3A", ":");
    decoded = decoded.replace("%3B", ";");
    decoded = decoded.replace("%3D", "=");
    decoded = decoded.replace("%3F", "?");
    decoded = decoded.replace("%40", "@");
    decoded = decoded.replace("%5B", "[");
    decoded = decoded.replace("%5D", "]");
    decoded = decoded.replace("%25", "%");
    return decoded;
}

And tests:

@Test
public void testPercentEncode_Decode() {
    assertEquals("", percentDecode(percentEncode(null)));
    assertEquals("", percentDecode(percentEncode("")));

    assertEquals("!", percentDecode(percentEncode("!")));
    assertEquals("#", percentDecode(percentEncode("#")));
    assertEquals("$", percentDecode(percentEncode("$")));
    assertEquals("@", percentDecode(percentEncode("@")));
    assertEquals("&", percentDecode(percentEncode("&")));
    assertEquals("'", percentDecode(percentEncode("'")));
    assertEquals("(", percentDecode(percentEncode("(")));
    assertEquals(")", percentDecode(percentEncode(")")));
    assertEquals("*", percentDecode(percentEncode("*")));
    assertEquals("+", percentDecode(percentEncode("+")));
    assertEquals(",", percentDecode(percentEncode(",")));
    assertEquals("/", percentDecode(percentEncode("/")));
    assertEquals(":", percentDecode(percentEncode(":")));
    assertEquals(";", percentDecode(percentEncode(";")));

    assertEquals("=", percentDecode(percentEncode("=")));
    assertEquals("?", percentDecode(percentEncode("?")));
    assertEquals("@", percentDecode(percentEncode("@")));
    assertEquals("[", percentDecode(percentEncode("[")));
    assertEquals("]", percentDecode(percentEncode("]")));
    assertEquals(" ", percentDecode(percentEncode(" ")));

    // Get a little complex
    assertEquals("[]]", percentDecode(percentEncode("[]]")));
    assertEquals("a=d%*", percentDecode(percentEncode("a=d%*")));
    assertEquals(")  (", percentDecode(percentEncode(")  (")));
    assertEquals("%21%20%2A%20%27%20%28%20%25%20%29%20%3B%20%3A%20%40%20%26%20%3D%20%2B%20%24%20%2C%20%2F%20%3F%20%23%20%5B%20%5D%20%25",
                    percentEncode("! * ' ( % ) ; : @ & = + $ , / ? # [ ] %"));
    assertEquals("! * ' ( % ) ; : @ & = + $ , / ? # [ ] %", percentDecode(
                    "%21%20%2A%20%27%20%28%20%25%20%29%20%3B%20%3A%20%40%20%26%20%3D%20%2B%20%24%20%2C%20%2F%20%3F%20%23%20%5B%20%5D%20%25"));

    assertEquals("%23456", percentDecode(percentEncode("%23456")));

}
Whale answered 19/5, 2017 at 18:32 Comment(2)
Thanks for this, but what is that I need to do to encode a space -> use %20 instead as per your example?Gaskins
Updated to account for spaces as %20Whale
T
11

Unfortunately, org.apache.commons.httpclient.util.URIUtil is deprecated, and the replacement org.apache.commons.codec.net.URLCodec does coding suitable for form posts, not in actual URL's. So I had to write my own function, which does a single component (not suitable for entire query strings that have ?'s and &'s)

public static String encodeURLComponent(final String s)
{
  if (s == null)
  {
    return "";
  }

  final StringBuilder sb = new StringBuilder();

  try
  {
    for (int i = 0; i < s.length(); i++)
    {
      final char c = s.charAt(i);

      if (((c >= 'A') && (c <= 'Z')) || ((c >= 'a') && (c <= 'z')) ||
          ((c >= '0') && (c <= '9')) ||
          (c == '-') ||  (c == '.')  || (c == '_') || (c == '~'))
      {
        sb.append(c);
      }
      else
      {
        final byte[] bytes = ("" + c).getBytes("UTF-8");

        for (byte b : bytes)
        {
          sb.append('%');

          int upper = (((int) b) >> 4) & 0xf;
          sb.append(Integer.toHexString(upper).toUpperCase(Locale.US));

          int lower = ((int) b) & 0xf;
          sb.append(Integer.toHexString(lower).toUpperCase(Locale.US));
        }
      }
    }

    return sb.toString();
  }
  catch (UnsupportedEncodingException uee)
  {
    throw new RuntimeException("UTF-8 unsupported!?", uee);
  }
}
Tartuffery answered 30/6, 2011 at 6:29 Comment(1)
Come on, there has to be a library that does this.Womanhater
S
10

URLEncoding can encode HTTP URLs just fine, as you've unfortunately discovered. The string you passed in, "http://search.barnesandnoble.com/booksearch/first book.pdf", was correctly and completely encoded into a URL-encoded form. You could pass that entire long string of gobbledigook that you got back as a parameter in a URL, and it could be decoded back into exactly the string you passed in.

It sounds like you want to do something a little different than passing the entire URL as a parameter. From what I gather, you're trying to create a search URL that looks like "http://search.barnesandnoble.com/booksearch/whateverTheUserPassesIn". The only thing that you need to encode is the "whateverTheUserPassesIn" bit, so perhaps all you need to do is something like this:

String url = "http://search.barnesandnoble.com/booksearch/" + 
       URLEncoder.encode(userInput,"UTF-8");

That should produce something rather more valid for you.

Secant answered 7/4, 2009 at 3:46 Comment(2)
That would replace the spaces in userInput with "+". The poster needs them replaced with "%20".Mcgee
@vocaro: that is a very good point. URLEncoder escapes like the arguments are query parameters, not like the rest of the URL.Secant
F
8

There is still a problem if you have got an encoded "/" (%2F) in your URL.

RFC 3986 - Section 2.2 says: "If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed." (RFC 3986 - Section 2.2)

But there is an Issue with Tomcat:

http://tomcat.apache.org/security-6.html - Fixed in Apache Tomcat 6.0.10

important: Directory traversal CVE-2007-0450

Tomcat permits '\', '%2F' and '%5C' [...] .

The following Java system properties have been added to Tomcat to provide additional control of the handling of path delimiters in URLs (both options default to false):

  • org.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH: true|false
  • org.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH: true|false

Due to the impossibility to guarantee that all URLs are handled by Tomcat as they are in proxy servers, Tomcat should always be secured as if no proxy restricting context access was used.

Affects: 6.0.0-6.0.9

So if you have got an URL with the %2F character, Tomcat returns: "400 Invalid URI: noSlash"

You can switch of the bugfix in the Tomcat startup script:

set JAVA_OPTS=%JAVA_OPTS% %LOGGING_CONFIG%   -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true 
Fung answered 28/9, 2010 at 7:33 Comment(0)
D
8

I read the previous answers to write my own method because I could not have something properly working using the solution of the previous answers, it looks good for me but if you can find URL that does not work with this, please let me know.

public static URL convertToURLEscapingIllegalCharacters(String toEscape) throws MalformedURLException, URISyntaxException {
            URL url = new URL(toEscape);
            URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
            //if a % is included in the toEscape string, it will be re-encoded to %25 and we don't want re-encoding, just encoding
            return new URL(uri.toString().replace("%25", "%"));
}
Diegodiehard answered 4/6, 2015 at 10:2 Comment(1)
"example.com?q=plus+plus noplus" plus sign is not encoded and might get picked up as space characterWaybill
A
5

Maybe can try UriUtils in org.springframework.web.util

UriUtils.encodeUri(input, "UTF-8")
Auditory answered 14/3, 2013 at 6:49 Comment(0)
D
5

You can also use GUAVA and path escaper: UrlEscapers.urlFragmentEscaper().escape(relativePath)

Deuno answered 18/5, 2016 at 11:54 Comment(0)
K
4

I agree with Matt. Indeed, I've never seen it well explained in tutorials, but one matter is how to encode the URL path, and a very different one is how to encode the parameters which are appended to the URL (the query part, behind the "?" symbol). They use similar encoding, but not the same.

Specially for the encoding of the white space character. The URL path needs it to be encoded as %20, whereas the query part allows %20 and also the "+" sign. The best idea is to test it by ourselves against our Web server, using a Web browser.

For both cases, I ALWAYS would encode COMPONENT BY COMPONENT, never the whole string. Indeed URLEncoder allows that for the query part. For the path part you can use the class URI, although in this case it asks for the entire string, not a single component.

Anyway, I believe that the best way to avoid these problems is to use a personal non-conflictive design. How? For example, I never would name directories or parameters using other characters than a-Z, A-Z, 0-9 and _ . That way, the only need is to encode the value of every parameter, since it may come from an user input and the used characters are unknown.

Key answered 4/6, 2011 at 14:3 Comment(1)
sample code using the URL in the question would be a good thing to put in your answerFaithfaithful
P
3

I took the content above and changed it around a bit. I like positive logic first, and I thought a HashSet might give better performance than some other options, like searching through a String. Although, I'm not sure if the autoboxing penalty is worth it, but if the compiler optimizes for ASCII chars, then the cost of boxing will be low.

/***
 * Replaces any character not specifically unreserved to an equivalent 
 * percent sequence.
 * @param s
 * @return
 */
public static String encodeURIcomponent(String s)
{
    StringBuilder o = new StringBuilder();
    for (char ch : s.toCharArray()) {
        if (isSafe(ch)) {
            o.append(ch);
        }
        else {
            o.append('%');
            o.append(toHex(ch / 16));
            o.append(toHex(ch % 16));
        }
    }
    return o.toString();
}

private static char toHex(int ch)
{
    return (char)(ch < 10 ? '0' + ch : 'A' + ch - 10);
}

// https://tools.ietf.org/html/rfc3986#section-2.3
public static final HashSet<Character> UnreservedChars = new HashSet<Character>(Arrays.asList(
        'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',
        'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',
        '0','1','2','3','4','5','6','7','8','9',
        '-','_','.','~'));
public static boolean isSafe(char ch)
{
    return UnreservedChars.contains(ch);
}
Postfree answered 8/8, 2018 at 19:41 Comment(0)
T
2

In addition to the Carlos Heuberger's reply: if a different than the default (80) is needed, the 7 param constructor should be used:

URI uri = new URI(
        "http",
        null, // this is for userInfo
        "www.google.com",
        8080, // port number as int
        "/ig/api",
        "weather=São Paulo",
        null);
String request = uri.toASCIIString();
Trifling answered 29/7, 2011 at 13:20 Comment(0)
G
2

Use the following standard Java solution (passes around 100 of the testcases provided by Web Plattform Tests):

0. Test if URL is already encoded.

1. Split URL into structural parts. Use java.net.URL for it.

2. Encode each structural part properly!

3. Use IDN.toASCII(putDomainNameHere) to Punycode encode the host name!

4. Use java.net.URI.toASCIIString() to percent-encode, NFC encoded unicode - (better would be NFKC!).

Find more here: https://mcmap.net/q/24456/-java-url-encoding-of-query-string-parameters

Gamete answered 12/4, 2018 at 13:7 Comment(0)
G
2

If you are using spring, you can try org.springframework.web.util.UriUtils#encodePath

Grandee answered 26/7, 2021 at 6:45 Comment(0)
G
0

I've created a new project to help construct HTTP URLs. The library will automatically URL encode path segments and query parameters.

You can view the source and download a binary at https://github.com/Widen/urlbuilder

The example URL in this question:

new UrlBuilder("search.barnesandnoble.com", "booksearch/first book.pdf").toString()

produces

http://search.barnesandnoble.com/booksearch/first%20book.pdf

Gerstner answered 15/1, 2011 at 5:0 Comment(0)
I
0

I had the same problem. Solved this by unsing:

android.net.Uri.encode(urlString, ":/");

It encodes the string but skips ":" and "/".

Interlink answered 3/4, 2017 at 9:55 Comment(0)
E
-1

I develop a library that serves this purpose: galimatias. It parses URL the same way web browsers do. That is, if a URL works in a browser, it will be correctly parsed by galimatias.

In this case:

// Parse
io.mola.galimatias.URL.parse(
    "http://search.barnesandnoble.com/booksearch/first book.pdf"
).toString()

Will give you: http://search.barnesandnoble.com/booksearch/first%20book.pdf. Of course this is the simplest case, but it'll work with anything, way beyond java.net.URI.

You can check it out at: https://github.com/smola/galimatias

Etesian answered 18/3, 2014 at 14:57 Comment(1)
I'm not sure why this answer was downvoted so much. This library albeit a bit big in footprint does exactly what I need.Fowle
D
-2

i use this

org.apache.commons.text.StringEscapeUtils.escapeHtml4("my text % & < >");

add this dependecy

 <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-text</artifactId>
        <version>1.8</version>
    </dependency>
Dessiedessma answered 16/9, 2019 at 11:41 Comment(1)
This escapes HTML tags but not URLsUnmeriting
I
-3

You can use a function like this. Complete and modify it to your need :

/**
     * Encode URL (except :, /, ?, &, =, ... characters)
     * @param url to encode
     * @param encodingCharset url encoding charset
     * @return encoded URL
     * @throws UnsupportedEncodingException
     */
    public static String encodeUrl (String url, String encodingCharset) throws UnsupportedEncodingException{
            return new URLCodec().encode(url, encodingCharset).replace("%3A", ":").replace("%2F", "/").replace("%3F", "?").replace("%3D", "=").replace("%26", "&");
    }

Example of use :

String urlToEncode = ""http://www.growup.com/folder/intérieur-à_vendre?o=4";
Utils.encodeUrl (urlToEncode , "UTF-8")

The result is : http://www.growup.com/folder/int%C3%A9rieur-%C3%A0_vendre?o=4

Ia answered 22/8, 2014 at 23:13 Comment(2)
This answer is incomplete without URLCodec.Ellynellynn
upvote for .replace() chaining, it's not ideal but it's enough for basic ad-hoc use casesHowler
C
-7

How about:

public String UrlEncode(String in_) {

String retVal = "";

try {
    retVal = URLEncoder.encode(in_, "UTF8");
} catch (UnsupportedEncodingException ex) {
    Log.get().exception(Log.Level.Error, "urlEncode ", ex);
}

return retVal;

}

Cutler answered 20/3, 2012 at 1:11 Comment(1)
URLEncoder can't be used to escape ivalid URL characters. Only to encode forms.Zippora

© 2022 - 2024 — McMap. All rights reserved.