Correctly generating a sitemap XML with special characters
Asked Answered
B

4

9

I've a program which generates XML sitemaps for Google Webmaster Tools (among other things).
GWTs is giving me errors for some sitemaps because the URLs contain character sequences like ã¾, ã‹, ã€, etc. **

Sitemap specification says:

Your Sitemap file must be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters listed: &, ', ", <, >.

The special characters are escaped in the XML files (with HTML entities). XML file snippet:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://domain/folder/listing-&#227;&#129;.shtml</loc>
        ...

Are my URLs UTF-8 encoded? If not, how do I do this in Java? The following is the line in my program where I add the URL to the sitemap:

    siteMap.addUrl(StringEscapeUtils.escapeXml(countryName+"/"+twoCharFile.getRelativeFileName().toLowerCase()));

I'm not sure which ones are causing the error, probably the first two examples.

Bakken answered 23/5, 2011 at 11:28 Comment(5)
I don't really understand your question. It seems as though you haven't HTML escaped you data (regardless of using utf-8). Are you escaping or not?Zolner
I edited the question a lot.Bakken
Open your sitemap XML files in an editor that supports UTF-8 encoding (like Notepad++) for a quick test to determine whether your files are saved in the correct encoding.Inna
@Vineet Done. Not certain where to look to see if the URLs are correctly UTF-8 encoded. I've supplied a snippet of the XML file. It looks like the characters have been escaped (with HTML entities).Bakken
the Encoding menu in Notepad++ will allow you to view the current encoding used. You could change the encoding of the file, but that is not the point; use the suggested approach to specify the encoding for the URL. Additionally, also ensure that you write the sitemap file using UTF-8 encoding (when you use the FileOutputStream class or a different class).Inna
O
17

Try using URLEncoder.encode(stringToBeEncoded, "UTF-8") to encode the url.

Ohmmeter answered 23/5, 2011 at 11:33 Comment(7)
This will application/x-www-form-urlencoded encode the string. This is generally only acceptable for parameters used in the query part. It would not encode the path part segments correctly, for example.Abigailabigale
How sure are you this will work? Are you suggesting I change the line to siteMap.addUrl(StringEscapeUtils.escapeXml(URLEncoder.encode(countryName+"/"+twoCharFile.getRelativeFileName().toLowerCase(), "UTF-8")));?Bakken
@Adam - no, you can't just pass a path part through this method - forward slashes will be encoded and spaces will be encoded incorrectly. This method is only useful for URIs when encoding query parameters for servers that expect them.Abigailabigale
@Abigailabigale hmm ok so siteMap.addUrl(StringEscapeUtils.escapeXml(countryName+"/"+URLEncoder.encode(tw‌​oCharFile, "UTF-8").getRelativeFileName().toLowerCase())); would be correct I take it? (twoCharFile would be the ã¾ for example)Bakken
McDowell is correct. This for parameters mostly. I still suggest you try a few combinations of both xml escaping and urlencoding. (feel running one over the other might corrupt the entire string, so you may have to see which parts need xml encoding, and which path need this solution)Ohmmeter
@Abigailabigale @Ohmmeter But does the % need to be escaped (for XML)?Bakken
Don't do this. Use java.net.URI.create(url).toASCIIStringItin
A
2

URLs must be percent-encoded as per the URI spec.

For example, the code point U+00e3 (ã) would become the encoded sequence %C3%A3.

When a URI is emitted in an XML document, it must conform to the markup requirements for XML.

For example, the URI http://foo/bar?a=b&x=%C3%A3 becomes http://foo/bar?a=b&amp;x=%C3%A3. The ampersand is an escape character in XML.

You can find a detailed discussion of URI encoding here.

Abigailabigale answered 23/5, 2011 at 11:53 Comment(0)
G
2

Don't confuse percentage encoding of non-ASCII characters in URLs with XML entity escapes of characters in URLs. You need to do both when creating XML sitemaps.

In honesty from reading your original post, it seems something funky is going on because the characters you mention remind me of when an unsuccessful conversion has taken place :)

Are you sure those characters truly are part of your URLs when using UTF-8?

Gluten answered 27/5, 2011 at 16:18 Comment(3)
In honesty from reading your original post, it seems something funky is going on because the characters you mention remind me of when an unsuccessful conversion has taken place. You are right. But I've script ready to go through the DB and clean that up. But still there's a problem with the encoding too. So if I had those characters, do I need to percentage-encode those characters alone and then escape the result for XML (w/ entities)?Bakken
1) Convert document to UTF-8 2) Percentage encode all non-ASCII chars 3) Convert & to &amp; < to &lt; etc.Gluten
I've step one done. And I know how to do step 2 but does % need to be escaped?Bakken
T
1

All non-ascii characters in URL has to be 'x-url-encoding' encoded.

Here is the wiki link that explains it: http://en.wikipedia.org/wiki/Percent-encoding.

In addition all XML special symbols (&, >, <, etc.) also have to be escaped.

Jai's answer shows the correct method to x-url-encode arbitrary string. Note, however, that it does not do XML escaping.

Tarmac answered 23/5, 2011 at 11:35 Comment(4)
Instead of percent-encoding, punycode is also a possibility: tools.ietf.org/html/rfc3492Ultrasonics
I've added a snippet of the XML file. Is both of your answers still applicable?Bakken
@Adam. Still applies, as your resulting URL is not x-url-encoded. Also, because x-url-encoding is not a trivial operation, I highly recommend keeping URL parts in plain ASCII. I don't know what the requirements are for you system, but could you, possibly, rename the file to listing-20110523.shtml ( or similar along those lines )? This way you don't even have to bother with encoding of your URLs.Tarmac
No not really possible. We have a big big system done this way.Bakken

© 2022 - 2024 — McMap. All rights reserved.