I've a program which generates XML sitemaps for Google Webmaster Tools (among other things).
GWTs is giving me errors for some sitemaps because the URLs contain character sequences like ã¾, ã‹, ã€, etc. **
Your Sitemap file must be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters listed: &, ', ", <, >.
The special characters are escaped in the XML files (with HTML entities). XML file snippet:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://domain/folder/listing-ã.shtml</loc>
...
Are my URLs UTF-8 encoded? If not, how do I do this in Java? The following is the line in my program where I add the URL to the sitemap:
siteMap.addUrl(StringEscapeUtils.escapeXml(countryName+"/"+twoCharFile.getRelativeFileName().toLowerCase()));
I'm not sure which ones are causing the error, probably the first two examples.