How do I convert HTML percent-encoding to Unicode, with XSLT?
Asked Answered
S

2

5

There are tons of entries and answers online about this, but they're all going the opposite direction of what I need. From my iTunes XML, I have thousands of percent-encoded entries, in multiple languages, that I'm trying to convert, with an XSLT stylesheet, to Unicode text. Is there any function or process that I'm missing, other than tracking down every single character and doing a replace? Here is a small sample of some examples of the variety that I'm working with, the first line is the XML string value, the following line is the basic text that I'm trying to generate, and output to a text file.

<string>/iTunes/iTunes%20Music/Droit%20devant/L'odysse%CC%81e.mp3</string>

/iTunes/iTunes Music/Droit devant/L'odyssée.mp3

<string>A%CC%80%20la%20Pe%CC%82che</string>

À la Pêche

<string>%D0%97%D0%B0%D0%BF%D0%BE%D0%BC%D0%B8%D0%BD%D0%B0%D0%B8%CC%86</string>

Запоминай

<string>%CE%9A%CE%BF%CC%81%CF%84%CF%83%CC%8C%CE%B1%CF%81%CE%B9</string>

Κότσ̌αρι

This last one may not display properly for some, because of the overstriking hacek/caron.

Thanks in advance for any advice or leads

Spectator answered 7/12, 2012 at 18:17 Comment(3)
Why does it have to be done with XSLT? Can't you just open the file, parse it, and replace the text, serialize and save?Hautbois
It would help if you let us know which version of XSLT you're using and on which platform?Bibbie
Sorry, it's been a while since I've worked with XML/XSLT, so I forgot those details. I'm using XSLT 2, saxon9he for processing, working on Linux and OSX.Spectator
N
9

A pure XSLT 2.0 solution could make use of the string-to-codepoints() and the codepoints-to-string() functions. The utf-8 decoding is a bit messy, it can be done.

This XSLT 2.0 style-sheet...

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:so="http://stackoverflow.com/questions/13768754"
  exclude-result-prefixes="xsl xs so">
<xsl:output encoding="UTF-8" omit-xml-declaration="yes" indent="yes" />
<xsl:strip-space elements="*"/>

<xsl:variable name="cp-base" select="string-to-codepoints('0A')" as="xs:integer+" />

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()" />
  </xsl:copy>
</xsl:template>

<xsl:function name="so:utf8decode" as="xs:integer*">
  <xsl:param name="bytes" as="xs:integer*" />
  <xsl:choose>
    <xsl:when test="empty($bytes)" />
    <xsl:when test="$bytes[1] eq 0"><!-- The null character is not valid for XML. -->
      <xsl:sequence select="so:utf8decode( remove( $bytes, 1))" />
    </xsl:when>
    <xsl:when test="$bytes[1] le 127">
      <xsl:sequence select="$bytes[1], so:utf8decode( remove( $bytes, 1))" />
    </xsl:when>
    <xsl:when test="$bytes[1] lt 224">
      <xsl:sequence select="
      ((($bytes[1] - 192) * 64) +
        ($bytes[2] - 128)        ),
        so:utf8decode( remove( remove( $bytes, 1), 1))" />
    </xsl:when>
    <xsl:when test="$bytes[1] lt 240">
      <xsl:sequence select="
      ((($bytes[1] - 224) * 4096) +
       (($bytes[2] - 128) *   64) +
        ($bytes[3] - 128)          ),
        so:utf8decode( remove( remove( remove( $bytes, 1), 1), 1))" />
    </xsl:when>
    <xsl:when test="$bytes[1] lt 248">
      <xsl:sequence select="
      ((($bytes[1] - 240) * 262144) +
       (($bytes[2] - 128) *   4096) +
       (($bytes[3] - 128) *     64) +
        ($bytes[4] - 128)            ),
        so:utf8decode( $bytes[position() gt 4])" />
    </xsl:when>
    <xsl:otherwise>
      <!-- Code-point valid for XML. -->
      <xsl:sequence select="so:utf8decode( remove( $bytes, 1))" />
    </xsl:otherwise>
  </xsl:choose>
</xsl:function>

<xsl:template match="string/text()">
  <xsl:analyze-string select="." regex="(%[0-9A-F]{{2}})+" flags="i">
    <xsl:matching-substring>
      <xsl:variable name="utf8-bytes" as="xs:integer+">
        <xsl:analyze-string select="." regex="%([0-9A-F]{{2}})" flags="i">
          <xsl:matching-substring>
          <xsl:variable name="nibble-pair" select="
            for $nibble-char in string-to-codepoints( upper-case(regex-group(1))) return
              if ($nibble-char ge $cp-base[2]) then
                  $nibble-char - $cp-base[2] + 10
                else
                  $nibble-char - $cp-base[1]" as="xs:integer+" />
            <xsl:sequence select="$nibble-pair[1] * 16 + $nibble-pair[2]" />                
          </xsl:matching-substring>
        </xsl:analyze-string>
      </xsl:variable>
      <xsl:value-of select="codepoints-to-string( so:utf8decode( $utf8-bytes))" />
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="." />
    </xsl:non-matching-substring>
    <xsl:fallback>
      <!-- For XSLT 1.0 operating in forward compatibility mode,
           just echo -->
      <xsl:value-of select="." />
    </xsl:fallback>
  </xsl:analyze-string>
</xsl:template>

</xsl:stylesheet>

...applied to this input...

<doc>
    <string>/iTunes/iTunes%20Music/Droit%20devant/L'odysse%CC%81e.mp3</string>
    <string>A%Cc%80%20la%20Pe%CC%82che</string>
    <string>%D0%97%D0%B0%D0%BF%D0%BE%D0%BC%D0%B8%D0%BD%D0%B0%D0%B8%CC%86</string>
    <string>%CE%9A%CE%BF%CC%81%CF%84%CF%83%CC%8C%CE%B1%CF%81%CE%B9</string>
</doc>

..yields..

<doc>
   <string>/iTunes/iTunes Music/Droit devant/L'odyssée.mp3</string>
   <string>À la Pêche</string>
   <string>Запоминай</string>
   <string>Κότσ̌αρι</string>
</doc>
Nun answered 8/12, 2012 at 15:18 Comment(8)
Wow, that is more complex than I imagined. I was hoping there was just a "decodeURI" function that I was missing. I accepted this answer for the pure XSLT aspect. Thanks for such an involved answer.Spectator
Six-plus years later, still no native "decode-from-uri" function, despite apparent demand and numerous well-articulated use cases. Many thanks to Sean for providing a workaround!Furunculosis
The function contained a little bug (($bytes[1] - 224) * 262144)($bytes[1] - 240) * 262144)). Thank you to Gerrit Imsieke for the observation.Omasum
@EiríkrÚtlendi XSLt was designed and implemented by academics who usually don't consider real,world use cases. It explains why it isn't very popular and its overly pedantic syntax. Most developers would rather process XML using their preferred language and tools; If XSLT was easier to work with it would've made XML the defacro format for content and data. It may have eliminated the need for content management systems.Nicky
@ATL_DEV: Re: that XML could have been "the defacto format for content and data", I don't know if I believe that -- XML itself is straightforward and simple enough, yet I still see new things appear like JSON that basically reproduce pretty much the same data-structure paradigms. Devs like inventing. Seems like no wheel is too round that it doesn't bear a little reinventing every now and then. 😄Furunculosis
@EiríkrÚtlendi My point exactly. XML is straightforward and simple enough, but transforming it is a painful with XSLT. Most programmers prefer to process it using familiar programming languages, but its structure doesn't lend itself well to non-declarative code. Microsoft's XSD processor makes it easier by generating strongly typed objects that are easier to work with. Regardless, writing an application for a trivial transformation is a bit much. XSLT could have fit this role nicely without the drawbacks. JSON, unfortunately, doesn't do any better at solving these problems.Nicky
I suspect the absence of the decode-from-uri function has more to do with the fact that historically XPath and related languages such as XSLT have been used in web development, but mostly focused on the aspect of presenting XML to the user (i.e. in generating UIs) rather than on consuming and parsing HTTP requests (i.e. in the role of web server), where normally the XPath-based code is embedded in some other web server platform which is responsible for the low-level HTTP request parsing.Cabalistic
Additionally, if your uri contains + signs representing spaces, then you need to replace($uri, '\+', ' ') and feed that into the above.Mellissamellitz
P
3

Here's one option using the java.net.URLDecoder.decode Java method, but you'll either have to upgrade to Saxon-PE (or EE) or downgrade to Saxon-B.

Saxon-B is free and is still an XSLT 2.0 processor. Both can be found here: http://saxon.sourceforge.net/

Example...

XML Input

<doc>
    <string>/iTunes/iTunes%20Music/Droit%20devant/L'odysse%CC%81e.mp3</string>
    <string>A%CC%80%20la%20Pe%CC%82che</string>
    <string>%D0%97%D0%B0%D0%BF%D0%BE%D0%BC%D0%B8%D0%BD%D0%B0%D0%B8%CC%86</string>
    <string>%CE%9A%CE%BF%CC%81%CF%84%CF%83%CC%8C%CE%B1%CF%81%CE%B9</string>
</doc>

XSLT 2.0 (tested with Saxon-PE 9.4 and Saxon-B 9.1)

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:java-urldecode="java.net.URLDecoder">
    <xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="string">
        <xsl:value-of select="java-urldecode:decode(.,'UTF-8')"/>
        <xsl:text>&#xA;</xsl:text>
    </xsl:template>

</xsl:stylesheet>

Output

/iTunes/iTunes Music/Droit devant/L'odyssée.mp3
À la Pêche
Запоминай
Κότσ̌αρι
Peanuts answered 8/12, 2012 at 8:3 Comment(1)
Great compact answer. +1 Thanks!Spectator

© 2022 - 2024 — McMap. All rights reserved.