XSLT- normalize non-breaking whitespace characters
Asked Answered
M

1

3

I have a sample xml file like this,

<doc>
    <p>text1 text2  </p>
    <p>text1 text2     </p>
    <p>text1 text2   </p>
</doc>

this sample xml, first <p> has space whitespace character (&#x0020;), second <p> has tab whitespace whitespace character (&#x9;) and third <p> has space non-breaking whitespace character (&#x00A0;).

I need to remove the any white spaces appearing just before closing tag.

So, expected output should be,

<doc>
    <p>text1 text2</p>
    <p>text1 text2</p>
    <p>text1 text2</p>
</doc>

By using xslt normalize-space() I can remove unnecessary spaces and tab characters but not non-breaking whitespace characters.

<xsl:template match="p/text()">
    <xsl:value-of select="normalize-space()"/>
</xsl:template>

Any suggestions how can I normalize all white spaces including non-breaking white spaces in xslt?

Maccaboy answered 2/12, 2016 at 5:42 Comment(1)
if you have a list of the potential non-breaking charactes, you could do a translate() of those to a normal space before calling the normalize-spacePolyhydroxy
S
7

You could do:

<xsl:value-of select="normalize-space(translate(., '&#160;', ' '))"/>

This will work in XSLT 1.0 and 2.0 alike.


In XSLT 2.0, you could also use regex - for example:

<xsl:value-of select="replace(., '[\t\p{Zs}]', '')"/>

will remove the horizontal tab character as well as any character in the Unicode Space_Separator category, which includes not only the space and non-breaking space characters but also other space characters. Documentation is hard to find, but I believe this is currently the complete list: (extracted from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt):

&#x0020; SPACE
&#x00A0; NO-BREAK SPACE
&#x1680; OGHAM SPACE MARK
&#x2000; EN QUAD
&#x2001; EM QUAD
&#x2002; EN SPACE
&#x2003; EM SPACE
&#x2004; THREE-PER-EM SPACE
&#x2005; FOUR-PER-EM SPACE
&#x2006; SIX-PER-EM SPACE
&#x2007; FIGURE SPACE
&#x2008; PUNCTUATION SPACE
&#x2009; THIN SPACE
&#x200A; HAIR SPACE
&#x202F; NARROW NO-BREAK SPACE
&#x205F; MEDIUM MATHEMATICAL SPACE
&#x3000; IDEOGRAPHIC SPACE

&#x10CB0; OLD HUNGARIAN CAPITAL LETTER EZS
&#x10CF0; OLD HUNGARIAN SMALL LETTER EZS
&#x16F36; MIAO LETTER ZSHA
&#x16F3C; MIAO LETTER ZSA
&#x16F3E; MIAO LETTER ZZSA
&#x16F41; MIAO LETTER ZZSYA

However, testing with Saxon 9.5 shows that the last 6 characters are not recognized: http://xsltransform.net/ncntCSo

Sophiasophie answered 2/12, 2016 at 7:54 Comment(1)
I'd never come across Unicode Categories before in Regex so reading regular-expressions.info/unicode.html helped me understand \p{Zs} .Donelu

© 2022 - 2024 — McMap. All rights reserved.