How do I strip accents from characters in XSL?

Asked 22/3, 2011 at 21:35 Answered 9/12, 2020 at 20:57

Solved xml xslt unicode character-encoding

I keep looking, but can't find an XSL function that is the equivalent of "normalize-space", for characters. That is, my content has accented UNICODE characters, which is great, but from that content, I'm creating a filename, where I don't want those accents.

So, is there something that I'm overlooking, or not googling properly, to easily process characters?

In the XML data:

<filename>gri_gonéwiththèw00mitc</filename>

In XSLT stylesheet:

<xsl:variable name="file">
    <xsl:value-of select="filename"/>
</xsl:variable>

<xsl:value-of select="$file"/>

results in "gri_gonéwiththèw00mitc"

where

<xsl:value-of select='replace( normalize-unicode( "$file", "NFKD" ), "[^\\p{ASCII}]", "" )'/>

results in nothing.

What I'm aiming for is gri_gonewiththew00mitc (no accents)

Am I using the syntax wrong?

Intrigante answered 22/3, 2011 at 21:35 Comment(2)

Removing accents only works for a small subset of Unicode characters. As far as I know, there's no standard way of latinized transcription of characters. (That is to say, there's a different one for each language.) – Ostracon 22/3, 2011 at 21:50

Check my answer for the correct RegExp syntax. – Blameworthy 24/3, 2011 at 2:59

In XSLT/XPath 1.0 if you want to replace those accented characters with the unaccented counterpart, you could use translate() function.

But, that assumes your "accented UNICODE characters" aren't composed unicode characters. If that were the case, you would need to use XPath 2.0 normalize-unicode() function.

And, if the real goal is to have a valid URI, you should use encode-for-uri()

Update: Examples

translate('gri_gonéwiththèw00mitc','áàâäéèêëíìîïóòôöúùûü','aaaaeeeeiiiioooouuuu')

Result: gri_gonewiththew00mitc

encode-for-uri('gri_gonéwiththèw00mitc')

Result: gri_gon%C3%A9withth%C3%A8w00mitc

Correct expression provide suggest by @biziclop:

replace(normalize-unicode('gri_gonéwiththèw00mitc','NFKD'),'\P{ASCII}','')

Result: gri_gonewiththew00mitc

Note: In XPath 2.0, the correct character class negation is with a capital \P.

Blameworthy answered 22/3, 2011 at 21:52 Comment(3)

translate() assumes that you list all the characters you want to replace. My guess is that OP wants to avoid this. Although I don't think it's possible in general. – Ostracon 22/3, 2011 at 21:59

@biziclop: There is a reason for my answer having only one link to encode-for-uri() function. – Blameworthy 22/3, 2011 at 22:7

@Alejandro On second thought, if you normalize your string to NFKD form and then throw away every non-basic ASCII (0-127) character (you can use a regexp replace for that), you will get an accent-free string. – Ostracon 22/3, 2011 at 22:38

So, contrary to my comment, you could try this:

replace( normalize-unicode( "öt hűtőházból kértünk színhúst", "NFKD" ), "[^\\p{ASCII}]", "" )

Although be warned that any characters which can't be decomposed and aren't basic ASCII (Norwegian ø or Icelandic Þ for example) will be completely deleted from the string, but that's probably okay with your requirements.

Ostracon answered 22/3, 2011 at 22:55 Comment(2)

Good example. Do check my update for the correct RegExp character class negation syntax. – Blameworthy 24/3, 2011 at 2:59

Where is this RegExp syntax example? I need to replace a list of characters.... ɔ ɛ and a few that I can`t even enter here.... As is noted these composed characters are deleted from the string using the above replace(normalize-unicode()) example. thx – Venerable 15/5, 2023 at 14:3

The previously suggested ways contain unknownthe character class named 'ASCII'. In my experience, XPath 2.0 recognises the class 'BasicLatin', which should serve the same purpose as 'ASCII'.

replace(normalize-unicode('Lliç d'Am Oükl Úkřeč', 'NFKD'), '\P{IsBasicLatin}', '')

Iffy answered 25/2, 2015 at 14:14 Comment(0)

The top voted answer does not work anymore XPath2.0, as mentioned by Yuri. The 'IsBasicLatin' is an appropriate substitution for ASCII

The following code works:

replace(normalize-unicode('çgri_gonéwiththèmitç','NFKD'),'\P{IsBasicLatin}','')

Acerbity answered 9/12, 2020 at 20:57 Comment(0)

Recommended topics

Hot tags