Convert accented characters into ascii character
Asked Answered
S

2

25

What is the optimal way to to remove German (or French) accents from a vector of 16 million string variables.

e.g., 'Sjögren's syndrome' into 'Sjogren's syndrome'

Converstion of single character into a single character is better then transliteration such as

ä => ae ö => oe ü => ue.

e.g., using regular expression would be one option but is there something better (R package for this)?

gsub('ü','u',gsub('ö','o',"Sjögren's syndrome ( über) "))

There are SO solutions for non-R platforms but not a good one for R.

Suspicious answered 28/11, 2012 at 16:54 Comment(2)
See the answer to this post: [https://mcmap.net/q/293114/-force-character-vector-encoding-from-quot-unknown-quot-to-quot-utf-8-quot-in-r][1] [1]: #23699771Ha
See the answer to this post: [#23699771 [1]: #23699771Ha
T
29

Use iconv to convert to ASCII with transliteration (if supported):

iconv(c("über","Sjögren's"),to="ASCII//TRANSLIT")
[1] "uber"      "Sjogren's"
Twylatwyman answered 28/11, 2012 at 17:9 Comment(1)
for accented characters, e.g.é, this will result in something that looks like 'e. Run this command over the output vector of the operation above: out <- gsub("\\'", '', out)Frustrated
S
29

One of the linked answers suggest

library(stringi)
stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")

[1] "Zazolc gesla jazn"
Suspicious answered 27/4, 2016 at 18:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.