What is the best way to remove accents (normalize) in a Python unicode string?

Asked 5/2, 2009 at 21:10 Answered 6/7, 2023 at 8:32

Solved python python-3.x unicode python-2.x diacritics

801

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
remove all the characters whose Unicode type is "diacritic".

Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

Aspire answered 5/2, 2009 at 21:10 Comment(0)

420

How about this:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

Vraisemblance answered 5/2, 2009 at 22:17 Comment(9)

These are not composed characters, unfortunately--even though "ł" is named "LATIN SMALL LETTER L WITH STROKE"! You'll either need to play games with parsing unicodedata.name, or break down and use a look-alike table-- which you'd need for Greek letters anyway (Α is just "GREEK CAPITAL LETTER ALPHA"). – Mytilene 7/4, 2012 at 11:25

@andi, I'm afraid I can't guess what point you want to make. The email exchange reflects what I wrote above: Because the letter "ł" is not an accented letter (and is not treated as one in the Unicode standard), it does not have a decomposition. – Mytilene 23/11, 2014 at 0:12

@Mytilene (late follow-up): This works perfectly well for Greek as well – eg. "GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA" is normalised into "GREEK CAPITAL LETTER ALPHA" just as expected. Unless you are referring to transliteration (eg. "α" → "a"), which is not the same as "removing accents"... – Duvalier 16/5, 2016 at 7:41

@lenz, I wasn't talking about removing accents from Greek, but about the "stroke" on the ell. Since it is not a diacritic, changing it to plain ell is the same as changing Greek Alpha to A. If don't want it don't do it, but in both cases you're substituting a Latin (near) look-alike. – Mytilene 16/5, 2016 at 17:1

Mostly works nice :) But it doesn't transform ß into ascii ss in example. I would still use unidecode to avoid accidents. – Thermomotor 1/3, 2017 at 6:53

Should probably use .combining() to check the property directly, rather than only handling .category() == 'Mn, which will mess up – Indecorum 5/5, 2017 at 21:46

+ for not requiring installing anything – Paramour 24/11, 2020 at 2:5

the spanish "ñ" is not an accent, but this changes it to an "n" (another letter) – Cetus 16/3, 2022 at 5:33

This is definitely a better solution than the above (by @oefe) since it uses Pythoin internal module, whereas the above use an external module which needs to be installed. An external module needs to be installed only when a problem cannot be solved with internal modules. This should be added to the "The Zen of Python" :) – Vapid 15/6, 2023 at 10:4

800

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

>>> from unidecode import unidecode
>>> unidecode('kožušček')
'kozuscek'
>>> unidecode('北亰')
'Bei Jing '
>>> unidecode('François')
'Francois'

Reticent answered 13/4, 2010 at 21:21 Comment(14)

Yeah, this is a better solution than simply stripping the accents. It provides much more useful transliterations for the languages that have conventions for writing words in ASCII. – Reparable 13/4, 2010 at 21:29

depends what you're trying to achieve. for example I'm doing a search right now, and I don't want to transliterate greek/russian/chinese, I just want to replace "ą/ę/ś/ć" with "a/e/s/c" – Francophile 31/3, 2012 at 18:15

@EOL unidecode works for great for strings like "François", if you pass unicode objects to it. It looks like you tried with a plain byte string. – Parrett 30/4, 2012 at 9:38

@EOL It looks like the "C cédille" is now handled properly. So, as far as I tested unidecode, which isn't much, I now consider it gives very good results. – Gannes 3/3, 2013 at 6:13

Note that unidecode >= 0.04.10 (Dec 2012) is GPL. Use earlier versions or check github.com/kmike/text-unidecode if you need a more permissive license and can stand a slightly worse implementation. – Cerveny 23/2, 2014 at 22:27

Doesn't seem to work with German eg. Ö => O Where it should be Oe – Woolsack 7/1, 2015 at 13:33

how to use it with variables? – Discordant 29/11, 2015 at 19:12

@Woolsack the Ö => OE is quite German-specific. In Finnish, some words like ääliö would render completely unrecognizable aeaelioe; it is simply more correct to omit diaresis than to add the e, though pronunciation of the accented letter is pretty much on par with the German umlaut. – Propitiatory 20/8, 2016 at 6:7

@EOL You'll be pleased to know that in the latest version of the library, 'François' is mapped to 'Francois' as you'd expect. – Zeiler 15/9, 2016 at 13:44

unidecode replaces ° with deg. It does more than just removing accents. – Outpost 28/4, 2017 at 12:2

People need to understand that Unicode character decomposition is a language specific mapping, it does not work universally and modules like unidecode are never going to work well with ignoring the locale or language of the input. As to CJK characters, it's a childish assumption that you can take an arbitary CJK character and 'render' it with ASCII: CJK characters can have multiple readings both in Chinese and Japanese, and the Chinese, Japanese, etc. readings are also going to be different. These modules are a waste of time. – Twana 14/5, 2017 at 17:0

What if I'm reading a string from a file how do I give it as input to the the library? like u+'str' but that would give me a varible answer name u is not defined – Gerger 30/7, 2018 at 11:14

pip install Unidecode – Koto 6/4, 2022 at 10:35

Unfortunately, It also replaces cyrillic characters with latin ones like "ф" -> "f". – Audible 6/9, 2023 at 8:15

420

How about this:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

Vraisemblance answered 5/2, 2009 at 22:17 Comment(9)

Mostly works nice :) But it doesn't transform ß into ascii ss in example. I would still use unidecode to avoid accidents. – Thermomotor 1/3, 2017 at 6:53

Should probably use .combining() to check the property directly, rather than only handling .category() == 'Mn, which will mess up – Indecorum 5/5, 2017 at 21:46

+ for not requiring installing anything – Paramour 24/11, 2020 at 2:5

the spanish "ñ" is not an accent, but this changes it to an "n" (another letter) – Cetus 16/3, 2022 at 5:33

218

I just found this answer on the Web:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

Aspire answered 5/2, 2009 at 21:19 Comment(12)

I had to add 'utf8' to unicode: nkfd_form = unicodedata.normalize('NFKD', unicode(input_str, 'utf8')) – Squelch 8/1, 2012 at 23:27

@Jabba: , 'utf8' is a "safety net" needed if you are testing input in terminal (which by default does not use unicode). But usually you don't have to add it, since if you're removing accents then input_str is very likely to be utf8 already. It doesn't hurt to be safe, though. – Dried 17/4, 2012 at 23:15

>>> def remove_accents(input_str): ... nkfd_form = unicodedata.normalize('NFKD', unicode(input_str)) ... return u"".join([c for c in nkfd_form if not unicodedata.combining(c)]) ... >>> remove_accents('é') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in remove_accents UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) – Naresh 9/6, 2013 at 15:40

@rbp: you should pass a unicode string to remove_accents instead of a regular string (u"é" instead of "é"). You passed a regular string to remove_accents, so when trying to convert your string to a unicode string, the default ascii encoding was used. This encoding does not support any byte whose value is >127. When you typed "é" in your shell, your O.S. encoded that, probably with UTF-8 or some Windows Code Page encoding, and that included bytes >127. I'll change my function in order to remove the conversion to unicode: it will bomb more clearly if a non-unicode string is passed. – Aspire 11/6, 2013 at 10:11

@Aspire that worked perfectly >>> remove_accents(unicode('é')) – Naresh 12/6, 2013 at 20:59

This answer gave me the best result on a large data set, the only exception is "ð"- unicodedata wouldn't touch it! – Levenson 8/6, 2018 at 2:38

The first example removes "ł" ("LATIN SMALL LETTER L WITH STROKE") completely :( – Nessus 8/11, 2019 at 12:42

In Python 3, the first version of remove_accents in this post returns a bytes. To return a str, you need to call nfkd_form.encode('ASCII', 'ignore').decode('utf8') – Feola 28/11, 2020 at 17:1

doesn't work on đ character, it supposed to be d – Cwmbran 19/1, 2021 at 7:56

works good but note that the first function doesnt work for ß – Rainarainah 18/5, 2021 at 20:22

also note that the function returns bytes like b"xxx" not string like "xxx" you have to convert it to string first like str(remove_accents(input_str), 'utf8') stackabuse.com/convert-bytes-to-string-in-python – Rainarainah 18/5, 2021 at 21:26

the Spanish "ñ" is not an accent, but this changes it to an "n". "n" and "ñ" are different letters in Spanish... – Cetus 16/3, 2022 at 5:35

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

Thanks to you, I have created this function that works wonders.

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

result:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

Runyan answered 24/7, 2015 at 10:8 Comment(2)

With Py2.7, passing an already unicode string errors at text = unicode(text, 'utf-8'). A workaround for that was to addexcept TypeError: pass – Ribera 18/3, 2016 at 15:56

Is there someway to make sure than if the letter M is large (capitalized) in the input then M is also large (capitalized) in the output? The input string "Montréal" becomes the output string "Montreal". – Onus 23/2, 2023 at 21:30

This handles not only accents, but also "strokes" (as in ø etc.):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(char)
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
        try:
            char = ud.lookup(desc)
        except KeyError:
            pass  # removing "WITH ..." produced an invalid name
    return char

This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed. In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.

There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

EDIT NOTE:

Incorporated suggestions from the comments (handling lookup errors, Python-3 code).

Duvalier answered 21/3, 2013 at 12:39 Comment(5)

You should catch the exception if the new symbol doesn't exist. For example there's SQUARE WITH VERTICAL FILL ▥, but there's no SQUARE. (not to mention that this code transforms UMBRELLA WITH RAIN DROPS ☔ into UMBRELLA ☂). – Uropygium 9/7, 2015 at 9:45

This looks elegant in harnessing the semantic descriptions of characters that are available. Do we really need the unicode function call in there with python 3 though? I think a tighter regex in place of the find would avoid all the trouble mentioned in the comment above, and also, memoization would help performance when it's a critical code path. – Islean 29/12, 2018 at 14:30

@matanster no, this is an old answer from the Python-2 era; the unicode typecast is no longer appropriate in Python 3. In any case, in my experience there is no universal, elegant solution to this problem. Depending on the application, any approach has its pros and cons. Quality-thriving tools like unidecode are based on hand-crafted tables. Some resources (tables, algorithms) are provided by Unicode, eg. for collation. – Duvalier 29/12, 2018 at 14:45

I just repeat, what is above (py3): 1) unicode(char)->char 2) try: return ud.lookup(desc) except KeyError: return char – Nessus 8/11, 2019 at 12:50

@Nessus you are right: since this thread is so popular, this answer deserves some updating/improving. I edited it. – Duvalier 8/11, 2019 at 18:22

In my view, the proposed solutions should NOT be accepted answers. The original question is asking for the removal of accents, so the correct answer should only do that, not that plus other, unspecified, changes.

Simply observe the result of this code which is the accepted answer. where I have changed "Málaga" by "Málagueña:

accented_string = u'Málagueña'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaguena'and is of type 'str'

There is an additional change (ñ -> n), which is not requested in the OQ.

A simple function that does the requested task, in lower form:

def f_remove_accents(old):
    """
    Removes common accent characters, lower form.
    Uses: regex.
    """
    new = old.lower()
    new = re.sub(r'[àáâãäå]', 'a', new)
    new = re.sub(r'[èéêë]', 'e', new)
    new = re.sub(r'[ìíîï]', 'i', new)
    new = re.sub(r'[òóôõö]', 'o', new)
    new = re.sub(r'[ùúûü]', 'u', new)
    return new

Persecution answered 8/9, 2021 at 8:43 Comment(2)

"correct answer should only do that, not that plus other, unspecified, changes" -> you make capital letters in lower case – Koo 18/2, 2022 at 13:5

Well... your answer convert "M" into "m" (not requested by OQ) – Forras 29/8, 2022 at 11:10

gensim.utils.deaccent(text) from Gensim - topic modelling for humans:

'Sef chomutovskych komunistu dostal postou bily prasek'

Another solution is unidecode.

Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').

Zelda answered 30/1, 2018 at 0:27 Comment(4)

deaccent still gives ł instead of l. – Cairistiona 10/6, 2019 at 8:13

You needn't to install NumPy and SciPy to get accents removed. – Ciprian 13/9, 2019 at 18:46

thanks for gensim reference ! how does it compare to unidecode (in terms of speed or accuracy) ? – Unmusical 20/12, 2019 at 11:38

Changes the "ñ" for "n" which you wouldn't want, at least if you're looking for removing accents in Spanish – Cetus 16/3, 2022 at 5:30

In response to @MiniQuark's answer:

I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats. As a test, I created a test.txt file that looked like this:

Montréal, über, 12.89, Mère, Françoise, noël, 889

I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba's comment:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

The result:

Montreal
uber
12.89
Mere
Francoise
noel
889

(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)

Tantalizing answered 12/6, 2013 at 15:48 Comment(4)

remove_accents was meant to remove accents from a unicode string. In case it's passed a byte-string, it tries to convert it to a unicode string with unicode(input_str). This uses python's default encoding, which is "ascii". Since your file is encoded with UTF-8, this would fail. Lines 2 and 3 change python's default encoding to UTF-8, so then it works, as you found out. Another option is to pass remove_accents a unicode string: remove lines 2 and 3, and on the last line replace element by element.decode("utf-8"). I tested: it works. I'll update my answer to make this clearer. – Aspire 12/6, 2013 at 19:52

Nice edit, good point. (On another note: The real problem I've realised is that my data file is apparently encoded in iso-8859-1, which I can't get to work with this function, unfortunately!) – Tantalizing 12/6, 2013 at 20:11

aseagram: simply replace "utf-8" with "iso-8859-1", and it should work. If you're on windows, then you should probably use "cp1252" instead. – Aspire 13/6, 2013 at 7:43

BTW, reload(sys); sys.setdefaultencoding("utf-8") is a dubious hack sometimes recommended for Windows systems; see #28657510 for details. – Offensive 16/5, 2018 at 13:13

import unicodedata
from random import choice

import perfplot
import regex
import text_unidecode


def remove_accent_chars_regex(x: str):
    return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))


def remove_accent_chars_join(x: str):
    # answer by MiniQuark
    # https://mcmap.net/q/40948/-what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string
    return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])


perfplot.show(
    setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
    kernels=[
        remove_accent_chars_regex,
        remove_accent_chars_join,
        text_unidecode.unidecode,
    ],
    labels=['regex', 'join', 'unidecode'],
    n_range=[2 ** k for k in range(22)],
    equality_check=None, relative_to=0, xlabel='str len'
)

Godard answered 3/2, 2021 at 2:59 Comment(1)

Haha... amazing. All these bits and pieces did actually install. The script did actually run. The graph actually displayed. And it is very similar to yours. unidecode actually handles the Chinese characters. And none of the three comes up with the hilarious "FranASSois". – Uptake 7/2, 2021 at 13:39

Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.

Code

from unicodedata import combining, normalize

LATIN = "ä  æ  ǽ  đ ð ƒ ħ ı ł ø ǿ ö  œ  ß  ŧ ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue"

def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
    return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))

NB. The default argument outliers is evaluated once and not meant to be provided by the caller.

Intended usage

As a key to sort a list of strings in a more “natural” order:

sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)

Output:

['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']

If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.

Tests

To make sure the behavior meets your needs, take a look at the pangrams below:

examples = [
    ("hello, world", "hello, world"),
    ("42", "42"),
    ("你好，世界", "你好，世界"),
    (
        "Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.",
        "des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.",
    ),
    (
        "Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.",
        "falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.",
    ),
    (
        "Љубазни фењерџија чађавог лица хоће да ми покаже штос.",
        "љубазни фењерџија чађавог лица хоће да ми покаже штос.",
    ),
    (
        "Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.",
        "ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.",
    ),
    (
        "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.",
        "quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.",
    ),
    (
        "Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.",
        "kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.",
    ),
    (
        "Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.",
        "glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.",
    )
]

for (given, expected) in examples:
    assert remove_diacritics(given) == expected

Case-preserving variant

LATIN = "ä  æ  ǽ  đ ð ƒ ħ ı ł ø ǿ ö  œ  ß  ŧ ü  Ä  Æ  Ǽ  Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö  Œ  ẞ  Ŧ Ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE"

def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
    return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))

Greatest answered 9/3, 2022 at 10:46 Comment(2)

This looks like a nice solution, but on Python 3 I get the error message ValueError: string keys in translate table must be of length 1. This limitation is clearly stated in the Python docs: docs.python.org/3/library/stdtypes.html#str.maketrans. Maybe you only tested with Python 2? – Semple 19/6, 2023 at 10:20

You're right, there was a "SS" in the (untested) LATIN string for the case-preserving variant. I have replaced it by "ẞ", namely LATIN CAPITAL LETTER SHARP S. Thanks! – Greatest 19/6, 2023 at 12:53

There are already many answers here, but this was not previously considered: using sklearn

from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode

accented_string = u'Málagueña®'

print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena

This is particularly useful if you are already using sklearn to process text. Those are the functions internally called by classes like CountVectorizer to normalize strings: when using strip_accents='ascii' then strip_accents_ascii is called and when strip_accents='unicode' is used, then strip_accents_unicode is called.

More details

Finally, consider those details from its docstring:

Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing

Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.

and

Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart

Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.

Particularly answered 2/6, 2022 at 12:51 Comment(0)

Some languages have combining diacritics as language letters and accent diacritics to specify accent.

I think it is more safe to specify explicitly what diactrics you want to strip:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

Jugglery answered 24/7, 2015 at 11:34 Comment(0)

If you are hoping to get functionality similar to Elasticsearch's asciifolding filter, you might want to consider fold-to-ascii, which is [itself]...

A Python port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into ASCII equivalents, if they exist.

Here's an example from the page mentioned above:

from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'

EDIT: The fold_to_ascii module seems to work well for normalizing Latin-based alphabets; however unmappable characters are removed, which means that this module will reduce Chinese text, for example, to empty strings. If you want to preserve Chinese, Japanese, and other Unicode alphabets, consider using @mo-han's remove_accent_chars_regex implementation, above.

Stith answered 2/5, 2021 at 10:9 Comment(0)

-3

I came up with this one (especially for Latin letters - linguistic purposes)

import string
from functools import lru_cache

import unicodedata


# This can improve performance by avoiding redundant computations when the function is
# called multiple times with the same arguments.
@lru_cache
def lookup(
    l: str, case_sens: bool = True, replace: str = "", add_to_printable: str = ""
):
    r"""
    Look up information about a character and suggest a replacement.

    Args:
        l (str): The character to look up.
        case_sens (bool, optional): Whether to consider case sensitivity for replacements. Defaults to True.
        replace (str, optional): The default replacement character when not found. Defaults to ''.
        add_to_printable (str, optional): Additional uppercase characters to consider as printable. Defaults to ''.

    Returns:
        dict: A dictionary containing the following information:
            - 'all_data': A sorted list of words representing the character name.
            - 'is_printable_letter': True if the character is a printable letter, False otherwise.
            - 'is_printable': True if the character is printable, False otherwise.
            - 'is_capital': True if the character is a capital letter, False otherwise.
            - 'suggested': The suggested replacement for the character based on the provided criteria.
    Example:
        sen = "Montréal, über, 12.89, Mère, Françoise, noël, 889"
        norm = ''.join([lookup(k, case_sens=True, replace='x', add_to_printable='')['suggested'] for k in sen])
        print(norm)
        #########################
        sen2 = 'kožušček'
        norm2 = ''.join([lookup(k, case_sens=True, replace='x', add_to_printable='')['suggested'] for k in sen2])
        print(norm2)
        #########################

        sen3="Falsches Üben von Xylophonmusik quält jeden größeren Zwerg."
        norm3 = ''.join([lookup(k, case_sens=True, replace='x', add_to_printable='')['suggested'] for k in sen3]) # doesn't preserve ü - ue ...
        print(norm3)
        #########################
        sen4 = "cætera"
        norm4 = ''.join([lookup(k, case_sens=True, replace='x', add_to_printable='ae')['suggested'] for k in
                         sen4])  
        print(norm4)


        # Montreal, uber, 12.89, Mere, Francoise, noel, 889
        # kozuscek
        # Falsches Uben von Xylophonmusik qualt jeden groseren Zwerg.
        # caetera
    """
    # The name of the character l is retrieved using the unicodedata.name()
    # function and split into a list of words and sorted by len (shortest is the wanted letter)
    v = sorted(unicodedata.name(l).split(), key=len)
    sug = replace
    stri_pri = string.printable + add_to_printable.upper()
    is_printable_letter = v[0] in stri_pri
    is_printable = l in stri_pri
    is_capital = "CAPITAL" in v
    # Depending on the values of the boolean variables, the variable sug may be
    # updated to suggest a replacement for the character l. If the character is a printable letter,
    # the suggested replacement is set to the first word in the sorted list of names (v).
    # If case_sens is True and the character is a printable letter but not a capital,
    # the suggested replacement is set to the lowercase version of the first word in v.
    # If the character is printable, the suggested replacement is set to the character l itself.
    if is_printable_letter:
        sug = v[0]

        if case_sens:
            if not is_capital:
                sug = v[0].lower()
    elif is_printable:
        sug = l
    return {
        "all_data": v,
        "is_printable_letter": is_printable_letter,
        "is_printable": is_printable,
        "is_capital": is_capital,
        "suggested": sug,
    }

There is another solution I came up with which is also based on lookup dicts and Numba, but the source code is way too big to post it here. Here is the GitHub link: https://github.com/hansalemaos/charchef

Araiza answered 6/7, 2023 at 8:32 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++