Regex for accent insensitive replacement in python

Asked 26/4, 2017 at 12:39 Answered 23/9, 2021 at 18:40

Solved python regex unicode non-ascii-characters accent-insensitive

In Python 3, I'd like to be able to use re.sub() in an "accent-insensitive" way, as we can do with the re.I flag for case-insensitive substitution.

Could be something like a re.IGNOREACCENTS flag:

original_text = "¿It's 80°C, I'm drinking a café in a cafe with Chloë。"
accent_regex = r'a café'
re.sub(accent_regex, 'X', original_text, flags=re.IGNOREACCENTS)

This would lead to "¿It's 80°C, I'm drinking X in X with Chloë。" (note that there's still an accent on "Chloë") instead of "¿It's 80°C, I'm drinking X in a cafe with Chloë。" in real python.

I think that such a flag doesn't exist. So what would be the best option to do this? Using re.finditer and unidecode on both original_text and accent_regex and then replace by splitting the string? Or modifying all characters in the accent_regex by their accented variants, for instance: r'[cç][aàâ]f[éèêë]'?

Linin answered 26/4, 2017 at 12:39 Comment(2)

Could be something like... @WiktorStribiżew – Gabor 26/4, 2017 at 12:45

What you are looking for is a equivalence class - though I don't know any python regex module that supports them. Syntax is usually like [[=a=]] – Lacombe 26/4, 2017 at 12:51

unidecode is often mentioned for removing accents in Python, but it also does more than that : it converts '°' to 'deg', which might not be the desired output.

unicodedata seems to have enough functionality to remove accents.

With any pattern

This method should work with any pattern and any text.

You can temporarily remove the accents from both the text and regex pattern. The match information from re.finditer() (start and end indices) can be used to modify the original, accented text.

Note that the matches must be reversed in order to not modify the following indices.

import re
import unicodedata

original_text = "I'm drinking a 80° café in a cafe with Chloë, François Déporte and Francois Deporte."

accented_pattern = r'a café|François Déporte'

def remove_accents(s):
    return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

print(remove_accents('äöüßéèiìììíàáç'))
# aoußeeiiiiiaac

pattern = re.compile(remove_accents(accented_pattern))

modified_text = original_text
matches = list(re.finditer(pattern, remove_accents(original_text)))

for match in matches[::-1]:
    modified_text = modified_text[:match.start()] + 'X' + modified_text[match.end():]

print(modified_text)
# I'm drinking a 80° café in X with Chloë, X and X.

If pattern is a word or a set of words

You could :

remove the accents out of your pattern words and save them in a set for fast lookup
look for every word in your text with \w+
remove the accents from the word:
- If it matches, replace by X
- If it doesn't match, leave the word untouched

import re
from unidecode import unidecode

original_text = "I'm drinking a café in a cafe with Chloë."

def remove_accents(string):
    return unidecode(string)

accented_words = ['café', 'français']

words_to_remove = set(remove_accents(word) for word in accented_words)

def remove_words(matchobj):
    word = matchobj.group(0)
    if remove_accents(word) in words_to_remove:
        return 'X'
    else:
        return word

print(re.sub('\w+', remove_words, original_text))
# I'm drinking a X in a X with Chloë.

Minutes answered 26/4, 2017 at 13:19 Comment(11)

Thanks, this method is smart! How to modify it to replace not only words but also n-gram? (I edited my question to take this option into account, for instance to replace "François Déporte" in a text where only "Francois Deporte" appears) – Wader 26/4, 2017 at 14:27

@AntoineDusséaux: No problem, the first method works fine. – Minutes 26/4, 2017 at 14:31

I initially thought it would work but after a few tests this method failed if the length of the unidecoded string is not the same as the original one. For instance unidecode('°') is 'deg', so if original_text = "I'm drinking a 18C° hot café in a cafe with Chloë, François Déporte and Francois Deporte." you get 'I'm drinking a 18C° hot café in a Xith Chloë, FrXnd FrX'. What would be another way to unidecode and keep the length constant? – Wader 28/4, 2017 at 10:4

@AntoineDusséaux: Not cool. :-/ You could replace ° by @ or # in the temporary string, before applying unidecode. I don't know of any other method which could remove accents while keeping the string length constant. – Minutes 28/4, 2017 at 10:8

@AntoineDusséaux: I found an alternative with unicodedata. No need for external package and as far as I know, it only removes the accents. – Minutes 28/4, 2017 at 12:7

Smart! I tried everything to break your solution and it seems it doesn't work with 'string_to_change = 'äöüßéèiìììíàáç°。阿bcqf反題梓z≤«»Ωﬁñ한か¿？！'' ;) so I changed it a little bit: return ''.join((unidecode.unidecode(c) if len(unidecode.unidecode(c)) == len(c) else c for c in s)) and it seems to work fine! – Wader 28/4, 2017 at 13:29

@AntoineDusséaux: Unicode can be so complex that it's pretty easy to break any code with it. – Minutes 28/4, 2017 at 13:33

Sure, for instance I realized some Unicode characters would be replaced with regex special characters, for instance Chinese dot 。 becomes . and then matches everything! – Wader 28/4, 2017 at 13:38

To avoid this we could escape all characters: return ''.join(re.escape(unidecode.unidecode(c)) if len(unidecode.unidecode(c)) == len(c) else re.escape(c) for c in s) (tested with

string_to_change = 'äöüßéèiìììíàáç°。阿bcqf反題梓z≤«»Ωﬁñ한か¿⸼？！..·⁂‖¦•+⸮…'

) – Wader 28/4, 2017 at 13:40

Let us continue this discussion in chat. – Wader 28/4, 2017 at 13:46

You may want to use 'NFKD' instead of 'NFD' to handle character equivalences, e.g. U+2160 (ROMAN NUMERAL ONE) is in NFKD the same as U+0049 (LATIN CAPITAL LETTER I). – Actinomycin 8/1, 2019 at 22:37

You can use Unidecode:

$ pip install unidecode

In your program:

from unidecode import unidecode

original_text = "I'm drinking a café in a cafe."
unidecoded_text = unidecode(original_text)
regex = r'cafe'
re.sub(regex, 'X', unidecoded_text)

Flapper answered 26/4, 2017 at 12:47 Comment(2)

Thanks but this won't help as I'd like to keep other accents from the origian stirng. – Wader 26/4, 2017 at 13:4

@AntoineDusséaux Right, I didn't thought about that. The other answer seems to be correct. – Flapper 26/4, 2017 at 13:56

Instead of removing accents, I needed to preserve accents on the text, then I used the following code:

accents_dic = {
'A': '(A|Á|À|Â|Ã)',
'E': '(E|É|È)',
'I': '(I|Í|Ï)',
'O': '(O|Ó|Ô|Õ|Ö)',
'U': '(U|Ú|Ü)',
'C': '(C|Ç)'
}
def define_regex_name(name):
    for i, j in accents_dic.items():
        name = re.sub(i,j,name)
    return re.compile(name, re.IGNORECASE)

Pediatrician answered 23/9, 2021 at 18:40 Comment(0)

With any pattern

If pattern is a word or a set of words

Recommended topics

Hot tags