Keep only alphabetic characters (multilingual) in a string
Asked Answered
C

2

14

On stackoverflow there are a lot of answers about how to keep only alphabetic characters from a string, the most common accepted is the famous regex '[^a-zA-Z]'. But this answer is totally wrong because it supposes everybody only write English... I thought I could down vote all these answers but I finally thought it would be more constructive to ask the question again, because I can't find the answer.

Is there an easy (or not...) way in python to keep only alphabetic characters from a string that works for all languages ? I think maybe about a library that could do like xregexp in javascript... By all languages I mean english but also french, russian, chinese, greec...etc

Claritaclarity answered 27/6, 2017 at 11:37 Comment(2)
I think it may be easier to include the whole unicode and exclude non-alphabetic charactersPericynthion
@MoonCheesez I was thinking of exactly the same way. There's an easy way of doing this in shell scripting but I can't think of a Pythonic way right now.Ehrenburg
M
11

[^\W\d_]

With Python3 or the re.UNICODE flag in Python2, you could use [^\W\d_].

\W : If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

So [^\W\d_] is anything which is not not alphanumeric or not a digit or not an underscore. In other words, it's any alphabetic character. :)

>>> import re
>>> re.findall("[^\W\d_]", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE)
['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']

Remove digits first, then look for "\w"

To avoid this convoluted logic, you could also remove digits and underscores first, and then look for alphanumeric characters :

>>> without_digit = re.sub("[\d_]", "", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE) 
>>> re.findall("\w", without_digit, re.UNICODE)
['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']

regex module

It seems that regex module could help, since it understands \p{L} or [\w--\d_].

This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.

>>> import regex as re
>>> re.findall("\p{L}", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE)
['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']

(Tested with Anaconda Python 3.6)

Mountain answered 27/6, 2017 at 11:50 Comment(9)
I tried the regex module and \p{L}, it looks like it keeps only latin letters with no accents... Maybe I missed something somewhere but it should works regarding one of the documentation example : [\p{L}--QW] # Set containing all letters except ‘Q’ and ‘W’Claritaclarity
it looks like your first and second example works perfectly :)Claritaclarity
I use version 2.7.9. Tried with version 3 but I have an issue trying to import regex installed with pip and since i'm not a python expert and didn't want to spend to much time trying to import regex I used 2.7.9 for tests.Claritaclarity
i mark as answer but have some doubts for the regex module solution since for what i tried it only kept letters with no accents (tested with python 2.7.9)Claritaclarity
Just kind of new with python so take me time to test all this. Tried with this code and the output is only latin letters # -*- coding: UTF-8 -*- re.UNICODE print (re.findall("[\p{L}]","jüste Ä tösté 1234 ßÜ א д"))Claritaclarity
Looks like i'm close, this is my output ['j', '\xc3', 's', 't', 'e', '\xc3', 't', '\xc3', 's', 't', '\xc3', '\xc3', '\xc3', '\xd0']. It's not a console problem since I can't print any language in it. Now I just have to find a way to replace theses "\x" values. I 'm just realizing that encoding is not the easiest things to deal with python :)Claritaclarity
ok many thanks for your help @Eric, it was very helpful ;) I don't want to bother you anymore since you answered correctly to my question. I think I can go to the end by myself from now since my current problem is not related to my question. It's more encoding problem than get rid of characters in a string now.... thank you for your patience.Claritaclarity
@Laurent: My pleasure, I also learn a lot while answering Python and Ruby questions.Mountain
It is important to not that \w in re and regex cover different repertoire of characters and will give different results for many languages.Sokol
C
0

In fact, there is a simple way to accomplish this task. However, it is limited to ~150 languages, but at least it covers the most common languages. The Alphabetic library offers a function for exactly this problem.

First, install it via pip install alphabetic (or the latest commit pip install git+https://github.com/Halvani/alphabetic.git)

Once, installed use the following:

from alphabetic import WritingSystem

ws = WritingSystem()
ws.keep_only_script_characters("#jüste BAD/good tösté X4567Y ßÜ משהו действует?!")

This will return the cleaned string:

'jüste BADgood tösté XY ßÜ משהו действует'
Cauley answered 12/6 at 12:45 Comment(11)
An interesting package with an odd mix of language and script codes. Especially like Amharic with should return True for an Abugida, and also return True for a Syllabary. Other interesting odds and ends in it as well. I find both Wolof and Hausa interesting, but that would also apply to many West African languages that have more than one writing system.Sokol
The Amharic script is an abugida, according to en.wikipedia.org/wiki/Amharic and omniglot.com/writing/amharic.htm. Do you have a source that classifies this script as a syllabary? Wrt languages that involve multiple script types, please take a look at the design considerations regarding this package: github.com/Halvani/…Cauley
It is an abugida, but it is written in the Ethiopic script. The package will return true for Amaharic as an abugida, but the text is also Ethiopic script so according to the packages documentation Amharic text should return true for both abugida and syllabary.Sokol
The Amharic script is a modification of the Ge'ez script, which is why it is considered Abugida (see here: en.wikipedia.org/wiki/Amharic#Writing_system). Please provide a source that clearly classifies it as a syllabary. Please also look at: en.wikipedia.org/wiki/List_of_writing_systems#Syllabaries It is not mentioned that Amharic is a syllabary.Cauley
The just following the dominant writing system isn't always the optimal approach. It depends on what you are using the data for. If I think of my X/Twitter feed, those languages are more often seen in a non-dominant script, especially outside the francophone sphere of the internet. And I am uncertain as to the logic of handling somethings in terms of script and others in terms of language.Sokol
I am not saying its a syllabry, i am saying the package documentation lists Ethiopic script as a syllabary. The Amharic script is a subset of the Ethiopic Unicode script.Sokol
Technically Ge'ez started was an Abjad, and evolved into an Abugida .Sokol
umm, tried installing the package after reading the documentation, but keeps giving me FileNotFoundError when i try to use it.Sokol
Oh, you're right regarding the documentation! I've just corrected the package. Now ws.is_abugida("ቅልአሐዱምስ") correctly returns True, where ws.is_syllabary("ቅልአሐዱምስ") returns False. Thank you for mentioning this!Cauley
@Andj, can you please open a detailed issue on that with a full stack trace? I've just run the tests and all of them passed. github.com/Halvani/alphabetic/issuesCauley
Let us continue this discussion in chat.Cauley

© 2022 - 2024 — McMap. All rights reserved.