Python: keep only letters in string
Asked Answered
S

6

34

What is the best way to remove all characters from a string that are not in the alphabet? I mean, remove all spaces, interpunction, brackets, numbers, mathematical operators..

For example:

input: 'as32{ vd"s k!+'
output: 'asvdsk'
Sashasashay answered 11/12, 2015 at 0:16 Comment(0)
R
69

You could use re, but you don't really need to.

>>> s = 'as32{ vd"s k!+'
>>> ''.join(x for x in s if x.isalpha())
'asvdsk'    
>>> filter(str.isalpha, s) # works in python-2.7
'asvdsk'
>>> ''.join(filter(str.isalpha, s)) # works in python3
'asvdsk'
Roofing answered 11/12, 2015 at 0:19 Comment(0)
M
41

If you want to use regular expression, this should be quicker:

import re
s = 'as32{ vd"s k!+'
print(re.sub('[^a-zA-Z]+', '', s))

prints 'asvdsk'

Mcphee answered 11/12, 2015 at 0:26 Comment(0)
R
4

Here is a method that uses ASCII ranges to check whether an character is in the upper/lower case alphabet (and appends it to a string if it is):

s = 'as32{ vd"s k!+'
sfiltered = ''

for char in s:
    if((ord(char) >= 97 and ord(char) <= 122) or (ord(char) >= 65 and ord(char) <= 90)):
        sfiltered += char

The variable sfiltered will show the result, which is 'asvdsk' as expected.

Rinarinaldi answered 11/12, 2015 at 0:36 Comment(0)
N
0

This simple expression get all letters, including non ASCII letters ok t áàãéèêçĉ... and many more used in several languages.

r"[^\W\d]+"

It means "get a sequence of one or more characters that are not either "non word characters" or a digit.

Nonsuch answered 28/4, 2022 at 3:16 Comment(0)
A
0

If you'd like to preserve characters like áàãéèêçĉ that are used in many languages around thw world, try this:

import re
print re.sub('[\W\d_]+', '', yourString)
Agripinaagrippa answered 2/6, 2022 at 14:36 Comment(1)
You're missing the argument to sub into the string: re.sub('[\W\d_]+', '', yourString)Integumentary
P
0

As an alternative approach, the Alphabetic package can be used, which provides a function for this purpose.

First install the package via pip install alphabetic, then proceed as follows:

from alphabetic import WritingSystem

input_str = 'as32{ vd"s k!+'

ws = WritingSystem()
ws.strip_non_script_characters(input_str, 
                               ws.Language.English, 
                               process_token_wise=False)

In this way you will get the desired output:

'asvdsk'

Note: If ws.Language.English is not passed as an argument, all characters of all supported languages (>150) are taken into account instead.

Pathe answered 15/6 at 15:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.