Removing everything except letters and spaces from string in Python3.3

Asked 4/2, 2014 at 22:12 Answered 26/3, 2018 at 7:46

Solved python regex python-3.3 translate

I have this example string: happy t00 go 129.129 and I want to keep only the spaces and letters. All I have been able to come up with so far that is pretty efficient is:

print(re.sub("\d", "", 'happy t00 go 129.129'.replace('.', '')))

but it is only specific to my example string. How can remove all characters other than letters and spaces?

Pediatrician answered 4/2, 2014 at 22:12 Comment(1)

None of answers contains other than 24 letters, e.g. ß, Ä, Ö, Ü, Ą, Ż, etc. Perhaps question should mention only ASCII letters? – Dowable 21/9, 2020 at 15:52

whitelist = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ')
myStr = "happy t00 go 129.129$%^&*("
answer = ''.join(filter(whitelist.__contains__, myStr))

Output:

>>> answer
'happy t go '

Mccormac answered 4/2, 2014 at 22:15 Comment(4)

After testing I found this to be 0.0029 usec faster than Joel's answer when run as a python -m timeit -n 100 -s loop for each code in Command Prompt. – Pediatrician 4/2, 2014 at 23:39

@Gronk:

>>> Timer('"".join(filter(whitelist.__contains__, myStr))', ''' ... whitelist = set('abcdefghijklmnopqrstuvwxy ABCDEFGHIJKLMNOPQRSTUVWXYZ') ... myStr = 'happy t00 go 129.129' * 10''').timeit(number=1000) 0.02490997314453125 >>> Timer('re.sub(r"[^a-zA-Z ]+", "", myStr)', '''import re ... myStr = 'happy t00 go 129.129' * 10''').timeit(number=1000) 0.011039972305297852 >>>

. My point is that 0.0029 usec is definitely within the normal variation for a sample size of 100. – Jahdai 5/2, 2014 at 1:9

This also filters accented alphabet characters which might be a problem. – Psychedelic 29/9, 2017 at 4:10

The lowercase character "z" is missing – Carriecarrier 19/5, 2018 at 16:32

Use a set complement:

re.sub(r'[^a-zA-Z ]+', '', 'happy t00 go 129.129')

Jahdai answered 4/2, 2014 at 22:15 Comment(0)

Slight variation on inspectorG4dget's method - import from string & generator comprehension:

from string import ascii_letters

allowed = set(ascii_letters + ' ')
myStr = 'happy t00 go 129.129'
answer = ''.join(l for l in myStr if l in allowed)
answer
# >>> 'happy t go '

Performance comparison:

(I made myStr a bit longer and pre-compiled the regex to make things a bit more interesting)

import re
from string import ascii_letters, digits
myStr = 'happy t00 go 129.129'*20
allowed = set(ascii_letters + ' ')

# Generator
%timeit answer = ''.join(l for l in myStr if l in allowed)

# filter/__contains__
%timeit answer = ''.join(filter(allowed.__contains__, myStr))

# Regex
pat = re.compile(r'[^a-zA-Z ]+')
%timeit answer = re.sub(pat, '', myStr)

53 µs ± 6.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.3 µs ± 7.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26 µs ± 509 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Anthia answered 26/3, 2018 at 7:46 Comment(1)

I found this to be the best answer. It is more readable and it shows how we can use the string constants instead of typing them manually which could easily introduce an error. – Grocer 23/2, 2019 at 3:36

Performance comparison:

Recommended topics

Hot tags