Removing everything except letters and spaces from string in Python3.3
Asked Answered
P

3

19

I have this example string: happy t00 go 129.129 and I want to keep only the spaces and letters. All I have been able to come up with so far that is pretty efficient is:

print(re.sub("\d", "", 'happy t00 go 129.129'.replace('.', '')))

but it is only specific to my example string. How can remove all characters other than letters and spaces?

Pediatrician answered 4/2, 2014 at 22:12 Comment(1)
None of answers contains other than 24 letters, e.g. ß, Ä, Ö, Ü, Ą, Ż, etc. Perhaps question should mention only ASCII letters?Dowable
M
30
whitelist = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ')
myStr = "happy t00 go 129.129$%^&*("
answer = ''.join(filter(whitelist.__contains__, myStr))

Output:

>>> answer
'happy t go '
Mccormac answered 4/2, 2014 at 22:15 Comment(4)
After testing I found this to be 0.0029 usec faster than Joel's answer when run as a python -m timeit -n 100 -s loop for each code in Command Prompt.Pediatrician
@Gronk: >>> Timer('"".join(filter(whitelist.__contains__, myStr))', ''' ... whitelist = set('abcdefghijklmnopqrstuvwxy ABCDEFGHIJKLMNOPQRSTUVWXYZ') ... myStr = 'happy t00 go 129.129' * 10''').timeit(number=1000) 0.02490997314453125 >>> Timer('re.sub(r"[^a-zA-Z ]+", "", myStr)', '''import re ... myStr = 'happy t00 go 129.129' * 10''').timeit(number=1000) 0.011039972305297852 >>> . My point is that 0.0029 usec is definitely within the normal variation for a sample size of 100.Jahdai
This also filters accented alphabet characters which might be a problem.Psychedelic
The lowercase character "z" is missingCarriecarrier
J
19

Use a set complement:

re.sub(r'[^a-zA-Z ]+', '', 'happy t00 go 129.129')
Jahdai answered 4/2, 2014 at 22:15 Comment(0)
A
9

Slight variation on inspectorG4dget's method - import from string & generator comprehension:

from string import ascii_letters

allowed = set(ascii_letters + ' ')
myStr = 'happy t00 go 129.129'
answer = ''.join(l for l in myStr if l in allowed)
answer
# >>> 'happy t go '

Performance comparison:

(I made myStr a bit longer and pre-compiled the regex to make things a bit more interesting)

import re
from string import ascii_letters, digits
myStr = 'happy t00 go 129.129'*20
allowed = set(ascii_letters + ' ')

# Generator
%timeit answer = ''.join(l for l in myStr if l in allowed)

# filter/__contains__
%timeit answer = ''.join(filter(allowed.__contains__, myStr))

# Regex
pat = re.compile(r'[^a-zA-Z ]+')
%timeit answer = re.sub(pat, '', myStr)

53 µs ± 6.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.3 µs ± 7.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26 µs ± 509 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Anthia answered 26/3, 2018 at 7:46 Comment(1)
I found this to be the best answer. It is more readable and it shows how we can use the string constants instead of typing them manually which could easily introduce an error.Grocer

© 2022 - 2024 — McMap. All rights reserved.