How can I remove non-ASCII characters but leave periods and spaces?

Asked 31/12, 2011 at 18:23 Answered 30/1, 2023 at 22:44

134

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.

Lettuce answered 31/12, 2011 at 18:23 Comment(2)

Thanks (sincerely) for the clarification John. I understood that spaces and periods are ASCII characters. However, I was removing both of them unintentionally while trying to remove only non-ASCII characters. I see how my question might have implied otherwise. – Lettuce 31/12, 2011 at 21:38

@PoliticalEconomist: Your problem is still very under-specified. See my answer. – Krahling 31/12, 2011 at 22:5

230

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

Popovich answered 31/12, 2011 at 18:29 Comment(23)

chr(127) in string.printable ? – Sociometry 31/12, 2011 at 18:39

what's up with those printable chars that are below ordinal 48 ? – Sociometry 31/12, 2011 at 18:46

chr(127) in string.printable == False – Popovich 31/12, 2011 at 18:48

Do you mean 0b and 0c? They are part of string.whitespace. – Popovich 31/12, 2011 at 18:49

yes, and from the OP: if ord(char) < 48 or ord(char) > 127. About my second comment, I am refering to '*' ,'(', and other printable which are eliminated by the OP... – Sociometry 31/12, 2011 at 18:54

Yeah, I was extrapolating that the OP probably meant all printable characters, rather than what was actually said, but might not be the case. – Popovich 31/12, 2011 at 18:57

Thanks! I understand now. Sorry for the confusion - jterrace correctly interpreted my question. – Lettuce 31/12, 2011 at 20:56

this is also great for just filtering to digits - filter(lambda x: x in string.digits, s) – Danyel 8/10, 2013 at 15:35

This is incredibly slow in a large file. Any suggestions? – Morrison 12/1, 2014 at 22:34

@Morrison create a set(string.printable) and re-use it for the filtering. Also don't filter the whole file at once - do it in chunks of 8K-512K – Popovich 13/1, 2014 at 0:5

The only problem with using filter is that it returns an iterable. If you need a string back (as I did because I needed this when doing list compression) then do this: ''.join(filter(lambda x: x in string.printable, s). – Njord 5/9, 2014 at 19:23

@Njord - comment is python 3 specific, but very useful. Thanks! – Entertainer 13/1, 2015 at 15:13

Why not use regular expression: re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string) . See this thread https://mcmap.net/q/98527/-replace-non-ascii-characters-with-a-single-space – Melo 18/1, 2016 at 16:8

This is the most compatible way of doing the OP's task, I tested in from Python 2.6 to Python 3.5. – Bridgeboard 30/1, 2016 at 16:31

@NoamManos this was 4-5 times faster for me thatn the join...filter...lambda solution, thanks. – Bigwig 22/2, 2016 at 11:59

I suspect changing lambda x: x in printable to printable.__contains__ would make it run faster; the lambda means more Python level code execution, while directly passing the built-in membership test method removes per character byte code execution. – Cumulative 4/4, 2016 at 22:52

PyLint Complains on the use of filter when using the above code. Given that list comprehensions seem to be preferred would using ''.join(x for x in s if x in printable) be a) equivalent, and b) any better? – Tisdale 17/6, 2016 at 14:37

Edit: I realise the above is a generator expression, but does the same apply? – Tisdale 18/6, 2016 at 14:44

@Tisdale - it's most likely equivalent, but I'd have to profile it to know for sure – Popovich 19/6, 2016 at 17:2

@Jonny, The result is the same, time differs (you need to compare if it happens to be a bottleneck). This is easier for an eye - the less the diversity of tools, the faster is reading comprehension. You may want to add an [Enter] before if and indent the second line so if starts just after ( from the first line. – Sheliasheline 6/6, 2018 at 11:34

Am I the only one who this doesn't work for? Why wouldnt those characters be included in the printable list? like 0 or x for example? – Caliph 27/1, 2020 at 21:36

@CharlesSmith - those are escape sequences – Popovich 28/1, 2020 at 20:32

when assigning value to a variable it works fine whereas reading from file has no effect on filtering.. Dont know why? any ideas? – Melanymelaphyre 4/6, 2020 at 22:5

119

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

Tennyson answered 25/8, 2013 at 15:50 Comment(4)

I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 27 – Morrison 12/1, 2014 at 22:33

I got that error when I put the actual unicode character in the string via copy paste. When you specify a string as u'thestring' encode works correctly. – Sot 30/4, 2015 at 21:5

Works only on Py3, but it's elegant. – Bridgeboard 30/1, 2016 at 16:32

For those who are getting the same error as @Morrison : you should first .decode() the string, and only after that encode. For example s.decode('utf-8').encode('ascii', errors='ignore') – Yann 21/3, 2017 at 17:40

According to @artfulrobot, this should be faster than filter and lambda:

import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)

See more examples here Replace non-ASCII characters with a single space

Melo answered 23/2, 2016 at 14:14 Comment(1)

This solution answers OP's stated question, but beware that it won't remove non printable characters that are included in ASCII which I think is what OP intended to ask. – Royalroyalist 15/6, 2018 at 0:32

You may use the following code to remove non-English letters:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

This will return

123456790 ABC#%? .()

Grecize answered 30/7, 2019 at 22:27 Comment(2)

Could you explain more on the regex you used? r'[^\x00-\x7f]' – Fabe 14/11, 2022 at 23:41

You should not assign a variable to str because that is a built-in python type. – Vermicelli 19/9, 2023 at 20:25

Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.

Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?

Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:

this is line 1
this is line 2

the result would be 'this is line 1this is line 2' ... is that what you really want?

A greater solution would include:

a better name for the filter function than onlyascii

recognition that a filter function merely needs to return a truthy value if the argument is to be retained:

def filter_func(char):
    return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()

Krahling answered 31/12, 2011 at 22:2 Comment(1)

This answer is very helpful to those of us coming in to ask something similar to the OP, and your proposed answer is helpfully pythonic. I do, however, find it strange that there isn't a more efficient solution to the problem as you interpreted it (which I often run into) - character by character, this takes a very long time in a very large file. – Morrison 12/1, 2014 at 22:50

Working my way through Fluent Python (Ramalho) - highly recommended. List comprehension one-ish-liners inspired by Chapter 2:

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

Kekkonen answered 14/9, 2017 at 18:27 Comment(1)

This would not allow for standard ASCII symbols, such as bullet points, degrees symbol, copyright symbol, Yen symbol, etc. Also, your first example includes non-printable symbols, such as BELL, which is undesirable. – Diplomacy 13/4, 2020 at 5:35

If you want printable ascii characters you probably should correct your code to:

if ord(char) < 32 or ord(char) > 126: return ''

this is equivalent, to string.printable (answer from @jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question

Sociometry answered 31/12, 2011 at 18:50 Comment(5)

Slightly simpler: lambda x: 32 <= ord(x) <= 126 – Popovich 31/12, 2011 at 18:59

that's not the same as string.printable because it leaves out string.whitespace, although that might be what the OP wants, depends on things like \n and \t. – Popovich 31/12, 2011 at 19:2

@Popovich right, includes space (ord 32) but no returns and tabs – Sociometry 31/12, 2011 at 19:7

yeah, just commenting on "this is equivalent to string.printable", but not true – Popovich 31/12, 2011 at 19:8

I edited the answer, thanks! the OP question is misleading if you do not read it carefully. – Sociometry 31/12, 2011 at 19:12

-1

this is best way to get ascii characters and clean code, Checks for all possible errors

from string import printable

def getOnlyCharacters(texts):
    _type = None
    result = ''
    
    if type(texts).__name__ == 'bytes':
        _type = 'bytes'
        texts = texts.decode('utf-8','ignore')
    else:
        _type = 'str'
        texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')

    texts = str(texts)
    for text in texts:
        if text in printable:
            result += text
            
    if _type == 'bytes':
        result = result.encode('utf-8')

    return result

text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)

print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri

Etruria answered 30/1, 2023 at 22:44 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Delogu 1/2, 2023 at 17:23

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags