How can I remove non-ASCII characters but leave periods and spaces?
Asked Answered
L

8

134

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.

Lettuce answered 31/12, 2011 at 18:23 Comment(2)
Thanks (sincerely) for the clarification John. I understood that spaces and periods are ASCII characters. However, I was removing both of them unintentionally while trying to remove only non-ASCII characters. I see how my question might have implied otherwise.Lettuce
@PoliticalEconomist: Your problem is still very under-specified. See my answer.Krahling
P
230

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))
Popovich answered 31/12, 2011 at 18:29 Comment(23)
chr(127) in string.printable ?Sociometry
what's up with those printable chars that are below ordinal 48 ?Sociometry
chr(127) in string.printable == FalsePopovich
Do you mean 0b and 0c? They are part of string.whitespace.Popovich
yes, and from the OP: if ord(char) < 48 or ord(char) > 127. About my second comment, I am refering to '*' ,'(', and other printable which are eliminated by the OP...Sociometry
Yeah, I was extrapolating that the OP probably meant all printable characters, rather than what was actually said, but might not be the case.Popovich
Thanks! I understand now. Sorry for the confusion - jterrace correctly interpreted my question.Lettuce
this is also great for just filtering to digits - filter(lambda x: x in string.digits, s)Danyel
This is incredibly slow in a large file. Any suggestions?Morrison
@Morrison create a set(string.printable) and re-use it for the filtering. Also don't filter the whole file at once - do it in chunks of 8K-512KPopovich
The only problem with using filter is that it returns an iterable. If you need a string back (as I did because I needed this when doing list compression) then do this: ''.join(filter(lambda x: x in string.printable, s).Njord
@Njord - comment is python 3 specific, but very useful. Thanks!Entertainer
Why not use regular expression: re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string) . See this thread https://mcmap.net/q/98527/-replace-non-ascii-characters-with-a-single-spaceMelo
This is the most compatible way of doing the OP's task, I tested in from Python 2.6 to Python 3.5.Bridgeboard
@NoamManos this was 4-5 times faster for me thatn the join...filter...lambda solution, thanks.Bigwig
I suspect changing lambda x: x in printable to printable.__contains__ would make it run faster; the lambda means more Python level code execution, while directly passing the built-in membership test method removes per character byte code execution.Cumulative
PyLint Complains on the use of filter when using the above code. Given that list comprehensions seem to be preferred would using ''.join(x for x in s if x in printable) be a) equivalent, and b) any better?Tisdale
Edit: I realise the above is a generator expression, but does the same apply?Tisdale
@Tisdale - it's most likely equivalent, but I'd have to profile it to know for surePopovich
@Jonny, The result is the same, time differs (you need to compare if it happens to be a bottleneck). This is easier for an eye - the less the diversity of tools, the faster is reading comprehension. You may want to add an [Enter] before if and indent the second line so if starts just after ( from the first line.Sheliasheline
Am I the only one who this doesn't work for? Why wouldnt those characters be included in the printable list? like 0 or x for example?Caliph
@CharlesSmith - those are escape sequencesPopovich
when assigning value to a variable it works fine whereas reading from file has no effect on filtering.. Dont know why? any ideas?Melanymelaphyre
T
119

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'
Tennyson answered 25/8, 2013 at 15:50 Comment(4)
I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 27Morrison
I got that error when I put the actual unicode character in the string via copy paste. When you specify a string as u'thestring' encode works correctly.Sot
Works only on Py3, but it's elegant.Bridgeboard
For those who are getting the same error as @Morrison : you should first .decode() the string, and only after that encode. For example s.decode('utf-8').encode('ascii', errors='ignore')Yann
M
42

According to @artfulrobot, this should be faster than filter and lambda:

import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string) 

See more examples here Replace non-ASCII characters with a single space

Melo answered 23/2, 2016 at 14:14 Comment(1)
This solution answers OP's stated question, but beware that it won't remove non printable characters that are included in ASCII which I think is what OP intended to ask.Royalroyalist
G
8

You may use the following code to remove non-English letters:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

This will return

123456790 ABC#%? .()

Grecize answered 30/7, 2019 at 22:27 Comment(2)
Could you explain more on the regex you used? r'[^\x00-\x7f]'Fabe
You should not assign a variable to str because that is a built-in python type.Vermicelli
K
6

Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.

Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?

Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:

this is line 1
this is line 2

the result would be 'this is line 1this is line 2' ... is that what you really want?

A greater solution would include:

  1. a better name for the filter function than onlyascii
  2. recognition that a filter function merely needs to return a truthy value if the argument is to be retained:

    def filter_func(char):
        return char == '\n' or 32 <= ord(char) <= 126
    # and later:
    filtered_data = filter(filter_func, data).lower()
    
Krahling answered 31/12, 2011 at 22:2 Comment(1)
This answer is very helpful to those of us coming in to ask something similar to the OP, and your proposed answer is helpfully pythonic. I do, however, find it strange that there isn't a more efficient solution to the problem as you interpreted it (which I often run into) - character by character, this takes a very long time in a very large file.Morrison
K
3

Working my way through Fluent Python (Ramalho) - highly recommended. List comprehension one-ish-liners inspired by Chapter 2:

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])
Kekkonen answered 14/9, 2017 at 18:27 Comment(1)
This would not allow for standard ASCII symbols, such as bullet points, degrees symbol, copyright symbol, Yen symbol, etc. Also, your first example includes non-printable symbols, such as BELL, which is undesirable.Diplomacy
S
1

If you want printable ascii characters you probably should correct your code to:

if ord(char) < 32 or ord(char) > 126: return ''

this is equivalent, to string.printable (answer from @jterrace), except for the absence of returns and tabs ('\t','\n','\x0b','\x0c' and '\r') but doesnt correspond to the range on your question

Sociometry answered 31/12, 2011 at 18:50 Comment(5)
Slightly simpler: lambda x: 32 <= ord(x) <= 126Popovich
that's not the same as string.printable because it leaves out string.whitespace, although that might be what the OP wants, depends on things like \n and \t.Popovich
@Popovich right, includes space (ord 32) but no returns and tabsSociometry
yeah, just commenting on "this is equivalent to string.printable", but not truePopovich
I edited the answer, thanks! the OP question is misleading if you do not read it carefully.Sociometry
E
-1

this is best way to get ascii characters and clean code, Checks for all possible errors

from string import printable

def getOnlyCharacters(texts):
    _type = None
    result = ''
    
    if type(texts).__name__ == 'bytes':
        _type = 'bytes'
        texts = texts.decode('utf-8','ignore')
    else:
        _type = 'str'
        texts = bytes(texts, 'utf-8').decode('utf-8', 'ignore')

    texts = str(texts)
    for text in texts:
        if text in printable:
            result += text
            
    if _type == 'bytes':
        result = result.encode('utf-8')

    return result

text = '�Ahm�����ed Sheri��'
result = getOnlyCharacters(text)

print(result)
#input --> �Ahm�����ed Sheri��
#output --> Ahmed Sheri
Etruria answered 30/1, 2023 at 22:44 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Delogu

© 2022 - 2024 — McMap. All rights reserved.