How to remove extended ascii using python?
Asked Answered
C

4

5

In trying to fix up a PML (Palm Markup Language) file, it appears as if my test file has non-ASCII characters which is causing MakeBook to complain. The solution would be to strip out all the non-ASCII chars in the PML.

So in attempting to fix this in python, I have

import unicodedata, fileinput

for line in fileinput.input():
    print unicodedata.normalize('NFKD', line).encode('ascii','ignore')

However, this results in an error that line must be "unicode, not str". Here's a file fragment.

\B1a\B \tintense, disordered and often destructive rage†.†.†.\t

Not quite sure how to properly pass line in to be processed at this point.

Carswell answered 6/11, 2009 at 5:54 Comment(3)
Do you want to filter out any character whose ASCII value is larger than 255?Sajovich
Strictly speaking, there's no such thing as Extended ASCII. ASCII defines values from 0 to 127. Anything higher than that can only be interpreted arbitrarily. Perhaps you should use the term non-ASCII characters.Alhambra
Related: Safe escape function for terminal output #437976Gannon
P
5

Try print line.decode('iso-8859-1').encode('ascii', 'ignore') -- that should be much closer to what you want.

Palpate answered 6/11, 2009 at 6:8 Comment(5)
This seems to work although MakeBook is now complaining about illegal control codes.Carswell
@Jauder, you can of course remove control codes too, for example after the above clean=''.join(c for c in line if ord(c)>=32) (removes ALL control codes including newline and carriage return -- adjust to taste, we can't really do it for you without knowing WHAT control codes you want to remove!-).Palpate
@Alex, if I knew, I would =). Trouble is that I'm working with just a Java prog without source available that only emits a cryptic error message. gist.github.com/227882Carswell
But ideally, I would want to remove spurious control codes but keeping the LF/CR.Carswell
@Jauder, fine, but I don't know which ones are "spurious". What about: spurious=set(chr(c) for c in range(32))-set('\r\n\t') and of course clean-''.join(c for c in line if c not in spurious, then interactively adjust spurious by empirically trying until it is exactly the set of characters you need to remove.Palpate
T
4

You would like to treat line as ASCII-encoded data, so the answer is to decode it to text using the ascii codec:

line.decode('ascii')

This will raise errors for data that is not in fact ASCII-encoded. This is how to ignore those errors:

line.decode('ascii', 'ignore').

This gives you text, in the form of a unicode instance. If you would rather work with (ascii-encoded) data rather than text, you may re-encode it to get back a str or bytes instance (depending on your version of Python):

line.decode('ascii', 'ignore').encode('ascii')

Trustee answered 6/11, 2009 at 6:17 Comment(0)
U
2

To drop non-ASCII characters use line.decode(your_file_encoding).encode('ascii', 'ignore'). But probably you'd better use PLM escape sequences for them:

import re

def escape_unicode(m):
    return '\\U%04x' % ord(m.group())

non_ascii = re.compile(u'[\x80-\uFFFF]', re.U)

line = u'\\B1a\\B \\tintense, disordered and often destructive rage\u2020.\u2020.\u2020.\\t'
print non_ascii.sub(escape_unicode, line)

This outputs \B1a\B \tintense, disordered and often destructive rage\U2020.\U2020.\U2020.\t.

Dropping non-ASCII and control characters with regular expression is easy too (this can be safely used after escaping):

regexp = re.compile('[^\x09\x0A\x0D\x20-\x7F]')
regexp.sub('', line)
Upshaw answered 6/11, 2009 at 11:2 Comment(0)
R
0

When reading from a file in Python you're getting byte strings, aka "str" in Python 2.x and earlier. You need to convert these to the "unicode" type using the decode method. eg:

line = line.decode('latin1')

Replace 'latin1' with the correct encoding.

Rrhoea answered 6/11, 2009 at 6:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.