Best way to remove '\xad' in Python?
Asked Answered
H

1

14

I'm trying to build a corpus from the .txt file found at this link. I believe the instances of \xad are supposedly 'soft-hyphens', but do not appear to be read correctly under UTF-8 encoding. I've tried encoding the .txt file as iso8859-15, using the code:

with open('Harry Potter 3 - The Prisoner Of Azkaban.txt', 'r', 
encoding='iso8859-15') as myfile:
data=myfile.read().replace('\n', '')

data2 = data.split(' ')

This returns an array of 'words', but '\xad' remains attached to many entries in data2. I've tried

data_clean = data.replace('\\xad', '')

and

data_clean = data.replace('\\xad|\\xad\\xad','')

but this doesn't seem to remove the instances of '\xad'. Has anyone ran into a similar problem before? Ideally I'd like to encode this data as UTF-8 to avail of the nltk library, but it won't read the file with UTF-8 encoding as I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 471: invalid start byte

Any help would be greatly appreciated!

Additional context: This is a recreational project with the aim of being able to generate stories based on the txt file. Everything I've generated thus far has been permeated with '\xad', which ruins the fun!

Heredity answered 22/8, 2018 at 23:7 Comment(5)
You don't have the character sequence backslash-x-a-d in your strings; you have actual soft hyphens. If you're seeing backslash-x-a-d in your printed output, you're probably doing something wrong like printing lists of strings instead of printing strings, or something else that would use the repr of your strings.Footrope
(They should really be regular hyphens instead of soft hyphens, but that's a different issue.)Footrope
So, why have you tried encoding the file as iso8859-15? Do you need iso8859-15 bytes? If so, why are you trying to read them as UTF-8?Descartes
@Footrope Ac-tu-al-ly, be-ing re-gu-lar hy-phens would pro-ba-bly be annoy-ing, un-less you want hy-phens in-sert-ed into al-most half the words in the no-vel. Probably better to just have nothing, and if someone wants to render it to a book, trust their hyphenation dictionary.Descartes
@abarnert: No, the file has soft hyphens where it actually needs regular hyphens, like in "eagle-feather quill" and "jet-black hair". None of the soft hyphens I've found in the file are in syllable breaks.Footrope
D
22

Your file almost certainly has actual U+00AD soft-hyphen characters in it.

These are characters that mark places where a word could be split when fitting lines to a page. The idea is that the soft hyphen is invisible if the word doesn't need to be split, but printed the same as a U+2010 normal hyphen if it does.

Since you don't care about rendering this text in a book with nicely flowing text, you're never going to hyphenate anything, so you just want to remove these characters.

The way to do this is not to fiddle with the encoding. Just remove them from the Unicode text, using whichever of these you find most readable:

data = data.replace('\xad', '')
data = data.replace('\u00ad', '')
data = data.replace('\N{SOFT HYPHEN}', '')

Notice the single backslash. We're not replacing a literal backslash, x, a, d, we're replacing a literal soft-hyphen character, that is, the character whose code point is hex 0xad.

You can either do this to the whole file before splitting into words, or do it once per word after splitting.


Meanwhile, you seem to be confused about what encodings are and what to do with them:

I've tried encoding the .txt file as iso8859-15

No, you've tried decoding the file as ISO-8859-15. It's not clear why you tried ISO-8859-15 in the first place. But, since the ISO-8859-15 encoding for the character '\xad' is the byte b'\xad', maybe that's correct.

Ideally I'd like to encode this data as UTF-8 to avail of the nltk library

But NLTK doesn't want UTF-8 bytes, it wants Unicode strings. You don't need to encode it for that.

Plus, you're not trying to encode your Unicode text to UTF-8, you're trying to decode your bytes from UTF-8. If that's not what those bytes are… if you're lucky, you'll get an error like this one; if not, you'll get mojibake that you don't notice until you've screwed up a 500GB corpus and thrown away the original data.1


1. UTF-8 is specifically designed so you'll get early errors whenever possible. In this case, reading ISO-8859-15 text with soft hyphens as if it were UTF-8 raises exactly the error you're seeing, but reading UTF-8 text with soft hyphens as if it were ISO-8859-15 will silently succeed, but with an extra 'Â' character before each soft hyphen. The error is usually more helpful.

Descartes answered 22/8, 2018 at 23:40 Comment(1)
Thanks. I tried alternative decodings to UTF-8 as they seemed to work better (didn't display the same error than when I tried to decode it with UTF-8). I think UTF-8 was simply the default decoding for nltk functions so I did indeed mix the two concepts of encoding and decoding up.Heredity

© 2022 - 2024 — McMap. All rights reserved.