How to skip the character causing UnicodeDecodeError: using textract like errors="replace"?
Asked Answered
A

0

6

I am trying to convert all readable in a pdf file into a string using textract. It works for most of the files but in some it gives UnicodeDecodeError: I want to skip problematic characters.

I have tried to find a way to solve it with errors="ignore" or errors="replace" but I couldn't find a way to do it.

This is the actual part that raises the error (it is in a for loop to work through each PDFs in folder_name):

text_of_the_pdf = textract.process(os.path.join(self.folder_name, each))
    text_of_the_pdf = textract.process(os.path.join(self.folder_name, each))
  File "/Users/aaron/PycharmProjects/PDFParser/venv/lib/python3.6/site-packages/textract/parsers/__init__.py", line 77, in process
    return parser.process(filename, encoding, **kwargs)
  File "/Users/aaron/PycharmProjects/PDFParser/venv/lib/python3.6/site-packages/textract/parsers/utils.py", line 47, in process
    unicode_string = self.decode(byte_string)
  File "/Users/aaron/PycharmProjects/PDFParser/venv/lib/python3.6/site-packages/textract/parsers/utils.py", line 65, in decode
    return text.decode(result['encoding'])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3227: character maps to <undefined>
Alible answered 25/10, 2019 at 11:51 Comment(1)
Why don't you just do a: try, except? for a in as: try: textract.process(file_path) except: continueLecture

© 2022 - 2024 — McMap. All rights reserved.