What I have as input: docx document raw bytes in byte64 format.
What I am trying to achieve: extract text from this document for further processing.
I tried to follow this answer: extracting text from MS word files in python
My code fragment:
base64_bytes = input.encode('utf-8')
decoded_data = base64.decodebytes(base64_bytes)
document = Document(decoded_data)
docText = '\n\n'.join([paragraph.text.encode('utf-8') for paragraph in document.paragraphs])
The document = Document(decoded_data)
line gives me the following error: AttributeError: 'bytes' object has no attribute 'seek'
The decoded_data
is in the following format: b'PK\\x03\\x04\\x14\\x00\\x08\\x08\\x08\\x00\\x87@CP\\x00...
How should I format the raw data to extract text from docx?
input.encode('utf-8')
. Is this your actual code? Because this is trying to encode the function objectinput
as UTF-8 – Decurvedseek
", your question says "code
". Which is it? 2) What exactly isDocument
and what kind of argument does it expect? – Anticipation