Reading .doc file in Python using antiword in Windows (also .docx)

Asked 7/8, 2018 at 12:49 Answered 31/1, 2021 at 17:42

I tried reading a .doc file like -

with open('file.doc', errors='ignore') as f:
    text = f.read()

It did read that file but with huge junk, I can't remove that junk as I don't know from where it starts and where it ends.

I also tried installing textract module which says it can read from any file format but there were many dependency issues while downloading it in Windows.

So I alternately did this with antiword command line utility, my answer is below.

Kaela answered 7/8, 2018 at 12:49 Comment(2)

doc is an obsolete binary format. docx is a zip file containing XML documents. You can't just read either of them as if they were text files – Athene 7/8, 2018 at 13:0

@PanagiotisKanavos I had to do text classification task based on content of the file using ML. I have files with .pdf .doc .docx and .txt extension. I did this to get text content from files, am I wrong? If so then how am I suppose to classify the text if I can not read it from files. Please clarify. – Kaela 8/8, 2018 at 9:21

You can use antiword command line utility to do this, I know most of you would have tried it but still I wanted to share.

Download antiword from here

Extract the antiword folder to C:\ and add the path C:\antiword to your PATH environment variable.

Here is a sample of how to use it, handling docx and doc files:

import os, docx2txt
def get_doc_text(filepath, file):
    if file.endswith('.docx'):
       text = docx2txt.process(file)
       return text
    elif file.endswith('.doc'):
       # converting .doc to .docx
       doc_file = filepath + file
       docx_file = filepath + file + 'x'
       if not os.path.exists(docx_file):
          os.system('antiword ' + doc_file + ' > ' + docx_file)
          with open(docx_file) as f:
             text = f.read()
          os.remove(docx_file) #docx_file was just to read, so deleting
       else:
          # already a file with same name as doc exists having docx extension, 
          # which means it is a different file, so we cant read it
          print('Info : file with same name of doc exists having docx extension, so we cant read it')
          text = ''
       return text

Now call this function:

filepath = "D:\\input\\"
files = os.listdir(filepath)
for file in files:
    text = get_doc_text(filepath, file)
    print(text)

This could be good alternate way to read .doc file in Python on Windows.

Hope it helps, Thanks.

Kaela answered 7/8, 2018 at 12:49 Comment(3)

It seems like it would be more simple to use subprocess.check_output here and get the output from antiword, rather than saving it as docx. From my usage, it seems like antiword isn't able to convert a doc file to docx. Have you found differently? – Ramsgate 9/7, 2019 at 17:32

The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several older modules and functions such as os.system. In my case conversion to .docx file was mandatory because I did not find any way to directly read from them. – Kaela 3/9, 2019 at 8:33

This works on 64-bit windows, so it must be a 64-bit version of antiword. But it's not linked from the official site. – Upturn 31/1, 2021 at 3:34

Mithilesh's example is good, but it's simpler to directly use textract once you have antiword installed. Download antiword, and extract the antiword folder to C:\. Then add the antiword folder to your PATH environment variable. (instructions for adding to PATH here). Open a new terminal or command console to re-load your PATH env variable. Install textract with pip install textract.

Then you can use textract (which uses antiword for .doc files) like this:

import textract
text = textract.process('filename.doc')
text.decode('utf-8')  # converts from bytestring to string

If you are getting errors, try running the command antiword from a terminal/console to make sure it works. Also be sure the filepath to the .doc file is correct (e.g. use os.path.exists('filename.doc')).

Upturn answered 31/1, 2021 at 17:42 Comment(0)

Recommended topics

Hot tags