How to convert .docx to .txt in Python
Asked Answered
V

5

6

I would like to convert a large batch of MS Word files into the plain text format. I have no idea how to do it in Python. I found the following code online. My path is local and all file names are like cx-xxx (i.e. c1-000, c1-001, c2-000, c2-001 etc.):

from docx import [name of file]
import io
import shutil
import os

def convertDocxToText(path):
for d in os.listdir(path):
    fileExtension=d.split(".")[-1]
    if fileExtension =="docx":
        docxFilename = path + d
        print(docxFilename)
        document = Document(docxFilename)
        textFilename = path + d.split(".")[0] + ".txt"
        with io.open(textFilename,"c", encoding="utf-8") as textFile:
            for para in document.paragraphs: 
                textFile.write(unicode(para.text))

path= "/home/python/resumes/"
convertDocxToText(path)
Vanadinite answered 12/7, 2020 at 10:1 Comment(0)
P
14

Convert docx to txt with pypandoc:

import pypandoc

# Example file:
docxFilename = 'somefile.docx'
output = pypandoc.convert_file(docxFilename, 'plain', outputfile="somefile.txt")
assert output == ""

See the official documentation here:

https://pypi.org/project/pypandoc/

Puggree answered 12/7, 2020 at 10:9 Comment(1)
Use "plain" instead of "txt"Elishaelision
P
5

You can also use the library docx2txt in Python. Here's an example:

I use glob to iter over all DOCX files in the folder. Note: I use a little list comprehension on the original name in order to re-use it in the TXT filename.

If there's anything I've forgotten to explain, tag me and I'll edit it in.

import docx2txt
import glob

directory = glob.glob('C:/folder_name/*.docx')

for file_name in directory:
    with open(file_name, 'rb') as infile:
        with open(file_name[:-5]+'.txt', 'w', encoding='utf-8') as outfile:
            doc = docx2txt.process(infile)
            outfile.write(doc)

print("=========")
print("All done!")
Photoelectric answered 24/1, 2023 at 16:33 Comment(1)
There was no need to use "wt" instead of "w" (edited out just now). Text mode is the default, so it's slightly verbose and potentially misleading for learners if they think they need to specify it in a simple case like this (where the code doesn't change to another mode).Photoelectric
W
0
import os
from docx import Document

# Path to the folder containing .docx files
input_folder = "d:/doc"

# Path to the folder where .txt files will be saved
output_folder = "d:/doc/text"

# Get a list of all .docx files in the input folder
files = [f for f in os.listdir(input_folder) if f.endswith(".docx")]

# Loop through each .docx file and convert it to .txt
for file in files:
    docx_path = os.path.join(input_folder, file)
    txt_path = os.path.join(output_folder, os.path.splitext(file)[0] + ".txt")

    doc = Document(docx_path)
    content = [p.text for p in doc.paragraphs]

    with open(txt_path, "w", encoding="utf-8") as txt_file:
        txt_file.write("\n".join(content))

print("Conversion complete!")
Widdershins answered 18/6, 2023 at 4:54 Comment(0)
A
0

It works not only with doc files. You can use it with pdf too. For MacOs use installation with brew.

https://textract.readthedocs.io

import textract

def textract_text_from_file(file_path):
    text = textract.process(file_path)
    return text.decode()
Anastice answered 12/9, 2023 at 8:26 Comment(1)
You might want to add some extra details, not just a link + code but one or two sentences to give more context to your answerBiosynthesis
C
-1

GroupDocs.Conversion Cloud SDK for Python supports 50+ file formats conversion. Its free plan provides 150 free API calls monthly.

# Import module
import groupdocs_conversion_cloud
from shutil import copyfile

# Get your client_id and client_key at https://dashboard.groupdocs.cloud (free registration is required).
client_id = "xxxxx-xxxx-xxxx-xxxx-xxxxxxxx"
client_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(client_id, client_key)

try:

        #Convert DOCX to txt
        # Prepare request
        request = groupdocs_conversion_cloud.ConvertDocumentDirectRequest("txt", "C:/Temp/sample.docx")

        # Convert
        result = convert_api.convert_document_direct(request)       
        copyfile(result, 'C:/Temp/sample.txt')
        
except groupdocs_conversion_cloud.ApiException as e:
        print("Exception when calling get_supported_conversion_types: {0}".format(e.message))
Cirenaica answered 24/11, 2020 at 13:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.