How to extract text from pdfs in folders with python and save them in dataframe?

Asked 16/2, 2021 at 12:47 Answered 16/2, 2021 at 14:48

Solved python dataframe pdf apache-tika pdf-conversion

I have many folders where each has a couple of pdf files (other file types like .xlsx or .doc are there as well). My goal is to extract the pdf's text for each folder and create a data frame where each record is the "Folder Name" and each column represents text content of each pdf file in that folder in string form.

I managed to extract text from one pdf file with tika package (code below). But can not make a loop to iterate on other pdfs in the folder or other folders so to construct a structured dataframe.

# import parser object from tike 
from tika import parser   
  
# opening pdf file 
parsed_pdf = parser.from_file("ducument_1.pdf") 
  
# saving content of pdf 
# you can also bring text only, by parsed_pdf['text']  
# parsed_pdf['content'] returns string  
data = parsed_pdf['content']  
  
# Printing of content  
print(data) 
  
# <class 'str'> 
print(type(data))

The desired output should look like this:

Folder_Name	pdf1	pdf2
17534	text of the pdf1	text of the pdf 2
63546	text of the pdf1	text of the pdf1
26374	text of the pdf1	-

Treharne answered 16/2, 2021 at 12:47 Comment(0)

If you want to find all the PDFs in a directory and its subdirectories, you can use os.listdir and glob, see Recursive sub folder search and return files in a list python . I've gone for a slightly longer form so it is easier to follow what is happening for beginners

Then, for each file, call Apache Tika, and save to the next row in the Pandas DataFrame

#!/usr/bin/python3

import os, glob
from tika import parser 
from pandas import DataFrame

# What file extension to find, and where to look from
ext = "*.pdf"
PATH = "."

# Find all the files with that extension
files = []
for dirpath, dirnames, filenames in os.walk(PATH):
    files += glob.glob(os.path.join(dirpath, ext))

# Create a Pandas Dataframe to hold the filenames and the text
df = DataFrame(columns=("filename","text"))

# Process each file in turn, parsing with Tika and storing in the dataframe
for idx, filename in enumerate(files):
   data = parser.from_file(filename)
   text = data["content"]
   df.loc[idx] = [filename, text]

# For debugging, print what we found
print(df)

Downcast answered 16/2, 2021 at 14:48 Comment(2)

This beautifully did the job! In case of including other file format (.doc) can this feature adjusted accordingly -> ext = "*.pdf"? – Treharne 17/2, 2021 at 9:25

Yes, just have two extensions defined and repeat the glob + save matching files step for both – Downcast 17/2, 2021 at 11:23

Extremely easy to have a list of all pdfs on unix.

import os

# saves all pdf in a string.
a = os.popen("du -a|awk '{print $2}'|grep '.*\.pdf$'").read()[2:-1]
print(a)

On my computer the output was:

[luca@artix tmp]$ python3 forum.py
a.pdf
./foo/test.pdf

You can just do something like

for line in a.split('\n'):
    print(line, line.split('/'))

and you'll know the folder of the pdf. I hope I helped you

Tortosa answered 16/2, 2021 at 12:52 Comment(5)

du and grep seems an overkill and not very portable... Why not use something like os.listdir? – Downcast 16/2, 2021 at 13:6

You can make a du script in python and grep script in python pretty easily. – Tortosa 16/2, 2021 at 13:7

If you are running on windows since du, grep and awk are written in C so very portable just include du.exe, grep.exe, awk.exe in your folder. It will work in both windows and linux – Tortosa 16/2, 2021 at 13:8

If unix tools are really needed, something like find or ls would be a lot simpler. However, python has built-in support for listing directories, so you need to explain why external tools are required and why the built-in support cannot be used – Downcast 16/2, 2021 at 13:16

Using builts-in could be better and more portable. So please write your own answer – Tortosa 16/2, 2021 at 13:17

Recommended topics

Hot tags