I have many folders where each has a couple of pdf files (other file types like .xlsx or .doc are there as well). My goal is to extract the pdf's text for each folder and create a data frame where each record is the "Folder Name" and each column represents text content of each pdf file in that folder in string form.
I managed to extract text from one pdf file with tika
package (code below). But can not make a loop to iterate on other pdfs in the folder or other folders so to construct a structured dataframe.
# import parser object from tike
from tika import parser
# opening pdf file
parsed_pdf = parser.from_file("ducument_1.pdf")
# saving content of pdf
# you can also bring text only, by parsed_pdf['text']
# parsed_pdf['content'] returns string
data = parsed_pdf['content']
# Printing of content
print(data)
# <class 'str'>
print(type(data))
The desired output should look like this:
Folder_Name | pdf1 | pdf2 |
---|---|---|
17534 | text of the pdf1 | text of the pdf 2 |
63546 | text of the pdf1 | text of the pdf1 |
26374 | text of the pdf1 | - |