Not enough space Error appears when running for loop for 13K pdf documents
Asked Answered
G

2

6

I'm doing for loop for 13K pdf files, where it reads, pre-processes text, finds similarities and writes in txt. However, when I run the for loop it gives an error

Error in poppler_pdf_text(loadfile(pdf), opw, upw) : Not enough space

What can be the reason?

  1. I tried to increase memory_limit(), it is also not the issue.
  2. I tried to delete hidden files in the folder, like Thumbs.db, but same issue appears again.
  3. I remove pdf files at every iteration.

folder_path <- "C: ...."
## get vector with all pdf names
pdf_folder <- list.files(folder.path)

## for loop over all pdf documents
for(s in 1:length(pdf_folder)){

   ## choose one pdf document from vector of strings
   pdf_document_name <- pdf_folder[s]

   ## read pdf_document pdf into data.frame
   pdf <- read_pdf(paste0(folder_path,"/",pdf_document_name))

   print(s)

   rm(pdf)

} ## end of for loop

# Error: 

Error in poppler_pdf_text(loadfile(pdf), opw, upw) : Not enough space

The expected outcome is to read all pdf documents in the original path.

Goofy answered 12/7, 2019 at 17:19 Comment(6)
Not sure what is generating that error, but how much free space is on the underlying hard drive?Diaconicum
Do you know the particular file it fails on? Maybe it's an issue with that particular file.Lerner
In line with what @Lerner said, you can try inserting a counter in your loop right before print(s) like so: cat("counter: ", s). Then you'll be able to see where the loop fails, and investigate that pdf file. Even though it seems that this is a memory issue, you can see how many files your computer can handle, and chunk out the loop into a few parts so that you don't run out of memory running the entire thing at once.Hamlett
The pdf files are totally machine readable.Goofy
@Diaconicum on C: I have around 1 GB free space.Goofy
Try to not only do rm(pdf), but also do gargabe collection with gc(). Also a good practice here would be to save your results every 100 files, to local file.Gildea
L
6

I was able to reproduce this error with the following:

  • Image based pdf (16,702 pages, 161,277 KB)
  • R v3.5.3 64-bit
  • textreadr v0.90
  • pdftools v2.2
  • tesseract v4.0
  • Windows 10 64-bit
  • 16 GB RAM

This is resolved by updating the pdftools package to v2.3.1.

large_pdf_file <- "path/to/file.pdf"

system.time(test <- textreadr::read_pdf(large_pdf_file))
#    user  system elapsed
#  165.64    0.42  166.17

dim(test)
# [1] 519871      3

The problem is a possible memory leak in the poppler library which is used by the pdftools package.

The Task Manager shows a huge increase in RAM while using the textreadr::read_pdf function to read a large image based pdf file.

If you insist on using an older version of pdftools, some users have reported success with this workaround - however, I tried it using the same large pdf file as before and received this error:

pdf <- callr::r(function(){
    textreadr::read_pdf('filename.pdf')
})
   
Error in value[[3L]](cond) : 
  callr subprocess failed: could not start R, exited with non-zero status,
has crashed or was killed
Lathe answered 13/7, 2019 at 2:11 Comment(0)
D
0

There is a generator function in python, which can hold a large number of documents without having any impact on memory. You can try using the same. I am not sure whether your code is in python. Even if it is not in python, you can incorporate python library and execute only this piece of code in python. There is also a difflib library in python, which can compare documents with a single line of code.

Please refer the below video for the same.

https://www.youtube.com/watch?v=bD05uGo_sVI

Dichlorodiphenyltrichloroethane answered 22/9, 2020 at 16:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.