How to launch a pdftk subprocess while in wsgi?
Asked Answered
T

2

8

I need to launch a pdftk process while serving a web request in Django, and wait for it to finish. My current pdftk code looks like this:

proc = subprocess.Popen(["/usr/bin/pdftk", 
                         "/tmp/infile1.pdf", 
                         "/tmp/infile2.pdf", 
                         "cat", "output", "/tmp/outfile.pdf"])    
proc.communicate()

This works fine, as long as I'm executing under the dev server (running as user www-data). But as soon as I switch to mod_wsgi, changing nothing else, the code hangs at proc.communicate(), and "outfile.pdf" is left as an open file handle of zero length.

I've tried a several variants of the subprocess invocation (as well as plain old os.system) -- setting stdin/stdout/stderr to PIPE or to various file handles changes nothing. Using "shell=True" prevents proc.communicate() from hanging, but then pdftk fails to create the output file, both under the devserver or mod_wsgi. This discussion seems to indicate there might be some deeper voodoo going on with OS signals and pdftk that I don't understand.

Are there any workarounds to get a subprocess call like this to work properly under wsgi? I'm avoiding using PyPDF to combine pdf files, because I have to combine large enough numbers of files (several hundred) that it runs out of memory (PyPDF needs to keep every source pdf file open in memory while combining them).

I'm doing this under recent Ubuntu, pythons 2.6 and 2.7.

Trying answered 25/9, 2011 at 3:26 Comment(0)
M
8

Try with absolute file system paths to input and output files. The current working directory under Apache will not be same directory as run server and could be anything.


Second attempt after eliminating the obvious.

The pdftk program is a Java program which is relying on being able to generate/receive SIGPWR signal to trigger garbage collection or perform other actions. Problem is that under Apache/mod_wsgi in daemon mode, signals are blocked within the request handler threads to ensure that they are only received by the main thread looking for process shutdown trigger events. When you are forking the process to run pdftk, it is unfortunately inheriting the blocked sigmask from the request handler thread. The consequence of this is that it impedes the operation of the Java garbage collection process and causes pdftk to fail in strange ways.

The only solution for this is to use Celery and have the front end submit a job to the Celery queue for celeryd to then fork and execute pdftk. Because this is then done from a process created distinct from Apache, you will not have this issue.

For more gory details Google for mod_wsgi and pdftk, in particular in Google Groups.

http://groups.google.com/group/modwsgi/search?group=modwsgi&q=pdftk&qt_g=Search+this+group

Manaus answered 25/9, 2011 at 4:33 Comment(3)
I am in fact using absolute paths, thanks. I've updated the example code to reflect this. The issue remains, unfortunately.Trying
Thanks, forking to celery worked. I executed the celery task synchronously (using task.delay().get() so that it could happen within a single response cycle, which gets me the desired result.Trying
Wow, I couldn't find this solution anywhere else, but this is exactly what I had to do to get PDFTK to work. THANKS SO MUCH!Poussin
M
0

Update: Merging Two Pdfs Together Using Pdftk on Python 3:

It's been several years since this question was posted. (2011). The original poster said that os.system didn't work for them when they were running older versions of python:

  • Python 2.6 and
  • Python 2.7

On Python 3.4, os.system worked for me:

  • import os
  • os.system("pdftk " + template_file + " fill_form " + data_file + " output " + export_file)

Python 3.5 adds subprocess.run

  • subprocess.run("pdftk " + template_file + " fill_form " + data_file + " output " + export_file)

  • I used absolute paths for my files:

    • template_file = "/var/www/myproject/static/"

I ran this with Django 1.10, with the resulting output being saved to export_file.

How to Merge Two PDFs and Display PDF Output:

from django.http import HttpResponse, HttpResponseNotFound
from django.core.files.storage import FileSystemStorage
from fdfgen import forge_fdf
import os

template_file = = "/var/www/myproject/template.pdf"
data_file = "/var/www/myproject/data.fdf"
export_file ="/var/www/myproject/pdf_output.pdf"

fields = {}
fields['organization_name'] = organization_name
fields['address_line_1'] = address_line_1
fields['request_date'] = request_date
fields['amount'] = amount
field_list = [(field, fields[field]) for field in fields]

fdf = forge_fdf("",field_list,[],[],[])
fdf_file = open(data_file,"wb")
fdf_file.write(fdf)
fdf_file.close()

os.system("pdftk " + template_file + " fill_form " + data_file + " output " + export_file)
time.sleep(1)

fs = FileSystemStorage()
if fs.exists(export_file):
  with fs.open(export_file) as pdf:
    return HttpResponse(pdf, content_type='application/pdf; charset=utf-8')
else:
    return HttpResponseNotFound('The requested pdf was not found in our server.')

Libraries:

Misusage answered 24/3, 2017 at 12:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.