Batch fill PDF forms from python or bash
Asked Answered
R

4

18

I have a PDF form that needs to be filled out a bunch of times (it's a timesheet to be exact). Now since I don't want to do this by hand, I was looking for a way to fill them out using a python script or tools that could be used in a bash script.

Does anyone have experience with this?

Roo answered 7/5, 2012 at 3:13 Comment(1)
See #1891070Tenpin
B
17

For Python you'll need the fdfgen lib and pdftk

@Hugh Bothwell's comment is 100% correct so I'll extend that answer with a working implementation.

If you're in windows you'll also need to make sure both python and pdftk are contained in the system path (unless you want to use long folder names).

Here's the code to auto-batch-fill a collection of PDF forms from a CSV data file:

import csv
from fdfgen import forge_fdf
import os
import sys

sys.path.insert(0, os.getcwd())
filename_prefix = "NVC"
csv_file = "NVC.csv"
pdf_file = "NVC.pdf"
tmp_file = "tmp.fdf"
output_folder = './output/'

def process_csv(file):
    headers = []
    data =  []
    csv_data = csv.reader(open(file))
    for i, row in enumerate(csv_data):
      if i == 0:
        headers = row
        continue;
      field = []
      for i in range(len(headers)):
        field.append((headers[i], row[i]))
      data.append(field)
    return data

def form_fill(fields):
  fdf = forge_fdf("",fields,[],[],[])
  fdf_file = open(tmp_file,"w")
  fdf_file.write(fdf)
  fdf_file.close()
  output_file = '{0}{1} {2}.pdf'.format(output_folder, filename_prefix, fields[1][1])
  cmd = 'pdftk "{0}" fill_form "{1}" output "{2}" dont_ask'.format(pdf_file, tmp_file, output_file)
  os.system(cmd)
  os.remove(tmp_file)

data = process_csv(csv_file)
print('Generating Forms:')
print('-----------------------')
for i in data:
  if i[0][1] == 'Yes':
    continue
  print('{0} {1} created...'.format(filename_prefix, i[1][1]))
  form_fill(i)

Note: It shouldn't be rocket-surgery to figure out how to customize this. The initial variable declarations contain the custom configuration.

In the CSV, in the first row each column will contain the name of the corresponding field name in the PDF file. Any columns that don't have corresponding fields in the template will be ignored.

In the PDF template, just create editable fields where you want your data to fill and make sure the names match up with the CSV data.

For this specific configuration, just put this file in the same folder as your NVC.csv, NVC.pdf, and a folder named 'output'. Run it and it automagically does the rest.

Beautician answered 10/1, 2013 at 2:37 Comment(3)
This works beautifully. Only thing I had to add was path to PDFtk: codeos.environ['PATH'] += os.pathsep + 'C:\\Program Files (x86)\\PDFtk\\bin;'Pinfeather
I needed to replace fdf_file = open(tmp_file,"w") by fdf_file = open(tmp_file,"wb") to make it work.Cowman
The code runs, but I cant really see any data in the output pdf. any ideas?Gastrectomy
R
17

Much faster version, no pdftk nor fdfgen needed, pure Python 3.6+:

# -*- coding: utf-8 -*-

from collections import OrderedDict
from PyPDF2 import PdfFileWriter, PdfFileReader


def _getFields(obj, tree=None, retval=None, fileobj=None):
    """
    Extracts field data if this PDF contains interactive form fields.
    The *tree* and *retval* parameters are for recursive use.

    :param fileobj: A file object (usually a text file) to write
        a report to on all interactive form fields found.
    :return: A dictionary where each key is a field name, and each
        value is a :class:`Field<PyPDF2.generic.Field>` object. By
        default, the mapping name is used for keys.
    :rtype: dict, or ``None`` if form data could not be located.
    """
    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
                       '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
    if retval is None:
        retval = OrderedDict()
        catalog = obj.trailer["/Root"]
        # get the AcroForm tree
        if "/AcroForm" in catalog:
            tree = catalog["/AcroForm"]
        else:
            return None
    if tree is None:
        return retval

    obj._checkKids(tree, retval, fileobj)
    for attr in fieldAttributes:
        if attr in tree:
            # Tree is a field
            obj._buildField(tree, retval, fileobj, fieldAttributes)
            break

    if "/Fields" in tree:
        fields = tree["/Fields"]
        for f in fields:
            field = f.getObject()
            obj._buildField(field, retval, fileobj, fieldAttributes)

    return retval


def get_form_fields(infile):
    infile = PdfFileReader(open(infile, 'rb'))
    fields = _getFields(infile)
    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())


def update_form_values(infile, outfile, newvals=None):
    pdf = PdfFileReader(open(infile, 'rb'))
    writer = PdfFileWriter()

    for i in range(pdf.getNumPages()):
        page = pdf.getPage(i)
        try:
            if newvals:
                writer.updatePageFormFieldValues(page, newvals)
            else:
                writer.updatePageFormFieldValues(page,
                                                 {k: f'#{i} {k}={v}'
                                                  for i, (k, v) in enumerate(get_form_fields(infile).items())
                                                  })
            writer.addPage(page)
        except Exception as e:
            print(repr(e))
            writer.addPage(page)

    with open(outfile, 'wb') as out:
        writer.write(out)


if __name__ == '__main__':
    from pprint import pprint

    pdf_file_name = '2PagesFormExample.pdf'

    pprint(get_form_fields(pdf_file_name))

    update_form_values(pdf_file_name, 'out-' + pdf_file_name)  # enumerate & fill the fields with their own names
    update_form_values(pdf_file_name, 'out2-' + pdf_file_name,
                       {'my_fieldname_1': 'My Value',
                        'my_fieldname_2': 'My Another 💎alue'})  # update the form fields
Royroyal answered 28/4, 2017 at 12:34 Comment(8)
shows syntax error here {k: f'#{i} {k}={v}'. using python 3.5. is that the reason?Raber
f-strings require Python 3.6+. Workaround: {k: "#{i} {k}={v}".format(**locals())}Vierno
Thank you so much for this. One tip for others reading: Open the original, make all the hardcoded changes you want, save it. (This allows easy editing of signatures and checkboxes.) Then only programmatically edit the fields you want edited.Ddene
Unfortunately it seems that, after running this script and printing it out, the changes are not applied, though I do see them in Preview on my Mac.Ddene
Copied exactly the same code and updated the source file name. It prints out all the fields but doesn't update anything in the output pdf file. Any suggestion?Manyplies
If the filled values are hidden and only show up when you click on them in Acrobat, see discussion at: github.com/mstamy2/PyPDF2/issues/355Ardent
If the filled values are hidden, you have not initialized the pdf writer object correctly. Try calling pdf_writer.cloneReaderDocumentRoot(pdf_reader) directly after creating the writer object. (This summarizes the issue #355 shared by @YifeiH above)Piece
This works really well, but it doesn't handle checkboxes. Any ydeas?Celebration
A
0

Replace Original File

os.system('pdftk "original.pdf" fill_form "data.fdf" output "output.pdf"')
os.remove("data.fdf")
os.remove("original.pdf")
os.rename("output.pdf","original.pdf")
Almazan answered 4/3, 2016 at 12:42 Comment(1)
It was possibly meant to be a comment to an answer above.Hoedown
E
0

I wrote a library built upon:'pdfrw', 'pdf2image', 'Pillow', 'PyPDF2' called fillpdf (pip install fillpdf and poppler dependency conda install -c conda-forge poppler)

Basic usage:

from fillpdf import fillpdfs

fillpdfs.get_form_fields("blank.pdf")

# returns a dictionary of fields
# Set the returned dictionary values a save to a variable
# For radio boxes ('Off' = not filled, 'Yes' = filled)

data_dict = {
'Text2': 'Name',
'Text4': 'LastName',
'box': 'Yes',
}

fillpdfs.write_fillable_pdf('blank.pdf', 'new.pdf', data_dict)

# If you want it flattened:
fillpdfs.flatten_pdf('new.pdf', 'newflat.pdf')

More info here: https://github.com/t-houssian/fillpdf

If some fields don't fill, use can use fitz (pip install PyMuPDF) and PyPDF2 (pip install PyPDF2) like the following altering the points as needed:

import fitz
from PyPDF2 import PdfFileReader

file_handle = fitz.open('blank.pdf')
pdf = PdfFileReader(open('blank.pdf','rb'))
box = pdf.getPage(0).mediaBox
w = box.getWidth()
h = box.getHeight()

# For images
image_rectangle = fitz.Rect((w/2)-200,h-255,(w/2)-100,h-118)
pages = pdf.getNumPages() - 1
last_page = file_handle[pages]
last_page._wrapContents()
last_page.insertImage(image_rectangle, filename=f'image.png')

# For text
last_page.insertText(fitz.Point((w/2)-247 , h-478), 'John Smith', fontsize=14, fontname="times-bold")
file_handle.save(f'newpdf.pdf')
Eldest answered 25/3, 2021 at 22:18 Comment(2)
This seems exactly what I want but I noticed it does not fill drop downs, is it on the future plans? Thanks!Lambertson
@RafaelSantos Good idea, I will add it to the future plans!Eldest

© 2022 - 2024 — McMap. All rights reserved.