Writing metadata to a pdf using pyobjc
Asked Answered
A

2

6

I'm trying to write metadata to a pdf file using the following python code:

from Foundation import *
from Quartz import *

url = NSURL.fileURLWithPath_("test.pdf")
pdfdoc = PDFDocument.alloc().initWithURL_(url)
assert pdfdoc, "failed to create document"

print "reading pdf file"

attrs = {}
attrs[PDFDocumentTitleAttribute] = "THIS IS THE TITLE"
attrs[PDFDocumentAuthorAttribute] = "A. Author and B. Author"

PDFDocumentTitleAttribute = "test"

pdfdoc.setDocumentAttributes_(attrs)
pdfdoc.writeToFile_("mynewfile.pdf")   

print "pdf made"

This appears to work fine (no errors to the consoled), however when I examine the metadata of the file it is as follows:

PdfID0:
242b7e252f1d3fdd89b35751b3f72d3
PdfID1:
242b7e252f1d3fdd89b35751b3f72d3
NumberOfPages: 4

and the original file had the following metadata:

InfoKey: Creator
InfoValue: PScript5.dll Version 5.2.2
InfoKey: Title
InfoValue: Microsoft Word - PROGRESS  ON  THE  GABION  HOUSE Compressed.doc
InfoKey: Producer
InfoValue: GPL Ghostscript 8.15
InfoKey: Author
InfoValue: PWK
InfoKey: ModDate
InfoValue: D:20101021193627-05'00'
InfoKey: CreationDate
InfoValue: D:20101008152350Z
PdfID0: d5fd6d3960122ba72117db6c4d46cefa
PdfID1: 24bade63285c641b11a8248ada9f19
NumberOfPages: 4

So the problems are, it is not appending the metadata, and it is clearing the previous metadata structure. What do I need to do to get this to work? My objective is to append metadata that reference management systems can import.

Alten answered 4/11, 2010 at 19:21 Comment(0)
T
6

Mark is on the right track, but there are a few peculiarities that should be accounted for.

First, he is correct that pdfdoc.documentAttributes is an NSDictionary that contains the document metadata. You would like to modify that, but note that documentAttributes gives you an NSDictionary, which is immutable. You have to convert it to an NSMutableDictionary as follows:

attrs = NSMutableDictionary.alloc().initWithDictionary_(pdfDoc.documentAttributes())

Now you can modify attrs as you did. There is no need to write PDFDocument.PDFDocumentTitleAttribute as Mark suggested, that one won't work, PDFDocumentTitleAttribute is declared as a module-level constant, so just do as you did in your own code.

Here is the full code that works for me:

from Foundation import *
from Quartz import *

url = NSURL.fileURLWithPath_("test.pdf")
pdfdoc = PDFDocument.alloc().initWithURL_(url)

attrs = NSMutableDictionary.alloc().initWithDictionary_(pdfdoc.documentAttributes())
attrs[PDFDocumentTitleAttribute] = "THIS IS THE TITLE"
attrs[PDFDocumentAuthorAttribute] = "A. Author and B. Author"

pdfdoc.setDocumentAttributes_(attrs)
pdfdoc.writeToFile_("mynewfile.pdf")
Tanatanach answered 12/11, 2010 at 23:12 Comment(5)
Thanks for the answer, Tamás. Just one question, where can I find what the other attributes are? I'm trying to hard-code metadata from a .bib file with the PDFs and I don't know if there are any limits on what I can store.Alten
Just look them up in the documentation of PDFKit on Apple's homepage; the list of constants you are looking for is here (sorry for the long link): developer.apple.com/library/mac/#documentation/GraphicsImaging/… . Look for the Constants section and "Document Attribute Keys" within it.Middleoftheroader
I've been a bit slow trying out this code - but I can't get it to run beyond the line <url = NSURL.fileURLWithPath_("test.pdf")>. Is there anything version specific about <PDFDocument.alloc().initWithUrl_(url)>?Alten
Exact error message: "pdfdoc = PDFDocument.alloc().initWithUrl_(url) AttributeError: 'PDFDocument' object has no attribute 'initWithUrl_'"Alten
Ermmm... I made a booboo when copying the source code from my Python terminal. It should be initWithURL_ and not initWithUrl_. Note the capitalization.Middleoftheroader
H
2

DISCLAIMER: I'm utterly new to Python, but an old hand at PDF.

To avoid smashing all the existing attributes, you need to start attrs with pdfDoc.documentAttributes, not {}. setDocumentAttributes is almost certainly an overwrite rather than a merge (given your output here).

Second, all the PDFDocument*Attribute constants are part of PDFDocument. My Python ignorance is undoubtedly showing, but shouldn't you be referencing them as attributes rather than as bare variables? Like this:

attrs[PDFDocument.PDFDocumentTitleAttribute] = "THIS IS THE TITLE"

That you can assign to PDFDocumentTitleAttribute leads me to believe it's not a constant.

If I'm right, your attrs will have tried to assign numerous values to a null key. My Python is weak, so I don't know how you'd check that. Examining attrs prior to calling pdfDoc.setDocumentAttributes_() should be revealing.

Hampshire answered 9/11, 2010 at 22:27 Comment(2)
Thanks for your suggestion Mark. I'm just trying to understand the first part of your comment - should it be pdfDoc.documentAttributes = {} or pdfDoc.documentAttributes.attrs = {}?Alten
attrs = pdfdoc.documentAttributesHampshire

© 2022 - 2024 — McMap. All rights reserved.