Remove PDF metadata (removing complete PDF metadata )
Asked Answered
E

5

11

I want to remove metadata from PDF files. I have already tried to use "exiftool", "pdftk" and "qpdf" to remove the metadata (method proposed - https://gist.github.com/hubgit/6078384 ). These tools claim to remove metadata but unfortunately retain them. I used "grep -a metadata_fieldname file.pdf" option and I could retrieve the metadata value.

Is there a way to completely delete the metadata information from PDF files (delete all the objects containing metadata information).

I am using Ubuntu. When I create a PDF file using LaTeX tool (ex- pdfTeX) or LibreOffice, the tool automatically writes the information of Producer, Creator and sometimes Full banner etc.. in the metadata of the PDF file. So I am looking to remove this information from PDF files (basically the metadata information stored by the PDF creator tool).

Erenow answered 18/3, 2020 at 11:49 Comment(5)
What OS are you on, and what tools are you looking for? There are undoubtedly GUI apps that can edit (and thus remove) the metadata; and it may be possible to use python libraries. Have you tried Coherent PDF? community.coherentpdf.comProt
Please define exactly the type of metadata you want to remove. Do you only mean the metadata as specified in the PDF specification (i.e. in metadata stream associated with the document or a component of the document, and in the document information dictionary associated with the document)? Or do you also mean custom metadata added by programs in their proprietary manners? PDF is a very flexible format and allows custom additions, so those custom metadata can take many forms not recognized by tools trying to remove metadata. Probably you should share your example PDF and the metadata key...Homozygote
In the case of exiftool, it's docs on PDFs specifically state that the changes it makes are reversible unless the file has been relinearized. Additionally, since you are searching the raw file, some of that data you are finding may be something embedded in one of the objects embedded in the PDF, such as a font or jpg image. You should try and check if that might be the case.Kelcie
Probably you should share your example PDF and the metadata key..Homozygote
@Homozygote u were right, the metadata I was viewing after using qpdf and pdftk was indeed associated to embedded objects and images. Thanks for the clarification :-)Erenow
N
17

To remove all pdf information dictionary using pdftk on your ubuntu terminal, you can use the following commands:

pdftk file.pdf  dump_data |sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | pdftk file.pdf update_info - output file_no_meta.pdf

Assuming file.pdf is the source file and your pdf file output as file_no_meta.pdf

Next, use the following command to remove XMP metadata:

exiftool -all:all= -overwrite_original file_no_meta.pdf

Finally, use the following command on your terminal to check for the file metadata again:

pdfinfo file_no_meta.pdf
Naiad answered 3/10, 2020 at 13:6 Comment(4)
This does not remove XMP metadata. I have just used pdftk to change the Title InfoKey, and discovered that the process stream I have to use reads the XMP metadata if present, where the title was left unchanged.Titmouse
Yes, I understand what you mean, in that case, you can use exiftool to remove XMP metadata. I just added the code that can help to remove the XMP metadata from a file. Please check it and see.Naiad
@Titmouse I hope my suggestion worked for you? Afterward, you choose the best answer that matches your question. Thanks!Naiad
That was not my question :-) -- in my own case I had to rewrite the XMP metadata, I just extracted the XML object and reinjected it in the PDF stream using xmlparse + iText.Titmouse
A
6

You can use pdftk to strip all Info and XMP metadata from a document by copying its pages into a new PDF, like this:

pdftk A=mydoc.pdf cat A output mydoc.no_metadata.pdf
Aftertaste answered 18/7, 2021 at 12:7 Comment(1)
Solved my problem faced while uploading latex generated pdf to arxiv.Lotetgaronne
A
1

I did some test:

1. pdftk

pdftk 1.pdf cat output 1-pdftk.pdf

This way seems to remove metadata

2. exiftool + qpdf

I found this solution in a lot of websites.

exiftool -all:all= foo.pdf
qpdf --linearize foo.pdf bar.pdf

This way too seems to make a clean file .

Then we make some try.

prove

bin> pdfinfo 1.pdf
Title:           SCHEDA DI VALUTAZIONE A4 PRIMARIA CON OBIETTIVI
Subject:         SCHEDA DI VALUTAZIONE A4 PRIMARIA CON OBIETTIVI
Creator:         Registro Elettronico Nuvola
Producer:        TCPDF 6.0.040 (http://www.tcpdf.org)
CreationDate:    Fri Feb 23 16:03:22 2024 ora solare Europa occidentale
ModDate:         Fri Feb 23 16:03:22 2024 ora solare Europa occidentale
Custom Metadata: no
Metadata Stream: yes
[..]
File size:       152186 bytes
Optimized:       no
PDF version:     1.7


bin> exiftool -all:all 1.pdf
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LC_ALL = "C.UTF-8",
        LANG = (unset)
    are supported and installed on your system.
perl: warning: Falling back to the system default locale ("Italian_Italy.1252").
ExifTool Version Number         : 12.44
File Name                       : 1.pdf
Directory                       : .
File Size                       : 152 kB
[...]
XMP Toolkit                     : Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04
Format                          : application/pdf
Title                           : SCHEDA DI VALUTAZIONE A4 PRIMARIA CON OBIETTIVI
[...]
Description                     : SCHEDA DI VALUTAZIONE A4 PRIMARIA CON OBIETTIVI
Subject                         :  TCPDF
Create Date                     : 2024:02:23 16:03:22+01:00
Creator Tool                    : Registro Elettronico Nuvola
Modify Date                     : 2024:02:23 16:03:22+01:00
Metadata Date                   : 2024:02:23 16:03:22+01:00
Keywords                        :  TCPDF
Producer                        : TCPDF 6.0.040 (http://www.tcpdf.org)
[...]




bin> pdftk 1.pdf cat output 1-pdftk.pdf

bin> exiftool -all:all= 1.pdf
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LC_ALL = "C.UTF-8",
        LANG = (unset)
    are supported and installed on your system.
perl: warning: Falling back to the system default locale ("Italian_Italy.1252").
Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered! - 1.pdf
    1 image files updated


bin> qpdf --linearize 1-exiftool.pdf 1-exiftool-qpdf.pdf

Then let's see if there is still metadata:

bin> pdfinfo 1-pdftk.pdf
Creator:         pdftk 2.02 - www.pdftk.com
Producer:        itext-paulo-155 (itextpdf.sf.net-lowagie.com)
CreationDate:    Mon Jun 17 11:24:07 2024 ora legale Europa occidentale
ModDate:         Mon Jun 17 11:24:07 2024 ora legale Europa occidentale
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           4
Encrypted:       no
Page size:       595.276 x 841.89 pts (A4)
Page rot:        0
File size:       147006 bytes
Optimized:       no
PDF version:     1.7

bin> pdfinfo 1-exiftool.pdf
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           4
Encrypted:       no
Page size:       595.276 x 841.89 pts (A4)
Page rot:        0
File size:       152658 bytes
Optimized:       no
PDF version:     1.7

bin>exiftool.exe -all:all 1-pdftk.pdf
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LC_ALL = "C.UTF-8",
        LANG = (unset)
    are supported and installed on your system.
perl: warning: Falling back to the system default locale ("Italian_Italy.1252").
ExifTool Version Number         : 12.44
File Name                       : 1-pdftk.pdf
Directory                       : .
File Size                       : 147 kB
File Modification Date/Time     : 2024:06:17 11:24:07+02:00
File Access Date/Time           : 2024:06:17 11:30:16+02:00
File Creation Date/Time         : 2024:06:17 11:24:07+02:00
File Permissions                : -rw-rw-rw-
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.7
Linearized                      : No
Creator                         : pdftk 2.02 - www.pdftk.com
Producer                        : itext-paulo-155 (itextpdf.sf.net-lowagie.com)
Modify Date                     : 2024:06:17 09:24:07Z
Create Date                     : 2024:06:17 09:24:07Z
Page Count                      : 4

bin>exiftool.exe -all:all 1-exiftool.pdf
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LC_ALL = "C.UTF-8",
        LANG = (unset)
    are supported and installed on your system.
perl: warning: Falling back to the system default locale ("Italian_Italy.1252").
ExifTool Version Number         : 12.44
File Name                       : 1-exiftool.pdf
Directory                       : .
File Size                       : 153 kB
File Modification Date/Time     : 2024:06:17 11:26:13+02:00
File Access Date/Time           : 2024:06:17 11:30:22+02:00
File Creation Date/Time         : 2024:06:17 11:05:26+02:00
File Permissions                : -rw-rw-rw-
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.7
Linearized                      : No
PDF Version                     : 1.7
Page Count                      : 4
Page Layout                     : SinglePage
Page Mode                       : UseNone

They look clean, but...


$ strings 1.pdf|grep -i valutazione
                                        <rdf:li xml:lang="x-default">SCHEDA DI VALUTAZIONE A4 PRIMARIA CON OBIETTIVI</rdf:li>
                                        <rdf:li xml:lang="x-default">SCHEDA DI VALUTAZIONE A4 PRIMARIA CON OBIETTIVI</rdf:li>

$ strings 1-pdftk.pdf|grep -i valutazione

$ strings 1-exiftool.pdf|grep -i valutazione
                                        <rdf:li xml:lang="x-default">SCHEDA DI VALUTAZIONE A4 PRIMARIA CON OBIETTIVI</rdf:li>
                                        <rdf:li xml:lang="x-default">SCHEDA DI VALUTAZIONE A4 PRIMARIA CON OBIETTIVI</rdf:li>

$ strings 1-exiftool-qpdf.pdf|grep -i valutazione

$ du -hs 1*.pdf
152K    1.pdf
152K    1-exiftool.pdf
148K    1-exiftool-qpdf.pdf
144K    1-pdftk.pdf

Note on pdftk

Strange to see, but the pdftk command that suggest imranayari here don't works as good as the simple pdftk 1.pdf cat output 1-pdftk.pdf:

$ pdftk 1.pdf cat output 1-pdftk.pdf

$ pdftk 1.pdf dump_data|sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | pdftk 1.pdf update_info - output 1-pdftk-rewrite.pdf

Now, pdftk FILENAME dump_data seems to show both as clean file but if we investigate with strings we see:

$ strings 1-pdftk.pdf |grep -i valutazione

$ strings 1-pdftk-rewrite.pdf |grep -i valutazione
                                        <rdf:li xml:lang="x-default">SCHEDA DI VALUTAZIONE A4 PRIMARIA CON OBIETTIVI</rdf:li>
                                        <rdf:li xml:lang="x-default">SCHEDA DI VALUTAZIONE A4 PRIMARIA CON OBIETTIVI</rdf:li>
Amie answered 17/6, 2024 at 10:9 Comment(0)
A
0

There are more than just two locations for metadata within a PDF. Thus all the answers that attempt to remove ALL metadata will usually retain some.

Best dedicated tool (cannot remove everything or else images with embedded metadata would be destroyed), is probably Coherent cpdf which can use GhostScript to fix and regenerate the file first, thus remove much of imbedded meta data based objects.

The simplest invocation is cpdf -remove-metadata in.pdf -o out.pdf

I included the word "private" into many locations within a PDF to make testing simple and clearly just parsing the file as if text will find some. There is also imbedded as 16bit MetaData the word "Google Inc"

>type PrivateMetaData.pdf |find /c /i "private"
12
>type PrivateMetaData.pdf |find /c /i "G o o g l e"
0

Text parsing does not see 16bit encoded text. However using exiftool to test the file it can report 17 instances of the word "private". Let's try to reduce that from the previous 12 found

exiftool -all:all= -overwrite_original privatemetadata.pdf  
Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered! - privatemetadata.pdf
    1 image files updated

and test again

>type PrivateMetaData.pdf |find /c /i "G o o g l e"
0

>type PrivateMetaData.pdf |find /c /i "private"
15

So I checked the file content and Google was not removed it is still there and now there are more instances of "private" than before Exiftool was used. Hence the warning it COMPOUNDS PDF XMP data never removing it!

So I try my suggestion above to remove as much metadata as possible

>cpdf -remove-metadata privatemetadata.pdf -o metaout.pdf
For non-commercial use only
To purchase a license visit http://www.coherentpdf.com/
>type metaout.pdf |find /c "private"
2

Well, it's far better but some still remains, because I know where it may be harder to remove or standard entries. Exiftool also will not normally remove these either

cpdf metaout.pdf -info|find "private"
For non-commercial use only
To purchase a license visit http://www.coherentpdf.com/

Author: private
Subject: private
Keywords: private contains google image

Those will need to be individually altered. So last simple count was two entries but the file had not been decoded, so let's check again.

type decoded.pdf |find /c "private"
10

So some were hiding in encoded data like scripts, bookmarks and many other PDF key objects.

What is the best solution?

Answer:

  • 1 Decompress the file with qpdf into a pure text format then,
  • 2 Use a plain text editor to redact all the observed Meta entries.

We still can see the Google MetaData and that cannot be removed without destroying the embedded JPEG Image. Also we can see any MetaCopyright data for fonts that also cannot be removed without HEXeditor redaction.

Atheroma answered 17/6, 2024 at 15:51 Comment(1)
I have changed it back. I am happy to be corrected on any of my edits. I've not come across that spelling variation before. (Will flag these comments as NLN).Terret
R
-1

For

pdftk A=mydoc.pdf cat A output mydoc.no_metadata.pdf

to work, you need an older version of pdftk.

pdftk-java messes things up.

Redletter answered 29/7, 2021 at 0:36 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.