How to unlock a "secured" (read-protected) PDF in Python?
Asked Answered
I

10

32

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1
ab0>

When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.

if check_extractable and not doc.is_extractable:
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)

Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable is a simple attribute of the doc, but I don't think it is as simple as changing .is_extractable to True..

Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!

================================================

Below you will find the code with which I currently extract the text from non-read protected.

def getTextFromPDF(rawFile):
    resourceManager = PDFResourceManager(caching=True)
    outfp = StringIO()
    device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)
    interpreter = PDFPageInterpreter(resourceManager, device)

    fileData = StringIO()
    fileData.write(rawFile)
    for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):
        interpreter.process_page(page)
    fileData.close()
    device.close()

    result = outfp.getvalue()

    outfp.close()
    return result
Iceni answered 28/1, 2015 at 13:2 Comment(3)
Have you tried changing .is_extractable to True? There's actually a reasonable chance that it would work.Dewittdewlap
Did you try passing password? for page in PDFPage.get_pages(fileData, set(), maxpages=0, password=password,caching=True, check_extractable=True):Greggs
See my post below. This behavior is changed in pdfminer.six, showring a warning instead of raising an error.Salmonberry
A
60

Refer, pikepdf, which is based on qpdf. It automatically converts pdfs to be extractable.

Code for Reference:

import pikepdf
def remove_password_from_pdf(filename, password=None):
    pdf = pikepdf.open(filename, password=password)
    pdf.save("pdf_file_with_no_password.pdf")

if __name__ == "__main__":
    remove_password_from_pdf(filename="/path/to/file")
Antalya answered 14/11, 2018 at 14:19 Comment(3)
If the pdf is password protected the password can be set with pikepdf.open('unextractable.pdf', password='thepassword')Eiland
How do you close the original pdf, since I can't delete it until I exit python repl?Metronymic
It doesn't work for me I've got a PasswordErrorRankin
H
29

As far as I know, in most cases the full content of the PDF is actually encrypted, using the password as the encryption key, and so simply setting .is_extractable to True isn't going to help you.

Per this thread:

Does a library exist to remove passwords from PDFs programmatically?

I would recommend removing the read-protection with a command-line tool such as qpdf (easily installable, e.g. on Ubuntu use apt-get install qpdf if you don't have it already):

qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf

Then open the unlocked file with pdfminer and do your stuff.

For a pure-Python solution, you can try using PyPDF2 and its .decrypt() method, but it doesn't work with all types of encryption, so really, you're better off just using qpdf - see:

https://github.com/mstamy2/PyPDF2/issues/53

Hooray answered 17/9, 2015 at 0:7 Comment(2)
PyPDF2 now supports a lot more. I think it should actually now support decrypting any pdfConclusion
PyPDF2 is deprecated. The project moved to pypdfConclusion
C
8

I used below code using pikepdf and able to overwrite.

import pikepdf

pdf = pikepdf.open('filepath', allow_overwriting_input=True)
pdf.save('filepath')
Colonist answered 15/8, 2020 at 3:37 Comment(0)
F
3

In my case there was no password, but simply setting check_extractable=False circumvented the PDFTextExtractionNotAllowed exception for a problematic file (that opened fine in other viewers).

Fortyish answered 19/7, 2017 at 6:7 Comment(2)
Best answer when the error is thrown while the file is not encrypted nor password protected.Thorma
Can you please share this with exampleLorola
A
2

The 'check_extractable=True' argument is by design. Some PDFs explicitly disallow to extract text, and PDFMiner follows the directive. You can override it (giving check_extractable=False), but do it at your own risk.

Affiliate answered 15/5, 2019 at 9:32 Comment(0)
S
2

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

This issue was fixed in 2020 by disabling the check_extractable by default. It now shows a warning instead of raising an error.

Similar question and answer here.

Salmonberry answered 12/9, 2021 at 12:4 Comment(0)
C
2

pikepdf didn't work for me. I found a solution using PyPDF2 to unencrypt all files in current working directory.

import os
from PyPDF2 import PdfReader, PdfWriter

def remove_encryption_from_pdf(input_path, output_path):
    with open(input_path, "rb") as file:
        reader = PdfReader(file)
        if reader.is_encrypted:
            writer = PdfWriter()
            for page in reader.pages:
                writer.add_page(page)
            with open(output_path, "wb") as output_pdf:
                writer.write(output_pdf)

if __name__ == "__main__":
    directory_path = os.getcwd()  # get current directory path
    
    for filename in os.listdir(directory_path):
        if filename.endswith('.pdf'):
            input_path = os.path.join(directory_path, filename)
            output_path = os.path.join(directory_path, "decrypted_" + filename)
            print(f"Processing {filename}")  # print the file name
            try:
                remove_encryption_from_pdf(input_path, output_path)
                print(f"Encryption removed from {filename}")
            except Exception as e:
                print(f"Failed to remove encryption from {filename}. Error: {e}")
Countdown answered 3/7, 2023 at 12:22 Comment(2)
This is the error message I got when I tried to do this : Processing MyPDF.pdf Failed to remove encryption from MyPDF.pdf. Error: PyCryptodome is required for AES algorithm I assume this module PyCryptodome is pip installable, please edit answer to add this dependency.Abixah
I recently tried it again, all I needed to do was 'pip3 install PyPDF2' in mac terminalCountdown
P
0

If you want to unlock all pdf files in a folder without renaming them, you may use this code:

import glob, os, pikepdf

p = os.getcwd()
for file in glob.glob('*.pdf'):
   file_path = os.path.join(p, file).replace('\\','/')
   init_pdf = pikepdf.open(file_path)
   new_pdf = pikepdf.new()
   new_pdf.pages.extend(init_pdf.pages)
   new_pdf.save(str(file))

In pikepdf library it is impossible to overwrite the existing file by saving it with the same name. In contrast, you would like to copy the pages to the newly created empty pdf file, and save it.

Pompadour answered 19/4, 2020 at 17:18 Comment(1)
this worked for me like a charm. I only had to add password param to init_pdf = pikepdf.open(file_path, password='mypass')Unsling
W
0

If you've forgotten the password to your PDF, below is a generic script which tries a LOT of password combinations on the same PDF. It uses pikepdf, but you can update the function check_password to use something else.

Usage example:

I used this when I had forgotten a password on a bank PDF. I knew that my bank always encrypts these kind of PDFs with the same password-structure:

  1. Total length = 8
  2. First 4 characters = an uppercase letter.
  3. Last 4 characters = a number.

I call script as follows:

check_passwords(
    pdf_file_path='/Users/my_name/Downloads/XXXXXXXX.pdf',
    combination=[
        ALPHABET_UPPERCASE,
        ALPHABET_UPPERCASE,
        ALPHABET_UPPERCASE,
        ALPHABET_UPPERCASE,
        NUMBER,
        NUMBER,
        NUMBER,
        NUMBER,
    ]
)

Password-checking script:

(Requires Python3.8, with libraries numpy and pikepdf)

from typing import *
from itertools import product
import time, pikepdf, math, numpy as np
from pikepdf import PasswordError

ALPHABET_UPPERCASE: Sequence[str] = tuple('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
ALPHABET_LOWERCASE: Sequence[str] = tuple('abcdefghijklmnopqrstuvwxyz')
NUMBER: Sequence[str] = tuple('0123456789')

def as_list(l):
    if isinstance(l, (list, tuple, set, np.ndarray)):
        l = list(l)
    else:
        l = [l]
    return l

def human_readable_numbers(n, decimals: int = 0):
    n = round(n)
    if n < 1000:
        return str(n)
    names = ['', 'thousand', 'million', 'billion', 'trillion', 'quadrillion']
    n = float(n)
    idx = max(0,min(len(names)-1,
                        int(math.floor(0 if n == 0 else math.log10(abs(n))/3))))

    return f'{n/10**(3*idx):.{decimals}f} {names[idx]}'

def check_password(pdf_file_path: str, password: str) -> bool:
    ## You can modify this function to use something other than pike pdf. 
    ## This function should throw return True on success, and False on password-failure.
    try:
        pikepdf.open(pdf_file_path, password=password)
        return True
    except PasswordError:
        return False


def check_passwords(pdf_file_path, combination, log_freq: int = int(1e4)):
    combination = [tuple(as_list(c)) for c in combination]
    print(f'Trying all combinations:')
    for i, c in enumerate(combination):
        print(f"{i}) {c}")
    num_passwords: int = np.product([len(x) for x in combination])
    passwords = product(*combination)
    success: bool | str = False
    count: int = 0
    start: float = time.perf_counter()
    for password in passwords:
        password = ''.join(password)
        if check_password(pdf_file_path, password=password):
            success = password
            print(f'SUCCESS with password "{password}"')
            break
        count += 1
        if count % int(log_freq) == 0:
            now = time.perf_counter()
            print(f'Tried {human_readable_numbers(count)} ({100*count/num_passwords:.1f}%) of {human_readable_numbers(num_passwords)} passwords in {(now-start):.3f} seconds ({human_readable_numbers(count/(now-start))} passwords/sec). Latest password tried: "{password}"')
    end: float = time.perf_counter()
    msg: str = f'Tried {count} passwords in {1000*(end-start):.3f}ms ({count/(end-start):.3f} passwords/sec). '
    msg += f"Correct password: {success}" if success is not False else f"All {num_passwords} passwords failed."
    print(msg)

Comments

  1. Obviously, don't use this to break into PDFs which are not your own. I hold no responsibility over how you use this script or any consequences of using it.
  2. A lot of optimizations can be made.
    • Right now check_password uses pikepdf, which loads the file from disk for every "check". This is really slow, ideally it should run against an in-memory copy. I haven't figured out a way to do that, though.
    • You can probably speed this up a LOT by calling qpdf directly using C++, which is much better than Python for this kind of stuff.
    • I would avoid multi-processing here, since we're calling the same qpdf binary (which is normally a system-wide installation), which might become the bottleneck.
Waterbuck answered 27/12, 2022 at 8:46 Comment(1)
To add, an approach which would probably work (but I haven't tried) is to make ~N copies of the PDF, and then run ~N processes to read from disk in parallel, where N can be set upto the number of CPUs you have on your machine. This will work as a speedup and maximize your disk throughput.Waterbuck
R
-1

I too faced the same problem of parsing the secured pdf but it has got resolved using pikepdf library. I tried this library on my jupyter notebbok and on windows os but it gave errors but it worked smoothly on Ubuntu

Ramtil answered 17/4, 2020 at 7:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.