How can I get all text from a PDF in Swift?

S

5

I have a PDF document and would like to extract all its text. I tried the following:

import Quartz

let url = NSBundle.mainBundle().URLForResource("test", withExtension: "pdf")
let pdf = PDFDocument(URL: url)
print(pdf.string())

It does get the text, however the order of the lines extracted is completely mixed up as compared to opening the PDF in Adobe, Edit Select All, Copy, Paste!

How can I get the same outcome in Swift, as opening the PDF, Select All, Copy/Paste!?

Sphacelus answered 15/5, 2016 at 16:27 Comment(2)

Couldn't I find string() for pdf instance? Is it gone? – Heiser 30/5, 2017 at 11:6

@Heiser it is a computed property string (Swift) – Predicant 8/11, 2020 at 5:38

I

4

That is unfortunately not possible.
At least not without some major work on your part. And it certainly is not possible in a general matter for all pdfs.

PDFs are (generally) a one-way street.
They were created to display text in the same way on every system without any difference and for printers to print a document without the printer having to know all fonts and stuff.

Extracting text is non-trivial and only possible for some PDFs where the basic image-pdf is accompanied by text (which it does not have to). All text information present in the PDF is coupled with location information to determine where it is to be shown.

If you have a table shown in the PDF where the left column contains the names of the entries and the right row contains its contents, both of those columns can be represented as completely different blocks of text which only appear to have some link between each other due to the their placement next to each other.

What the framework / your code would have to do is determine what parts of text that are visually linked are also logically linked and belong together. That is not (yet) possible. The reason you and I can read and understand and group the PDF is that in some fields our brain is still far better than computers.

Final note because it might cause confusion: It is certainly possible that Adobe and Apple as well do some of this grouping already and achieves a good result, but it is still not perfect. The PDF I just tested was pretty mangled up after extracting the text via the Mac Preview.

Interspace answered 15/5, 2016 at 16:37 Comment(10)

Thats unfortunate! Do you know how I could cut out a section of the PDF? It does have columns. Then I could cut into sections and again try to use 'pdf.string'. – Sphacelus 15/5, 2016 at 18:26

@CenTinel I do not know that, no. But I know you can cut out sides and take the string only from that. There is plenty of functionality in the docs of PDFDocument, you might want to read through that site and google for interesting keywords you hit. – Interspace 15/5, 2016 at 18:40

Ok i managed to make selection rectangles across the PDF using pdf.pageAtIndex(x).selectionForRect(somerect) but this is also completely jumbled up :( – Sphacelus 15/5, 2016 at 19:13

@luk2302, I couldn't find string() function anywhere on pdf of type CGPDFDocument? Where is it actually? – Heiser 30/5, 2017 at 11:8

@Heiser no idea, CGPDFDocument is not the same as PDFDocument. – Interspace 30/5, 2017 at 11:21

So what's PDFDocument? Is it a custom class? – Heiser 30/5, 2017 at 11:26

@Heiser have you tried googling it? developer.apple.com/reference/quartz/pdfdocument – Interspace 30/5, 2017 at 11:27

@luk2302, this is amazing. Thanks! Never thought of this class is also available in the doc. So is there any difference between the CGPDFDocument and PDFDocument? Why should we use CGPDFDocument over PDFDocument? – Heiser 30/5, 2017 at 11:29

@Heiser one is CoreGraphics, one is Quartz, one is mac-only, one is cross-platform. – Interspace 30/5, 2017 at 11:30

CGPDFDocument (and similar classes) are the original Core Graphics classes, created at the dawn of OS X. PDFDocument (and similar) are part of PDFKit, a slightly higher level API introduced around Snow Leopard. Generally, PDFKit is faster and offers easier methods for doing stuff. – Sinistrocular 23/2, 2018 at 17:1

R

8

If you want only text content:

  extension String
{
    func readPDF() -> String
    {
        let path = "\(self)"
        let url = URL(fileURLWithPath: path)
        let pdf = PDFDocument(url: url)
        return pdf!.string!
    }
}

Rena answered 21/2, 2017 at 3:29 Comment(0)

N

6

I did it. with this:

if let pdf = PDFDocument(url: url) {
    let pageCount = pdf.pageCount
    let documentContent = NSMutableAttributedString()

    for i in 1 ..< pageCount {
        guard let page = pdf.page(at: i) else { continue }
        guard let pageContent = page.attributedString else { continue }
        documentContent.append(pageContent)
    }
}

Hope it helps.

Nereen answered 20/3, 2020 at 11:52 Comment(2)

I know that PDF count starting at 1 but in the quick help says "indexes are zero based", so I changed the 1 to 0 and it worked perfectly fine. – Durance 17/6, 2021 at 23:40

With import PDFKit and let url = URL(fileURLWithPath: "file.pdf") it worked for me and for my test document the result was identical to manually selecting the entire document in macOS Preview and copy and pasting into a text editor. – Barrier 24/10, 2022 at 19:19

I

4

That is unfortunately not possible.
At least not without some major work on your part. And it certainly is not possible in a general matter for all pdfs.

PDFs are (generally) a one-way street.
They were created to display text in the same way on every system without any difference and for printers to print a document without the printer having to know all fonts and stuff.

Extracting text is non-trivial and only possible for some PDFs where the basic image-pdf is accompanied by text (which it does not have to). All text information present in the PDF is coupled with location information to determine where it is to be shown.

If you have a table shown in the PDF where the left column contains the names of the entries and the right row contains its contents, both of those columns can be represented as completely different blocks of text which only appear to have some link between each other due to the their placement next to each other.

What the framework / your code would have to do is determine what parts of text that are visually linked are also logically linked and belong together. That is not (yet) possible. The reason you and I can read and understand and group the PDF is that in some fields our brain is still far better than computers.

Final note because it might cause confusion: It is certainly possible that Adobe and Apple as well do some of this grouping already and achieves a good result, but it is still not perfect. The PDF I just tested was pretty mangled up after extracting the text via the Mac Preview.

Interspace answered 15/5, 2016 at 16:37 Comment(10)

Thats unfortunate! Do you know how I could cut out a section of the PDF? It does have columns. Then I could cut into sections and again try to use 'pdf.string'. – Sphacelus 15/5, 2016 at 18:26

@CenTinel I do not know that, no. But I know you can cut out sides and take the string only from that. There is plenty of functionality in the docs of PDFDocument, you might want to read through that site and google for interesting keywords you hit. – Interspace 15/5, 2016 at 18:40

Ok i managed to make selection rectangles across the PDF using pdf.pageAtIndex(x).selectionForRect(somerect) but this is also completely jumbled up :( – Sphacelus 15/5, 2016 at 19:13

@luk2302, I couldn't find string() function anywhere on pdf of type CGPDFDocument? Where is it actually? – Heiser 30/5, 2017 at 11:8

@Heiser no idea, CGPDFDocument is not the same as PDFDocument. – Interspace 30/5, 2017 at 11:21

So what's PDFDocument? Is it a custom class? – Heiser 30/5, 2017 at 11:26

@Heiser have you tried googling it? developer.apple.com/reference/quartz/pdfdocument – Interspace 30/5, 2017 at 11:27

@luk2302, this is amazing. Thanks! Never thought of this class is also available in the doc. So is there any difference between the CGPDFDocument and PDFDocument? Why should we use CGPDFDocument over PDFDocument? – Heiser 30/5, 2017 at 11:29

@Heiser one is CoreGraphics, one is Quartz, one is mac-only, one is cross-platform. – Interspace 30/5, 2017 at 11:30

CGPDFDocument (and similar classes) are the original Core Graphics classes, created at the dawn of OS X. PDFDocument (and similar) are part of PDFKit, a slightly higher level API introduced around Snow Leopard. Generally, PDFKit is faster and offers easier methods for doing stuff. – Sinistrocular 23/2, 2018 at 17:1

C

1

Here's an option using PDFKit:

import Cocoa
import Quartz

func pdfToText(fromPDF: String) -> String {
    let urlPath = Bundle.main.url(forResource: fromPDF, withExtension: "pdf")
    let docContent = NSMutableAttributedString()
    if let pdf = PDFDocument(url: urlPath!) {
        let pageCount = pdf.pageCount

        for i in 1 ..< pageCount {
            guard let page = pdf.page(at: i) else { continue }
            guard let pageContent = page.attributedString else { continue }
            docContent.append(pageContent)
        }
    }

    return docContent.string
}

let pdfString = pdfToText(fromPDF: "documentName")

This gives you the option to get the PDF content as an attributed string. If you're just after the plain text, you can get it by attaching .string to the result like I did in the above example.

cf. Paul Hudson's snippet

Chorale answered 4/6, 2019 at 9:11 Comment(1)

I know that PDF count starting at 1 but in the quick help says "indexes are zero based", so I changed the 1 to 0 and it worked perfectly fine. – Durance 18/6, 2021 at 7:37

S

0

Apple's documentation for the PDFDocument class says that string is "a convenience method, equivalent to creating a selection object for the entire document and then invoking the PDFSelection class’s string method."

So you should get the same results using it as copying and pasting in Preview.

Adobe's Acrobat may use some other routine to create a more logically useful flow, but you can't access that programmatically in MacOS.

Sinistrocular answered 23/2, 2018 at 17:5 Comment(0)

Recommended topics

Hot tags