PDFBox IOException: End of File, expected line

O

3

7

I am currently trying to grab text from a PDF that is already uploaded and accessed through a link by using PDFBox and Selenium. I used this as a source: http://www.seleniumeasy.com/selenium-tutorials/how-to-extract-pdf-text-and-verify-using-selenium-webdriver-java

public String function(String pdf_url) {
    PDFTextStripper pdfStripper = null;
    PDDocument pDoc;
    COSDocument cDoc;
    String parsedText = "";
    try {
        URL url = new URL(pdf_url);
        BufferedInputStream file = new BufferedInputStream(url.openStream());
        PDFParser parser = new PDFParser(file);
        parser.parse();
        cDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(1);

        pDoc = new PDDocument(cDoc);
        parsedText = pdfStripper.getText(pDoc);

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    return parsedText;
}

Error: End-of-File expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1519)
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:372)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:186)
at scripts.Script.grabPDF_Text(Script.java:94)
at scripts.Script.main(Script.java:817)

Why am I getting this error?

Outlet answered 20/6, 2018 at 17:28 Comment(16)

what is the return type of this method ? – Polytonality 20/6, 2018 at 17:50

String, sorry I omitted it on accident. I updated the original post. – Outlet 20/6, 2018 at 17:56

I tried this and it works, only difference is I've read pdf from local. now, what is the pdf_url has ? – Polytonality 20/6, 2018 at 18:21

It's a link to a PDF that is directly uploaded to the server. I access the pdf using the link and it shows like if I opened it using acrobat. This is a String that I passed in from the main. Thanks for testing it out yourself. It's kinda like this file: adobe.com/support/products/enterprise/knowledgecenter/media/… – Outlet 20/6, 2018 at 18:29

would it be easier if I just downloaded the file and read it that way? since you confirmed that it works, – Outlet 20/6, 2018 at 18:32

This code works for the PDF in a URL as well. I think there must be something in your pdf. Anyways give it a try by download and read – Polytonality 20/6, 2018 at 18:37

Also, find out on which line you're getting exception. – Polytonality 20/6, 2018 at 18:41

The exception occurs on this line: parser.parse(); – Outlet 20/6, 2018 at 19:11

Check your URL for any unwanted chars spaces . You might have to debug that – Polytonality 20/6, 2018 at 19:32

An "End-of-File" during PDFParser.parseHeader sounds like an empty file (or nearly so). – Lemuellemuela 20/6, 2018 at 19:34

@Polytonality Can I see how you passed in the URL for both your examples to work on your end? – Outlet 20/6, 2018 at 19:36

@Lemuellemuela The page is not empty though. – Outlet 20/6, 2018 at 19:37

The best would be to save what you get with url.openStream() into a file to see what's really there. I concur with mkl that your file is empty. – Dannie 20/6, 2018 at 19:39

@Outlet - Added as an answer – Polytonality 20/6, 2018 at 20:20

Thanks everyone, it turns out it is these PDFs. They are incompatible or is some strange format because I'm able to read different PDFs. – Outlet 20/6, 2018 at 20:25

I was trying to parse 100s of files kept in one directory & I was getting same error for PDDocument.load() & by mistake there was one zero byte non - pdf file in that directory :) – Guthrey 29/4, 2019 at 7:50

P

3

Here is the example that you asked to share using PDFURL

string PDFURL = "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf";
function(PDFURL1);

public String function(String pdf_url)
{
 //Exact same code as yours
}

For using PDF as local file, URL and BufferedInputStream needs to be replaced by

 File file = new File(pdf_url);
 PDFParser parser = new PDFParser(new FileInputStream(file));

Hope this helps

Polytonality answered 20/6, 2018 at 20:19 Comment(1)

new PDFParser() is outdated. Use PDDocument.load(). – Dannie 21/6, 2018 at 7:43

S

1

Please check either files are with size of 0 KB or You may check with

try (final PDDocument document = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly())){

Silverts answered 25/2, 2020 at 11:48 Comment(0)

Z

0

If the first parameter is a FileInputStream, verify that it is at beginning.

To be sure, always reset channel position to start:

if(inputStream instanceof FileInputStream)
{
    FileInputStream fis = (FileInputStream) inputStream;
    fis.getChannel().position(0);
}

Zeniazenith answered 27/7, 2023 at 10:16 Comment(0)

Recommended topics

Hot tags