java.io.IOException: Error: End-of-File, expected line Issue with PDFBox
Asked Answered
C

1

9

I am trying to read the PDF text from the PDF which is opened in the browser.

After clicking on a button 'Print' the below URL opens up in the new tab.

https://myappurl.com/employees/2Jb_rpRC710XGvs8xHSOmHE9_LGkL97j/details/listprint.pdf?ids%5B%5D=2Jb_rpRC711lmIvMaBdxnzJj_ZfipcXW

I have executed the same program with other web URLs and found to be working fine. I have used the same code that is used here (Extract PDF text).

And i am using the below versions of PDFBox.

    <dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>1.8.9</version>
</dependency>
<dependency>
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>fontbox</artifactId>
    <version>1.8.9</version>
</dependency>

Below is the code that is working fine with other URLS :

public boolean verifyPDFContent(String strURL, String reqTextInPDF) {

    boolean flag = false;

    PDFTextStripper pdfStripper = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    String parsedText = null;

    try {
        URL url = new URL(strURL);
        BufferedInputStream file = new BufferedInputStream(url.openStream());
        PDFParser parser = new PDFParser(file);

        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(1);

        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
    } catch (MalformedURLException e2) {
        System.err.println("URL string could not be parsed "+e2.getMessage());
    } catch (IOException e) {
        System.err.println("Unable to open PDF Parser. " + e.getMessage());
        try {
            if (cosDoc != null)
                cosDoc.close();
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e1) {
            e.printStackTrace();
        }
    }

    System.out.println("+++++++++++++++++");
    System.out.println(parsedText);
    System.out.println("+++++++++++++++++");

    if(parsedText.contains(reqTextInPDF)) {
        flag=true;
    }

    return flag;
}

And The below is the Stacktrace of the exception that im getting

java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1517)
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:372)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:186)
at com.kareo.utils.PDFManager.getPDFContent(PDFManager.java:26)

Updating the image which i took when debugging at URL and File. enter image description here Please help me out. Is this something with 'https'???

Christychristye answered 13/4, 2015 at 18:30 Comment(10)
Are you sure that the input file is a pdf created using a pdf creation software? It is common for pdfs to be just a concerted img. In which case you need ocr implementation.Carr
The correct code is PDDocument doc = PDDocument.load() or (better) .loadNonSeq(). I can't tell if that is the cause of the problem. The error message indicates that %PDF is missing. You should verify that url.openStream() really returns a PDF file content.Confession
@Invexity That is opened as a PDF. I was able to download to local machine and read it. But i was not able to read it.Christychristye
@TilmanHausherr exactly ` parser.parse();` at this position i get error. But when i tried to debug see the image that i updated now for details if this might help some way.Christychristye
The image also indicates that the stream is empty. To check this, read your https stream into a byte array and see what size is read. Downloading with a browser may not be the same as reading with java. (proxy ?)Confession
@Dev Raj Did you find the solution to your problem?Shandashandee
@DevRaj Did you find the solution?Grishilda
@DevRaj Did you find the solution?Ormazd
#34871770 - Try this one.Opisthognathous
Nothing was wrong in my code. I resolved it by finding that the PDFs I was merging were corrupted/unable to open.Hagridden
C
-1

We all know that file stream is like a pipe. Once the data flows past, it cannot be used again. so you can: 1.Convert input stream to file.

public void useInputStreamTwiceBySaveToDisk(InputStream inputStream) { 
    String desPath = "test001.bin";
    try (BufferedInputStream is = new BufferedInputStream(inputStream);
         BufferedOutputStream os = new BufferedOutputStream(new FileOutputStream(desPath))) { 
        int len;
        byte[] buffer = new byte[1024];
        while ((len = is.read(buffer)) != -1) { 
            os.write(buffer, 0, len);
        }
    } catch (IOException e) { 
        e.printStackTrace();
    }
    
    File file = new File(desPath);
    StringBuilder sb = new StringBuilder();
    try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file))) { 
        int len;
        byte[] buffer = new byte[1024];
        while ((len = is.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len));
        }
        System.out.println(sb.toString());
    } catch (IOException e) { 
        e.printStackTrace();
    }
}

2.Convert input stream to data.

public void useInputStreamTwiceSaveToByteArrayOutputStream(InputStream inputStream) { 
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    try { 
        byte[] buffer = new byte[1024];
        int len;
        while ((len = inputStream.read(buffer)) != -1) { 
            outputStream.write(buffer, 0, len);
        }
    } catch (IOException e) { 
        e.printStackTrace();
    }
    // first read InputStream
    InputStream inputStream1 = new ByteArrayInputStream(outputStream.toByteArray());
    printInputStreamData(inputStream1);
    // second read InputStream
    InputStream inputStream2 = new ByteArrayInputStream(outputStream.toByteArray());
    printInputStreamData(inputStream2);
}

3.Marking and resetting with input stream.

public void useInputStreamTwiceByUseMarkAndReset(InputStream inputStream) { 
    StringBuilder sb = new StringBuilder();
    try (BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream, 10)) { 
        byte[] buffer = new byte[1024];
        //Call the mark method to mark
        //The number of bytes allowed to be read by the flag set here after reset is the maximum value of an integer
        bufferedInputStream.mark(bufferedInputStream.available() + 1);
        int len;
        while ((len = bufferedInputStream.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len));
        }
        System.out.println(sb.toString());
        // After the first call, explicitly call the reset method to reset the flow
        bufferedInputStream.reset();
        // Read the second stream
        sb = new StringBuilder();
        int len1;
        while ((len1 = bufferedInputStream.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len1));
        }
        System.out.println(sb.toString());
    } catch (IOException e) { 
        e.printStackTrace();
    }
}

then you can repeat the read operation for the same input stream many times.

Confide answered 4/8, 2022 at 8:34 Comment(1)
This does not solve the issue of the question and is only loosely related to its topic.Laity

© 2022 - 2024 — McMap. All rights reserved.