I am trying to read the PDF text from the PDF which is opened in the browser.
After clicking on a button 'Print' the below URL opens up in the new tab.
https://myappurl.com/employees/2Jb_rpRC710XGvs8xHSOmHE9_LGkL97j/details/listprint.pdf?ids%5B%5D=2Jb_rpRC711lmIvMaBdxnzJj_ZfipcXW
I have executed the same program with other web URLs and found to be working fine. I have used the same code that is used here (Extract PDF text).
And i am using the below versions of PDFBox.
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>1.8.9</version>
</dependency>
<dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>fontbox</artifactId>
<version>1.8.9</version>
</dependency>
Below is the code that is working fine with other URLS :
public boolean verifyPDFContent(String strURL, String reqTextInPDF) {
boolean flag = false;
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
String parsedText = null;
try {
URL url = new URL(strURL);
BufferedInputStream file = new BufferedInputStream(url.openStream());
PDFParser parser = new PDFParser(file);
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(1);
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (MalformedURLException e2) {
System.err.println("URL string could not be parsed "+e2.getMessage());
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
try {
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
} catch (Exception e1) {
e.printStackTrace();
}
}
System.out.println("+++++++++++++++++");
System.out.println(parsedText);
System.out.println("+++++++++++++++++");
if(parsedText.contains(reqTextInPDF)) {
flag=true;
}
return flag;
}
And The below is the Stacktrace of the exception that im getting
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1517)
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:372)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:186)
at com.kareo.utils.PDFManager.getPDFContent(PDFManager.java:26)
Updating the image which i took when debugging at URL and File. Please help me out. Is this something with 'https'???