Parsing pdf files [closed]
Asked Answered
S

3

10

I have a requirement to split a large pdf document into smaller files based on the content of the file. We use BCL easyPDF to manipulate pdf files. easyPDF can split pdf documents based on a page number, but it cannot split the document based on the file content. Also it does not have a search function (as far as I can tell, if I am wrong please someone let me know.) to determine the location of the content.

Now can someone tell me how I can find the location of text in a pdf file using .net?

Thanks

Sweetie answered 3/5, 2012 at 18:19 Comment(4)
yes but it should/is a community where we can help people who may be still learning the ins and outs of a language or protocol. We can try to point them in the right direction.Jackqueline
Isn't PDF a sort of binary file? You cannot just parse it as text. A library is requiredPooka
I start out my year with my usual complaint. Why is this off topic ( I know the rules say it is) but its very useful, many of the preserved, 'best' questions (which you cannot find now I see) are of this nature. They represent the accumulated advice (aka wisdom) of many experienced devsPiedadpiedmont
SKDocument.CreatePdfNorthward
G
3

You might try Docotic.Pdf library for your task.

The library can extract text from PDFs (with or without formatting).

Or you could just retrieve a collection of words with their bounding rectangles from PDFs. This should help you to find location of the text in a file.

Disclaimer: I work for the vendor of the library.

Genuflect answered 4/5, 2012 at 15:45 Comment(1)
NOTE: As Bobrovsky mentions, this is a commercial product. Its price is non-trivial (though appropriate for what it does).Knowledgeable
P
2

You need a PDF library in .NET such as iText.Net.

Procaine answered 3/5, 2012 at 18:23 Comment(0)
J
1

take a look at this question. there are links to some libraries that may satisfy your requirements

How to programatically search a PDF document in c#

Jackqueline answered 3/5, 2012 at 18:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.