Extracting information from PDFs of research papers [closed]
Asked Answered
A

12

41

I need a mechanism for extracting bibliographic metadata from PDF documents, to save people entering it by hand or cut-and-pasting it.

At the very least, the title and abstract. The list of authors and their affiliations would be good. Extracting out the references would be amazing.

Ideally this would be an open source solution.

The problem is that not all PDF's encode the text, and many which do fail to preserve the logical order of the text, so just doing pdf2text gives you line 1 of column 1, line 1 of column 2, line 2 of column 1 etc.

I know there's a lot of libraries. It's identifying the abstract, title authors etc. on the document that I need to solve. This is never going to be possible every time, but 80% would save a lot of human effort.

Abroad answered 28/11, 2009 at 19:3 Comment(7)
Is this question related to any language and/or platform?Weasand
General UNIX platform, more cross-platform the better. The main tool (EPrints) is MySQL,Perl,Apache but it could shell-out if needed. Ideally this should run fast enough that it provides near-instant results.Abroad
The bounty is for an answer which can take a PDF file and return me a datastucture containing at least title and abstract, and is zero-cost software. It would make many university librarians very happy. Ideally also date, conference details (if any), and references. In utf-8, while I'm being unreasonably optimistic.Abroad
Even if you could get all the text, how would you identify titles/abstracts? In the case when OCR is needed?Transmutation
Can you point out a link to a PDF containing such 'bibliographic metadata' as you have in mind?Cockup
This question is now also discussed at tex.sx: How to automatically generate BibTeX dataDeadfall
I think pdfextract looks useful github.com/Crossref/pdfextract.Ludivinaludlew
A
8

We ran a contest to solve this problem at Dev8D in London, Feb 2010 and we got a nice little GPL tool created as a result. We've not yet integrated it into our systems but it's there in the world.

https://code.google.com/p/pdfssa4met/

Abroad answered 19/8, 2010 at 15:31 Comment(1)
I cannot recomment it: First, you need an obscure binary pdftoxml.linux.exe.1.2.4, the pdftoxml project seems not to have a proper build-system to generate binaries on your own. More over you need to register at opencalais.com for a special API key. Sorry this all is not convenient, and I better try with pdftotext, or google scholar.Poling
U
13

I'm only allowed one link per posting so this is it: pdfinfo Linux manual page

This might get the title and authors. Look at the bottom of the manual page, and there's a link to www.foolabs.com/xpdf where the open source for the program can be found, as well as binaries for various platforms.

To pull out bibliographic references, look at cb2bib:

cb2Bib is a free, open source, and multiplatform application for rapidly extracting unformatted, or unstandardized bibliographic references from email alerts, journal Web pages, and PDF files.

You might also want to check the discussion forums at www.zotero.org where this topic has been discussed.

Unalloyed answered 5/12, 2009 at 4:8 Comment(2)
I think the basic problem you're running into is that unless you're dealing with an E-Publisher or a very organized company you'll get marginally useful information out of the pdf metadata. So what is sounds like you're really after is a product that identifies and outputs the following from UNSTRUCTURED text: 1) Author(s) 2) Abstract 3) Bibliography information. This text can be easily extracted from a PDF (and often many other file formats) and there are many open source solutions for that. It seems c2bib might be a good starting point as it should help in the bibliography arena.Botzow
+1 for c2bib, it is a great tool (even if not fully automated).Skid
A
8

We ran a contest to solve this problem at Dev8D in London, Feb 2010 and we got a nice little GPL tool created as a result. We've not yet integrated it into our systems but it's there in the world.

https://code.google.com/p/pdfssa4met/

Abroad answered 19/8, 2010 at 15:31 Comment(1)
I cannot recomment it: First, you need an obscure binary pdftoxml.linux.exe.1.2.4, the pdftoxml project seems not to have a proper build-system to generate binaries on your own. More over you need to register at opencalais.com for a special API key. Sorry this all is not convenient, and I better try with pdftotext, or google scholar.Poling
P
5

Might be a tad simplistic but Googling "bibtex + paper title" ussualy gets you a formated bibtex entry from the ACM,Citeseer, or other such reference tracking sites. Ofcourse this is assuming the paper isn't from a non-computing journal :D

-- EDIT --

I have a feeling you won't find a custom solution for this, you might want to write to citation trackers such as citeseer, ACM and google scholar to get ideas for what they have done. There are tons of others and you might find their implementations are not closed source but not in a published form. There is tons of research material on the subject.

The research team I am part of has looked at such problems and we have come to the conclusion that hand written extraction algorithms or machine learning are the way to do it. Hand written algorithms are probably your best bet.

This is quite a hard problem due to the amount of variation possible. I suggest normalizing the PDF's to text (which you get from any of the dozens of programmatic PDF libraries). You then need to implement custom text scrapping algorithms.

I would start backward from the end of the PDF and look what sort of citation keys exist -- e.g., [1], [author-year], (author-year) and then try to parse the sentence following. You will probably have to write code to normalize the text you get from a library (removing extra whitespace and such). I would only look for citation keys as the first word of a line, and only for 10 pages per document -- the first word must have key delimiters -- e.g., '[' or '('. If no keys can be found in 10 pages then ignore the PDF and flag it for human intervention.

You might want a library that you can further programmatically consult for formatting meta-data within citations --e.g., itallics have a special meaning.

I think you might end up spending quite some time to get a working solution, and then a continual process of tuning and adding to the scrapping algorithms/engine.

Pulp answered 28/11, 2009 at 19:39 Comment(2)
Nice idea, but I'm working on a system for putting research PDF's online, so it's the thing providing the bibtex!Abroad
I've already gotten that far. I was hoping there might be some packaged solution. It's a research-level problem :(Abroad
S
5

CERMINE - Content ExtRactor and MINEr

Described in the paper: TKACZYK, Dominika, et al. CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR), 2015, 18.4: 317-335.

Mainly written in Java and available as open source at github.

Skid answered 23/11, 2015 at 10:4 Comment(3)
Why is this voted down?Deadfall
@Deadfall :-) Who knows...Skid
I have used cermine with good results as it looks at the content of your pdf too! Not many applications do this. Of course your pdf files need to be ocr-ed beforehand for it to work.Abrahamabrahams
W
4

In this case i would recommend TET from PDFLIB

If you need to get a quick feel for what it can do, take a look at the TET Cookbook

This is not an open source solution, but it's currently the best option in my opinion. It's not platform-dependant and has a rich set of language bindings and a commercial backing.

I would be happy if someone pointed me to an equivalent or better open source alternative.

To extract text you would use the TET_xxx() functions and to query metadata you can use the pcos_xxx() functions.

You can also use the commanline tool to generate an XML-file containing all the information you need.

tet --tetml word file.pdf

There are examples on how to process TETML with XSLT in the TET Cookbook

What’s included in TETML?

TETML output is encoded in UTF-8 (on zSeries with USS or MVS: EBCDIC-UTF-8, see www.unicode.org/reports/tr16), and includes the following information: general document information and metadata text contents of each page (words or paragraph) glyph information (font name, size, coordinates) structure information, e.g. tables information about placed images on the page resource information, i.e. fonts, colorspaces, and images error messages if an exception occurred during PDF processing

Weasand answered 28/11, 2009 at 19:36 Comment(0)
S
2

Another Java library to try would be PDFBox. PDFs are really designed to viewed and printed, so you definitely want a library to do some of the heavy lifting for you. Even so, you might have to do a little gluing of text pieces back together to get the data you want extracted. Good Luck!

Snifter answered 28/11, 2009 at 19:20 Comment(0)
S
2

Just found pdftk... it's amazing, comes in a binary distribution for Win/Lin/Mac as well as source.

In fact, I solved my other problem (look at my profile, I asked then answered another pdf question .. can't link due to 1 link limitation).

It can do pdf metadata extraction, for example, this will return the line containing the title:

 pdftk test.pdf dump_data output test.txt | grep -A 1 "InfoKey: Title" | grep "InfoValue"

It can dump title, author, mod-date, and even bookmarks and page numbers (test pdf had bookmarks)... obviously a bit of work will be needed to properly grep the output, but I think this should fit your needs.

If your pdfs don't have metadata (ie, no "Abstract" metadata), you can cat the text using a different tool like pdf2text, and use some grep tricks like above. If your pdfs are not OCR'd, you have a much bigger problem, and ad-hoc querying of the pdf(s) will be painfully slow (best to OCR).

Regardless, I would recommend you build an index of your documents instead of having each query scan the file metadata/text.

Spelldown answered 8/12, 2009 at 2:10 Comment(2)
Only extracts the metadata embedded by the creating software. I need the bibliographic metadata. This can't get me the abstract. I know I have a big problem, that's why I asked the question. Looks like there's no solution available :( google scholar clearly have a way, but I've not got their resources.Abroad
I'm pretty sure there's no pre-packaged solution for your problem. However, use of tools like pdftk, pdf2txt and some perl/shell scripting should give you that 80-90% coverage (assuming you don't have to OCR them first). I think it's a bit unfair to post this bounty without sample data, because there is no way to solve this without examining the corpus of data. Even commercial or pre-packaged solutions will likely need to know some details of what your content looks like or you will need to configure/test repeatedly until you get a good coverage.Spelldown
F
1

Take a look at iText. It is a Java library that will let you read PDFs. You will still face the problem of finding the right data, but the library will provide formatting and layout information that might be usable to infer purpose.

Fivepenny answered 28/11, 2009 at 19:14 Comment(0)
E
1

PyPDF might be of help. It provides extensive API for reading and writing the content of a PDF file (un-encrypted), and its written in an easy language Python.

Enormous answered 28/11, 2009 at 19:26 Comment(0)
B
1

Have a look at this research paper - Accurate Information Extraction from Research Papers using Conditional Random Fields

You might want to use an open-source package like Stanford NER to get started on CRFs.

Or perhaps, you could try importing them (the research papers) to Mendeley. Apparently, it should extract the necessary information for you.

Hope this helps.

Barramunda answered 21/4, 2010 at 10:10 Comment(0)
I
1

Here is what I do using linux and cb2bib.

  1. Open up cb2bib and make sure that clipboard connection is ON, and that your reference database is loaded
  2. Find your paper on google scholar
  3. Click 'import to bibtex' underneath the paper
  4. Select (highlight) everything on the next page (ie., the bibtex code)
  5. It should now appear formatted in cb2bib
  6. Optionally now press network search (the globe icon) to add additional info.
  7. Press save in cb2bib to add the paper to your ref database.

Repeat this for all the papers. I think in the absence of a method that reliably extracts metadata from PDFs, this is the easiest solution I found.

Iambus answered 21/5, 2013 at 17:17 Comment(1)
+1 for cb2bib, it is a great tool (even if not fully automated).Skid
P
1

I recommend gscholar in combination with pdftotext.

Although PDF provides meta data, it is seldomly populated with correct content. Often "None" or "Adobe-Photoshop" or other dumb strings are inplace of the title field, for example. That is why none of the above tools might derive correct information from PDFs as the title might be anywhere in the document. Another example: many papers of conference proceedings might also have the title of the conference, or the name of the editors which confuses automatic extraction tools. The results are then dead wrong when you are interested of the real authors of the paper.

So I suggest a semi-automatic approach involving google scholar.

  1. Render the PDF to text, so you might extract: author, and title.
  2. Second copy paste some of this info and query google scholar. To automate this, I employ the cool python script gscholar.py.

So in real life this is what I do:

me@box> pdftotext 10.1.1.90.711.pdf - | head
Computational Geometry 23 (2002) 183–194
www.elsevier.com/locate/comgeo

Voronoi diagrams on the sphere ✩
Hyeon-Suk Na a , Chung-Nim Lee a , Otfried Cheong b,∗
a Department of Mathematics, Pohang University of Science and Technology, South Korea
b Institute of Information and Computing Sciences, Utrecht University, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands

Received 28 June 2001; received in revised form 6 September 2001; accepted 12 February 2002
Communicated by J.-R. Sack
me@box> gscholar.py "Voronoi diagrams on the sphere Hyeon-Suk" 
@article{na2002voronoi,
  title={Voronoi diagrams on the sphere},
  author={Na, Hyeon-Suk and Lee, Chung-Nim and Cheong, Otfried},
  journal={Computational Geometry},
  volume={23},
  number={2},
  pages={183--194},
  year={2002},
  publisher={Elsevier}
}

EDIT: Be careful, you might encounter captchas. Another great script is bibfetch.

Poling answered 28/10, 2013 at 8:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.