Seeking citation parser

S

5

I need a parser that will scan scholarly texts, extract citations, and parse those citations into their component parts (author, title, publication date, etc).

I've tried Paracite, but it is abominably slow and doesn't produce high quality results.

Any language is OK, but Java is preferred.

Srinagar answered 16/9, 2011 at 11:32 Comment(0)

M

6

Take a look at ParsCit:

This is the home page of the ParsCit project, which performs two tasks: 1) reference string parsing, sometimes also called citation parsing or citation extraction, and 2) logical structure parsing of scienfific documents. It is architected as a supervised machine learning procedure that uses Conditional Random Fields as its learning mechanism. You can download the code below, parse strings online, or send batch jobs to our web service. The code contains both the training data, feature generator and shell scripts to connect the system to a web service (used on this web site).

Master answered 16/9, 2011 at 11:53 Comment(1)

Thanks, that link also leads to some other interesting projects in the same domain. I'll check them out! – Srinagar 16/9, 2011 at 11:56

S

2

We recently faced a similar problem and ended up writing our own parser based on ParsCit but using Wapiti instead of CRF++ for the conditional random fields model. Like Mike mentions above, the problem with ML-based parsers is getting good tagged training data; for this we wrote a visual editor that lets you tag the results (and save them as training data). This approach works pretty well for parsing bibliographies.

If anyone is interested, we've made both parser and editor available here at anystyle.io.

Sisterhood answered 20/5, 2014 at 10:12 Comment(0)

M

1

A list of projects is here: https://forums.zotero.org/discussion/1211/

Cb2bib uses regexes http://www.molspaces.com/cb2bib/

Citeseer uses a big list of author names and titles. You can have a look at their publication list

Here is a project but in python: https://code.google.com/p/pdfssa4met/

Also see these stackoverflow questions: * Extracting information from PDFs of research papers

Mervinmerwin answered 5/10, 2013 at 15:48 Comment(1)

Thanks, Max. We ended up coding our own HMM-based statistical recognizer. The regex approach is just too brittle I think. The difficulty now is getting good tagged training data. I suspect Citeseer's list could help. – Srinagar 7/10, 2013 at 0:33

S

1

You can also try this little tool for parsing academic citations into fields:

http://citationparser.com

Citationparser.com is still beta but the 2017 version is working well especially for Journal Articles but also for Monographs and Book Chapters.

The list doesn't have to be in ONE style, but can be a mixture of different official or unofficial styles

You can walk through the references and check for fulltext or you can EXPORT as Endnote File (.ENL). I developed this tool only for smaller Lists of hundreds of titles. If you paste a list with more than 1000 titles it will run much slower.

Siamang answered 16/1, 2017 at 12:43 Comment(0)

B

0

You could try looking into an indexing / searching library like Lucene

Bukovina answered 16/9, 2011 at 11:38 Comment(1)

Thanks I'm familiar with Lucene, but it doesn't really address this problem specifically. – Srinagar 16/9, 2011 at 11:46

Recommended topics

Hot tags