Natural Language Processing - Converting unstructured bibliography to structured metadata

Asked 26/8, 2015 at 8:55 Answered 4/9, 2015 at 3:52

Currently working on a natural language processing project in which I need to convert unstructured bibliography section (which is at the end of research article) to structured metadata like "Year", "Author", "Journal", "Volume ID", "Page Number", "Title", etc.

For example: Input

McCallum, A.; Nigam, K.; and Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, 169–178

Expected output:

<Author> McCallum, A.</Author> <Author>Nigam, K.</Author> <Author>Ungar, L. H.</Author>
<Year> 2000 </Year>
<Title>Efficient clustering of high-dimensional data sets with application to reference matching <Title> and so on

Tool used: CRFsuite

Data-set: This contains 12000 references

Contains Journal title,
Contains article title's words,
Contains location names,

Each word in given line considered as token and for each token I derive following features

BOR at the start of line,
EOR for end
digitFeature : if token is digit
Year: if token is in year format like 19** and 20**
available in current data-set,

From above tool and data-set I got only 63.7% accuracy. Accuracy is very less for "Title" and good for "Year" and "Volume".

Questions:

Can I draw any additional features?
Can I use any other tool?

Brunabrunch answered 26/8, 2015 at 8:55 Comment(1)

You might have to give a full set of example strings to show the variety of ways that bibliography entries are formatted. One example is fine if every entry has exactly the same format, but I suspect you're seeing a plethora of different structures in these bibliography entries, so a fuller set of examples will help people to suggest ways to extract the data you want. – Thetes 28/8, 2015 at 14:36

I'd propose to base solution over existed approaches. Take a look for example at this paper

Park, Sung Hee, Roger W. Ehrich, and Edward A. Fox. "A hybrid two-stage approach for discipline-independent canonical representation extraction from references." Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries. ACM, 2012.

Sections 3.2 and 4.2 provide descriptions of dozens of features.

As for CRF implementations, there are other tools like this one, but I don't think it is a source of low accuracy.

Alundum answered 28/8, 2015 at 16:57 Comment(0)

While I would generally agree with Nikita that any particular CRF toolset isn't the source of the low accuracy, and that it is a solutions approach issue. I'm not sure that the two-staged approach, while very accurate and effective when complete, demonstrated by Park,et al. is a practical approach to your problem.

For One, the "two-stages" referred to in the paper are a paired SVM / CRF that are not that easy to setup on the fly if this not your main area of study. They each involve training on labelled data, and a degree of tuning.

Two, it is unlikely that your actual set of data (based on your description above) is as differentially structured as this particular solution was designed to cope with while still maintaining high accuracy. In which case this level of supervised learning is not necessary.

If I may propose a domain specific solution with many of the same features that should be far easier to implement in whatever tool you're using, I would try a (restricted) semantic tree approach, that is semi-supervised, specifically exception(error) advised.

Instead of an english sentence as your data molecule, you have a bibliographic entry. The parts of this molecule that must be there are the author part, the title part, the date part , and the publisher part, there may also be other data parts (page number, Vol. Id, etc.).

As some of these parts may be nested (e.g. page # in publisher part) inside one another or in a varied order of arrangement, but still operationally valid, it's a good indicator for use of semantic trees.

Further still, the fact that each area although variable has unique characteristics: author part (personal names formats e.g. Blow,J. or James,et all, etc.) ; title part (quoted, or italicized, has standard sentence structure); date part (date formats, enclosed in (), etc.), means you need less overall training than for tokenized and unstructured analysis. In the the end this less learning for your program.

Additionally there are structural relations that may be learned to improve accuracy for example: date part (often at the end or separating key sections), author part (often at the beginning, or else after the title), etc. This is further supported by the fact that many associations and publisher have their way of formatting such references, these can be easily learned by relation without much training data.

So to sum up by segmenting the parts and doing structured learning you are reducing the pattern matching in each sub-part and the learning is relegated to relational patterns, which are more reliable, as that is how we construct such entries as humans.

Also there's a ton of tools for this sort of domain specific semantical learning

http://www.semantic-measures-library.org/ http://wiki.opensemanticframework.org/index.php/Ontology_Tools

Hope that helps :)

Expatriate answered 4/9, 2015 at 3:52 Comment(0)

Recommended topics

Hot tags