citeseerx search api
Asked Answered
R

1

8

Is there a way to access CiteSeerX programmatically (e.g. search by author and/or title?) Surprisingly I cannot find anything relevant; surely others too are trying to get scholarly article metadata without resorting to scraping?

EDIT: note that CiteSeerX supports OAI PMH, but that seems to be an API geared towards digital libraries keeping up to date with each other ("content dissemination") and does not specifically support search. Moreover the citeseer info on that page is very sparse and even says "Currently, there are difficulties with the OAI".

There is another SO question about CiteSeerX API (though not specifically search); the 2 answers do not resolve the problem (one talks about Mendeley, another piece of software, and the other says OAI-PMH implementations are free to offer extensions to the minimal spec).

Alternatively, can anyone suggest a good way to obtain citations from authors/titles programmatically?

Reset answered 29/12, 2012 at 19:56 Comment(5)
JabRef has a CiteSeerX support. Look at their GIT to see how they do it: jabref.sourceforge.net/download.php Is possibly JabRef the answer to your real problem, i.e. reference management?Lennon
I would suggest scraping their webpage and writing your own XQuery engine to be able to do that reliably.Madancy
Thanks for JabRef, @marek-cruz. Yes, I see that they scrape too (CiteSeerXFetcher.java). I'm surprised that CiteSeerX doesn't have an API (and that they don't clearly state the situation on their site, one way or the other).Reset
My own XQuery expression, I presume, @Madancy :) I will try to see if I can reuse JabRef in my scripts (it does have a batch mode).Reset
@Reset You're welcome. For the record, here is the jabref implementation in case someone else will need it: sourceforge.net/p/jabref/code/ci/…Lennon
R
7

As suggested by one of the commenters, I tried jabref first:

jabref -n -f "citeseer:title:(lessons from) author:(Brewer)"

However jabref seems to not realize that the query string needs to contain colons and so throws an error.

For search results, I ended up scraping the CiteSeerX results with Python's BeautifulSoup:

url = "http://citeseerx.ist.psu.edu/search?q="
q = "title%3A%28{1}%29+author%3%28{0}%29&submit=Search&sort=cite&t=doc"
url += q.format (author_last, title.replace (" ", "+"))
soup = BeautifulSoup (urllib2.urlopen (url).read ())
result = soup.html.body ("div", id = "result_list") [0].div
title = result.h3.a.string.strip ()
authors = result ("span", "authors") [0].string
authors = authors [len ("by "):].strip ()
date = result ("span", "pubyear") [0].string.strip (", ")

It is possible to get a document ID from the results (the misleadingly-named "doi=..." part in the summary link URL) and then pass that to the CiteSeerX OAI engine to get Dublin Core XML (e.g. http://citeseerx.ist.psu.edu/oai2?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:CiteSeerX.psu:10.1.1.42.2177); however that XML ends up containing multiple dc:date elements, which makes it less useful than the scrape output.

Too bad CiteSeerX makes people resort to scraping in spite of all the open archives / open access rhetoric.

Reset answered 31/12, 2012 at 14:3 Comment(1)
I wouldn't worry too much about it, as CiteSeerX's old links have all been offline again since at least the start of 2024. So if you can't trust them to keep those old links working, why would you trust their new ones? In another 20 years (if they survive) they may decide to abandon those too...Ligature

© 2022 - 2024 — McMap. All rights reserved.