Is there a way to download bibtex from Google Scholar using PHP
Asked Answered
A

2

8

Hi, is there a way to download the BibTeX entry for something from Google Scholar using PHP without having to download the BibTeX manually one by one? For example, setting a search value like "research" and then downloading the related BibTeX from the links automatically through code.

Any help would be appreciated. I tried to get the HTML page, but as I try to get the page contents the "Import to BibTeX" link disappears on the retrieved page contents.

My code:

<?php
$url = 'http://scholar.google.com/scholar?q=honors+college&amp;hl=en&amp;btnG=Search&     amp;as_sdt=1%2C4&amp;as_sdtp=on';
$needle = 'Import into bibtex';
$contents = file_get_contents($url);
echo $contents;
if(strpos($contents, $needle)!== false) {
echo 'found';
} else {
echo 'not found';
}
?>
Anissaanita answered 21/11, 2011 at 20:0 Comment(2)
A lot of google's web-based interfaces are heavily javascript dependent, which your screen scaper can't handle. You'd have to figure out what's happening in the background to replicate it via scripting.Dietetics
I think, the "Import into bitex" link is only displayed when you're logged in. Try to login into Google (which I don't know how to do programatically) and then fetch the Scholar page.Walrus
C
1

The short answer is No you cannot do this

Google does not provide API's for search / scholar and uses firm rate-limitation. The problem is that for each BibTex entry you need 2 additional requests (1 for the query, 1 for the 'import link' and a final one to get the actual BibTex entry content)

I wrote a script that scrapes google scholar results and finds the BibTex links and saves the results. However, due to the rate limit is not viable and will get blocked almost instantly.

Code can be viewed here: https://gist.github.com/Tessmore/11099509 and is free of use, but at your own risk.

Cree answered 19/4, 2014 at 22:28 Comment(0)
A
1

As Tessmore said - you can't. But you can make it work by using Google Scholar Organic Results API from SerpApi that bypasses quota limits and blocks from search engines so you don't have to think about how to reduce the chance of being blocked.

Example:

toc_02


Install google-search-results-php package first via composer:

$ composer require serpapi/google-search-results-php:2.0

Code to integrate and full example in the online IDE:

<?php
ini_set("display_errors", 1);
ini_set("display_startup_errors", 1);
error_reporting(E_ALL);

require __DIR__ . "/vendor/autoload.php";

function getResultIds () {
    $result_ids = array();

    $params = [
        "engine" => "google_scholar", // parsing engine
        "q" => "biology"              // search query
    ];
    
    $search = new GoogleSearch(getenv("API_KEY"));
    $response = $search->get_json($params);
    
    foreach ($response->organic_results as $result) {
        // print_r($result->result_id);
        
        array_push($result_ids, $result->result_id);
    }

    return $result_ids;
}

function getBibtexData () {
    $bibtex_data = array();

    foreach (getResultIds() as $result_id) {
        $params = [
            "engine" => "google_scholar_cite",  // parsing engine
            "q" => $result_id
        ];
    
        $search = new GoogleSearch(getenv("API_KEY"));
        $response = $search->get_json($params);

        foreach ($response->links as $result) {
            if ($result->name === "BibTeX") {
                array_push($bibtex_data, $result->link);
            }
        }
    }
    
    return $bibtex_data;
}

print_r(json_encode(getBibtexData(), JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES));
?>

Output:

[
    "https://scholar.googleusercontent.com/scholar.bib?q=info:KNJ0p4CbwgoJ:scholar.google.com/&output=citation&scisdr=CgXjqB_WGAA:AAGBfm0AAAAAYkm8amenawYn_EBidiCQT5QBh0L1KJEX&scisig=AAGBfm0AAAAAYkm8at9X4P3eIWKUCOc6UriCEDKVsQE0&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:6zRLFbcxtREJ:scholar.google.com/&output=citation&scisdr=CgWhqfi6GAA:AAGBfm0AAAAAYkm8bDoIhTlfTkQFCOzYGax54Bst576o&scisig=AAGBfm0AAAAAYkm8bMe_7Nq4e4pB5lg_eR9jmeGrO8ek&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:6Yb0qOX88FMJ:scholar.google.com/&output=citation&scisdr=CgXn_4MdGAA:AAGBfm0AAAAAYkm8bi8ypCZcFDNEQZYZeoSlvx-U1OSk&scisig=AAGBfm0AAAAAYkm8bnFMnwTWGfkfJDCNEx0C4n-aQwql&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:HFdEElNr3IgJ:scholar.google.com/&output=citation&scisdr=CgXKCFpQGAA:AAGBfm0AAAAAYkm8byukcQCl4WHQx-nSNp2pC1gUFSKG&scisig=AAGBfm0AAAAAYkm8b8EReTVkLwtxfth_pjwMyyY3dqts&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:bs-D_MeC14YJ:scholar.google.com/&output=citation&scisdr=CgXEUXwWGAA:AAGBfm0AAAAAYkm8bwwfMNJrffe16EaGypsem9JlmGTi&scisig=AAGBfm0AAAAAYkm8b6nWlPOQL63fXg6dV2U-JQbpyQyS&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:Rn1qFVLRfKwJ:scholar.google.com/&output=citation&scisdr=CgU-HswkGAA:AAGBfm0AAAAAYkm8cHE1YRK23eHV8nzF89Eem-Bsuz72&scisig=AAGBfm0AAAAAYkm8cDEj8ZrzZjAo2bNX-tjYYYJYQZay&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:d8thHtTwq6YJ:scholar.google.com/&output=citation&scisdr=CgXj7oe9GAA:AAGBfm0AAAAAYkm8cTYamCKGKImjdg5MQdgbxUIIHAEY&scisig=AAGBfm0AAAAAYkm8cTcop1ceKzKYvKAKtvlSQ1EdEtSN&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:IUmhOhGaDaEJ:scholar.google.com/&output=citation&scisdr=CgU0qZ2_GAA:AAGBfm0AAAAAYkm8ctCPwoihZkjbNcdEqSnwa0J3jwDy&scisig=AAGBfm0AAAAAYkm8cingBcYnEp8YRqFDFdN-FAEBgDT7&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:PWsf8O5OMQEJ:scholar.google.com/&output=citation&scisdr=CgVBAJxXGAA:AAGBfm0AAAAAYkm8c3CDKQG0Wh_lWsXU_DZxEJkwZz5y&scisig=AAGBfm0AAAAAYkm8c6I-HjAxD1Gy6FLFDRdxH_qU4OBr&scisf=4&ct=citation&cd=-1&hl=en",
    "https://scholar.googleusercontent.com/scholar.bib?q=info:yGvgHH8ROuIJ:scholar.google.com/&output=citation&scisdr=CgXFuhOkGAA:AAGBfm0AAAAAYkm8dD0rcSR4LQF8GgTxx865BADtXNDN&scisig=AAGBfm0AAAAAYkm8dIQhodz3rHF9IUdaCSRlhdudACNQ&scisf=4&ct=citation&cd=-1&hl=en"
]

Bibtex data from the first URL:

@article{woese2004new,
  title={A new biology for a new century},
  author={Woese, Carl R},
  journal={Microbiology and molecular biology reviews},
  volume={68},
  number={2},
  pages={173--186},
  year={2004},
  publisher={Am Soc Microbiol}
}

Disclaimer, I work for SerpApi.

Aboveground answered 16/3, 2022 at 10:59 Comment(12)
When trying to download the BibTeX by following the links I still run into 403Mazzola
@Mazzola could you check one more time on replit? The reason could be because of my api key change that locates on the replit env file. Also added a GIF code execution example.Aboveground
The issue is not with the code execution, it's when actually trying to retrieve the BibTeX entries using the URLs it generates. I can even paste one of your URLs (https:\/\/scholar.googleusercontent.com\/scholar.bib?q=info:YnWp49O_RTMJ:scholar.google.com\/&output=citation&scisdr=CgXCiln7GAA:AAGBfm0AAAAAYjHB1PjuGwPWg-Oc1PTDkki_-3T_pD2o&scisig=AAGBfm0AAAAAYjHB1OoX_TdI3yhMKMvdA1dCMdNG0sfZ&scisf=4&ct=citation&cd=-1&hl=en) in my browser now and get a 403. This was the first one I tried to day, I wonder whether Google blocks requests to BibTeX without requesting the citation list first?Mazzola
@Mazzola my bad, I forgot to add JSON_UNESCAPED_SLASHES to don't escape / inside json_encode(). You can try to run it one more time, or have a look at the attached GIF above. Thank you for your clarification.Aboveground
That does not change the result – If I just copy-paste a URL you extract (scholar.googleusercontent.com/…) into the browser and get a 403.Mazzola
@Mazzola I don't get it. I just tried it one more time: run on replit, opened each URL from the terminal output. Every link was 200 with Bibtex data. One guess is that those links expire after some time or something else. I updated the GIF to show actual clicking on the first URL from the terminal output.Aboveground
That must be it! Have to use the link right away. Thanks!!Mazzola
@Mazzola Of course, hope it helps ;)Aboveground
There does seem to be some sort of counter too, I can only download a few citations a day before getting a 403...Mazzola
@Mazzola if you're using SerpApi, it shouldn't have any sort of limits. It uses dedicated proxies and a captcha solver. Feel free to open an issue (if using SerpApi) with the detailed problem. Here's how to report an issue. You'll get a faster solution rather than in the comments here.Aboveground
I am using serpapi to get the bibtext link, but to retrieve the bibtex itself I then use that link directly. Is there a way to retrieve the bibtex (i.e., get the content from the links that you print in your script) through serpapi?Mazzola
@Mazzola A late reply. Currently, it's not available. There's an open issue at SerpApi public-roadmap and I wrote a workaround for it by making another request using reqeusts.Aboveground

© 2022 - 2024 — McMap. All rights reserved.