I am using Pycharm Community Edition 2020.3.2, Scholarly version 1.0.2, Tor version 1.0.0. I tried to scrape 700 articles to find their numbers of citations. Google Scholar blocked me from using search_pubs (a function of Scholarly). However, another function of Scholarly, which is search_author, is still working well. In the beginning, search_pubs function worked properly. I tried these codes.
from scholarly import scholarly
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')
After a few trials, it shows the below error.
Traceback (most recent call last):
File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-9-3bbcfb742cb5>", line 1, in <module>
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')
File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_scholarly.py", line 121, in search_pubs
return self.__nav.search_publications(url)
File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py", line 256, in search_publications
return _SearchScholarIterator(self, url)
File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\publication_parser.py", line 53, in __init__
self._load_url(url)
File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\publication_parser.py", line 58, in _load_url
self._soup = self._nav._get_soup(url)
File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py", line 200, in _get_soup
html = self._get_page('https://scholar.google.com{0}'.format(url))
File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py", line 152, in _get_page
raise Exception("Cannot fetch the page from Google Scholar.")
Exception: Cannot fetch the page from Google Scholar.
Then, I figured out that the reason is I need to pass the CAPTCHA from Google in order to continue to fetch the info from Google Scholar. Many people suggest that I need to use Proxy since my IP was blocked by Google. I tried to change Proxy using FreeProxies()
from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
pg.FreeProxies()
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')
It does not work and Pycharm is frozen for a long time. Then, I installed Tor (pip install Tor) and tried again:
from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
pg.Tor_External(tor_sock_port=9050, tor_control_port=9051, tor_password="scholarly_password")
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')
It does not work. Then, I tried with SingleProxy()
from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
pg.SingleProxy(https='socks5://127.0.0.1:9050',http='socks5://127.0.0.1:9050')
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')
It also does not work. I have never tried Luminati since I am not familiar with it. If anyone knows the solution, please help!