Python Download file with Pandas / Urllib

Asked 29/1, 2017 at 8:57 Answered 17/8, 2021 at 11:55

Solved python python-3.x pandas python-requests urllib

I am trying to download a CSV file with Python 3.x The path of the file is: https://www.nseindia.com/content/fo/fo_mktlots.csv

I have found three ways to do it. Of the three only one method works. I wanted to know why or what I am doing wrong.

Method 1: (Unsuccessful)

import pandas as pd

mytable = pd.read_table("https://www.nseindia.com/content/fo/fo_mktlots.csv",sep=",")
print(mytable)

But I am getting the following error :

- HTTPError: HTTP Error 403: Forbidden

Method 2: (Unsuccessful)

from urllib.request import Request, urlopen

url='https://www.nseindia.com/content/fo/fo_mktlots.csv'

url_request  = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(url_request ).read()

Got the same error as before :

 - HTTPError: HTTP Error 403: Forbidden

Method 3: (Successful)

import requests
import pandas as pd

url = 'https://www.nseindia.com/content/fo/fo_mktlots.csv'

r = requests.get(url)
df = pd.read_csv(StringIO(r.text))

I am also able to open the file with Excel VBA as below:

Workbooks.Open Filename:="https://www.nseindia.com/content/fo/fo_mktlots.csv"

Also, is there any other method to do the same?

Resound answered 29/1, 2017 at 8:57 Comment(1)

Sniffing request with wireshark point to an "Encrypted Alert" when using your second script. Maybe you will have to deeply configure your socket before making the request. – Cote 29/1, 2017 at 9:26

The website tries to prevent content scraping.

The issue is not about what you are doing wrong, it is more about how the web server is configured and how it behaves in various situations.

But to overcome the scraping protection, create well defined http request headers, the best way to do so is to send a complete set of http headers a real web browser does.

Here it works with a minimal set:

>>> myHeaders = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36', 'Referer': 'https://www.nseindia.com', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
>>> url_request  = Request(url, headers=myHeaders)
>>> html = urlopen(url_request ).read()
>>> len(html)
42864
>>>

You can pass urllib to pandas:

>>> import pandas as pd
...
>>> url_request  = Request(url, headers=myHeaders)
>>> data = urlopen(url_request )
>>> my_table = pd.read_table(data)
>>> len(my_table)
187

Hairline answered 29/1, 2017 at 13:24 Comment(7)

Thanks! That worked. Will I need to update 'myHeaders' after few months or so, when browser's version change? Or that also depends on how the web server is configured? Any idea how it would be done with 'Method 1' – Resound 30/1, 2017 at 8:30

Edited: Pass urllib to pandas – Hairline 30/1, 2017 at 9:13

could you please elaborate. I googled a lot couldn't find anything on how to pass urllib to pandas. Couldn't find any parameter in pd.read_csv either. Sorry for my ignorance :( – Resound 30/1, 2017 at 11:15

docs ... 'or any object with a read() method' – Hairline 30/1, 2017 at 11:18

Woo-hoo! Thanks, got it! I wasn’t aware of this property of ‘filepath_or_buffer’ parameter. Highly appreciate the quick response. – Resound 30/1, 2017 at 11:50

Lastly, any idea, if I will I need to update 'myHeaders' after few months or so, when browser's version change? Or that also depends on how the web server is configured? – Resound 30/1, 2017 at 11:52

For now it seems fine. No need to change anything ... just when it fails. – Hairline 30/1, 2017 at 11:53

Since 1.2 of pandas, it is possible to tune the used reader by adding options as dictionary keys to the storage_options parameter of read_table. So by invoking it with

import pandas as pd


url = ''
storage_options = {'User-Agent': 'Mozilla/5.0'}
df = pd.read_table(url, storage_options=storage_options)

the library will include the User-Agent header to the request so you don't have to set it up externally and before to the invocation of read_table.

Disenthrone answered 17/8, 2021 at 11:55 Comment(0)

Recommended topics

Hot tags