Download file from Blob URL with Python

Asked 15/9, 2016 at 18:0 Answered 4/7, 2022 at 11:56

I wish to have my Python script download the Master data (Download, XLSX) Excel file from this Frankfurt stock exchange webpage.

When to retrieve it with urrlib and wget, it turns out that the URL leads to a Blob and the file downloaded is only 289 bytes and unreadable.

http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx

I'm entirely unfamiliar with Blobs and have these questions:

Can the file "behind the Blob" be successfully retrieved using Python?
If so, is it necessary to uncover the "true" URL behind the Blob – if there is such a thing – and how? My concern here is that the link above won't be static but actually change often.

Mccomas answered 15/9, 2016 at 18:0 Comment(0)

That 289 byte long thing might be a HTML code for 403 forbidden page. This happen because the server is smart and rejects if your code does not specify a user agent.

Python 3

# python3
import urllib.request as request

url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of Safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
r = request.Request(url, headers={'User-Agent': fake_useragent})
f = request.urlopen(r)

# print or write
print(f.read())

Python 2

# python2
import urllib2

url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'
# fake user agent of safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'

r = urllib2.Request(url, headers={'User-Agent': fake_useragent})
f = urllib2.urlopen(r)

print(f.read())

Undeviating answered 15/9, 2016 at 18:15 Comment(2)

Thanks for the answer. I need to be able to download the file to disk (preferably while being able to overwrite), not just read it. – Mccomas 15/9, 2016 at 18:44

That is an example. Once you f.read(), you can write it into another file. The point is using a fake agent to retrieve an excel file. After then, it is the same as file manipulation. – Undeviating 15/9, 2016 at 18:48

from bs4 import BeautifulSoup
import requests
import re

url='http://www.xetra.com/xetra-en/instruments/etf-exchange-traded-funds/list-of-tradable-etfs'
html=requests.get(url)
page=BeautifulSoup(html.content)
reg=re.compile('Master data')
find=page.find('span',text=reg)  #find the file url
file_url='http://www.xetra.com'+find.parent['href']
file=requests.get(file_url)
with open(r'C:\\Users\user\Downloads\file.xlsx','wb') as ff:
    ff.write(file.content)

recommend requests and BeautifulSoup,both good lib

Haynor answered 15/9, 2016 at 18:33 Comment(3)

I'm considering this solution. Will it be more robust in case the "blob URL" changes? (Do they?) – Mccomas 15/9, 2016 at 18:54

of course,if the framework of the page not change@Winterflags – Haynor 15/9, 2016 at 19:2

It seems that this doesn't work if the url is like "blob:blabla.com....". Any idea how to handle this? Also, why did you pass "span" as an argument to page.find("span", text=reg) ? – Morgan 17/5, 2022 at 15:46

for me, the target download url is like: blob:https://jdc.xxxx.com/2c21275b-c1ef-487a-a378-1672cd2d0760

i tried to write the original response in local xlsx, and found it was working.

import requests
        
r = requests.post(r'http://jdc.xxxx.com/rawreq/api/handler/desktop/export-rat-view?flag=jdc',json=data,headers={'content-type': 'application/json'})
file_name = 'file_name.xlsx'
with open(file_name, 'wb') as f:
    for chunk in r.iter_content(100000):
        f.write(chunk)

Grishilda answered 4/7, 2022 at 11:56 Comment(0)

Recommended topics

Hot tags