Python download multiple files from links on pages
Asked Answered
A

2

6

I'm trying to download all the PGNs from this site.

I think I have to use urlopen to open each url and then use urlretrieve to download each pgn by accessing it from the download button near the bottom of each game. Do I have to create a new BeautifulSoup object for each game? I'm also unsure of how urlretrieve works.

import urllib
from urllib.request import urlopen, urlretrieve, quote
from bs4 import BeautifulSoup

url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492'
u = urlopen(url)
html = u.read().decode('utf-8')

soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
    urlopen('http://chessgames.com'+link.get('href'))
Approbation answered 17/9, 2017 at 12:26 Comment(0)
S
6

There is no short answer to your question. I will show you a complete solution and comment this code.

First, import necessary modules:

from bs4 import BeautifulSoup
import requests
import re

Next, get index page and create BeautifulSoup object:

req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")

I strongly advice to use lxml parser, not common html.parser After that, you should prepare game's links list:

pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))

You can do it by searching links containing 'chessgame' word in it. Now, you should prepare function which will download files for you:

def download_file(url):
    path = url.split('/')[-1].split('?')[0]
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

And final magic is to repeat all previous steps preparing links for file downloader:

host = 'http://www.chessgames.com'
for page in pages:
    url = host + page.get('href')
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
    file_link = soup.find('a',text=re.compile('.*download.*'))
    file_url = host + file_link.get('href')
    download_file(file_url)

(first you search links containing text 'download' in their description, then construct full url - concatenate hostname and path, and finally download file)

I hope you can use this code without correction!

Shin answered 18/9, 2017 at 9:13 Comment(4)
Why should I use requests and not urllib by the way? Can you do the same thing with urllib?Approbation
Of course, you can use urllib as you did in your question. But using requests is a good practice. You can obtain additional information hereShin
What do these 2 lines of code do?req = requests.get(url) soup = BeautifulSoup(req.text, "lxml")I understand that the first line makes a request but don't really understand what that means? And what does the .text method do in the BS constructor? I understand that it gets all the child strings but again am not sure what that means in this context.Approbation
req = requests.get(url) gives you an object of type requests.models.Response. It is a class, containing http response itself and a number of attributes and methods to deal with. One of the attributes is text which lets you obtain pure html code of the web page received. You should pass this html code to BeautifuSoup object's constructor as an argument (soup = BeautifulSoup(req.text, "lxml")) to get BeautifulSoup object - a class which methods let you search certain tags and others.Shin
U
6

The accepted answer is fantastic but the task is embarrassingly parallel; there's no need to retrieve these sub-pages and files one at a time. This answer shows how to speed things up.

The first step is to use requests.Session() when sending multiple requests to a single host. Quoting Advanced Usage: Session Objects from the requests docs:

The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3's connection pooling. So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).

Next, asyncio, multiprocessing or multithreading are available to parallelize the workload. Each has tradeoffs respective to the task at hand and which you choose is likely best determined by benchmarking and profiling. This page offers great examples of all three.

For the purposes of this post, I'll show multithreading. The impact of the GIL shouldn't be too much of a bottleneck because the tasks are mostly IO-bound, consisting of babysitting requests on the air to wait for the response. When a thread is blocked on IO, it can yield to a thread parsing HTML or doing other CPU-bound work.

Here's the code:

import os
import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

def download_pgn(task):
    session, host, page, destination_path = task
    response = session.get(host + page)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    game_url = host + soup.find("a", text="download").get("href")
    filename = re.search(r"\w+\.pgn", game_url).group()
    path = os.path.join(destination_path, filename)
    response = session.get(game_url, stream=True)
    response.raise_for_status()

    with open(path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

def main():
    host = "http://www.chessgames.com"
    url_to_scrape = host + "/perl/chesscollection?cid=1014492"
    destination_path = "pgns"
    max_workers = 8

    if not os.path.exists(destination_path):
        os.makedirs(destination_path)

    with requests.Session() as session:
        session.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
        response = session.get(url_to_scrape)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")
        pages = soup.find_all("a", href=re.compile(r".*chessgame\?.*"))
        tasks = [
            (session, host, page.get("href"), destination_path) 
            for page in pages
        ]

        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            pool.map(download_pgn, tasks)

if __name__ == "__main__":
    main()

I used response.iter_content here which is unnecessary on such tiny text files, but is a generalization so the code will handle larger files in a memory-friendly way.

Results from a rough benchmark (the first request is a bottleneck):

max workers session? seconds
1 no 126
1 yes 111
8 no 24
8 yes 22
32 yes 16
Unreserve answered 1/2, 2021 at 16:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.