Only download a part of the document using python requests
Asked Answered
I

2

17

I'm writing a web scraper using python-requests.

Each page is over 1MB, but the actual data I need to extract is very early on in the document's flow, so I'm wasting time downloading a lot of unnecessary data.

If possible I would like to stop the download as soon as the required data appears in the document's source code, in order to save time.

For example, I only want to extract the text in the "abc" Div, the rest of the document is useless:

<html>
<head>
<title>My site</title>
</head>
<body>

<div id="abc">blah blah...</div>

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris fermentum molestie ligula, a pharetra eros mollis ut.</p>
<p>Quisque auctor volutpat lobortis. Vestibulum pellentesque lacus sapien, quis vulputate enim mollis a. Vestibulum ultrices fermentum urna ac sodales.</p>
<p>Nunc sit amet augue at dolor fermentum ultrices. Curabitur faucibus porttitor vehicula. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>Etiam sed leo at ipsum blandit dignissim ut a est.</p>

</body>
</html>

Currently I'm simply doing:

r = requests.get(URL)
Ilke answered 12/5, 2014 at 6:30 Comment(0)
T
32

What you want to use here is called Range HTTP Header.

See: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html (Specifically the bit on Range).

See also API Docs on Custom Headers

Example:

from requests import get


url = "http://download.thinkbroadband.com/5MB.zip"
headers = {"Range": "bytes=0-100"}  # first 100 bytes

r = get(url, headers=headers)
Tandem answered 12/5, 2014 at 6:39 Comment(4)
I can confirm, that e.g. requests is serving in this situation very well.Bybee
What if the byte-ranges are not supported by the server? This will fetch the entire content.Oleneolenka
What if a server is malicious and disrespects the range?Caravansary
This only works if the server accepts the Range header, otherwise it will download all the file.Lenrow
O
11

I landed here from the question: Open first N characters of a url file with Python . However, I don't think that's a strict duplicate since it doesn't explicitly mention in the title whether it is mandatory to use the requests module or not. Also, it may be the case that range bytes are not supported by the server where the request is to be made, for whatever reason. In that case, I'd rather simply talk HTTP directly:

#!/usr/bin/env python

import socket
import time

TCP_HOST = 'stackoverflow.com' # This is the host we are going to query
TCP_PORT = 80 # This is the standard port for HTTP protocol
MAX_LIMIT = 1024 # This is the maximum size of the info we want in bytes

# Create the string to talk HTTP/1.1
MESSAGE = \
"GET /questions/23602412/only-download-a-part-of-the-document-using-python-requests HTTP/1.1\r\n" \
"HOST: stackoverflow.com\r\n" \
"User-Agent: Custom/0.0.1\r\n" \
"Accept: */*\r\n\n"

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # Create a socket
s.connect((TCP_HOST, TCP_PORT)) # Connect to remote socket at given address
s.send(MESSAGE) # Let's begin the transaction

time.sleep(0.1) # Machines are involved, but... oh, well!

# Keep reading from socket till max limit is reached
curr_size = 0
data = ""
while curr_size < MAX_LIMIT:
    data += s.recv(MAX_LIMIT - curr_size)
    curr_size += len(data)

s.close() # Mark the socket as closed

# Everyone likes a happy ending!
print data + "\n"
print "Length of received data:", len(data)

Sample run:

$ python sample.py
HTTP/1.1 200 OK
Cache-Control: private
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
X-Request-Guid: 3098c32c-3423-4e8a-9c7e-6dd530acdf8c
Content-Length: 73444
Accept-Ranges: bytes
Date: Fri, 05 Aug 2016 03:21:55 GMT
Via: 1.1 varnish
Connection: keep-alive
X-Served-By: cache-sin6926-SIN
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1470367315.724674,VS0,VE246
X-DNS-Prefetch-Control: off
Set-Cookie: prov=c33383b6-3a4d-730f-02b9-0eab064b3487; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly

<!DOCTYPE html>
<html itemscope itemtype="http://schema.org/QAPage">
<head>

<title>http - Only download a part of the document using python requests - Stack Overflow</title>
    <link rel="shortcut icon" href="//cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
    <link rel="apple-touch-icon image_src" href="//cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
    <link rel="search" type="application/open

Length of received data: 1024
Oleneolenka answered 5/8, 2016 at 3:22 Comment(2)
I appreciate the Answer, is there any way to get N to M bytes from a page instead of first N bytes?Gesso
@M.S.P You can do that, provided the server supports request with range bytes, as mentioned in the answer by James.Oleneolenka

© 2022 - 2024 — McMap. All rights reserved.