How do I fix a ValueError: read of closed file exception?
Asked Answered
C

2

10

This simple Python 3 script:

import urllib.request

host = "scholar.google.com"
link = "/scholar.bib?q=info:K7uZdMSvdQ0J:scholar.google.com/&output=citation&hl=en&as_sdt=1,14&ct=citation&cd=0"
url = "http://" + host + link
filename = "cite0.bib"
print(url)
urllib.request.urlretrieve(url, filename)

raises this exception:

Traceback (most recent call last):
  File "C:\Users\ricardo\Desktop\Google-Scholar\BibTex\test2.py", line 8, in <module>
    urllib.request.urlretrieve(url, filename)
  File "C:\Python32\lib\urllib\request.py", line 150, in urlretrieve
    return _urlopener.retrieve(url, filename, reporthook, data)
  File "C:\Python32\lib\urllib\request.py", line 1597, in retrieve
    block = fp.read(bs)
ValueError: read of closed file

I thought this might be a temporary problem, so I added some simple exception handling like so:

import random
import time
import urllib.request

host = "scholar.google.com"
link = "/scholar.bib?q=info:K7uZdMSvdQ0J:scholar.google.com/&output=citation&hl=en&as_sdt=1,14&ct=citation&cd=0"
url = "http://" + host + link
filename = "cite0.bib"
print(url)
while True:
    try:
        print("Downloading...")
        time.sleep(random.randint(0, 5))
        urllib.request.urlretrieve(url, filename)
        break
    except ValueError:
        pass

but this just prints Downloading... ad infinitum.

Counterpoint answered 17/7, 2012 at 22:13 Comment(3)
If you look in http://scholar.google.com/robots.txt you can see that Google forbids automated downloads of this page. And if you try using wget you will get a 403 Forbidden error. I suspect this is also happening to your script.Hyperpyrexia
@senderle There isn't an API, so I'm parsing it manually.Counterpoint
@senderle, most likely you need to send a cookie to get the content.Punic
S
6

Your URL return a 403 code error and apparently urllib.request.urlretrieve is not good at detecting all the HTTP errors, because it's using urllib.request.FancyURLopener and this latest try to swallow error by returning an urlinfo instead of raising an error.

About the fix if you still want to use urlretrieve you can override FancyURLopener like this (code included to also show the error):

import urllib.request
from urllib.request import FancyURLopener


class FixFancyURLOpener(FancyURLopener):

    def http_error_default(self, url, fp, errcode, errmsg, headers):
        if errcode == 403:
            raise ValueError("403")
        return super(FixFancyURLOpener, self).http_error_default(
            url, fp, errcode, errmsg, headers
        )

# Monkey Patch
urllib.request.FancyURLopener = FixFancyURLOpener

url = "http://scholar.google.com/scholar.bib?q=info:K7uZdMSvdQ0J:scholar.google.com/&output=citation&hl=en&as_sdt=1,14&ct=citation&cd=0"
urllib.request.urlretrieve(url, "cite0.bib")

Else and this is what i recommend you can use urllib.request.urlopen like so:

fp = urllib.request.urlopen('http://scholar.google.com/scholar.bib?q=info:K7uZdMSvdQ0J:scholar.google.com/&output=citation&hl=en&as_sdt=1,14&ct=citation&cd=0')
with open("citi0.bib", "w") as fo:
    fo.write(fp.read())
Sixfold answered 17/7, 2012 at 22:56 Comment(1)
Thanks for the help. +1 and the accept for the monkey patching and general help, even though I've since realised, per the comments above,that robots.txt disallows downloading those files. I completely forgot to check that.Counterpoint
T
1

If you run your app over managed cloud infrastructure, or managed security service, check for limitations that may come from them. Happened to me. Cloud providers sometimes impose a white list on accessible sites.

Tervalent answered 21/1, 2020 at 14:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.