feedparser with timeout
Asked Answered
O

4

17

My code got stuck on this function call:

feedparser.parse("http://...")

This worked before. The url is even not possible to open in the browser. How would you cure this case? Is there a timeout possibility? I'd like to continue as if nothing would happen (only with printing some message or log this issue)

Otherdirected answered 19/3, 2012 at 15:12 Comment(0)
S
17

You can specify timeout globally using socket.setdefaulttimeout().

The timeout may limit how long an individual socket operation may last -- feedparser.parse() may perform many socket operations and therefore the total time spent on dns, establishing the tcp connection, sending/receiving data may be much longer. See Read timeout using either urllib2 or any other http library.

Shepp answered 19/3, 2012 at 15:22 Comment(1)
OK, I used it but don't know if it works because the URL with endless loading is active again.Otherdirected
S
24

Use Python requests library for network IO, feedparser for parsing only:

# Do request using requests library and timeout
try:
    resp = requests.get(rss_feed, timeout=20.0)
except requests.ReadTimeout:
    logger.warn("Timeout when reading RSS %s", rss_feed)
    return

# Put it to memory stream object universal feedparser
content = BytesIO(resp.content)

# Parse content
feed = feedparser.parse(content)
Stung answered 5/9, 2016 at 12:5 Comment(2)
It is better than specifying the global timeout but it might not fix the issue due to the reason pointed out in my answer (requests.get() may block for much longer than the timeout value). Follow the link for details.Shepp
I like this solution. I have http settings that work really well for my purposes, but wanted to feedparser for the variations I find in rss feeds. This allows me to do both. Thanks!Bat
S
17

You can specify timeout globally using socket.setdefaulttimeout().

The timeout may limit how long an individual socket operation may last -- feedparser.parse() may perform many socket operations and therefore the total time spent on dns, establishing the tcp connection, sending/receiving data may be much longer. See Read timeout using either urllib2 or any other http library.

Shepp answered 19/3, 2012 at 15:22 Comment(1)
OK, I used it but don't know if it works because the URL with endless loading is active again.Otherdirected
M
7

According to the author's recommendation[1], you should use requests library to do http request, and parse result to feedparser.

[1] https://github.com/kurtmckee/feedparser/pull/80

Mesothorium answered 8/7, 2020 at 8:26 Comment(1)
The custom HTTP client in the code was directly replaced with requests in 2023. And while it includes a default timeout of 10 seconds, it is not changeable.Peppy
B
1

If you want a quick workaround you can monkey patch and use requests lib instead with proper timeout. It also fixes https certificate issues I had with default feedparser url open implementation. This is how I do it:

feedparser._open_resource = lambda *args, **kwargs: feedparser._StringIO(requests.get(args[0], timeout=5).content)

Update: On versions above 6.x use following:

feedparser.api._open_resource = lambda *args, **kwargs: requests.get(args[0], headers=headers, timeout=5).content

All rights reserved by darklow

Bashful answered 17/1 at 15:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.