Haskell http-conduit web-scraping daemon crashes with out of memory error
Asked Answered
R

2

8

I've written a daemon in Haskell that scrapes information from a webpage every 5 minutes.

The daemon originally ran fine for about 50 minutes, but then it unexpectedly died with out of memory (requested 1048576 bytes). Every time I ran it it died after the same amount of time. Setting it to sleep only 30 seconds, it instead died after 8 minutes.

I realized the code to scrape the website was incredibly memory inefficient (going from about 30M while sleeping to 250M while parsing 9M of html), so I rewrote it so that now it only uses about 15M extra while parsing. Thinking the problem was fixed, I ran the daemon overnight and when I woke up it was actually using less memory than it was that night. I thought I was done, but roughly 20 hours after it had started, it had crashed with the same error.

I started looking into ghc profiling but I wasn't able to get that to work. Next I started messing with rts options, and I tried setting -H64m to set the default heap size to be larger than my program was using, and also using -Ksize to shrink the maximum size of the stack to see if that would make it crash sooner.

Despite every change I've made, the daemon still seems to crash after a constant number of iterations. Making the parsing more memory efficient made this value higher, but it still crashes. This doesn't make sense to me because none of these have runs have even come close to using all of my memory, much less swap space. The heap size is supposed to be unlimited by default, shrinking the stack size didn't make a difference, and all my ulimits are either unlimited or significantly higher than what the daemon is using.

In the original code I pinpointed the crash to somewhere in the html parsing, but I haven't done the same for the more memory efficient version because 20 hours takes so long to run. I don't know if this would even be useful to know because it doesn't seem like any specific part of the program is broken because it run successfully for dozens of iterations before crashing.

Out of ideas, I even looked through the ghc source code for this error, and it appears to be a failed call to mmap, which wasn't very helpful to me because I assume that isn't the root of the problem.

(Edit: code rewritten and moved to end of post)

I'm pretty new at Haskell, so I'm hoping this is some quirk of lazy evaluation or something else that has a quick fix. Otherwise, I'm fresh out of ideas.

I'm using GHC version 7.4.2 on FreeBsd 9.1

Edit:

Replacing the downloading with static html got rid of the problem, so I've narrowed it down to how I'm using http-conduit. I've edited the code above to include my networking code. The hackage docs mention to share a manager so I've done that. And it also says that for http you have to explicitly close connections, but I don't think I need to do that for httpLbs.

Here's my code.

import Control.Monad.IO.Class (liftIO)
import qualified Data.Text as T
import qualified Data.ByteString.Lazy as BL
import Text.Regex.PCRE
import Network.HTTP.Conduit

main :: IO ()
main = do
    manager <- newManager def
    daemonLoop manager

daemonLoop :: Manager -> IO ()
daemonLoop manager = do
    rows <- scrapeWebpage manager
    putStrLn $ "number of rows parsed: " ++ (show $ length rows)
    doSleep
    daemonLoop manager

scrapeWebpage :: Manager -> IO [[BL.ByteString]]
scrapeWebpage manager = do
    putStrLn "before makeRequest"
    html <- makeRequest manager
    -- Force evaluation of html.
    putStrLn $ "html length: " ++ (show $ BL.length html)
    putStrLn "after makeRequest"
    -- Breaks ~10M html table into 2d list of bytestrings.
    -- Max memory usage is about 45M, which is about 15M more than when sleeping.
    return $ map tail $ html =~ pattern
    where
        pattern :: BL.ByteString
        pattern = BL.concat $ replicate 12 "<td[^>]*>([^<]+)</td>\\s*"

makeRequest :: Manager -> IO BL.ByteString
makeRequest manager = runResourceT $ do
    defReq <- parseUrl url
    let request = urlEncodedBody params $ defReq
                    -- Don't throw errors for bad statuses.
                    { checkStatus = \_ _ -> Nothing
                    -- 1 minute.
                    , responseTimeout = Just 60000000
                    }
    response <- httpLbs request manager
    return $ responseBody response

and it's output:

before makeRequest
html length: 1555212
after makeRequest
number of rows parsed: 3608
...
before makeRequest
html length: 1555212
after makeRequest
bannerstalkerd: out of memory (requested 2097152 bytes)

Getting rid of the regex computations fixed the problem, but it seems that the error happens after the networking and during the regex, presumably because of something I'm doing wrong with http-conduit. Any ideas?

Also, when I try to compile with profiling enabled I get this error:

Could not find module `Network.HTTP.Conduit'
Perhaps you haven't installed the profiling libraries for package `http-conduit-1.8.9'?

Indeed, I have not installed profiling libraries for http-conduit and I don't know how.

Roomful answered 25/2, 2013 at 2:9 Comment(8)
Can you replace the whole db with a lazy text file to see if it really is the db?Ought
I actually removed the entire databases part of it and it still has the same problem. I'll edit the post to reflect that.Roomful
Replace the downloading part with something fixed like let page = "<html></html>"Ought
It could be that you have loitering problem.Treviso
Profile the programEbro
I narrowed down the problem to the way I'm using http-conduit. In my edit I explained why I'm unable to compile the program with profiling.Roomful
try import qualified Data.ByteString.Char8 (strict)Ought
also try html = patternOught
R
3

I ended up solving my own problem. It seems to be a GHC bug on FreeBSD. I submitted a bug report and switched to Linux, and now it's been running flawlessly for the last few days.

Roomful answered 3/3, 2013 at 3:23 Comment(0)
N
4

So you've found yourself a leak. By tricking with compiler options and memory settings you can only postpone the moment your program crashes, but you cannot eliminate the source of the problem, so no matter what you set there, you will still run out of memory eventually.

I recommend you to carefully walk thru all the non-pure code and primarilly the part working with resources. Check whether all resources get released correctly. Check whether you have an accumulating state, like a growing ubounded channel. And, of course, as wisely suggested by n.m., profile it.

I have a scraper that parses pages without pausing and downloads files, and it does it all concurrently. I've never seen it using any more memory than ~60M. I've been compiling it with GHC 7.4.2, GHC 7.6.1 and GHC 7.6.2 and had no problems with neither.

It should be noted that the root of your problem may also be in the libraries you're using. In my scraper I use http-conduit, http-conduit-browser, HandsomeSoup and HXT.

Need answered 25/2, 2013 at 7:44 Comment(3)
It sounds like your on the right track. I'm using http-conduit and regex-pcre for the webscraping, and I edited in all the code I'm using. My program never uses more than about 45M, but it still dies for some reason. My http-conduit code is pretty bare bones, and I don't see where I could be mishandling resources.Roomful
@Nikita Is it possible to share your code? I'm also doing web scraping with Haskell and would like to learn from your code.Somnambulism
@osager Sorry. It's a private project. However there's plenty of tutorials out there. E.g., this and this.Need
R
3

I ended up solving my own problem. It seems to be a GHC bug on FreeBSD. I submitted a bug report and switched to Linux, and now it's been running flawlessly for the last few days.

Roomful answered 3/3, 2013 at 3:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.