So I am writing a packet sniffing app. Basically I wanted it to sniff for tcp sessions, and then parse them to see if they are http, and if they are, and if they have the right content type, etc, save them as a file on my hard drive.
So, to that end, I wanted it to be efficient. Since the current http library is string based, and I will be dealing with large files, and I only really needed to parse http responses, I decided to roll my own in attoparsec.
When I finished my program, I found that when I was parsing a 9 meg http response with a wav file in it, when I profiled it, it was allocating a gig of memory when it was trying to parse out the body of the http response. When I look at the HTTP.prof I see some lines:
httpBody Main 362 1 0.0 0.0 93.8 99.3 take Data.Attoparsec.Internal 366 1201 0.0 0.0 93.8 99.3 takeWith Data.Attoparsec.Internal 367 3603 0.0 0.0 93.8 99.3 demandInput Data.Attoparsec.Internal 375 293 0.0 0.0 93.8 99.2 prompt Data.Attoparsec.Internal 378 293 0.0 0.0 93.8 99.2 +++ Data.Attoparsec.Internal 380 586 93.8 99.2 93.8 99.2
So as you can see, somewhere within httpbody, take is called 1201 times, causing 500+ (+++) concatenations of bytestrings, which causes an absurd amount of memory allocation.
Here's the code. N is just the content length of the http response, if there is one. If there isn't one it just tries to take everything.
I wanted it to return a lazy bytestring of 1000 or so character bytestrings, but even if I change it to just take n and return a strict bytestring, it still has those allocations in it (and it uses 14 gig of memory).
httpBody n = do
x <- if n > 0
then AC.take n
else AC.takeWhile (\_ -> True)
if B.length x == 0
then return Nothing
else return (Just x)
I was reading a blog by the guy who did combinatorrent and he was having the same issue, but I never heard of a resolution. Has anyone ever run across this problem before or found a solution?
Edit: Okay, well I left this up the entire day and got nothing. After researching the problem I don't think there is a way to do it without adding a lazy bytestring accessor to attoparsec. I also looked at all the other libraries and they either lacked bytestrings or other things.
So I found a workaround. If you think about an http request, it goes headers, newline, newline, body. Since the body is last, and parsing returns a tuple with both what you parsed and what is remaining of the bytestring, I can skip parsing the body inside attoparsec and instead pluck the body straight off the bytestring that is left.
parseHTTPs bs = if P.length results == 0
then Nothing
else Just results
where results = foldParse(bs, [])
foldParse (bs,rs) = case ACL.parse httpResponse bs of
ACL.Done rest r -> addBody (rest,rs) r
otherwise -> rs
addBody (rest,rs) http = foldParse (rest', rs')
where
contentlength = ((read . BU.toString) (maybe "0" id (hdrContentLength (rspHeaders http))))
rest' = BL.drop contentlength rest
rs' = rs ++ [http { rspBody = body' }]
body'
| contentlength == 0 = Just rest
| BL.length rest == 0 = Nothing
| otherwise = Just (BL.take contentlength rest)
httpResponse = do
(code, desc) <- statusLine
hdrs <- many header
endOfLine
-- body <- httpBody ((read . BU.toString) (maybe "0" id (hdrContentLength parsedHeaders)))
return Response { rspCode = code, rspReason = desc, rspHeaders = parseHeaders hdrs, rspBody = undefined }
It is a little messy, but ultimately it works fast and allocates nothing more than I wanted. So basically you fold over the bytestring collecting http data structures, then in between collections, I check the content length of the structure I just got, pull an appropriate amount from the remaining bytestring, and then continue on if there is any bytestring left.
Edit: I actually finished up this project. Works like a charm. I isn't cabalized properly but if someone wants to view the entire source, you can find it at https://github.com/onmach/Audio-Sniffer.