Read lines from a file inside a zip archive using Haskell's zip-conduit
Asked Answered
K

3

6

As the title says, I'd like to be able to read lines from a file that is inside a zip archive, using zip-conduit (the zip files I'm dealing with are very big, so I need to be able to do this in constant memory). I grok the very basic idea of conduits, but have never used them in anger, and am feeling quite stuck as to where to start. I've read the conduits tutorial, but I'm having trouble matching that up with my problem.

The zip-conduit documentation says one can source from a zip archive via something like the following:

import qualified Data.Conduit.Binary as CB
import Codec.Archive.Zip

withArchive archivePath $ do
    name:_ <- entryNames
    sourceEntry name $ CB.sinkFile name

I presume what I need to do is write something in place of CB.sinkFile. Data.Conduit.Text has a lines function — could this be used in some way to get the lines out of the file?

I would really appreciate a simple example, say using putStrLn to write out the lines of a simple text file that is archived inside a zip file. Thanks in advance.

Kaylyn answered 21/11, 2013 at 17:1 Comment(0)
D
6

Michael's answer but with zip-conduit:

import           Control.Monad.IO.Class (liftIO)
import           Data.Conduit
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Text as CT
import           Codec.Archive.Zip

main :: IO ()
main = withArchive "input.zip" $ do
  n:_ <- entryNames
  sourceEntry n
     $ CT.decode CT.utf8
    =$ CT.lines
    =$ CL.mapM_ (\t -> liftIO $ putStrLn $ "Got a line: " ++ show t)
Deuterogamy answered 22/11, 2013 at 20:3 Comment(1)
Thanks very much, this makes much more sense. Using conduit has made my code much cleaner.Kaylyn
E
1

Here's a simple example:

import           Control.Monad.IO.Class (liftIO)
import           Data.Conduit
import qualified Data.Conduit.Binary    as CB
import qualified Data.Conduit.List      as CL
import qualified Data.Conduit.Text      as CT

main :: IO ()
main = runResourceT
     $ CB.sourceFile "input.txt"
    $$ CT.decode CT.utf8
    =$ CT.lines
    =$ CL.mapM_ (\t -> liftIO $ putStrLn $ "Got a line: " ++ show t)

You can also view and experiment on FP Haskell Center.

Envenom answered 21/11, 2013 at 18:49 Comment(2)
Thanks for taking the time to reply, Michael. Your example demonstrates the general use of conduits (which I understand from the conduits tutorial), but doesn't illustrate how one would use the zip-conduit as outlined in my question, and I'm afraid I'm too dumb to jump immediately from your example to a solution. Further help really would be appreciated!Kaylyn
The example does not work: getting "Variable not in scope: main :: [GHC.Types.Char] -> t"Kezer
G
1

Here is a simple example-

import Data.ByteString as B
import Data.Conduit
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Binary as CB
import Codec.Archive.Zip
import System.Environment

sink :: Monad m => Sink ByteString m [ByteString]
sink = CL.consume

main::IO()
main = do
    [archivePath] <- getArgs
    res <- withArchive archivePath $ do
        name:_ <- entryNames
        source <- getSource name
        runResourceT $ (source $$ sink)

    print res

You can either process the data as it comes through in the sink function (consuming as needed using CL, CB functions), or since the data is being returned lazily, you can modify the data in res.

Godart answered 21/11, 2013 at 19:12 Comment(1)
Thanks for this, @jamshidh. But, how exactly would I process things in the sink function? If I convert res to a list of strings, and process these, I find that there are far fewer elements in this list than there should be. The number of items appears to be limited by the memory they require. I’m not sure that res is lazy (the documentation says that CB.consume puts all values into memory). How would one modify your sink function to trivially process each line (e.g., append a given string to each line)?Kaylyn

© 2022 - 2024 — McMap. All rights reserved.