Streaming recursive descent of a directory in Haskell

Asked 10/1, 2013 at 13:38 Answered 20/10, 2017 at 0:0

Solved lazy-evaluation directory-structure haskell

I am trying to do a recursive descent of a directory structure using Haskell. I would like to only retrieve the child directories and files as needed (lazily).

I wrote the following code, but when I run it, the trace shows that all directories are visited before the first file:

module Main where

import Control.Monad ( forM, forM_, liftM )
import Debug.Trace ( trace )
import System.Directory ( doesDirectoryExist, getDirectoryContents )
import System.Environment ( getArgs )
import System.FilePath ( (</>) )

-- From Real World Haskell, p. 214
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topPath = do
  names <- getDirectoryContents topPath
  let
    properNames =
      filter (`notElem` [".", ".."]) $
      trace ("Processing " ++ topPath) names
  paths <- forM properNames $ \name -> do
    let path = topPath </> name
    isDirectory <- doesDirectoryExist path
    if isDirectory
      then getRecursiveContents path
      else return [path]
  return (concat paths)

main :: IO ()
main = do
  [path] <- getArgs
  files <- getRecursiveContents path
  forM_ files $ \file -> putStrLn $ "Found file " ++ file

How can I interleave the file processing with the descent? Is the problem that the files <- getRecursiveContents path action gets performed before the following forM_ in main?

Haystack answered 10/1, 2013 at 13:38 Comment(3)

The later section called "Another way of looking at traversal" in the "Searching the filesystem" chapter of Real World Haskell also provides a more flexible way of navigating the file system which uses a fold and an iterator. – Delarosa 10/1, 2013 at 14:23

I (obviously) took the function getRecursiveContents from RWH. I did not see the later section. I'll take a look. Thanks. – Haystack 10/1, 2013 at 14:33

You may want to check out hackage.haskell.org/package/FilePather – Sandra 10/1, 2013 at 15:54

This is exactly the kind of problem that iteratees/coroutines were designed to solve.

You can easily do this with pipes. The only change I made to your getRecursiveContents was to make it a Producer of FilePaths and to respond with the file name instead of returning it. This lets downstream handle the file name immediately instead of waiting for getRecursiveContents complete.

module Main where

import Control.Monad ( forM_, liftM )
import Control.Proxy
import System.Directory ( doesDirectoryExist, getDirectoryContents )
import System.Environment ( getArgs )
import System.FilePath ( (</>) )

getRecursiveContents :: (Proxy p) => FilePath -> () -> Producer p FilePath IO ()
getRecursiveContents topPath () = runIdentityP $ do
  names <- lift $ getDirectoryContents topPath
  let properNames = filter (`notElem` [".", ".."]) names
  forM_ properNames $ \name -> do
    let path = topPath </> name
    isDirectory <- lift $ doesDirectoryExist path
    if isDirectory
      then getRecursiveContents path ()
      else respond path

main :: IO ()
main = do
    [path] <- getArgs
    runProxy $
            getRecursiveContents path
        >-> useD (\file -> putStrLn $ "Found file " ++ file)

This prints out each file immediately as it traverses the tree, and it does not require lazy IO. It's also very easy to change what you do with the file names, since all you have to do is switch out the useD stage with your actual file handling logic.

To learn more about pipes, I highly recommend you read Control.Proxy.Tutorial.

Colophony answered 10/1, 2013 at 15:46 Comment(1)

I updated the code for the current API of Pipes 4 instead of Pipes 3 but it's too long to paste here, so I gisted it: gist.github.com/FranklinChen/133cb61af931a08bbe20 – Rossen 20/10, 2015 at 23:15

Using lazy IO / unsafe... is not a good way to go. Lazy IO causes many problems, including unclosed resources and executing impure actions within pure code. (See also The problem with lazy I/O on Haskell Wiki.)

A safe way is to use some iteratee/enumerator library. (Replacing problematic lazy IO was the motivation for developing these concepts.) Your getRecursiveContents would become a source of data (AKA enumerator). And the data will be consumed by some iterator. (See also Enumerator and iteratee on Haskell wiki.)

There is a tutorial on the enumerator library that just gives an example of traversing and filtering directory tree, implementing a simple find utility. It implements method

enumDir :: FilePath -> Enumerator FilePath IO b

which is basically just what you need. I believe you will find it interesting.

Also there is a nice article explaining iteratees in The Monad Reader, Issue 16: Iteratee: Teaching an Old Fold New Tricks by John W. Lato, the author of the iteratee library.

Today many people prefer newer libraries such as pipes. You may be interested in a comparison: What are the pros and cons of Enumerators vs. Conduits vs. Pipes?.

Bearden answered 10/1, 2013 at 14:49 Comment(1)

I have added all of the references you gave to my Instapaper account and will read them after work. Thanks. – Haystack 10/1, 2013 at 16:3

Thanks to the comment by Niklas B., here is the solution that I have:

module Main where

import Control.Monad ( forM, forM_, liftM )
import Debug.Trace ( trace )
import System.Directory ( doesDirectoryExist, getDirectoryContents )
import System.Environment ( getArgs )
import System.FilePath ( (</>) )
import System.IO.Unsafe ( unsafeInterleaveIO )

-- From Real World Haskell, p. 214
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topPath = do
  names <- unsafeInterleaveIO $ getDirectoryContents topPath
  let
    properNames =
      filter (`notElem` [".", ".."]) $
      trace ("Processing " ++ topPath) names
  paths <- forM properNames $ \name -> do
    let path = topPath </> name
    isDirectory <- doesDirectoryExist path
    if isDirectory
      then unsafeInterleaveIO $ getRecursiveContents path
      else return [path]
  return (concat paths)

main :: IO ()
main = do
  [path] <- getArgs
  files <- unsafeInterleaveIO $ getRecursiveContents path
  forM_ files $ \file -> putStrLn $ "Found file " ++ file

Is there a better way?

Haystack answered 10/1, 2013 at 14:17 Comment(0)

I was recently looking at a very similar problem, where I'm trying to do a somewhat complicated search using the IO monad, stopping after I find the file I'm interested in. While the solutions using libraries like Enumerator, Conduit, etc. seem to be the best you could do at the time those answers were posted, I just learned IO became an instance of Alternative in GHC's base library about a year ago, which opens up some new possibilities. Here's the code I wrote to try it out:

import Control.Applicative (empty)
import Data.Foldable (asum)
import Data.List (isSuffixOf)
import System.Directory (doesDirectoryExist, listDirectory)
import System.FilePath ((</>))

searchFiles :: (FilePath -> IO a) -> FilePath -> IO a
searchFiles f fp = do
    isDir <- doesDirectoryExist fp
    if isDir
        then do
            entries <- listDirectory fp
            asum $ map (searchFiles f . (fp </>)) entries
        else f fp

matchFile :: String -> FilePath -> IO ()
matchFile name fp
    | name `isSuffixOf` fp = putStrLn $ "Found " ++ fp
    | otherwise = empty

The searchFiles function does a depth-first search of a directory tree, stopping when it finds what you're looking for, as determined by the function passed as the first argument. The matchFile function is just there to show how to construct a suitable function to use as the first argument for searchFiles; in real life you'd probably do something more complicated.

The interesting thing here is that now you can use empty to make an IO computation "give up" without returning a result, and you can chain computations together with asum (which is just foldr (<|>) empty) to keep trying computations until one of them succeeds.

I find it a little unnerving that the type signature of an IO action no longer reflects the fact that it may deliberately not produce a result, but it sure simplifies the code. I was previously trying to use types like IO (Maybe a), but doing so made it very hard to compose actions.

IMHO there's no longer much reason to use a type like IO (Maybe a), but if you need to interface with code that uses a type like that, it's easy to convert between the two types. To convert IO a to IO (Maybe a), you can just use Control.Applicative.optional, and going the other way, you can use something like this:

maybeEmpty :: IO (Maybe a) -> IO a
maybeEmpty m = m >>= maybe empty pure

Chrisman answered 20/10, 2017 at 0:0 Comment(0)

Recommended topics

Hot tags