Recursively search directories for all files matching name criteria in Haskell
Asked Answered
T

2

6

I'm relatively inexperienced in Haskell and I wanted to improve, so for a learning project of mine I have the following requirements:

  • I want to search starting from a specified top directory, not necessarily an absolute path.
  • I want to find all files of a given extension, say .md.
  • I want to not search hidden directories, say toplevel/.excluded.
  • I want to be able to ignore hidden files like gedit produces .filename.md.swp.
  • I want to end up with a complete list of files as the result of my function.

I searched all over SO. Here's what I have so far:

import qualified System.FilePath.Find as SFF
import qualified Filesystem.Path.CurrentOS as FP

srcFolderName = "src"
outFolderName = "output"
resFolderName = "res"

ffNotHidden :: SFF.FindClause Bool
ffNotHidden = SFF.fileName SFF./~? ".?*"

ffIsMD :: SFF.FindClause Bool
ffIsMD = SFF.extension SFF.==? ".md" SFF.&&? SFF.fileName SFF./~? ".?*"

findMarkdownSources :: FilePath -> IO [FilePath]
findMarkdownSources filePath = do
    paths <- SFF.find ffNotHidden ffIsMD filePath
    return paths

This doesn't work. printf-style debugging in "findMarkdownSources", I can verify that filePath is correct e.g. "/home/user/testdata" (print includes the ", in case that tells you something). The list paths is always empty. I'm absolutely certain there are markdown files in the directory I have specified (find /path/to/dir -name "*.md" finds them).

I therefore have some specific questions.

  1. Is there a reason (filters incorrect) for example, why this code should not work?
  2. There are a number of ways to do this in haskell. It seems there are at least six packages (fileman, system.directory, system.filepath.find) dedicated to this. Here's some questions where something like this is answered:

    1. Streaming recursive descent of a directory in Haskell
    2. Is there some directory walker in Haskell?
    3. avoid recursion into specifc folder using filemanip

    Each one has about three unique ways to achieve what I want to achieve, so, we're nearly at 10 ways to do it...

  3. Is there a specific way I should be doing this? If so why? If it helps, once I have my file list, I'm going to walk the entire thing, open and parse each file.

If it helps, I'm reasonably comfortable with basic haskell, but you'll need to slow down if we start getting too heavy with monads and applicative functors (I don't use haskell enough for this to stay in my head). I find the haskell docs on hackage incomprehensible, though.

Tallou answered 6/8, 2018 at 16:33 Comment(2)
The documentation for GlobPattern doesn't mention supporting ?; perhaps that is part of the problem.Pelota
@DanielWagner thanks OK I'll try this. The answer I have looks pretty good as well, I'll try that tomorrow also.Tallou
M
7

so, we're nearly at 10 ways to do it...

Here's yet another way to do it, using functions from the directory, filepath and extra packages, but not too much monad wizardry:

import Control.Monad (foldM)
import System.Directory (doesDirectoryExist, listDirectory) -- from "directory"
import System.FilePath ((</>), FilePath) -- from "filepath"
import Control.Monad.Extra (partitionM) -- from the "extra" package

traverseDir :: (FilePath -> Bool) -> (b -> FilePath -> IO b) -> b -> FilePath -> IO b
traverseDir validDir transition =
    let go state dirPath =
            do names <- listDirectory dirPath
               let paths = map (dirPath </>) names
               (dirPaths, filePaths) <- partitionM doesDirectoryExist paths
               state' <- foldM transition state filePaths -- process current dir
               foldM go state' (filter validDir dirPaths) -- process subdirs
     in go

The idea is that the user passes a FilePath -> Bool function to filter unwanted directories; also an initial state b and a transition function b -> FilePath -> IO b that processes file names, updates the b state and possibly has some side effects. Notice that the type of the state is chosen by the caller, who might put useful things there.

If we only want to print file names as they are produced, we can do something like this:

traverseDir (\_ -> True) (\() path -> print path) () "/tmp/somedir"

We are using () as a dummy state because we don't really need it here.

If we want to accumulate the files into a list, we can do it like this:

traverseDir (\_ -> True) (\fs f -> pure (f : fs)) [] "/tmp/somedir" 

And what if we want to filter some files? We would need to tweak the transition function we pass to traverseDir so that it ignores them.

Microcrystalline answered 6/8, 2018 at 18:7 Comment(0)
K
2

I tested you code on my machine, and it seems to work fine. Here is some example data:

$ find test/data
test/data
test/data/look-a-md-file.md
test/data/another-dir
test/data/another-dir/shown.md
test/data/.not-shown.md
test/data/also-not-shown.md.bkp
test/data/.hidden
test/data/some-dir
test/data/some-dir/shown.md
test/data/some-dir/.ahother-hidden
test/data/some-dir/.ahother-hidden/im-hidden.md

Running your function will result in:

ghci> findMarkdownSources "test"
["test/data/another-dir/shown.md","test/data/look-a-md-file.md","test/data/some-dir/shown.md"]

I've tested this with an absolute path, and it also works. Are you sure you have passed a valid path? You'll get an empty list if that is the case (although you also get a warning).

Note that your code could be simplified as follows:

module Traversals.FileManip where

import           Data.List            (isPrefixOf)
import           System.FilePath.Find (always, extension, fileName, find, (&&?),
                                       (/~?), (==?))

findMdSources :: FilePath -> IO [FilePath]
findMdSources fp = find isVisible (isMdFile &&? isVisible) fp
    where
      isMdFile = extension ==? ".md"
      isVisible = fileName /~? ".?*"

And you can even remove the fp parameter, but I'm leaving it here for the sake of clarity.

I prefer to import explicitly so that I know where each function comes from (since I don't know of any Haskell IDE with advanced symbol navigation).

However, note that this solution uses uses unsafe interleave IO, which is not recommended.

So regarding your questions 2 and 3, I would recommend a streaming solution, like pipes or conduits. Sticking to these kind of solutions will reduce your options (just like sticking to pure functional programming languages reduced my options for programming languages ;)). Here you have an example on how pipes can be used to walk a directory.

Here is the code in case you want to try this out.

Kasiekask answered 8/8, 2018 at 11:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.