Recursive globs. ** or */** globstar different on zsh, Bash, Python, and Ruby
Asked Answered
H

1

12

Suppose you have this directory tree:

$ tree /tmp/test
/tmp/test
├── dir_a
│   ├── dir a\012file with CR
│   ├── dir a file with spaces
│   └── sub a directory
│       └── with a file in it
├── dir_b
│   ├── dir b\012file with CR
│   └── dir b file with spaces
├── dir_c
│   ├── \012
│   ├── dir c\012file with CR and *
│   └── dir c file with space and *
├── file_1
├── file_2
└── file_3

4 directories, 11 files

(HERE is a script to produce that. The \012 is a \n to make the scripting more challenging. There is a .hidden file in there too.)

There seem to be substantial implementation differences for recursive globbing between Bash 5.1, zsh 5.8, Python pathlib 5.10, Python glob module with recursion enabled and ruby 3.0.

This also assumes shopt -s globstar with Bash and cwd is current working directory and set to /tmp/test for this example.

This is what Bash does:

  1. * Just the files, directories in cwd. ie, 3 directories, 3 files
  2. ** All files and directories in a tree rooted by cwd but not the cwd -- 4 and 11 files
  3. **/ Only directories in the tree rooted by cwd but not including cwd -- 4 and 0
  4. */** All directories in cwd and all files EXCEPT the files in cwd -- 4 and 8 files since recursion only starts in the sub directories
  5. **/* Same as ** -- 4 and 11
  6. **/*/ Only directories in tree -- 4 and 0 files
  7. */**/* Only directories below second level and files below first -- 1 and 8

If I run this script under Bash 5.1 and zsh 5.8, they results are different:

# no shebang - execute with appropriate shell
# BTW - this is how you count the result since ls -1 ** | wc -l is incorrect 
# if the file name has \n in it.
cd /tmp/test || exit
[ -n "$BASH_VERSION" ] && shopt -s globstar
[ -n "$ZSH_VERSION" ] && setopt GLOBSTARSHORT # see table
dc=0; fc=0
for f in **; do              # the glob there is the only thing being changed
    [ -d "$f" ] && (( dc++ ))
    [ -f "$f" ] && (( fc++ ))
    printf "%d, %d \"%s\"\n" $dc $fc "$f"
done
printf "%d directories, %d files" $dc $fc

Results (expressed as X,Y for X directories and Y files for that example directory using the referenced glob. By inspection or by running these scripts you can see what is visited by the glob.):

glob Bash zsh zsh GLOBSTARSHORT pathlib python glob ruby
* 3,3 3,3 3,3 3,3 3,3 3,3
** 4,11 3,3 4,11 5,0‡ 5,11‡ 3,3
**/ 4,0 4,0 4,0 5,0‡ 5,0‡ 5,0‡
*/** 4,8 1,7 1,8 4,0 4,8 1,7
**/* 4,11 4,11 4,11 4,12† 4,11 4,11
**/*/ 4,0 4,0 4,0 4,12† 4,0 4,0
*/**/* 1,8 1,8 1,8 1,9† 1,8 1,8

‡ Directory count of 5 means the cwd is returned too.

† Python pathlib globs hidden files; the others do not.

Python script:

from pathlib import Path 
import glob 

tg="**/*"  # change this glob for testing

fc=dc=0
for fn in Path("/tmp/test").glob(tg):
    print(fn)
    if fn.is_file():
        fc=fc+1
    elif fn.is_dir():
        dc=dc+1

print(f"pathlib {dc} directories, {fc} files\n\n")      

fc=dc=0
for sfn in glob.glob(f"/tmp/test/{tg}", recursive=True):
    print(sfn)
    if Path(sfn).is_file():
        fc=fc+1
    elif Path(sfn).is_dir():
        dc=dc+1

print(f"glob.glob {dc} directories, {fc} files") 

Ruby script:

dc=fc=0
Dir.glob("/tmp/test/**/").
    each{ |f| p f; File.directory?(f) ? dc=dc+1 : (fc=fc+1 if File.file?(f)) }

puts "#{dc} directories, #{fc} files"

So the only globs that all agree on (other than the hidden file) are *, **/* and */**/*

Documentation:

  1. Bash: two adjacent ‘*’s used as a single pattern will match all files and zero or more directories and subdirectories.

  2. zsh: a) setopt GLOBSTARSHORT sets **.c to be equivalent to **/*.c and b) ‘**/’ is equivalent to ‘(*/)#’; note that this therefore matches files in the current directory as well as subdirectories.

  3. pathlib: ** which means “this directory and all subdirectories, recursively”

  4. python glob: If recursive is true, the pattern ** will match any files and zero or more directories, subdirectories and symbolic links to directories. If the pattern is followed by an os.sep or os.altsep then files will not match.

  5. ruby: ** Matches directories recursively if followed by /. If this path segment contains any other characters, it is the same as the usual *.

Questions:

  1. Are my assumptions about what each glob is supposed to do correct?

  2. Why is Bash the only one that is recursive with **? (if you add setopt GLOBSTARSHORT to zsh the result is similar with **

  3. Is it reasonable to tell yourself that **/* works for all

Hutch answered 6/1, 2022 at 21:16 Comment(12)
According to to the pathlib documentation, "Patterns are the same as for fnmatch, with the addition of “**” which means “this directory and all subdirectories, recursively”.", so "**" should only give you directories and subdirectories, not files. Similarly for glob.glob().Impute
bash and zsh document ** differently: it matches all files in bash, but only directories in zsh. (zsh has an additional option--of course--for allowing shorthand like **.c to be the same as **/*.c.)Pissed
@chepner: Yes, I read this. So it is expected from Bash. I am more curious about the others? My expectations were set by Bash but it is also puzzling that 100% of the others would get wrong? Is there another 'authoritative' way of recursive globbing?Hutch
It's not wrong: it's nonstandard, so every language is free to do whatever it wants.Pissed
I disagree with the claim that the others are "wrong". If they behave as documented, they are correct!Impute
The Python documentation for glob("**") states the pattern "**" will match any files and zero or more directories, subdirectories and symbolic links to directories which is functionally the same as Bash documentation. glob So -- Python is wrong with **?Hutch
The docs for Python glob.glob() are irrelevant when what you have tested is pathlib.Path.glob(). The docs for the latter say that ** means “this directory and all subdirectories, recursively”, which appears to be consistent with the behavior you observe.Flopeared
@Hutch Just tested glob.glob("**", recursive=True), and it's giving me all files in all (sub)directories.Impute
Replacing Path("/tmp/test").glob("**") with glob.glob("/tmp/test/**") I now get 3,3 so they are not even consistent between those two module...Hutch
And glob.glob("**/", recursive=1) gives me just (sub)directories. I think these two globs were implemented by different people who didn't talk to each other! :) Or read each other's documentation. Or maybe one didn't like the implementation choices made by the other...Impute
There is no standard behavior. Read the documentation for each function or language to see how they define the behavior of **, and code accordingly. To answer your first question, no, your assumptions are not correct.Pissed
In short, it's messy. @dawg, it might be worth adding an extra column in your question to distinguish between these two glob functions in Python.Impute
C
1

A file glob pattern works differently depending on the program doing the globing. There are many other possible programs that do globing (e.g. Perl, git, fish shell, windows’ cmd.exe, or PowerShell, csh, tcsh, tcl, etc). As you have found, specifications and behavior varies. You ask if your assumptions are correct; this is a broad question that is difficult to answer, but your research and testing looks thorough to me. That said, I don’t think it is productive to attempt to quantify and describe behavior across all of these programs, since any generalizations you discover are of limited utility. Personally, when a program says that it accepts a “glob” all I assume is that simple patterns work (a single * with text before or after), and if I need something more complex I consult specific documentation (which, sadly, is often lacking or difficult to find).

You might also find https://github.com/begin/globbing useful, since it tries to document commonly supported globbing syntax.

Carruthers answered 7/1, 2022 at 14:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.