How do you use globbing in perl for a one-liner with many files, avoiding xargs/find/etc

Asked 12/3, 2023 at 18:43 Answered 1/4, 2023 at 17:6

When there are too many matching files, shells like bash break if you include a glob pattern on the commandline like

perl -pi -e 's/hi/bye/' too_many_files*

You can work around this with xargs, gnu parallel, or find, but for complex commands, these can be difficult to get right in terms of quoting, and they can also be less efficient than running perl once.

Is there a way to use perl's built-in globbing support for something like this? (which does not work)

perl -pi -e 's/hi/bye/' 'manyfiles*' # <-- Does not work.

Bifurcate answered 12/3, 2023 at 18:43 Comment(7)

What is the actual pattern to build the filelist? ... Must it be a a glob? – Catherincatherina 12/3, 2023 at 19:37

Note this isn't a bash problem, but an issue for operating systems that don't allow arbitrarily long command lines. – Puton 12/3, 2023 at 19:45

"difficult to get right in terms of quoting" -- what do you mean? xargs doesn't require changes in quoting. In printf '%s\0' too_many_files* | xargs -0 perl -pi -e 's/hi/bye/', the perl command is quoted exactly the same way as your original code. – Decree 12/3, 2023 at 20:0

@CharlesDuffy I'm not able to reconstruct a good example right now, though I believe I've run into problems, perhaps with nesting of different kinds of quotes, where the single perl command works on one file, but not when called from outside like this. parallel -q has sometimes been handy, and sometimes insufficient. – Bifurcate 12/3, 2023 at 20:50

parallel tries to be "smart" in ways that make quoting difficult. I strongly advise against its use. For background, the thread at lists.gnu.org/archive/html/bug-parallel/2015-05/msg00005.html, and the Unix & Linux question unix.stackexchange.com/questions/349483/… are good places to start. xargs is much less clever, does fewer things behind your back, and generally leans towards making only easily-foreseeable kinds of trouble. – Decree 12/3, 2023 at 21:20

Re "these can be difficult to get right in terms of quoting", huh? not at all – Diabolism 12/3, 2023 at 23:47

@CharlesDuffy I have indeed been assuming that parallel and xargs would have similar issues, so I'll look into your suggestions. I'm still very glad to have learned the BEGIN/glob trick, though! – Bifurcate 13/3, 2023 at 1:58

As noted in this answer, you can use a BEGIN block to have perl (rather than the shell) expand the file list:

slightly modified from the original:

Leave globing to perl instead of bash which has limitations,
perl -pi -e 'BEGIN{ @ARGV = glob(pop) } s/#//g' "*"
or when there are spaces in globed directory,
perl -MFile::Glob=bsd_glob -pi -e 'BEGIN{ @ARGV = bsd_glob(pop) } s/#//g' "*"

For more about glob vs bsd_glob, see this post.

(This is intentionally duplicated as I had trouble finding the answer quickly with search terms I had in mind.)

Bifurcate answered 12/3, 2023 at 18:43 Comment(0)

It is very simple to use Perl's glob for the filelist and then process files

perl -MPath::Tiny -we'path($_)->edit_lines( sub { s/hi/bye/ } ) for glob "files*"'

Here I use Path::Tiny with its edit_lines (see edit) while there are yet other tools for this.

A call to glob in the list context^† returns a list and we iterate over it, editing the file in a single statement. If there is more to do, or you'd rather open a file by hand and go through lines, then can put that code in a do block

perl -wE'do { ...process file... } for glob "files*"'

It there are so many files that the shell has a problem with the list then you may want to avoid building it in Perl as well, which is what glob does in list context. (Even as there is no limit, nor a disastrous performance impact.)

There's a number of ways to take those files one by one -- using File::Find, or things like readdir in scalar context or Path::Tiny::iterator. This requires selection of entries they find by regex so a "translation" of the glob, but that shouldn't be a problem given how simple glob patterns are. Or use glob in scalar context.^†

If there may be spaces in entry names use File::Glob -- merely adding use File::Glob qw(:bsd_glob); to the program will use that instead of glob. With a one-liner one can do

perl -MFile::Glob=":bsd_glob" -wE'...'

In this case you can also suppress sorting that glob does by using GLOB_NOSORT flag, good to do for large filelists unless you need them sorted. Then we have to actually replace glob

perl -MFile::Glob=":bsd_glob" -wE'do { ... } for bsd_glob("files*", GLOB_NOSORT)'

Tools other than glob, like ones mentioned above, don't care about spaces in names.

^† Another way is to use glob in the scalar context, when it acts as an iterator, and process each file as it is returned

perl -wE'do { ... } while "files*"'

(Or while (glob "files*") { ... })

This avoids building the potentially huge list upfront.

Catherincatherina answered 12/3, 2023 at 19:42 Comment(2)

If using File::Glob directly, might as well also tell it not to sort the files (Unless you really want them sorted, of course) to save some work. – Sulphanilamide 12/3, 2023 at 22:40

@Sulphanilamide Indeed, a good point to avoid sorting on large filelists, thank you. Edited.. – Catherincatherina 13/3, 2023 at 3:28

Contrary to your claims, find and/or xargs doesn't require any special or complex quoting. And contrary to your claims, it doesn't have to be inefficient either.

Here are some ways to use find and/or xargs:

find -maxdepth 1 -name 'too_many_files*' -exec               perl -pe'...' -i~ {} +
find -maxdepth 1 -name 'too_many_files*' -exec               perl -pe'...' -i~ {} \;
find -maxdepth 1 -name 'too_many_files*'         | xargs -r  perl -pe'...' -i~
find -maxdepth 1 -name 'too_many_files*' -print0 | xargs -r0 perl -pe'...' -i~

The first requires GNU find.

The second calls launches a new perl process for each file, so it's slower than the others which launch perl the minimum number of times.

The third doesn't support line feeds in the file names.

That said, you could also use

perl -MFile::Glob=bsd_glob -pe'
   BEGIN { @ARGV = map bsd_glob($_), @ARGV }
   s/hi/bye/;
' 'too_many_files*'

Diabolism answered 12/3, 2023 at 23:51 Comment(0)

You can work around this with xargs, gnu parallel, or find, but for complex commands, these can be difficult to get right in terms of quoting

Yep. GNU Parallel provides --shellquote to help with quoting:

$ parallel --shellquote
parallel: Warning: Input is read from the terminal. You are either an expert
parallel: Warning: (in which case: YOU ARE AWESOME!) or maybe you forgot
parallel: Warning: ::: or :::: or -a or to pipe data into parallel. If so
parallel: Warning: consider going through the tutorial: man parallel_tutorial
parallel: Warning: Press CTRL-D to exit.
perl -pi -e 's/"Hi Joe"/"Bye \$\$\$"/'  <-- pasted
'perl -pi -e '"'"'s/"Hi Joe"/"Bye \$\$\$"/'"'"
<CTRL-D>

You can then use:

ls | parallel 'perl -pi -e '"'"'s/"Hi Joe"/"Bye \$\$\$"/'"'"

But that is not very readable.

Instead I recommend defining a bash function. It improves readability greatly - especially if the code is bigger than a one-liner:

doit() {
  perl -pi -e 's/"Hi Joe"/"Bye \$\$\$"/' "$1"
  # do other stuff
}
export -f doit

ls | parallel doit

Olivero answered 1/4, 2023 at 17:6 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags