Loop over files in HDFS directory
Asked Answered
A

2

14

I need to loop over all csv files in a Hadoop file system. I can list all of the files in a HDFS directory with

> hadoop fs -ls /path/to/directory
Found 2 items
drwxr-xr-x   - hadoop hadoop          2 2016-10-12 16:20 /path/to/directory/tmp
-rwxr-xr-x   3 hadoop hadoop 4691945927 2016-10-12 19:37 /path/to/directory/myfile.csv

and can loop over all files in a standard directory with

for filename in /path/to/another/directory/*.csv; do echo $filename; done

but how can I combine the two? I've tried

for filename in `hadoop fs -ls /path/to/directory | grep csv`; do echo $filename; done

but that gives me some nonsense like

Found
2
items
drwxr-xr-x

hadoop
hadoop
2    
2016-10-12
....
Aitken answered 13/10, 2016 at 1:22 Comment(2)
hadoop fs -ls /path/to/directory | grep csv should give you a list of lines of standard out, not necessarily just filenames.Nichols
See in another question a nice way todo a loop: #28685971Rellia
R
14

This should work

for filename in `hadoop fs -ls /path/to/directory | awk '{print $NF}' | grep .csv$ | tr '\n' ' '`
do echo $filename; done
Radiophotograph answered 13/10, 2016 at 2:0 Comment(5)
This works like a charm! But it prints the entire path to the file. How can I cut it short so that it prints only the file name??Coreligionist
For anyone looking for a similar solution, use 'cut' to get the substring. $(echo $filename | cut -f4 -d/)Coreligionist
I can refer to #965553 for shortRadiophotograph
It would be great if someone could explain how this worksFisticuffs
It works for me when I run it in the shell, but when I run it through a script, the loop runs only once. The output is a single string that contains the full filename of each file in the directory. The trim operation removes the newline character and replaces it with a space and turns every line of the -ls output to a single space separated line. How can I fix this?Shiite
A
3

The -C option will display only the file paths.

for filename in $(hadoop fs -ls -C /path/to/directory/*.csv); do
    echo "${filename}"
done
Ambagious answered 16/8, 2021 at 18:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.