I attempted to implement the accepted solution above.
Unfortunately, it only partially worked for me. I ran into 3 real world problems.
First, hdfs didn't have enough RAM to load up and print all the files.
Second, even when hdfs could print all the files awk could only handle ~8300 records before it broke.
Third, the performance was abysmal. When implemented it was deleting ~10 files per minute. This wasn't useful because I was generating ~240 files per minute.
So my final solution was this:
tmpfile=$(mktemp)
HADOOP_CLIENT_OPTS="-Xmx2g" hdfs dfs -ls /path/to/directory | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=35*24*60; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print $3}}; close(cmd);' > $tmpfile
hdfs dfs -rm -r $(cat $tmpfile)
rm "$tmpfile"
I don't know if there are additional limits on this solution but it handles 50,000+ records in a timely fashion.
EDIT: Interestingly, I ran into this issue again and on the remove, I had to batch my deletes as hdfs rm statement couldn't take more than ~32,000 inputs.
hdfs dfs -ls -R /path/to/directory | egrep .txt$
is a good start. – Sempiternalhdfs dfs -rm
in a loop... In other words, it needs to be scripted. – Sempiternalx
in Hadoop 2.x? – Sempiternal