Delete files older than 10days on HDFS
Asked Answered
V

8

21

Is there a way to delete files older than 10 days on HDFS?

In Linux I would use:

find /path/to/directory/ -type f -mtime +10 -name '*.txt' -execdir rm -- {} \;

Is there a way to do this on HDFS? (Deletion to be done based on file creation date)

Valgus answered 29/5, 2017 at 5:15 Comment(18)
There is no find command, but hdfs dfs -ls -R /path/to/directory | egrep .txt$ is a good start.Sempiternal
@cricket_007 but how do we do the older than 'x' days?Valgus
You'd have to cut out the date portion of the standard output, store that filtered file list, and run hdfs dfs -rm in a loop... In other words, it needs to be scripted.Sempiternal
See: data retention: third optionSempiternal
Which hadoop version are you using?Smallage
@cricket_007 looks like that's the only way.Valgus
@GauravDave Hadoop2Valgus
I think he meant, what is x in Hadoop 2.x?Sempiternal
I use this scriptAurangzeb
@Aurangzeb Thanks, shall try this :)Valgus
@cricket_007 Hadoop 2.7.3Valgus
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask.Fruitless
@Fruitless Why would you say this isn't a topic to be asked here? Super User or Unix & Linux doesn't cover HDFS, this is a framework oriented question about how a developer could delete something using a tool(or command).Valgus
Stack Overflow is a site for programming and development questions. As asked, deleting files or repairing your filesystem is not on-topic. Ask at another site. If you don't like the suggested sites, then try one of Hadoop's mailing lists.Fruitless
@Fruitless I wanted to delete old files on HDFS, not do h/w administration! My mailing list is the best option for all questions, why do we have SO?Valgus
You have SO for your programming and development questions.Fruitless
@Fruitless Please go back to the link you pointed to in the earlier comment and read through it. Questions related to a framework/tool commonly used by programmers(in this case - Hadoop) is apt for SO. Reference: SO on-topicValgus
You missed the other part. The part about "... and is a practical, answerable problem that is unique to software development". Deleting files and fixing your filesystem has nothing to do with programming or development. There are more appropriate sites to learn how to delete files and run commands.Fruitless
V
21

Solution 1: Using multiple commands as answered by daemon12

hdfs dfs -ls /file/Path    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

Solution 2: Using Shell script

today=`date +'%s'`
hdfs dfs -ls /file/Path/ | grep "^d" | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')

if [ ${difference} -gt 10 ]; then
    hdfs dfs -rm -r $filePath
fi
done
Valgus answered 29/5, 2017 at 17:59 Comment(2)
The grep "^d" would return only directories (pending approval of my edit). It may be advisable to use something like ... | grep "/file/Path/" | ... to avoid header lines in the processed output, as well as grep -v "^d" to avoid directories, if that is necessary.Arrio
fix $(date -d ${dir_date} +%s) to $(date -d ${dir_date} +'%s')Windpollinated
D
16

How about this:

hdfs dfs -ls /tmp    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

A detailed description is here.

Dickie answered 29/5, 2017 at 9:50 Comment(2)
And skipTrash??Bandurria
Yes, if the user wants to delete the files without moving them to trash.Dickie
B
4

Yes, you can try with HdfsFindTool:

hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-job.jar \
  org.apache.solr.hadoop.HdfsFindTool \
  -find /pathhodir -mtime +10 -name ^.*\.txt$ \
  | xargs hdfs dfs -rm -r -skipTrash
Bandurria answered 29/5, 2017 at 11:45 Comment(2)
For those of us not using CDH, how do we get that?Sempiternal
This is reallyyy slow.Kaseykasha
D
1

I attempted to implement the accepted solution above.

Unfortunately, it only partially worked for me. I ran into 3 real world problems.

First, hdfs didn't have enough RAM to load up and print all the files.

Second, even when hdfs could print all the files awk could only handle ~8300 records before it broke.

Third, the performance was abysmal. When implemented it was deleting ~10 files per minute. This wasn't useful because I was generating ~240 files per minute.

So my final solution was this:

tmpfile=$(mktemp)
HADOOP_CLIENT_OPTS="-Xmx2g" hdfs dfs -ls /path/to/directory    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=35*24*60; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print $3}}; close(cmd);' > $tmpfile
hdfs dfs -rm -r $(cat $tmpfile)
rm "$tmpfile"

I don't know if there are additional limits on this solution but it handles 50,000+ records in a timely fashion.

EDIT: Interestingly, I ran into this issue again and on the remove, I had to batch my deletes as hdfs rm statement couldn't take more than ~32,000 inputs.

Defaulter answered 14/7, 2020 at 21:37 Comment(0)
F
0
hdfs dfs -ls -t /file/Path|awk -v dateA="$date" '{if ($6" "$7 < {target_date}) {print ($8)}}'|xargs -I% hdfs dfs -rm "%" /file/Path
Fula answered 18/6, 2020 at 4:12 Comment(0)
J
0
today=`date +'%s'`
days_to_keep=10

# Loop through files
hdfs dfs -ls -R /file/Path/ | while read f; do
  # Get File Date and File Name
  file_date=`echo $f | awk '{print $6}'`
  file_name=`echo $f | awk '{print $8}'`

  # Calculate Days Difference
  difference=$(( ($today - $(date -d $file_date +%s)) / (24 * 60 * 60) ))
  if [ $difference -gt $days_to_keep ]; then
    echo "Deleting $file_name it is older than $days_to_keep and is dated $file_date."
    hdfs dfs -rm -r $file_name
  fi
done
Jointless answered 22/3, 2021 at 3:10 Comment(1)
is there a difference between this and similar answer?Valgus
P
0

thanks @Ani Menon's answer

hdfs dfs -ls /file/Path | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

and update to

EXPIRED_DAY=80; PP='a hdfs path'; hdfs dfs -ls "${PP}" | tr -s ' ' | cut -d' ' -f6-8 | grep "^[0-9]" | awk -v TEST=${TEST} -v EXPIRED_DAY=${EXPIRED_DAY} 'BEGIN{LAST=24*60*60*EXPIRED_DAY; "date +%s" | getline NOW} {cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF>LAST){print "\""$3"\""}}' | tr '\n' ' ' | xargs -I '{}' -n 100 hdfs dfs -rm -r '{}'

with

  1. parametric the param with expired day(EXPIRED_DAY) and path(PP)
  2. add single path with quote prevent path space split({print "\""$3"\""})
  3. bath remove multi path is more efficiency with xargs

and running part of the command to see the remove

EXPIRED_DAY=80; PP='a hdfs path'; hdfs dfs -ls "${PP}" | tr -s ' ' | cut -d' ' -f6-8 | grep "^[0-9]" | awk -v TEST=${TEST} -v EXPIRED_DAY=${EXPIRED_DAY} 'BEGIN{LAST=24*60*60*EXPIRED_DAY; "date +%s" | getline NOW} {cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF>LAST){print "\""$3"\""}}'

Pandich answered 7/9, 2023 at 6:37 Comment(0)
Z
-1

Just to add another variation of previously submitted answers (I do not try to pretend to be original). The script can be modified to remove sub-folders or files recursively

#!/bin/bash
function hdfs-list-older-than () {
# list all content | filter out sub-folders | filter by creation datetime and isolate filepath
  hdfs dfs -ls  $1 | grep ^- | awk -v d=$2 -v t=$3 '{if($6 < d || ($6 == d && $7 < t) ){print $8}}'
}
hdfs-list-older-than $1 `date -d "-10 days"  +'%Y-%m-%d %H:%M'` | xargs hdfs dfs -rm {}
Zirkle answered 6/12, 2021 at 1:26 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Herrod

© 2022 - 2024 — McMap. All rights reserved.