Delete files older than 10days on HDFS

Asked 29/5, 2017 at 5:15 Answered 7/9, 2023 at 6:37

Is there a way to delete files older than 10 days on HDFS?

In Linux I would use:

find /path/to/directory/ -type f -mtime +10 -name '*.txt' -execdir rm -- {} \;

Is there a way to do this on HDFS? (Deletion to be done based on file creation date)

Valgus answered 29/5, 2017 at 5:15 Comment(18)

There is no find command, but hdfs dfs -ls -R /path/to/directory | egrep .txt$ is a good start. – Sempiternal 29/5, 2017 at 5:17

@cricket_007 but how do we do the older than 'x' days? – Valgus 29/5, 2017 at 5:18

You'd have to cut out the date portion of the standard output, store that filtered file list, and run hdfs dfs -rm in a loop... In other words, it needs to be scripted. – Sempiternal 29/5, 2017 at 5:20

See: data retention: third option – Sempiternal 29/5, 2017 at 5:21

Which hadoop version are you using? – Smallage 29/5, 2017 at 5:51

@cricket_007 looks like that's the only way. – Valgus 29/5, 2017 at 6:41

@GauravDave Hadoop2 – Valgus 29/5, 2017 at 6:41

I think he meant, what is x in Hadoop 2.x? – Sempiternal 29/5, 2017 at 6:44

I use this script – Aurangzeb 29/5, 2017 at 6:55

@Aurangzeb Thanks, shall try this :) – Valgus 29/5, 2017 at 8:3

@cricket_007 Hadoop 2.7.3 – Valgus 29/5, 2017 at 8:7

Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See What topics can I ask about here in the Help Center. Perhaps Super User or Unix & Linux Stack Exchange would be a better place to ask. – Fruitless 6/5, 2018 at 5:49

@Fruitless Why would you say this isn't a topic to be asked here? Super User or Unix & Linux doesn't cover HDFS, this is a framework oriented question about how a developer could delete something using a tool(or command). – Valgus 6/5, 2018 at 14:34

Stack Overflow is a site for programming and development questions. As asked, deleting files or repairing your filesystem is not on-topic. Ask at another site. If you don't like the suggested sites, then try one of Hadoop's mailing lists. – Fruitless 6/5, 2018 at 19:11

@Fruitless I wanted to delete old files on HDFS, not do h/w administration! My mailing list is the best option for all questions, why do we have SO? – Valgus 7/5, 2018 at 2:5

You have SO for your programming and development questions. – Fruitless 7/5, 2018 at 2:37

@Fruitless Please go back to the link you pointed to in the earlier comment and read through it. Questions related to a framework/tool commonly used by programmers(in this case - Hadoop) is apt for SO. Reference: SO on-topic – Valgus 7/5, 2018 at 3:16

You missed the other part. The part about "... and is a practical, answerable problem that is unique to software development". Deleting files and fixing your filesystem has nothing to do with programming or development. There are more appropriate sites to learn how to delete files and run commands. – Fruitless 7/5, 2018 at 3:42

Solution 1: Using multiple commands as answered by daemon12

hdfs dfs -ls /file/Path    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

Solution 2: Using Shell script

today=`date +'%s'`
hdfs dfs -ls /file/Path/ | grep "^d" | while read line ; do
dir_date=$(echo ${line} | awk '{print $6}')
difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
filePath=$(echo ${line} | awk '{print $8}')

if [ ${difference} -gt 10 ]; then
    hdfs dfs -rm -r $filePath
fi
done

Valgus answered 29/5, 2017 at 17:59 Comment(2)

The grep "^d" would return only directories (pending approval of my edit). It may be advisable to use something like ... | grep "/file/Path/" | ... to avoid header lines in the processed output, as well as grep -v "^d" to avoid directories, if that is necessary. – Arrio 29/4, 2019 at 7:55

fix $(date -d ${dir_date} +%s) to $(date -d ${dir_date} +'%s') – Windpollinated 6/5, 2021 at 9:40

How about this:

hdfs dfs -ls /tmp    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

A detailed description is here.

Dickie answered 29/5, 2017 at 9:50 Comment(2)

And skipTrash?? – Bandurria 30/5, 2017 at 16:20

Yes, if the user wants to delete the files without moving them to trash. – Dickie 31/5, 2017 at 3:29

Yes, you can try with HdfsFindTool:

hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-job.jar \
  org.apache.solr.hadoop.HdfsFindTool \
  -find /pathhodir -mtime +10 -name ^.*\.txt$ \
  | xargs hdfs dfs -rm -r -skipTrash

Bandurria answered 29/5, 2017 at 11:45 Comment(2)

For those of us not using CDH, how do we get that? – Sempiternal 29/5, 2017 at 14:8

This is reallyyy slow. – Kaseykasha 17/9, 2019 at 17:27

I attempted to implement the accepted solution above.

Unfortunately, it only partially worked for me. I ran into 3 real world problems.

First, hdfs didn't have enough RAM to load up and print all the files.

Second, even when hdfs could print all the files awk could only handle ~8300 records before it broke.

Third, the performance was abysmal. When implemented it was deleting ~10 files per minute. This wasn't useful because I was generating ~240 files per minute.

So my final solution was this:

tmpfile=$(mktemp)
HADOOP_CLIENT_OPTS="-Xmx2g" hdfs dfs -ls /path/to/directory    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=35*24*60; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print $3}}; close(cmd);' > $tmpfile
hdfs dfs -rm -r $(cat $tmpfile)
rm "$tmpfile"

I don't know if there are additional limits on this solution but it handles 50,000+ records in a timely fashion.

EDIT: Interestingly, I ran into this issue again and on the remove, I had to batch my deletes as hdfs rm statement couldn't take more than ~32,000 inputs.

Defaulter answered 14/7, 2020 at 21:37 Comment(0)

hdfs dfs -ls -t /file/Path|awk -v dateA="$date" '{if ($6" "$7 < {target_date}) {print ($8)}}'|xargs -I% hdfs dfs -rm "%" /file/Path

Fula answered 18/6, 2020 at 4:12 Comment(0)

today=`date +'%s'`
days_to_keep=10

# Loop through files
hdfs dfs -ls -R /file/Path/ | while read f; do
  # Get File Date and File Name
  file_date=`echo $f | awk '{print $6}'`
  file_name=`echo $f | awk '{print $8}'`

  # Calculate Days Difference
  difference=$(( ($today - $(date -d $file_date +%s)) / (24 * 60 * 60) ))
  if [ $difference -gt $days_to_keep ]; then
    echo "Deleting $file_name it is older than $days_to_keep and is dated $file_date."
    hdfs dfs -rm -r $file_name
  fi
done

Jointless answered 22/3, 2021 at 3:10 Comment(1)

is there a difference between this and similar answer? – Valgus 23/3, 2021 at 5:40

thanks @Ani Menon's answer

hdfs dfs -ls /file/Path | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

and update to

EXPIRED_DAY=80; PP='a hdfs path'; hdfs dfs -ls "${PP}" | tr -s ' ' | cut -d' ' -f6-8 | grep "^[0-9]" | awk -v TEST=${TEST} -v EXPIRED_DAY=${EXPIRED_DAY} 'BEGIN{LAST=24*60*60*EXPIRED_DAY; "date +%s" | getline NOW} {cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF>LAST){print "\""$3"\""}}' | tr '\n' ' ' | xargs -I '{}' -n 100 hdfs dfs -rm -r '{}'

with

parametric the param with expired day(EXPIRED_DAY) and path(PP)
add single path with quote prevent path space split({print "\""$3"\""})
bath remove multi path is more efficiency with xargs

and running part of the command to see the remove

EXPIRED_DAY=80; PP='a hdfs path'; hdfs dfs -ls "${PP}" | tr -s ' ' | cut -d' ' -f6-8 | grep "^[0-9]" | awk -v TEST=${TEST} -v EXPIRED_DAY=${EXPIRED_DAY} 'BEGIN{LAST=24*60*60*EXPIRED_DAY; "date +%s" | getline NOW} {cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF>LAST){print "\""$3"\""}}'

Pandich answered 7/9, 2023 at 6:37 Comment(0)

-1

Just to add another variation of previously submitted answers (I do not try to pretend to be original). The script can be modified to remove sub-folders or files recursively

#!/bin/bash
function hdfs-list-older-than () {
# list all content | filter out sub-folders | filter by creation datetime and isolate filepath
  hdfs dfs -ls  $1 | grep ^- | awk -v d=$2 -v t=$3 '{if($6 < d || ($6 == d && $7 < t) ){print $8}}'
}
hdfs-list-older-than $1 `date -d "-10 days"  +'%Y-%m-%d %H:%M'` | xargs hdfs dfs -rm {}

Zirkle answered 6/12, 2021 at 1:26 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Herrod 6/12, 2021 at 1:56

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags