BASH - Tell if duplicate lines exist (y/n)
Asked Answered
B

4

8

I am writing a script to manipulate a text file.

First thing I want to do is check if duplicate entries exist and if so, ask the user whether we wants to keep or remove them.

I know how to display duplicate lines if they exist, but what I want to learn is just to get a yes/no answer to the question "Do duplicates exist?"

It seems uniq will return 0 either if duplicates were found or not as long as the command completed without issues.

What is that command that I can put in an if-statement just to tell me if duplicate lines exist?

My file is very simple, it is just values in single column.

Boothe answered 18/3, 2014 at 23:7 Comment(2)
If you're not against using Vim to filter text files manually, I recommend the HighlightRepeats method in stackoverflow.com/questions/1268032. I often use it to filter duplicate files/folders, then apply shell commands on the filtered file.Reside
@F.X Thanks for your reply. I would like to accomplish this with some lines within my script.Boothe
O
3

You can use awk combined with the boolean || operator:

# Ask question if awk found a duplicate
awk 'a[$0]++{exit 1}' test.txt || (
    echo -n "remove duplicates? [y/n] "
    read answer
    # Remove duplicates if answer was "y" . I'm using `[` the shorthand
    # of the test command. Check `help [`
    [ "$answer" == "y" ] && uniq test.txt > test.uniq.txt
)

The block after the || will only get executed if the awk command returns 1, meaning it found duplicates.

However, for a basic understanding I'll also show an example using an if block

awk 'a[$0]++{exit 1}' test.txt

# $? contains the return value of the last command
if [ $? != 0 ] ; then
    echo -n "remove duplicates? [y/n] "
    read answer
    # check answer
    if [ "$answer" == "y" ] ; then
        uniq test.txt > test.uniq.txt            
    fi
fi

However the [] are not just brackets like in other programming languages. [ is a synonym for the test bash builtin command and ] it's last argument. You need to read help [ in order to understand

Outrun answered 18/3, 2014 at 23:17 Comment(1)
Thanks for your help. I will give a try to your code.Boothe
C
11

I'd probably use awk to do this but, for the sake of variety, here is a brief pipe to accomplish the same thing:

$ { sort | uniq -d | grep . -qc; } < noduplicates.txt; echo $?
1
$ { sort | uniq -d | grep . -qc; } < duplicates.txt; echo $?
0

sort + uniq -d make sure that only duplicate lines (which don't have to be adjacent) get printed to stdout and grep . -c counts those lines emulating wc -l with the useful side effect that it returns 1 if it doesn't match (i.e. a zero count) and -q just silents the output so it doesn't print the line count so you can use it silently in your script.

has_duplicates()
{
  {
    sort | uniq -d | grep . -qc
  } < "$1"
}

if has_duplicates myfile.txt; then
  echo "myfile.txt has duplicate lines"
else
  echo "myfile.txt has no duplicate lines"
fi
Cairns answered 19/3, 2014 at 7:33 Comment(0)
O
3

You can use awk combined with the boolean || operator:

# Ask question if awk found a duplicate
awk 'a[$0]++{exit 1}' test.txt || (
    echo -n "remove duplicates? [y/n] "
    read answer
    # Remove duplicates if answer was "y" . I'm using `[` the shorthand
    # of the test command. Check `help [`
    [ "$answer" == "y" ] && uniq test.txt > test.uniq.txt
)

The block after the || will only get executed if the awk command returns 1, meaning it found duplicates.

However, for a basic understanding I'll also show an example using an if block

awk 'a[$0]++{exit 1}' test.txt

# $? contains the return value of the last command
if [ $? != 0 ] ; then
    echo -n "remove duplicates? [y/n] "
    read answer
    # check answer
    if [ "$answer" == "y" ] ; then
        uniq test.txt > test.uniq.txt            
    fi
fi

However the [] are not just brackets like in other programming languages. [ is a synonym for the test bash builtin command and ] it's last argument. You need to read help [ in order to understand

Outrun answered 18/3, 2014 at 23:17 Comment(1)
Thanks for your help. I will give a try to your code.Boothe
K
1

You can do uniq=yes/no using this awk one-liner:

awk '!seen[$0]{seen[$0]++; i++} END{print (NR>i)?"no":"yes"}' file
  • awk uses an array of uniques called seen.
  • Every time we put an element in unique we increment an counter i++.
  • Finally in END block we compare # of records with unique # of records in this code: (NR>i)?
  • If condition is true that means there are duplicate records and we print no otherwise it prints yes.
Kizzykjersti answered 18/3, 2014 at 23:17 Comment(2)
Thanks for your reply. Can you please explain to me how your line works?Boothe
Yes sure added explanation.Kizzykjersti
S
1

A quick bash solution:

#!/bin/bash

INPUT_FILE=words

declare -A a 
while read line ; do
    [ "${a[$line]}" = 'nonempty' ] && duplicates=yes && break
    a[$line]=nonempty
done < $INPUT_FILE

[ "$duplicates" = yes ] && echo -n "Keep duplicates? [Y/n]" && read keepDuplicates

removeDuplicates() {
    sort -u $INPUT_FILE > $INPUT_FILE.tmp
    mv $INPUT_FILE.tmp $INPUT_FILE
}

[ "$keepDuplicates" != "Y" ] && removeDuplicates

The script reads line by line from the INPUT_FILE and stores each line in the associative array a as the key and sets the string nonempty as value. Before storing the value, it first checks whether it is already there - if it is it means it found a duplicate and it sets the duplicates flag and then it breaks out of the cycle.

Later it only checks if the flag is set and asks the user whether to keep the duplicates. If they answer anything else than Y then it calls the removeDuplicates function which uses sort -u to remove the duplicates. ${a[$line]} evaluates to the value of the associative array a for the key $line. [ "$duplicates" = yes ] is a bash builtin syntax for a test. If the test succeeds then whatever follows after && is evaluated.

But note that the awk solutions will likely be faster so you may want to use them if you expect to process bigger files.

Simulate answered 18/3, 2014 at 23:34 Comment(2)
Thanks jkbkot! Can you please give me a brief explanation of how this code works? I am a rookie :)Boothe
@Boothe no problem, added explanation. Btw, upvoting is good enough as thanks ;) also, try to accept one of the answers to keep the site organized. Happy coding!Simulate

© 2022 - 2024 — McMap. All rights reserved.