Comparing two files in linux terminal
Asked Answered
A

13

206

There are two files called "a.txt" and "b.txt" both have a list of words. Now I want to check which words are extra in "a.txt" and are not in "b.txt".

I need a efficient algorithm as I need to compare two dictionaries.

Afterworld answered 24/1, 2013 at 11:54 Comment(3)
diff a.txt b.txt is not enough?Northcutt
Can the words occur several times in each file? Can you sort the files?Cornellcornelle
i need only those words that are no present in "b.txt" and are present in a.txtAfterworld
A
2

Here is my solution for this :

mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english
Afterworld answered 24/1, 2013 at 13:28 Comment(2)
Did you try any of the other solutions? Did one of these solutions was useful to you? Your question is generic enough to draw in many users, but your answer is more specific for my taste... For my particular case sdiff -s file1 file2 was useful.Germanium
@Germanium my solution do not use sdiff command. It only use linux built in commands to solve the problem.Afterworld
R
407

if you have vim installed,try this:

vimdiff file1 file2

or

vim -d file1 file2

you will find it fantastic.enter image description here

Rule answered 13/2, 2014 at 9:10 Comment(5)
Your answer is awesome, but my teacher required me to not use any library function :PAfterworld
The codes colored means they are different in two files. @zygimantusRule
This solution is awesome, it's just a pity that is vim because it sometimes unnecessary complex.Efflorescence
How to exit vimdiff file1 file2 ?Narthex
press ESC then press column and q like this ":q"Poirier
T
83

Sort them and use comm:

comm -23 <(sort a.txt) <(sort b.txt)

comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1, -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(...) syntax to sort the files on the fly, if they are already sorted you don't need this.

Thereafter answered 24/1, 2013 at 11:56 Comment(4)
I have added my own answer using only grep commands, please tell me is it more efficient?Afterworld
@AliImran, comm is more efficient because it does the job in a single run, without storing the entire file in memory. Since you're using dictionaries that are most likely already sorted you don't even need to sort them. Using grep -f file1 file2 on the other hand will load the entire file1 into memory and compare each line in file2 with all of those entries, which is much less efficient. It's mostly useful for small, unsorted -f file1.Thereafter
Thanks @AndersJohansson for sharing the "comm" command. Its nifty indeed. I frequently have to do outer joins between files and this does the trick.Swacked
Pay attention to the new line character... I just found that \n will also be included to do comparing.Pantelegraph
Y
48

If you prefer the diff output style from git diff, you can use it with the --no-index flag to compare files not in a git repository:

git diff --no-index a.txt b.txt

Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in timecommand) this approach vs some of the other answers here:

git diff --no-index a.txt b.txt
# ~1.2s

comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s

diff a.txt b.txt
# ~2.6s

sdiff a.txt b.txt
# ~2.7s

vimdiff a.txt b.txt
# ~3.2s

comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.


Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:

This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.

Youngs answered 15/10, 2017 at 14:16 Comment(0)
O
37

Try sdiff (man sdiff)

sdiff -s file1 file2
Outbalance answered 27/12, 2014 at 12:22 Comment(0)
W
34

You can use diff tool in linux to compare two files. You can use --changed-group-format and --unchanged-group-format options to filter required data.

Following three options can use to select the relevant group for each option:

  • '%<' get lines from FILE1

  • '%>' get lines from FILE2

  • '' (empty string) for removing lines from both files.

E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt

[root@vmoracle11 tmp]# cat file1.txt 
test one
test two
test three
test four
test eight
[root@vmoracle11 tmp]# cat file2.txt 
test one
test three
test nine
[root@vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt 
test two
test four
test eight
Wack answered 24/1, 2013 at 11:57 Comment(0)
T
9

You can also use: colordiff: Displays the output of diff with colors.

About vimdiff: It allows you to compare files via SSH, for example :

vimdiff /var/log/secure scp://192.168.1.25/var/log/secure

Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html

Telekinesis answered 16/5, 2016 at 8:18 Comment(0)
S
8

Also, do not forget about mcdiff - Internal diff viewer of GNU Midnight Commander.

For example:

mcdiff file1 file2

Enjoy!

Szymanowski answered 6/6, 2018 at 12:34 Comment(0)
M
4

Use comm -13 (requires sorted files):

$ cat file1
one
two
three

$ cat file2
one
two
three
four

$ comm -13 <(sort file1) <(sort file2)
four
Mcelroy answered 24/1, 2013 at 11:58 Comment(0)
B
3

You can also use:

sdiff file1 file2

To display differences side by side within your terminal!

Benthamism answered 11/2, 2021 at 18:8 Comment(0)
A
2

Here is my solution for this :

mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english
Afterworld answered 24/1, 2013 at 13:28 Comment(2)
Did you try any of the other solutions? Did one of these solutions was useful to you? Your question is generic enough to draw in many users, but your answer is more specific for my taste... For my particular case sdiff -s file1 file2 was useful.Germanium
@Germanium my solution do not use sdiff command. It only use linux built in commands to solve the problem.Afterworld
P
2
diff a.txt b.txt | grep '<'

can then pipe to cut for a clean output

diff a.txt b.txt | grep '<' | cut -c 3
Parmentier answered 10/12, 2021 at 0:4 Comment(0)
M
0

You can use cmp.

cmp file1.c file2.c

Example (-b option is for printing differing bytes.):

$ cmp -b quine2.c quine3.c
quine2.c quine3.c differ: byte 13, line 1 is  15 ^M  12 ^J

Be sure to checkout the man page for cmp.

Milch answered 22/4 at 1:1 Comment(0)
L
-1

Using awk for it. Test files:

$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one

The awk:

$ awk '
NR==FNR {                    # process b.txt  or the first file
    seen[$0]                 # hash words to hash seen
    next                     # next word in b.txt
}                            # process a.txt  or all files after the first
!($0 in seen)' b.txt a.txt   # if word is not hashed to seen, output it

Duplicates are outputed:

four
four

To avoid duplicates, add each newly met word in a.txt to seen hash:

$ awk '
NR==FNR {
    seen[$0]
    next
}
!($0 in seen) {              # if word is not hashed to seen
    seen[$0]                 # hash unseen a.txt words to seen to avoid duplicates 
    print                    # and output it
}' b.txt a.txt

Output:

four

If the word lists are comma-separated, like:

$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three

you have to do a couple of extra laps (forloops):

awk -F, '                    # comma-separated input
NR==FNR {
    for(i=1;i<=NF;i++)       # loop all comma-separated fields
        seen[$i]
    next
}
{
    for(i=1;i<=NF;i++)
        if(!($i in seen)) {
             seen[$i]        # this time we buffer output (below):
             buffer=buffer (buffer==""?"":",") $i
        }
    if(buffer!="") {         # output unempty buffers after each record in a.txt
        print buffer
        buffer=""
    }
}' b.txt a.txt

Output this time:

four
five,six
Lallage answered 3/10, 2019 at 8:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.