Even after `sort`, `uniq` is still repeating some values

Reference file: http://snap.stanford.edu/data/wiki-Vote.txt.gz

(It is a tape archive that contains a file called Wiki-Vote.txt)

The first few lines in the file that contains the following, head -n 10 Wiki-Vote.txt

# Directed graph (each unordered pair of nodes is saved once): Wiki-Vote.txt 
# Wikipedia voting on promotion to administratorship (till January 2008). 
# Directed edge A->B means user A voted on B becoming Wikipedia administrator.
# Nodes: 7115 Edges: 103689
# FromNodeId    ToNodeId
     30          1412
     30          3352
     30          5254
     30          5543
     30          7478
     3            28

I want to find the number of nodes in the graph, (although it's already given in line 3). I ran the following command,

awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort | uniq | wc -l

Explanation:

/^#/ matches all the lines that start with #. And !/^#/ matches that doesn't.
awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt prints the first and second column of all those matched lines in new lines.
| sort pipes the output to sort them.
| uniq should display all those unique values, but it doesn't.
| wc -l counts the previous lines and it is wrong.

The result of the above command is, 8491, which is not 7115 (as mentioned in the line 3). I don't know why uniq repeats the values. I can tell that since awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort -i | uniq | tail returns,

Which contains the repeated values. Someone please run the code and tell me that I am not the only one getting the wrong answer and please help me figure out why I'm getting what I am getting.

$ awk '!/^#/ { print $1; print $2; }' ./wiki-Vote.txt | sort | uniq | tail | hexdump -C 00000000 39 39 32 0a 39 39 33 0a 39 39 33 0d 0a 39 39 34 |992.993.993..994| # ^^ HERE 00000010 0a 39 39 34 0d 0a 39 39 35 0d 0a 39 39 36 0a 39 |.994..995..996.9| # ^^ ^^ 00000020 39 38 0a 39 39 39 0a 39 39 39 0d 0a |98.999.999..| # ^^ 0000002c

Recommended topics

Hot tags