Reference file: http://snap.stanford.edu/data/wiki-Vote.txt.gz
(It is a tape archive that contains a file called Wiki-Vote.txt
)
The first few lines in the file that contains the following, head -n 10 Wiki-Vote.txt
# Directed graph (each unordered pair of nodes is saved once): Wiki-Vote.txt
# Wikipedia voting on promotion to administratorship (till January 2008).
# Directed edge A->B means user A voted on B becoming Wikipedia administrator.
# Nodes: 7115 Edges: 103689
# FromNodeId ToNodeId
30 1412
30 3352
30 5254
30 5543
30 7478
3 28
I want to find the number of nodes in the graph, (although it's already given in line 3). I ran the following command,
awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort | uniq | wc -l
Explanation:
/^#/
matches all the lines that start with#
. And!/^#/
matches that doesn't.awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt
prints the first and second column of all those matched lines in new lines.| sort
pipes the output to sort them.| uniq
should display all those unique values, but it doesn't.| wc -l
counts the previous lines and it is wrong.
The result of the above command is, 8491, which is not 7115 (as mentioned in the line 3). I don't know why uniq
repeats the values. I can tell that since awk '!/^#/ { print $1; print $2; }' Wiki-Vote.txt | sort -i | uniq | tail
returns,
992
993
993
994
994
995
996
998
999
999
Which contains the repeated values. Someone please run the code and tell me that I am not the only one getting the wrong answer and please help me figure out why I'm getting what I am getting.