Given a file looks like this:
1440927 1
1727557 3
1440927 2
9917156 4
The first field is an ID which is in range(0, 200000000)
. The second field represents a type , which is in range(1, 5)
. And type 1 and type 2 belong to a common category S1
, while type 3 and type 4 belong to S2
. One single ID may have several records with different type. The file is about 200MB in size.
The problem is to count the number of IDs which has a record of type 1 or 2, and the number of IDs which has a record of type 3 or 4.
My code:
def gen(path):
line_count = 0
for line in open(path):
tmp = line.split()
id = int(tmp[0])
yield id, int(tmp[1])
max_id = 200000000
S1 = bitarray.bitarray(max_id)
S2 = bitarray.bitarray(max_id)
for id, type in gen(path):
if type != 3 and type != 4:
S1[id] = True
else:
S2[id] = True
print S1.count(), S2.count()
Although it gives the answer, I think it runs a little slowly. What should I do to make it run faster?
EDIT:
There are duplicated records in the file. And I only need to distinguish between S1(type 1 and type 2) and S2(type 3 and type 4). For example, 1440927 1
and 1440927 2
are counted only once but not twice because they belong to S1. So I have to store the IDs.
id=int( ...
and useyield int(tmp[0], ...
instead. You could useif type <= 2
instead of two comparisons. And you could remove the generator entirely and inline the code in awith open( ... ) as f:
block. Give it a try. And the comment below has a good point too, about the bitarray ^^ – Sherburne