How to improve performance of this counting program?

Asked 6/12, 2011 at 9:36 Answered 6/12, 2011 at 10:7

Given a file looks like this:

The first field is an ID which is in range(0, 200000000). The second field represents a type , which is in range(1, 5). And type 1 and type 2 belong to a common category S1, while type 3 and type 4 belong to S2. One single ID may have several records with different type. The file is about 200MB in size.

The problem is to count the number of IDs which has a record of type 1 or 2, and the number of IDs which has a record of type 3 or 4.

My code:

def gen(path):
    line_count = 0
    for line in open(path):
        tmp = line.split()
        id = int(tmp[0])
        yield id, int(tmp[1])

max_id = 200000000
S1 = bitarray.bitarray(max_id)
S2 = bitarray.bitarray(max_id)
for id, type in gen(path):
    if type != 3 and type != 4:
        S1[id] = True
    else:
        S2[id] = True

print S1.count(), S2.count()

Although it gives the answer, I think it runs a little slowly. What should I do to make it run faster?

EDIT: There are duplicated records in the file. And I only need to distinguish between S1(type 1 and type 2) and S2(type 3 and type 4). For example, 1440927 1 and 1440927 2 are counted only once but not twice because they belong to S1. So I have to store the IDs.

Odrick answered 6/12, 2011 at 9:36 Comment(4)

You could use a profiler. You could remove id=int( ... and use yield int(tmp[0], ... instead. You could use if type <= 2 instead of two comparisons. And you could remove the generator entirely and inline the code in a with open( ... ) as f: block. Give it a try. And the comment below has a good point too, about the bitarray ^^ – Sherburne 6/12, 2011 at 9:39

Is there any reason you use the bitarray to mark the indices? Otherwise you could simply increase a counter instead of setting the entries to "True". This should give you a performance increase. – Milliard 6/12, 2011 at 9:40

+1 on using a profiler. Where is the bottleneck? Is it the allocation of S1 and S2? Also, consider these questions: Are (almost) all numbers in 0-200000000 present? If not, consider another datatype. Can each id be present multiple times? If not, consider ditching the arrays completely and just use a counter. Or maybe this is a problem where you already have an optimal solution. For really large files your bottleneck may well be disk I/O which will require you to buy better disks to optimize. – Borrero 6/12, 2011 at 9:45

@Milliard I have to store the IDs because there are duplicated records. For example, in the file sample 1440927 should be counted only once but not twice. Because type 1 and 2 both belong to S1. – Odrick 6/12, 2011 at 9:52

If there is enough memory you could use dict instead of bitarray.bitarray. It could be faster:

S1, S2 = {}, {} # dicts are slightly faster than `set()`
with open(path) as f:
     for i, line in enumerate(f, 1):
         id, sep, type = line.partition(" ")
         if type == "1" or type == "2":
            S1[id] = True
         elif type == "3" or type == "4":
            S2[id] = True
         else:
            print "WARNING: unknown type: %r in line %d: %r" % (type, i, line)
print len(S1), len(S2)

Or you could try to sort the lines first:

def gettype(line):
    return line[-1]

S1, S2 = 0, 0
with open(path) as f:
     lines = f.read().splitlines()

lines.sort(key=gettype)
for type, group in itertools.groupby(lines, gettype):
    ids = (line.partition(" ")[0] for line in group)
    if type == "1" or type == "2":
       S1 += len(set(ids))
    elif type == "3" or type == "4":
       S2 += len(set(ids))
    else:
       assert 0, (type, list(ids))

print S1, S2

The asymptotic complexity of the second approach is worse.

You could use line_profiler to find out where your bottleneck is.

Omura answered 6/12, 2011 at 10:7 Comment(0)

You're using a iterator over the file, this means that you only buffer a few lines at the time. Every time the buffer is empty the disk needs to seek and your program has to wait.

200MB easily fits into your memory, so getting all lines will speed things up:

def gen(path):
    # load all the lines, 
    lines = open(path).readlines() 
    split = (line.split() for line in lines)
    return ((int(x), int(y)) for x,y in split)

Hohenlinden answered 6/12, 2011 at 9:56 Comment(3)

It looks as if you're using 600MB in your solution. – Sherburne 6/12, 2011 at 10:48

@hochl: Ok I changed the listcomprehension to a generator expression. Now it should use 200MB to store the lines. – Hohenlinden 6/12, 2011 at 11:33

you can't be certain what is faster for line in f.readlines() or for line in f unless a profiler says it. File iterator uses READAHEAD_BUFSIZE (8192) it means hundreds of lines at a time in this case. – Omura 6/12, 2011 at 12:31

Are you tied to Python?

egrep -e "[12]$" filename.txt | cut -d " " -f 1 | sort -u | wc -l

egrep -e "[34]$" filename.txt | cut -d " " -f 1 | sort -u | wc -l

These two commands count you the number of occurences of ("1" or "2") and ("3" or "4") at the end of each line in your filename.txt while ignoring duplicate first fields.

Probably faster than Python…

Dissected answered 6/12, 2011 at 9:51 Comment(5)

uniq requires sorted input, which the OP doesn't have. You could add a sort to the pipeline... – Homogeneous 6/12, 2011 at 10:10

Are you tied to Python? vs. Are you tied to Linux? :) – Colb 6/12, 2011 at 10:16

@warvariuc: My windows desktop has a grep -E available on the commandline... what's your point? – Decrepit 6/12, 2011 at 10:26

@MattH, my point was: what's better - to be tied to a separate program, or do everything in Python? – Colb 6/12, 2011 at 11:31

@warvariuc: I'd be inclined to say that the right tool for the right job is the best approach. – Decrepit 6/12, 2011 at 11:49

If there is enough memory you could use dict instead of bitarray.bitarray. It could be faster:

S1, S2 = {}, {} # dicts are slightly faster than `set()`
with open(path) as f:
     for i, line in enumerate(f, 1):
         id, sep, type = line.partition(" ")
         if type == "1" or type == "2":
            S1[id] = True
         elif type == "3" or type == "4":
            S2[id] = True
         else:
            print "WARNING: unknown type: %r in line %d: %r" % (type, i, line)
print len(S1), len(S2)

Or you could try to sort the lines first:

def gettype(line):
    return line[-1]

S1, S2 = 0, 0
with open(path) as f:
     lines = f.read().splitlines()

lines.sort(key=gettype)
for type, group in itertools.groupby(lines, gettype):
    ids = (line.partition(" ")[0] for line in group)
    if type == "1" or type == "2":
       S1 += len(set(ids))
    elif type == "3" or type == "4":
       S2 += len(set(ids))
    else:
       assert 0, (type, list(ids))

print S1, S2

The asymptotic complexity of the second approach is worse.

You could use line_profiler to find out where your bottleneck is.

Omura answered 6/12, 2011 at 10:7 Comment(0)

Recommended topics

Hot tags