Using Pandas how do I deduplicate a file being read in chunks?

I have a large fixed width file being read into pandas in chunks of 10000 lines. This works great for everything except removing duplicates from the data because the duplicates can obviously be in different chunks. The file is being read in chunks because it is too large to fit into memory in its entirety.

My first attempt at deduplicating the file was to bring in just the two columns needed to deduplicate it and make a list of rows to not read. Reading in just those two columns (out of about 500) easily fits in memory and I was able to use the id column to find duplicates and an eligibility column to decide which of the two or three with the same id to keep. I then used the skiprows flag of the read_fwf() command to skip those rows.

The problem I ran into is that the Pandas fixed width file reader doesn't work with skiprows = [list] and iterator = True at the same time.

So, how do I deduplicate a file being processed in chunks?

My solution was to bring in just the columns needed to find the duplicates I want to drop and make a bitmask based on that information. Then, by knowing the chunksize and which chunk I'm on I reindex the chunk I'm on so that it matches the correct position it represents on the bitmask. Then I just pass it through the bitmask and the duplicate rows are dropped.

Bring in the entire column to deduplicate on, in this case 'id'. Then create a bitmask of the rows that AREN'T duplicates. DataFrame.duplicated() returns the rows that are duplicates and the ~ inverts that. Now we have our 'dupemask'.

dupemask = ~df.duplicated(subset = ['id'])

Then create an iterator to bring the file in in chunks. Once that is done loop over the iterator and create a new index for each chunk. This new index matches the small chunk dataframe with its position in the 'dupemask' bitmask, which we can then use to only keep the lines that aren't duplicates.

for i, df in enumerate(chunked_data_iterator):
    df.index = range(i*chunksize, i*chunksize + len(df.index))
    df = df[dupemask]

This approach only works in this case because the data is large because its so wide. It still has to read in a column in its entirety in order to work.

Recommended topics

Hot tags