I have a large fixed width file being read into pandas in chunks of 10000 lines. This works great for everything except removing duplicates from the data because the duplicates can obviously be in different chunks. The file is being read in chunks because it is too large to fit into memory in its entirety.
My first attempt at deduplicating the file was to bring in just the two columns needed to deduplicate it and make a list of rows to not read. Reading in just those two columns (out of about 500) easily fits in memory and I was able to use the id column to find duplicates and an eligibility column to decide which of the two or three with the same id to keep. I then used the skiprows flag of the read_fwf() command to skip those rows.
The problem I ran into is that the Pandas fixed width file reader doesn't work with skiprows = [list] and iterator = True at the same time.
So, how do I deduplicate a file being processed in chunks?