I have a rather large fixed-width file (~30M rows, 4gb) and when I attempted to create a DataFrame using pandas read_fwf() it only loaded a portion of the file, and was just curious if anyone has had a similar issue with this parser not reading the entire contents of a file.
import pandas as pd
file_name = r"C:\....\file.txt"
fwidths = [3,7,9,11,51,51]
df = read_fwf(file_name, widths = fwidths, names = [col0, col1, col2, col3, col4, col5])
print df.shape #<30M
If I naively read the file into 1 column using read_csv(), all of the file is read to memory and there is no data loss.
import pandas as pd
file_name = r"C:\....\file.txt"
df = read_csv(file_name, delimiter = "|", names = [col0]) #arbitrary delimiter (the file doesn't include pipes)
print df.shape #~30M
Of course, without seeing the contents or format of the file it could be related to something on my end, but wanted to see if anyone else has had any issues with this in the past. I did a sanity check and tested a couple of the rows deep in the file and they all seem to be formatted correctly (further verified when I was able to pull this into an Oracle DB with Talend using the same specs).
Let me know if anyone has any ideas, it would be great to run everything via Python and not go back and forth when I begin to develop analytics.