Rows are lost when reading this tab-separated file with pandas read_csv

About

Asked 24/2, 2016 at 9:32 Answered 25/2, 2016 at 9:18

I have a .text file with following format, where fields (index number, name and message) are separated by \t (tab-separated):

712 ben     Battle of the Books
713 james   i used to be in TOM
714 tomy    i was in BOB once
715 ben Tournaments of Minds
716 tommy    Also the Lion in the upcoming school play
717 tommy   Can you guess
718 tommy    P
...

which I read with read_csv into a data frame:

 chat = pd.read_csv("f.text", sep = "\t", header = None, usecols = [2])

But the data frame just has 9812 rows while the ordinary file has more than 12428 rows (just 21 empty lines). It is quite weird. Do you have any idea? Thanks.

Hortensiahorter answered 24/2, 2016 at 9:32 Comment(7)

Can you post a download link to your data, difficult to answer here without posting guesses which is counter-productive – Navarro 24/2, 2016 at 9:34

Very weird. Maybe is necessary parameter lineterminator of read_csv. Or you can try add index_col=None.How you check length of df ? By print len(df) ? – Acrosstheboard 24/2, 2016 at 9:43

@Acrosstheboard just print df It will show the row number under the table. Same result with len(df) – Hortensiahorter 24/2, 2016 at 10:2

Hmmm, interesting. If you omit usecols, length is still wrong? – Acrosstheboard 24/2, 2016 at 10:11

@Acrosstheboard yes. when i print line by line, I got 12428 lines. – Hortensiahorter 24/2, 2016 at 11:32

Hmmm, try skip rows like chat = pd.read_csv("f.text", skiprows=9810, sep = "\t", header = None, usecols = [2]), then maybe check columns print df.columns and index print df.index – Acrosstheboard 24/2, 2016 at 11:35

@Acrosstheboard And I got the remaining rows! What happened!? – Hortensiahorter 24/2, 2016 at 11:39

I think you need add parameter quoting:

import csv

chat = pd.read_csv("f.text",sep = "\t", header = None, usecols = [2], quoting=csv.QUOTE_NONE)

Acrosstheboard answered 25/2, 2016 at 9:18 Comment(2)

jezrael can you actually explain why this works, i.e. why the unquoted read dropped lines? Otherwise it's not a reusable resource to other users. – Nicaragua 19/10, 2019 at 5:37

OMG, this saved me! It looks like the default behavior for read_csv() expects everything to be wrapped in quotes. But if it is a tab separated file with no quotes, then you need to specify such, otherwise the data parsing goes awry – Gomulka 9/3, 2021 at 0:42

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags