Rows are lost when reading this tab-separated file with pandas read_csv
Asked Answered
H

1

12

I have a .text file with following format, where fields (index number, name and message) are separated by \t (tab-separated):

712 ben     Battle of the Books
713 james   i used to be in TOM
714 tomy    i was in BOB once
715 ben Tournaments of Minds
716 tommy    Also the Lion in the upcoming school play
717 tommy   Can you guess
718 tommy    P
...

which I read with read_csv into a data frame:

 chat = pd.read_csv("f.text", sep = "\t", header = None, usecols = [2])

But the data frame just has 9812 rows while the ordinary file has more than 12428 rows (just 21 empty lines). It is quite weird. Do you have any idea? Thanks.

Hortensiahorter answered 24/2, 2016 at 9:32 Comment(7)
Can you post a download link to your data, difficult to answer here without posting guesses which is counter-productiveNavarro
Very weird. Maybe is necessary parameter lineterminator of read_csv. Or you can try add index_col=None.How you check length of df ? By print len(df) ?Acrosstheboard
@Acrosstheboard just print df It will show the row number under the table. Same result with len(df)Hortensiahorter
Hmmm, interesting. If you omit usecols, length is still wrong?Acrosstheboard
@Acrosstheboard yes. when i print line by line, I got 12428 lines.Hortensiahorter
Hmmm, try skip rows like chat = pd.read_csv("f.text", skiprows=9810, sep = "\t", header = None, usecols = [2]), then maybe check columns print df.columns and index print df.indexAcrosstheboard
@Acrosstheboard And I got the remaining rows! What happened!?Hortensiahorter
A
20

I think you need add parameter quoting:

import csv

chat = pd.read_csv("f.text",sep = "\t", header = None, usecols = [2], quoting=csv.QUOTE_NONE)
Acrosstheboard answered 25/2, 2016 at 9:18 Comment(2)
jezrael can you actually explain why this works, i.e. why the unquoted read dropped lines? Otherwise it's not a reusable resource to other users.Nicaragua
OMG, this saved me! It looks like the default behavior for read_csv() expects everything to be wrapped in quotes. But if it is a tab separated file with no quotes, then you need to specify such, otherwise the data parsing goes awryGomulka

© 2022 - 2024 — McMap. All rights reserved.