Pandas - write Multiindex rows with to_csv

About

Asked 27/6, 2013 at 17:25 Answered 27/6, 2013 at 17:36

I am using to_csv to write a Multiindex DataFrame to csv files. The csv file has one column that contains the multiindexes in tuples, like:

('a', 'x')
('a', 'y')
('a', 'z')
('b', 'x')
('b', 'y')
('b', 'z')

However, I want to be able to output the Multiindex to two columns instead of one column of tuples, such as:

a, x
 , y
 , z
b, x
 , y
 , z

It looks like tupleize_cols can achieve this for columns, but there is no such option for the rows. Is there a way to achieve this?

Hazelton answered 27/6, 2013 at 17:25 Comment(1)

tupleize_cols is for a multi-index on columns (its coming in 0.12); prob just reset_index().set_index(['idxa','idxb']).to_csv() is your best bet (the specify index_col=['idxa','idxb'] on read-back – Moslemism 27/6, 2013 at 17:31

I think this will do it

In [3]: df = DataFrame(dict(A = 'foo', B = 'bar', value = 1),index=range(5)).set_index(['A','B'])

In [4]: df
Out[4]: 
         value
A   B         
foo bar      1
    bar      1
    bar      1
    bar      1
    bar      1

In [5]: df.to_csv('test.csv')

In [6]: !cat test.csv
A,B,value
foo,bar,1
foo,bar,1
foo,bar,1
foo,bar,1
foo,bar,1

In [7]: pd.read_csv('test.csv',index_col=[0,1])
Out[7]: 
         value
A   B         
foo bar      1
    bar      1
    bar      1
    bar      1
    bar      1

To write with the index duplication (kind of a hack though)

In [27]: x = df.reset_index()

In [28]: mask = df.index.to_series().duplicated()

In [29]: mask
Out[29]: 
A    B  
foo  bar    False
     bar     True
     bar     True
     bar     True
     bar     True
dtype: bool

In [30]: x.loc[mask.values,['A','B']] = ''

In [31]: x
Out[31]: 
     A    B  value
0  foo  bar      1
1                1
2                1
3                1
4                1

In [32]: x.to_csv('test.csv')

In [33]: !cat test.csv
,A,B,value
0,foo,bar,1
1,,,1
2,,,1
3,,,1
4,,,1

Read back is a bit tricky actually

In [37]: pd.read_csv('test.csv',index_col=0).ffill().set_index(['A','B'])
Out[37]: 
         value
A   B         
foo bar      1
    bar      1
    bar      1
    bar      1
    bar      1

Moslemism answered 27/6, 2013 at 17:36 Comment(6)

Sorry maybe I didn't explain clearly. What I hope to achieve is to remove duplicate entries in the csv file. In your example, I hope to have 'foo' to appear only once in the csv file, so that when you cat test.csv, you would see one 'foo' in the first row and blank in the subsequent rows. – Hazelton 27/6, 2013 at 17:41

updated...though this is pretty non-standard and not sure this will generalize – Moslemism 27/6, 2013 at 17:49

let me ask you, is this format of csv useful/common? e.g. non-duplicating indexes? – Moslemism 27/6, 2013 at 17:57

I don't know if it's common, but in my case it's useful/convenient as a visual representation. Imagine if I need to print out the csv table as some kind of report. Suppose there are 100 rows and the first level of multiindex has 3 groups. The non-duplicating indexes will make it very easy to visually identify the 3 groups. – Hazelton 27/6, 2013 at 19:1

ok....added this solution to the cookbook in any event (will be updated later today): pandas.pydata.org/pandas-docs/dev/cookbook.html#csv – Moslemism 27/6, 2013 at 19:22

If the original dataframe had an actual null value in the value column, the last example, on reading back the dataframe, would wrongly duplicate a previous value on top of it – Antistrophe 1/9, 2016 at 21:42

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags