Pandas Multiindex to CSV without duplicate Index

Asked 12/7, 2018 at 19:31 Answered 5/6, 2022 at 15:11

For the life of me I can't figure out how to remove the duplicate index items created when writing a multiindex dataframe to CSV. While there is this answer out there, it doesn't apply to me per se because my second level has all different values.

This is a chunk of the dataframe I have, it just goes on for each month

Month Measure
JAN   KWH          
      Max KW       
      Day/Hour     
      Peak End Use 
      Peak Pct     
FEB   KWH          
      Max KW       
      Day/Hour     
      Peak End Use 
      Peak Pct

But in CSV it looks something like

JAN KWH
JAN Max KW
JAN Day/Hour
JAN Peak End Use
JAN Peak Pct
FEB KWH
FEB Max KW
FEB Day/Hour
FEB Peak End Use
FEB Peak Pct

I hope for it to look exactly just like the dataframe so that means in CSV it would look like

Jan,KWH
   ,Max KW
   ,Day/Hour

and so on

I have tried df.index.to_series().duplicates() but it doesn't work because each of my index pairs are not duplicates. I've also tried using index.tolist() then looping through to change with no luck.

What's the right approach here?

Thanks in advance!

Corotto answered 12/7, 2018 at 19:31 Comment(5)

Can you provide an expected CSV output? The index would be the combination of the first and second columns and what you've shown is that exactly, so I don't see any duplicates. – Slusher 12/7, 2018 at 21:53

Sorry I should've been more clear. I had hoped that the output would look exactly like how the dataframe does where there's only one of first index and multiple of the other. – Corotto 13/7, 2018 at 14:21

I think if you write it that way, you will no longer have a recognizable CSV file. That is, if you were to write to your multi-index CSV format, then read that CSV file, it would give you a totally different result. That is what most of the answer you linked seems to address -- how to get Pandas to read the file and format as expected. – Slusher 13/7, 2018 at 15:44

I won't be reading it back with the script at all. All I want is to be extracting certain information from a few thousand page long simulation out put. The first character can just be a space no? – Corotto 13/7, 2018 at 17:33

Ah. Got it. Then it's much simpler than you think. You can just use file.write(str(df)). – Slusher 13/7, 2018 at 19:0

It seems pandas does not provide directly such functionality, perhaps to ensure that the generated csv files can be read back, as mentioned in the comments above. I faced with the same problem when generating a table for report in rst format. Finally, I decided to process the generated csv by an external script. The script replaces the repeating values in columns with spaces, thus achieving desired result. Note: this script assumes commas as csv separators.

# Clean csv with multiindices
from sys import argv

# Positional command line arguments:
#   1: original csv file
#   2: number of columns to check (optional)

with open(argv[1], 'r') as f:
  if len(argv) > 2:
    n = int(argv[2])
    i0 = ('', ) * n
  else:
    n = None
    i0 = None
  for l in f:
    if n is None:
        # define n from the 1-st line
        i1 = l.split(',')
        n = len(i1)
        i0 = ('', ) * n
    i1 = l.split(',', n)
    nl = ()
    for s0, s1 in zip(i0, i1):
        if s0 == s1:
            e = ' ' * len(s0)
        else:
            e = s1
        nl += (e, )
    nl += tuple(i1[n:])
    print ','.join(nl),
    i0 = i1[:n]

Salmagundi answered 11/1, 2019 at 0:36 Comment(0)

Based on this answer to a newer question, you can move the index levels to columns and use DataFrame.duplicated to set duplicate index values to NaN:

def prepare_multiindex_for_export(df):
  new_df = df.copy()

  for i in range(df.index.nlevels, 0, -1):
    new_df = new_df.sort_index(level=i-1)

  replace_cols = dict()
  
  for i in range(new_df.index.nlevels):
    idx = new_df.index.get_level_values(i)

    new_df.insert(i, idx.name, idx)

    replace_cols[idx.name] = new_df[idx.name].where(~new_df.duplicated(subset=new_df.index.names[:i+1]))

  for col, ser in replace_cols.items():
    new_df[col] = ser

  return new_df.reset_index(drop=True)

df = pd.DataFrame([
  {'Last': 'Doe', 'First': 'John', 'Age': 43},
  {'Last': 'Doe', 'First': 'Jane', 'Age':42},
  {'Last': 'Blow', 'First': 'John', 'Age': 17},
  {'Last': 'Blow', 'First': 'Bob', 'Age': 20},
  {'Last': 'Doe', 'First': 'Ben', 'Age': 30}]).set_index(['Last', 'First'])

>>> df
            Age
Last First
Doe  John    43
     Jane    42
Blow John    17
     Bob     20
Doe  Ben     30

>>> prepare_multiindex_for_export(df)
   Last First  Age
0  Blow   Bob   20
1   NaN  John   17
2   Doe   Ben   30
3   NaN  Jane   42
4   NaN  John   43

>>> print(prepare_multiindex_for_export(df).to_csv(index=False))
Last,First,Age
Blow,Bob,20
,John,17
Doe,Ben,30
,Jane,42
,John,43

Note: If your index has NaN values in it, you may need to modify

Gormless answered 5/6, 2022 at 15:11 Comment(0)

Recommended topics

Hot tags