Python Pandas read_excel dtype str replace nan by blank ('') when reading or when writing via to_csv
Asked Answered
K

3

28

Python version: Python 2.7.13 :: Anaconda custom (64-bit) Pandas version: pandas 0.20.2

Hello,

I have a quite simple requirement. I would like to read an excel file and write a specific sheet to a csv file. Blank values in the source Excel file should be treated / written as blank when writing the csv file. However, my blank records are always written as 'nan' to the output file. (without the quotes)

I read the Excel file via method

read_excel(xlsx, sheetname='sheet1', dtype = str)

I am specifying dtype because I have some columns that are numbers but should be treated as string. (Otherwise they might lose leading 0s etc) i.e. I would like to read the exact value from every cell.

Now I write the output .csv file via to_csv(output_file,index=False,mode='wb',sep=',',encoding='utf-8')

However, my result csv file contains nan for all blank cells from the excel file.

What am I missing? I already tried .fillna('', inplace=True) function but it seems to be doing nothing to my data. I also tried to add parameter na_rep ='' to the to_csv method but without success.

Thanks for any help!

Addendum: Please find hereafter a reproducible example.

Please find hereafter a reproducible example code. Please first create a new Excel file with 2 columns with the following content: COLUMNA COLUMNB COLUMNC 01 test 02 test
03 test

(I saved this Excel file to c:\test.xls Please note that 1st and 3rd row for column B as well as the 2nd row for Column C is blank/empty)

Now here is my code:

import pandas as pd
xlsx = pd.ExcelFile('c:\\test.xlsx')
df = pd.read_excel(xlsx, sheetname='Sheet1', dtype = str)
df.fillna('', inplace=True)
df.to_csv('c:\\test.csv', index=False,mode='wb',sep=',',encoding='utf-8', na_rep ='')

My result is:
COLUMNA,COLUMNB,COLUMNC
01,nan,test
02,test,nan
03,nan,test

My desired result would be:
COLUMNA,COLUMNB,COLUMNC
01,,test
02,test,
03,,test

Kerge answered 17/7, 2017 at 15:38 Comment(5)
What do you mean by "blank"?Varela
df.fillna didn't work?Kabul
Please provide a reproducible example.Varela
By blank I mean an empty cell in Excel (when reading) gets written to 'nan' (without quotes) df.fillna('', inplace=True) didn't work. I will try to work on a reproducible example.Kerge
@Kerge So it is a nan string?Kabul
K
32

Since you are dealing with nan strings, you can use the replace function:

df = pd.DataFrame({'Col1' : ['nan', 'foo', 'bar', 'baz', 'nan', 'test']})
df.replace('nan', '')

   Col1
0      
1   foo
2   bar
3   baz
4      
5  test

All 'nan' string values will be replaced by the empty string ''. replace is not in-place, so make sure you assign it back:

df = df.replace('nan', '')

You can then write it to your file using to_csv.


If you are actually looking to fill NaN values with blank, use fillna:

df = df.fillna('')    
Kabul answered 17/7, 2017 at 15:40 Comment(7)
Thank you very much, this works! Would you be able to explain why fillna() doesn't work?Kerge
@Kerge df.fillna explicitly looks for np.nan (it is a float number with a special bit representation), whereas 'nan' is just a string. Also, you can consider marking this answer accepted if it helped. Thanks!Kabul
Thank you. Is there any specific reason/added value that read_excel would replace blank/empty cells with 'nan' string instead of leaving them blank?Kerge
@Kerge A little searching, and I found it's a bug apparently: #16158439Kabul
I didn't like the idea of replacing 'nan' with '' since that doesn't seem to make the distinction between legitimate 'nan' strings and empty cells, but df = pd.read_excel.fillna('') worked well, thanks.Gerardogeratology
This method will replace all "nan" string, so it's not safe. Don't use it.Eschar
That's what the question asks, and that is the answer providedKabul
B
13

I had the same issue, and the options for replace or fillna did not work. (df.info() showed that the fields where NaN was displayed were float64). The na_filter = False option worked for me. It would work like this:

df = pd.read_excel("myexcelfile.xlsx", na_filter = False)

You can read more about the na_filter option in the pandas read_excel documentation.

Boethius answered 8/8, 2022 at 2:47 Comment(1)
This should be the accepted option - it is faster and actually provides what is needed (the nan fields show up as empty strings). Thank you!Optime
M
6

I had the same problem and used the option keep_default_na=False in read_excel. df = pd.read_excel(content, sheet_name=sheet, header=[0], keep_default_na=False)

Mosque answered 8/2, 2022 at 13:38 Comment(3)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Glaydsglaze
This does not really answer the question. If you have a different question, you can ask it by clicking Ask Question. To get notified when this question gets new answers, you can follow this question. Once you have enough reputation, you can also add a bounty to draw more attention to this question. - From ReviewShirleyshirlie
very useful thoughPreconize

© 2022 - 2024 — McMap. All rights reserved.