I red the mails about EU to US decimal mark conversion, these helped a lot, but I'm still feeling to need some help from the experts .. My datas are from an ERP system with numbers in the format like "1'000'000,32" and I'd like simply to convert into something like "1000000.32" for further processing in Pandas.
My actual solution to obtain the US format starting from the EU looks like:
...
# read_csv and merge, clean .. different CSV files
# result = merge (some_DataFrame_EU_format, ...)
...
result.to_csv(path, sep';')
result = read_csv(path, sep';', converters={'column_name': lambda x: float(x.replace ('.','').replace(',','.'))})
....
result.to_csv(path, sep';')
I had the feeling this to be a slow method to change the ',' with '.' because of the read_csv and to_csv (and the disk ..) so was willing to try the .replace method directly on the DataFrame to save some processing time.
My initial tentative had been something like the below (that I red elsewhere here on the forum ..):
result['column_name'] = result['column_name'].replace( '.', '')
result['column_name'] = result['column_name'].replace( ',', '.')
result['column_name'] = result['column_name'].astype(float)
Which did not worked and resulted in an 'invalid literal for float' error'.
I so moved to:
for i in range (0, len(result)):
result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '')
result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( ',', '.')
result['column_name'] = result['column_name'].astype(float)
The above worked .. but with some surprise it appeared about 3 times slower than the read_csv/converters solution. The use of the below helped in some way:
for i in range (0, len(result)):
result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '').replace( ',', '.')
result['column_name'] = result['column_name'].astype(float)
I red the fine manuals .. and know that read_csv is optimized .. but did not really expected a red / write /read/ write cicle to be three times faster than a for loop !!
Do you think it might be worth the case to work more on this? Any suggestion ? Or is it better to stay with repeated write/read/write approach?
My file is aboout 30k lines x 150 columns, the read/write/read(convert)/write takes about 18 seconds, the .ix for is above 52 sec with the first kind of loop (and 32 with grouped .replace).
What is your experience in converting DataFrames from EU to US format ? Some suggested method to improve? What about 'mapping' or 'locale' ? Might they be faster?
Thank you so much, Fabio.
P.S. I realize I was 'verbose' and not enought 'pythonic' .. sorry sorry .. I'm still learning ...:-)
df['column'].replace('.','').replace(',','.').astype(float)
should be much faster as its done as array ops – Sappington