pd.read_feather problems with decimal / thousands separator and rounding problems for floats
Asked Answered
M

2

6

I'd like to use .ftr files to quickly analyze hundreds of tables. Unfortunately I have some problems with decimal and thousands separator, similar to that post, just that read_feather does not allow for decimal=',', thousands='.' options. I've tried the following approaches:

df['numberofx'] = (
    df['numberofx']
    .apply(lambda x: x.str.replace(".","", regex=True)
                      .str.replace(",",".", regex=True))

resulting in

AttributeError: 'str' object has no attribute 'str'

when I change it to

df['numberofx'] = (
    df['numberofx']
    .apply(lambda x: x.replace(".","").replace(",","."))

I receive some strange (rounding) mistakes in the results, like 22359999999999998 instead of 2236 for some numbers that are higher than 1k. All below 1k are 10 times the real result, which is probably because of deleting the "." of the float and creating an int of that number.

Trying

df['numberofx'] = df['numberofx'].str.replace('.', '', regex=True)

also leads to some strange behavior in the results, as some numbers are going in the 10^12 and others remain at 10^3 as they should.

Here is how I create my .ftr files from multiple Excel files. I know I could simply create DataFrames from the Excel files but that would slowdown my daily calculations to much.

How can I solve that issue?


EDIT: The issue seems to come from reading in an excel file as df with non US standard regarding decimal and thousands separator and than saving it as feather. using pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.') options for reading in the excel file solved my issue. That leads to the next question:

why does saving floats in a feather file lead to strange rounding errors like changing 2.236 to 2.2359999999999998?

Morrissey answered 20/5, 2020 at 9:30 Comment(4)
just to be clear you want to convert 1.000.000,10 to 1000000.10 on your numberofx column ?Yaakov
all the values in that column should be integers. Somehow the values are stored as float instead. I think the issue is because of European vs. US dezimal signs. Therefore all digits that are greater than 1k are converted to e.g. 2.236. I'd like to receive the original integer valuesMorrissey
you sure that column in the dataframe contains float and not string type ? right ? to see type of column execute that line --> df.dtypes['numberofx']Yaakov
You are right, it is stored as object, including the decimal and thousands separator sign.Morrissey
M
0

As mentioned in my edit here is what solved my initial problem:

path = r"pathname\*_somename*.xlsx"
file_list = glob.glob(path)
for f in file_list:
    df = pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.')
    for col in df.columns:
            w= (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
            if len(df[w]) > 0:

                df[col] = df[col].astype(str)

            if df[col].dtype == list:
                df[col] = df[col].astype(str)
    pathname = f[:-4] + "ftr"
    df.to_feather(pathname)
df.head()

I had to add the decimal=',', thousands='.' option for reading in an excel file, which I later saved as feather. So the problem did not arise when working with .ftr files but before. The rounding problems seem to come from saving numbers with different decimal and thousand separators as .ftr files.

Morrissey answered 1/6, 2020 at 8:44 Comment(0)
Y
2

the problem in your code is that :

when you check your column type in dataframe ( Panda ) you gonna find :

df.dtypes['numberofx']

result : type object

so suggested solution is to try :

df['numberofx'] = df['numberofx'].apply(pd.to_numeric, errors='coerce')

Another way to fix this problem is to convert your values to float :

def coerce_to_float(val):
    try:
       return float(val)
    except ValueError:
       return val

df['numberofx']= df['numberofx'].applymap(lambda x: coerce_to_float(x))

to avoid that type of float '4.806105e+12' here is a sample Sample :

df = pd.DataFrame({'numberofx':['4806105017087','4806105017087','CN414149']})
print (df)
              ID
0  4806105017087
1  4806105017087
2       CN414149

print (pd.to_numeric(df['numberofx'], errors='coerce'))
0    4.806105e+12
1    4.806105e+12
2             NaN
Name: ID, dtype: float64

df['numberofx'] = pd.to_numeric(df['numberofx'], errors='coerce').fillna(0).astype(np.int64)
print (df['numberofx'])
              ID
0  4806105017087
1  4806105017087
2              0
Yaakov answered 25/5, 2020 at 13:33 Comment(1)
thank you @Zine Mahmoud! Unfortunately all figures that are supposed to be above 1k and are some float (e.g. 2.236) due to some formatting problems, are simply cut after the point (so e.g. 2 instead of 2236)Morrissey
M
0

As mentioned in my edit here is what solved my initial problem:

path = r"pathname\*_somename*.xlsx"
file_list = glob.glob(path)
for f in file_list:
    df = pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.')
    for col in df.columns:
            w= (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
            if len(df[w]) > 0:

                df[col] = df[col].astype(str)

            if df[col].dtype == list:
                df[col] = df[col].astype(str)
    pathname = f[:-4] + "ftr"
    df.to_feather(pathname)
df.head()

I had to add the decimal=',', thousands='.' option for reading in an excel file, which I later saved as feather. So the problem did not arise when working with .ftr files but before. The rounding problems seem to come from saving numbers with different decimal and thousand separators as .ftr files.

Morrissey answered 1/6, 2020 at 8:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.