Converting excel to feather format with python
Asked Answered
A

2

2

I have a (daily growing) list of around 100 big excel files, which I analyse in Python. As I have to run several loops over all the files, my analysis are getting slower and slower. Therefore I'd like to convert all excel files into feather format (like once a week). Is there a clever way to do that? What I have tried so far:

path = r"filepath\*_name*.xlsx"
file_list = glob.glob(path)
for f in file_list:
    df = pd.read_excel(f, encoding='utf-8')
    df[['boola', 'boolb']] = dfa[['boola', 'boolb']].astype(int)
    pathname = f[:-5] + ".ftr"
    df.to_feather(pathname)

But I'm getting the following error message:

ArrowInvalid: ('Could not convert stringa with type str: tried to convert to boolean', "Conversion failed for column stringb with type object")
Arceliaarceneaux answered 6/5, 2020 at 14:43 Comment(1)
Looks like you're trying to write an incompatible datatype to feather, have you tried explicitly casting the flagged column to an acceptable datatype? Similar to https://mcmap.net/q/1280157/-error-when-trying-to-write-dataframe-to-feather-does-feather-support-list-columnsForesail
A
1

Here is what solved my problem:

path = r"pathname\*_somename*.xlsx"
file_list = glob.glob(path)
for f in file_list:
    df = pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.')
    for col in df.columns:
            w= (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
            if len(df[w]) > 0:

                df[col] = df[col].astype(str)

            if df[col].dtype == list:
                df[col] = df[col].astype(str)
    pathname = f[:-4] + "ftr"
    df.to_feather(pathname)
df.head()

the , decimal=',', thousands='.' part was necessary because my input file was formatted in European standard, i.e. using comma as a decimal separator and a dot as thousands separator

Arceliaarceneaux answered 8/5, 2020 at 11:37 Comment(0)
D
0

Actually, you are getting this issue because the columns named "stringa,stringb" has some characters that feather cannot determine and he tried to convert to other types which returned error, So my solution for the same problem I faced before was to actually convert the columns to string first and replace characters that caused the error also:

import pandas as pd
import os
path = 'c://examplepath//'
files = [file for file in os.listdir(path)]
for file in files:
     df = pd.read_excel(path+file)
     df['column'] = df['column'].astype(str)
     df['column'] = df['column'].replace('old charecter causing error','new charecter').astype(str)
     df.to_feather(path+file.split('.')[0]+'.feather')

N.B I don't think pd.read_excel needs the parameter encoding as per documentation.

Distinguished answered 9/9, 2021 at 20:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.