Error when trying to write DataFrame to feather. Does feather support list columns?
Asked Answered
C

3

8

I'm working with both R and Python and I want to write one of my pandas DataFrames as a feather so I can work with it more easily in R. However, when I try to write it as a feather, I get the following error:

ArrowInvalid: trying to convert NumPy type float64 but got float32

I doubled checked my column types and they are already float 64:

In[1]
df.dtypes

Out[1]
id         Object
cluster    int64
vector_x   float64
vector_y   float64

I get the same error regardless of using feather.write_dataframe(df, "path/df.feather") or df.to_feather("path/df.feather").

I saw this on GitHub but didn't understand if it was related or not: https://issues.apache.org/jira/browse/ARROW-1345 and https://github.com/apache/arrow/issues/1430

In the end, I can just save it as a csv and change the columns in R (or just do the whole analysis in Python), but I was hoping to use this.

Edit 1:

Still having the same issue despite the great advice below so updating what I've tried.

df[['vector_x', 'vector_y', 'cluster']] = df[['vector_x', 'vector_y', 'cluster']].astype(float)

df[['doc_id', 'text']] = df[['doc_id', 'text']].astype(str)

df[['doc_vector', 'doc_vectors_2d']] = df[['doc_vector', 'doc_vectors_2d']].astype(list)

df.dtypes

Out[1]:
doc_id           object
text             object
doc_vector       object
cluster          float64
doc_vectors_2d   object
vector_x         float64
vector_y         float64
dtype: object

Edit 2:

After much searching, it appears that the issue is that my cluster column is a list type made up of int64 integers. So I guess the real quest is, does feather format support lists?

Edit 3:

Just to tie this in a bow, feather does not support nested data types like lists, at least not yet.

Carboxylate answered 24/1, 2019 at 20:39 Comment(1)
Did storing the lists as strings work?Silicone
M
2
  • Luckly, I got the reason of my feather IO error here.
  • And I also got the solution for this problem, pandas.to_feather and read_feather are both based on pyarrow, and a column that contains lists as values is already support by pyarrow from 2019.

Solution:

pip install pyarrow==latest # my version is 1.0.0 and it work

Then, still use pd.to_feather("Filename") and read_feather.

Marauding answered 5/8, 2020 at 7:41 Comment(0)
D
6

The problem in your case is the id Object column. These are Python objects and they cannot represented in a language neutral format. This feather (actually the underlying Apache Arrow / pyarrow) is trying to guess the DataType of the id column. The guess is done on the first objects it sees in the column. These are float64 numpy scalars. Later, you have float32 scalars. Instead of coercing them to some type, Arrow is more strict with types and fails.

You should be able to work around this problem by ensuring that all columns have a non-object dtype with df['id'] = df['id'].astype(float).

Doge answered 25/1, 2019 at 17:4 Comment(4)
the id column is a string column--is there a type that I can convert it to that is compatible with arrow?Carboxylate
Yes, you can ensure that all object in the column are of the same Python type by using df['id'].astype(str).Doge
still having the same issue--looks like I'm not actually converting it from and object . . . I updated the question with the code I've tried.Carboxylate
Try df['id'] = df['id'].astype('string')Harrelson
C
6

After much research, the simple answer is that feather does not support list (or other nested data type) columns.

Carboxylate answered 26/1, 2020 at 21:47 Comment(1)
here it says feather V2 supports list columns but in practice I noticed that it didn't... <ursalabs.org/blog/2020-feather-v2>Banc
M
2
  • Luckly, I got the reason of my feather IO error here.
  • And I also got the solution for this problem, pandas.to_feather and read_feather are both based on pyarrow, and a column that contains lists as values is already support by pyarrow from 2019.

Solution:

pip install pyarrow==latest # my version is 1.0.0 and it work

Then, still use pd.to_feather("Filename") and read_feather.

Marauding answered 5/8, 2020 at 7:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.