Can not save pandas dataframe to parquet with lists of floats as cell value
Asked Answered
S

1

5

I have an dataframe with a structure like this:

                                                Coumn1                                             Coumn2
0    (0.00030271668219938874, 0.0002655923890415579...  (0.0016430083196610212, 0.0014970217598602176,...
1    (0.00015607803652528673, 0.0001314736582571640...  (0.0022136708721518517, 0.0014974646037444472,...
2    (0.011317798867821693, 0.011339936405420303, 0...  (0.004868391435593367, 0.004406007472425699, 0...
3    (3.94578673876822e-05, 3.075833956245333e-05, ...  (0.0075020878575742245, 0.0096737677231431, 0....
4    (0.0004926157998852432, 0.0003811710048466921,...  (0.010351942852139473, 0.008231297135353088, 0...
..                                                 ...                                                ...
130  (0.011190211400389671, 0.011337820440530777, 0...  (0.010182800702750683, 0.011351295746862888, 0...
131  (0.006286659277975559, 0.007315031252801418, 0...  (0.02104150503873825, 0.02531484328210354, 0.0...
132  (0.0022791570518165827, 0.0025983047671616077,...  (0.008847278542816639, 0.009222050197422504, 0...
133  (0.0007059817435219884, 0.0009831463685259223,...  (0.0028264704160392284, 0.0029402063228189945,...
134  (0.0018992726691067219, 0.002058899961411953, ...  (0.0019639385864138603, 0.002009353833273053, ...

[135 rows x 2 columns]

where each cell holds a list/tuple of some float values:

type(psd_res.data_frame['Column1'][0])
<class 'tuple'>
type(psd_res.data_frame['Column1'][0][0])
<class 'numpy.float64'>

(each cell entry contains the same amount of entries in the tuple)

when i try to save the dataframe now as parquet i get an error (fastparquet):

Can't infer object conversion type: 0    (0.00030271668219938874, 0.0002655923890415579...
1    (0.00015607803652528673, 0.0001314736582571640...
...

Name: Column1, dtype: object

Full stack trace: https://pastebin.com/8Myu8hNV

and i also tried it with the other engine pyarrow:

pyarrow.lib.ArrowInvalid: ('Could not convert (0.00030271668219938874, ..., 0.0002464042045176029)
  with type tuple: did not recognize Python value type when inferring an Arrow data type', 
  'Conversion failed for column UO-Pumpe with type object')

So i found this thread https://github.com/dask/fastparquet/issues/458. It seems to be a bug in fastparquet - but it should work in pyarrow which fails for me.

I then tried some things i found like infer_objects() and astype(float) ... nothing worked so far.

Does anyone have a solution how i can save my dataframe to parquet?

Sciamachy answered 25/3, 2021 at 14:4 Comment(0)
V
8

The cells of your dataframe contain tuples of float. This is an unusual datatype.

So you need to give arrow a little bit of help to figure out the type of your data. To do so you need to provide the schema of your table explicitely.

df = pd.DataFrame(
    {
        "column1": [(1.0, 2.0), (3.0, 4.0, 5.0)]
    }
)
schema = pa.schema([pa.field('column1', pa.list_(pa.float64()))])
df.to_parquet('/tmp/hello.pq', schema=schema)

Note that if you were using lists of floats (instead of tuples) it would have worked:

df = pd.DataFrame(
    {
        "column1": [[1.0, 2.0], [3.0, 4.0, 5.0]]
    }
)
df.to_parquet('/tmp/hello.pq')
Verily answered 25/3, 2021 at 18:43 Comment(3)
thx, will try this tomorrow and give some feedbackSciamachy
just tried it. had to do some changes, cause pandas converted my list of lists to a series of tuples by itself. So i now use the pandas.DataFrame.from_dict method for keeping the lists and with the lists it works, like you mentioned. Cause i have a huge amount of data i will also try the schema method and look which will be more performant.Sciamachy
@Sciamachy What was the outcome of your experiments?Eeg

© 2022 - 2024 — McMap. All rights reserved.