pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')
Asked Answered
B

4

26

Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code

import pandas as pd
import pyarrow as pa

class Player:
    def __init__(self, name, age, gender):
        self.name = name
        self.age = age
        self.gender = gender

    def __repr__(self):
        return f'<{self.name} ({self.age})>'

data = [
    Player('Jack', 21, 'm'),
    Player('Ryan', 18, 'm'),
    Player('Jane', 35, 'f'),
]
df = pd.DataFrame(data, columns=['player'])
print(pa.Table.from_pandas(df))

we get the error:

pyarrow.lib.ArrowInvalid: ('Could not convert <Jack (21)> with type Player: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column 0 with type object')

Same error encountered by using

df.to_parquet('players.pq')

Is it possible for pyarrow to fallback to serializing these Python objects using pickle? Or is there a better solution? The pyarrow.Table will eventually be written to disk using Parquet.write_table().

  • Using Python 3.8.0, pandas 0.25.3, pyarrow 0.13.0.
  • pandas.DataFrame.to_parquet() does not support multi index, so a solution using pq.write_table(pa.Table.from_dataframe(pandas.DataFrame)) is preferred.

Thank you!

Baskerville answered 7/1, 2020 at 22:7 Comment(2)
Can you please open a JIRA issue with Apache Arrow? We don't really engage with users or developers on StackOverflow. github.com/apache/arrow/blob/master/CONTRIBUTING.mdJuggler
Did you ever figure this out?Coastline
B
5

My suggestion will be to insert the data into the DataFrame already serialized.

Best option - Use dataclass (python >=3.7)

Define the Player class as a dataclass by the decorator, and let the serialization be done natively for you (to JSON).

import pandas as pd
from dataclasses import dataclass

@dataclass
class PlayerV2:
    name:str
    age:int
    gender:str

    def __repr__(self):
        return f'<{self.name} ({self.age})>'


dataV2 = [
    PlayerV2(name='Jack', age=21, gender='m'),
    PlayerV2(name='Ryan', age=18, gender='m'),
    PlayerV2(name='Jane', age=35, gender='f'),
]

# The serialization is done natively to JSON
df_v2 = pd.DataFrame(data, columns=['player'])
print(df_v2)

# Can still get the objects's attributes by deserializeing the record
json.loads(df_v2["player"][0])['name']

Manually serialize the object (python < 3.7)

Define a serialization function in the Player class and serialize each of the instances before the creation of the Dataframe.

import pandas as pd
import json

class Player:
    def __init__(self, name, age, gender):
        self.name = name
        self.age = age
        self.gender = gender

    def __repr__(self):
        return f'<{self.name} ({self.age})>'
    
    # The serialization function for JSON, if for some reason you really need pickle you can use it instead
    def toJSON(self):
        return json.dumps(self, default=lambda o: o.__dict__)

# Serialize the objects before inserting it into the DataFrame
data = [
    Player('Jack', 21, 'm').toJSON(),
    Player('Ryan', 18, 'm').toJSON(),
    Player('Jane', 35, 'f').toJSON(),
]
df = pd.DataFrame(data, columns=['player'])

# You can see all the data inserted as a serialized json into the column player
print(df)

# Can still get the objects's attributes by deserializeing the record
json.loads(df["player"][0])['name']
Biostatics answered 15/4, 2021 at 10:47 Comment(0)
H
0

In my understanding there is problem with 'type' because of repr Try this approach(it works):

class Player:
    def __init__(self, name, age, gender):
        self.name = name
        self.age = age
        self.gender = gender

    def other(self):
        return f'<{self.name} ({self.age})>'

data = [
    Player('Jack', 21, 'm').other(),
    Player('Ryan', 18, 'm').other(),
    Player('Jane', 35, 'f').other(),
]
df = pd.DataFrame(data, columns=['player'])
print(df)
        player
0  <Jack (21)>
1  <Ryan (18)>
2  <Jane (35)>

print(pa.Table.from_pandas(df))

pyarrow.Table
player: string
Hi answered 25/1, 2021 at 16:29 Comment(0)
T
0

Not sure is parquet support format <string (int)>. But it works on dict, list.

for a python class. by calling object.dict to get a dictionary representation of an object.

for example the following works

from dataclasses import dataclass
import pandas as pd
import pyarrow as pa

@dataclass
class Player:
  name: str
  age: int
  gender: str

players = [
  {"name": "player1", "age": 12, "gender": "f"},
  {"name": "player2", "age": 22, "gender": "m"},
  {"name": "player3", "age": 18, "gender": "m"}
]
df = pd.DataFrame()
df["players"] = [Player(**r).__dict__ for r in players]

pa.Table.from_pandas(df)
Trishtrisha answered 25/3, 2022 at 14:23 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Conquian
S
0

Another option is to extend pandas with your own custom Dtype. Pandas gives a fair bit of documentation on how to create an extension Dtype, and you can look at the base class for more details, and existing extensions for examples.

That said, it's a bit involved, and if all you are looking to do is work around the "could not convert" error and get your data printed or saved to parquet, I'd recommend some form of pre-serializing as mentioned in other answers, or implement __str__ on your class, then convert the column type to str. While you're at it, since you'll be using __str__ for its intended purpose, you can improve your __repr__ to return a string that looks like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment). Putting it all together, it'll be something like:

import pandas as pd
import pyarrow as pa

class Player:
    def __init__(self, name, age, gender):
        self.name = name
        self.age = age
        self.gender = gender

    def __repr__(self):
        return f'Player("{self.name}", {self.age}, "{self.gender}")'

    def __str__(self):
        return f'<{self.name} ({self.age})>'


data = [
    Player('Jack', 21, 'm'),
    Player('Ryan', 18, 'm'),
    Player('Jane', 35, 'f'),
]
df = pd.DataFrame(data, columns=['player'])
for col in [c for c in df.select_dtypes(include=['object']).columns]:
    df[col] = df[col].astype('str')

print(pa.Table.from_pandas(df))
df.to_parquet('players.pq')
print([repr(d) for d in data])

This gives the output:

pyarrow.Table
player: string
----
player: [["<Jack (21)>","<Ryan (18)>","<Jane (35)>"]]
# No output from to_parquet b/c there was no error
['Player("Jack", 21, "m")', 'Player("Ryan", 18, "m")', 'Player("Jane", 35, "f")']

Naturally, if you want to keep the original DataFrame around with the original types, you'll want to change those column types on a copy instead of the original.

Slovenly answered 11/2, 2023 at 1:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.