PySpark and Protobuf Deserialization UDF Problem
Asked Answered
H

1

19

I'm getting this error

Can't pickle <class 'google.protobuf.pyext._message.CMessage'>: it's not found as google.protobuf.pyext._message.CMessage

when I try to create a UDF in PySpark. Apparently, it uses CloudPickle to serialize the command however, I'm aware that protobuf messages contains C++ implementations, meaning it cannot be pickled.

I've tried finding a way to perhaps override CloudPickleSerializer, however, I wasn't able to find a way.

Here's my example code:

from MyProject.Proto import MyProtoMessage
from google.protobuf.json_format import MessageToJson
import pyspark.sql.functions as F

def proto_deserialize(body):
  msg = MyProtoMessage()
  msg.ParseFromString(body)
  return MessageToJson(msg)

from_proto = F.udf(lambda s: proto_deserialize(s))

base.withColumn("content", from_proto(F.col("body")))

Thanks in advance.

Hanes answered 3/5, 2020 at 14:41 Comment(2)
This seems to work fine with pyspark 3.2.1 and protoc 3.19.4.Dilks
im getting the same error. PicklingError: Could not serialize object: TypeError: cannot pickle 'google._upb._message.Descriptor' object. pyspark 3.3.0 i am wondering if it has something to do with converting from a C# proto file with BCL in itLeodora
D
1

From version 3.4.0 pyspark includes a mechanism to deserlialize protobuf binary messages where you can specify the path to the descriptor file of the message. In 3.5.0 an argument to allow passing the descriptor binary was added. This way you avoid UDFs overhead and errors :)

from pyspark.sql.protobuf.functions import from_protobuf
spark = ...
descriptor_file_path = ...

# Assume we have a protobuf message called MyMessage and that
# the dataframe has the binary data of the messages in a column
# called `data`
df = spark.createDataFrame(...)
df.withColumn(
    'deserialized',
    from_frotobuf(
        data='data',
        messageName='MyMessage',
        descFilePath=descriptor_file_path,
        # binaryDescriptorSet=... # If you rather provide the binary descriptor
    )
)

For more, the docs are here.

Dorton answered 25/1, 2024 at 12:56 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.