Why is the creation of Python protobuf messages so slow?
Asked Answered
P

0

6

Say I have a message defined in test.proto as:

message TestMessage {
    int64 id = 1;
    string title = 2;
    string subtitle = 3;
    string description = 4;
}

And I use protoc to convert it to Python like so:

protoc --python_out=. test.proto

timeit for PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python:

from test_pb2 import TestMessage

%%timeit
tm = TestMessage()
tm.id = 1
tm.title = 'test title'
tm.subtitle = 'test subtitle'
tm.description = 'this is a test description'

6.75 µs ± 152 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

timeit for PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp:

1.6 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Compare that to just a dict:

%%timeit
tm = dict(
    id=1,
    title='test title',
    subtitle='test subtitle',
    description='this is a test description'
)

308 ns ± 2.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This is also only for one message. Protobuf cpp implementation is about 10.6µs for my full project.

Is there a way to make this faster? Perhaps compiling the output (test_pb2)?

Pomiculture answered 13/5, 2020 at 21:49 Comment(7)
Protocol buffers are widely-used, and pretty well-optimized already, so I doubt it. Also, you don't really "compile" a Python source file; you could use a different interpreter if you needed to (pypy, etc.). But in any case, do you have reason to believe that serialization specifically is a bottleneck in your application?Alfie
@Alfie I was thinking there might be a way to output c++ and call those messages from python by building with setup.py somehow. It's a bottleneck for me because I'm parsing millions of rows of data into proto messages and it's taking 15+ hoursPomiculture
Do you mean write a C++ executable to do the serialization, and then call that from Python? If so, that would be more expensive than what you have (you need to get the data from Python to C++, which is...serialization, plus process overhead). Have you tried the standard tools for parallelizing CPU-bound work, like ProcessPoolExecutor, joblib or similar?Alfie
@Alfie I found this example which might be what I'm looking for yz.mit.edu/wp/fast-native-c-protocol-buffers-from-pythonPomiculture
What protobuf and python versions are you using?Scrawny
@Scrawny I'm using Python 3.8 and Protobuf version 3.9.2Pomiculture
Hey, @BrendanMartin. Did you solve this issue?Bane

© 2022 - 2024 — McMap. All rights reserved.