How to implement Pandas in Apache beam ? I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation in Apache beam. Can anyone direct me to the desired link ?
There's some confusion going on here.
pandas
is "supported", in the sense that you can use the pandas
library the same way you'd be using it without Apache Beam, and the same way you can use any other library from your Beam pipeline as long as you specify the proper dependencies. It is also "supported" in the sense that it is bundled as a dependency by default so you don't have to specify it yourself. For example, you can write a DoFn
that performs some computation using pandas
for every element; a separate computation for each element, performed by Beam in parallel over all elements.
It is not supported in the sense that Apache Beam currently provides no special integration with it, e.g. you can't use a PCollection
as a pandas dataframe, or vice versa. A PCollection
does not physically contain any data (this should be particularly clear for streaming pipelines) - it is just a placeholder node in Beam's execution plan.
That said, a pandas
-like API for working with Beam PCollections
would certainly be a good idea, and would simplify learning Beam for many existing pandas
users, but I don't think anybody is working on implementing this currently. However, the Beam community is currently discussing the idea of adding schemas to PCollections, which is a step in this direction.
As well as using Pandas directly from DoFns, Beam now has an API to manipulate PCollections as Dataframes. See https://s.apache.org/simpler-python-pipelines-2020 for more details.
pandas
is supported in the Dataflow SDK for Python 2.x. As of writing, workers have the pandas v0.18.1
version pre-installed, so you should not have any issue with that. StackOverflow does not accept answers where you request the community to point you to external documentation and/or tutorials, so maybe you should first try an implementation yourself, and then come back with more information about what is/isn't failing and what did you achieve before stumbling with an error.
In any case, if what you want to achieve is a left join, maybe you can also have a look at the CoGroupByKey
transform type, which is documented in the Apache Beam documentation. It is used to perform relational joins of several PCollection
s with a common key type. In that same page, you will be able to find some examples, which use CoGroupByKey
and ParDo
to join the contents of several data objects.
© 2022 - 2024 — McMap. All rights reserved.