Convert Avro in Kafka to Parquet directly into S3
Asked Answered
O

2

5

I have topics in Kafka that are stored in Avro format. I would like to consume the entire topic (which at time of receipt will not change any messages) and convert it into Parquet, saving directly on S3.

I currently do this but it requires me consuming the messages from Kafka one a time and processing on a local machine, converting them to a parquet file, and once the entire topic is consumed and the parquet file completely written, close the writing process and then initiate an S3 multipart file upload. Or | Avro in Kafka -> convert to parquet on local -> copy file to S3 | for short.

What I'd like to do instead is | Avro in Kafka -> parquet in S3 |

One of the caveats is that the Kafka topic name isn't static, and needs to be fed in an argument, used once, and then never used again.

I've looked into Alpakka and it seems like it might be possible - but it's unclear, I haven't seen any examples. Any suggestions?

Outbuilding answered 13/6, 2019 at 14:28 Comment(2)
Possible duplicate of Parquet Output From Kafka ConnectCombe
'Sup Mike/John/Craig, and anybody else!Outbuilding
F
3

You just described Kafka Connect :)

Kafka Connect is part of Apache Kafka, and with the S3 connector plugin. Although, at the moment the development of Parquet support is still in progress.

For a primer in Kafka Connect see http://rmoff.dev/ksldn19-kafka-connect

Frigg answered 13/6, 2019 at 14:57 Comment(0)
O
0

Try to add "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat" in your PUT request when you set up your connector.

You can find more details here.

Ohare answered 2/10, 2022 at 15:2 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.