Using Kafka for Data Integration with Updates & Deletes
Asked Answered
P

1

5

So a little background - we have a large number of data sources ranging from RDBMS's to S3 files. We would like to synchronize and integrate this data with other various data warehouses, databases, etc.

At first, this seemed like the canonical model for Kafka. We would like to stream the data changes through Kafka to the data output sources. In our test case we are capturing the changes with Oracle Golden Gate and successfully pushing the changes to a Kafka queue. However, pushing these changes through to the data output source has proven challenging.

I realize that this would work very well if we were just adding new data to the Kafka topics and queues. We could cache the changes and write the changes to the various data output sources. However this is not the case. We will be updating, deleting, modifying partitions, etc. The logic for handling this seems to be much more complicated.

We tried using staging tables and joins to update/delete the data but I feel that would become quite unwieldy quickly.

This comes to my question - are there any different approaches we could go about handling these operations? Or should we totally move in a different direction?

Any suggestions/help is much appreciated. Thank you!

Paginate answered 3/5, 2016 at 21:53 Comment(8)
Think like Hadoop's HDFS -- append only. Instead of deleting records, post a second record to the topic to unwind the first. Maybe it seems inefficient, but only because you aren't thinking like a hard drive. Hard drives like to read long stripes of data -- like reading from a Kafka partition start to finish. Hard drives don't like to hop around and read a record here, a record there. It's faster to read 10,000 records and throw out 9,990 of them than it is to move the disk head around 10 times to find the exact 10 records you want. A lot faster.Yuille
This is more than worth the read: kafka.apache.org/08/design.htmlYuille
Thanks a ton for your response @DavidGriffin. This is great stuff - I'm still making my way through the documentation suggestion you posted. So in the case that the log has a stream of data source changes, how would you suggest these changes be propagated to the data warehouse? In this case, it would be inefficient to run all of the operations on the database correct? I figure at some point the queue would need to be filtered to the desired db operations.Paginate
Recommended read: confluent.io/blog/stream-data-platform-1 and confluent.io/blog/stream-data-platform-2Rune
Thanks @MatthiasJ.Sax. I also watched Jay Knepp's Kafka introduction. It's very helpful but however, did he explain how to put the output from Kafka into a data warehouse such as Redshift or HDFS(Updates and Deletes, not just appending)? I may have missed it but he never really seems to explain that particular step.Paginate
In HDFS you can only append. This is independent from Kafka but HDFS desig. For a database, you can of course do updates. To avoid replaying a huge log, log compaction should be the Kafka feature you are looking for.Rune
Btw: you might also be interested in "Kafka Connect".Rune
Thanks @MatthiasJ.Sax The only problem with Kafka Connect is that it requires a key. Unfortunately, we do not have key that we can use on the majority of our tables.Paginate
S
9

There are 3 approaches you can take:

  1. Full dump load
  2. Incremental dump load
  3. Binlog replication

Full dump load

Periodically, dump your RDBMS data source table into a file, and load that into the datawarehouse, replacing the previous version. This approach is mostly useful for small tables, but is very simple to implement, and supports updates and deletes to the data easily.

Incremental dump load

Periodically, get the records that changed since your last query, and send them to be loaded to the data warehouse. Something along the lines of

SELECT *
FROM my_table
WHERE last_update > #{last_import}

This approach is slightly more complex to implement, because you have to maintain the state ("last_import" in the snippet above), and it does not support deletes. It can be extended to support deletes, but that makes it more complicated. Another disadvantage of this approach that it requires your tables to have a last_update column.

Binlog replication

Write a program that continuously listens to the binlog of your RDBMS and sends these updates to be loaded to an intermediate table in the data warehouse, containing the updated values of the row, and whether it is a delete operation or update/create. Then write a query that periodically consolidates these updates to create a table that mirrors the original table. The idea behind this consolidation process is to select, for each id, the last (most advanced) version as seen in all the updates, or in the previous version of the consolidated table.

This approach is slightly more complex to implement, but allows achieving high performance even on large tables and supports updates and deletes.

Kafka is relevant to this approach in that it can be used as a pipeline for the row updates between the binlog listener and the loading to the data warehouse intermediate table.


You can read more about these different replication approaches in this blog post.

Disclosure: I work in Alooma (a co-worker wrote the blog post linked above, and we provide data-pipelines as a service, solving problems like this).

Suggs answered 16/6, 2016 at 12:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.