Using Kafka for Data Integration with Updates & Deletes

So a little background - we have a large number of data sources ranging from RDBMS's to S3 files. We would like to synchronize and integrate this data with other various data warehouses, databases, etc.

At first, this seemed like the canonical model for Kafka. We would like to stream the data changes through Kafka to the data output sources. In our test case we are capturing the changes with Oracle Golden Gate and successfully pushing the changes to a Kafka queue. However, pushing these changes through to the data output source has proven challenging.

I realize that this would work very well if we were just adding new data to the Kafka topics and queues. We could cache the changes and write the changes to the various data output sources. However this is not the case. We will be updating, deleting, modifying partitions, etc. The logic for handling this seems to be much more complicated.

We tried using staging tables and joins to update/delete the data but I feel that would become quite unwieldy quickly.

This comes to my question - are there any different approaches we could go about handling these operations? Or should we totally move in a different direction?

Any suggestions/help is much appreciated. Thank you!

There are 3 approaches you can take:

Full dump load
Incremental dump load
Binlog replication

Full dump load

Periodically, dump your RDBMS data source table into a file, and load that into the datawarehouse, replacing the previous version. This approach is mostly useful for small tables, but is very simple to implement, and supports updates and deletes to the data easily.

Incremental dump load

Periodically, get the records that changed since your last query, and send them to be loaded to the data warehouse. Something along the lines of

SELECT *
FROM my_table
WHERE last_update > #{last_import}

This approach is slightly more complex to implement, because you have to maintain the state ("last_import" in the snippet above), and it does not support deletes. It can be extended to support deletes, but that makes it more complicated. Another disadvantage of this approach that it requires your tables to have a last_update column.

Binlog replication

Write a program that continuously listens to the binlog of your RDBMS and sends these updates to be loaded to an intermediate table in the data warehouse, containing the updated values of the row, and whether it is a delete operation or update/create. Then write a query that periodically consolidates these updates to create a table that mirrors the original table. The idea behind this consolidation process is to select, for each id, the last (most advanced) version as seen in all the updates, or in the previous version of the consolidated table.

This approach is slightly more complex to implement, but allows achieving high performance even on large tables and supports updates and deletes.

Kafka is relevant to this approach in that it can be used as a pipeline for the row updates between the binlog listener and the loading to the data warehouse intermediate table.

You can read more about these different replication approaches in this blog post.

Disclosure: I work in Alooma (a co-worker wrote the blog post linked above, and we provide data-pipelines as a service, solving problems like this).

Full dump load

Incremental dump load

Binlog replication

Recommended topics

Hot tags