I have a streaming process which reads data from Kafka, processes it using Spark and writes data to Cassandra.
This will run on cluster which will have 3 - 5 nodes. My plan is to deploy spark, kafka and cassandra on each node of the cluster.
I want to enforce data locality as much as possible, and by that I mean that each Spark node reads data from Kafka that is ONLY on that node,processes it locally (there are not shuffling transformations in my pipelines), and writes to Cassandra in that node.
So, my questions are the following:
1) In order for same topic to be stored on multiple nodes, do I need to partition the Kafka topic?
2) Do I need to synchronize (set to be the same) Kafka partitioner and Cassandra partitioner, so that I am sure that data that arrives in Kafka partition on node X will for sure be stored in Cassandra on the same node?
3) Are there any other things I should specially pay attention in Spark pipeline? I am using Spark-Cassandra Connector, which should exploit data locality (so that each Spark task reads the data stored on that specific node).
Any blog posts or articles explaining how this should be done are more than appreciated.
Regards,
Srdjan