How to setup Spark with a multi node Cassandra cluster?

You have two high level tasks here:

setup Spark (single node or cluster);
setup Cassandra (single node or cluster);

This tasks are different and not related (if we are not talking about data locality). How to setup Spark in Cluster you can find here Architecture overview. Generally there are two types (standalone, where you setup Spark on hosts directly, or using tasks schedulers (Yarn, Mesos)), you should draw upon your requirements. As you built all by yourself, I suppose you will use Standalone installation. The difference between one node is network communication. By default Spark runs on localhost, more commonly it uses FQDNS name, so you should configure it in /etc/hosts and hostname -f or try IPs. Take a look at this page, which contains all necessary ports for nodes communication. All ports should be open and available between nodes. Be attentive that by default Spark uses TorrentBroadcastFactory with random ports.

For Cassandra see this docs: 1, 2, tutorials 3, etc. You will need 4 likely. You also could use Cassandra inside Mesos using docker containers.

p.s. If data locality it is your case you should come up with something yours, because nor Mesos, nor Yarn don't handle running spark jobs for partitioned data closer to Cassandra partitions.

Recommended topics

Hot tags