Is HDFS necessary for Spark workloads?

Asked 19/9, 2015 at 14:12 Answered 14/1, 2016 at 7:53

hadoop apache-spark hdfs mesos mesosphere

HDFS is not necessary but recommendations appear in some places.

To help evaluate the effort spent in getting HDFS running:

What are the benefits of using HDFS for Spark workloads?

Encrata answered 19/9, 2015 at 14:12 Comment(7)

Well, do you need to store any data? – Miculek 19/9, 2015 at 14:14

@SeanOwen haha, yes. But can Spark not just write to the hosts' FS? Say EXT4? – Encrata 19/9, 2015 at 14:15

@Encrata Automatic resilience, automatic distribution, integration with other tools that run nicely on HDFS, for naming a few? I also think HDFS is engineered to reduce disk-access which may be a bottleneck for applications requiring big datasets in a non-distributed file system (in case you cannot cache it in Spark). – Eisenberg 19/9, 2015 at 14:20

Yes you can store to local storage, but what use is that in a distributed computation framework? – Miculek 19/9, 2015 at 14:35

@kaktusito Distribution and resilience are solid reasons in support of HDFS. – Encrata 19/9, 2015 at 14:40

@SeanOwen True, but one may have an initial test environment that is single machine, in which case the extra effort may not be worthwhile for the present time. When it comes time to scale-out, that is in the case scale-out has greater utility than scale-up, the effort can be spent getting HDFS running. – Encrata 20/9, 2015 at 21:29

Yes, local storage works fine for unit testing Spark jobs. You still need something for production – Miculek 20/9, 2015 at 22:43

The shortest answer is:"No, you don't need it". You can analyse data even without HDFS, but off course you need to replicate the data on all your nodes.

The long answer is quite counterintuitive and i'm still tryng to understand it with the help stackoverflow community.

Spark local vs hdfs permormance

Adopted answered 14/1, 2016 at 7:53 Comment(0)

HDFS (or any distributed Filesystems) makes distributing your data much simpler. Using a local filesystem you would have to partition/copy the data by hand to the individual nodes and be aware of the data distribution when running your jobs. In addition HDFS also handles failing nodes failures. From an integration between Spark and HDFS, you can imagine spark knowing about the data distribution so it will try to schedule tasks to the same nodes where the required data resides.

Second: which problems did you face exactly with the instruction?

BTW: if you are just looking for an easy setup on AWS, DCOS allows you to install HDFS with a single command...

Frump answered 21/9, 2015 at 11:54 Comment(0)

-1

So you could go with Cloudera or Hortenworks distro and load up an entire stack very easily. CDH will be used with YARN though I find it so much more difficult to configure mesos in CDH. Horten is much easier to customize.

HDFS is great because of datanodes = data locality (process where the data is) as shuffling/data transfer is very expensive. HDFS also naturally blocks files which allows Spark to partition on the blocks. (128mb blocks, you can change this).

You could use S3 and Redshift.

See here: https://github.com/databricks/spark-redshift

Farreaching answered 19/9, 2015 at 15:53 Comment(0)

Recommended topics

Hot tags