Can Apache Spark run without Hadoop?
Asked Answered
I

11

128

Are there any dependencies between Spark and Hadoop?

If not, are there any features I'll miss when I run Spark without Hadoop?

Iridescence answered 15/8, 2015 at 6:51 Comment(0)
S
69

Spark can run without Hadoop but some of its functionality relies on Hadoop's code (e.g. handling of Parquet files). We're running Spark on Mesos and S3 which was a little tricky to set up but works really well once done (you can read a summary of what needed to properly set it here).

(Edit) Note: since version 2.3.0 Spark also added native support for Kubernetes

Scholl answered 15/8, 2015 at 12:0 Comment(0)
T
26

By default, Spark does not have a storage mechanism.

To store data, it needs a fast and scalable file system. You can use S3 or HDFS or any other file system. Hadoop is an economical option due to low cost.

Additionally, if you use Tachyon, it will boost performance with Hadoop. It's highly recommended to use Hadoop for Apache Spark processing.

Enter image description here

Trev answered 15/8, 2015 at 15:25 Comment(1)
The link is dead. If the image was not created by you, state its origin.Nebulize
M
7

As per Spark documentation, Spark can run without Hadoop.

You may run it as a Standalone mode without any resource manager.

But if you want to run in multi-node setup, you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc.

Mustache answered 18/8, 2015 at 4:45 Comment(0)
R
6

Yes, Spark can run without Hadoop. All core spark features will continue to work, but you'll miss things like easily distributing all your files (code as well as data) to all the nodes in the cluster via HDFS, etc.

Ridgeway answered 15/8, 2015 at 7:28 Comment(0)
A
5

Yes, you can install the Spark without the Hadoop. That would be little tricky You can refer arnon link to use parquet to configure on S3 as data storage. http://arnon.me/2015/08/spark-parquet-s3/

Spark is only do processing and it uses dynamic memory to perform the task, but to store the data you need some data storage system. Here hadoop comes in role with Spark, it provide the storage for Spark. One more reason for using Hadoop with Spark is they are open source and both can integrate with each other easily as compare to other data storage system. For other storage like S3, you should be tricky to configure it like mention in above link.

But Hadoop also have its processing unit called Mapreduce.

Alba answered 17/1, 2016 at 0:47 Comment(0)
T
1

Yes, of course. Spark is an independent computation framework. Hadoop is a distribution storage system (HDFS) with a MapReduce computation framework. Spark can get data from HDFS, as well as any other data source such as traditional databases (JDBC), Kafka or even the local disk.

Talanian answered 18/8, 2015 at 15:12 Comment(0)
F
1

TL;DR

Use a local (single node) or stand-alone (cluster) to run Spark without Hadoop, but it still needs Hadoop dependencies for logging and some file process.

Windows is strongly not recommend to run Spark!


Local mode

There are so many running modes with Spark. One of them is called local and it will be running without Hadoop dependencies.

So, here is the first question: How can I tell Spark we want to run in local mode?

After reading this official documentation, I just give it a try on my Linux OS:

  1. You must install Java and Scala, not the core content, so skip it.

  2. Download the Spark package. There are "without Hadoop" and "Hadoop integrated", two types of package. The most important thing is "without Hadoop". It does not run without Hadoop, but just not bundled with Hadoop, so you can bundle it with your custom Hadoop! Spark can run without Hadoop (HDFS and Yarn), but you need the Hadoop dependency JAR such as Parquet, Avro, etc. SerDe class, so I strongly recommend to use the "integrated" package (and you will found missing some log dependencies like Log4j and SLF4J and other common utility classes if choosing "without Hadoop" package, but all this is bundled with the Hadoop integrated package)!

  3. Run in local mode

The simplest way is just run shell,and you will see the welcome log

# The same as ./bin/spark-shell --master local[*]
./bin/spark-shell

Stand-alone mode

The same as blew, but different with step 3.

# Start up the cluster
# If you want run on the frontend
# export SPARK_NO_DAEMONIZE=true
./sbin/start-master.sh
# Run this on your every worker
./sbin/start-worker.sh spark://VMS110109:7077

# Submit job or just shell
./bin/spark-shell spark://VMS110109:7077

On Windows?

I know so many people who run Spark on Windows just for study, but here it is so different on Windows, and it is really strongly not recommended to use Windows.

The most important things is download winutils.exe from here and configure the system variable HADOOP_HOME to point where winutils is located.

At this moment 3.2.1 is the most latest release version of spark,but a bug is exist.You will got a exception like Illegal character in path at index 32: spark://xxxxxx:63293/D:\classe when run ./bin/spark-shell.cmd,only startup a standalone cluster then use ./bin/sparkshell.cmd or use lower version can temporary fix this.

For more details and solutions, you can refer to here.

Forsaken answered 28/3, 2022 at 15:9 Comment(0)
C
0

Yes, Spark can run without Hadoop. You can install Spark in your local machine without Hadoop. But the Spark library comes with pre Hadoop libraries, i.e., they are used while installing on your local machine.

Corabella answered 21/1, 2020 at 13:54 Comment(0)
D
0

You can run Spark without Hadoop, but Spark has a dependency on Hadoop win-utils, so some features may not work. Also if you want to read hive tables from Spark then you need Hadoop.

Deauville answered 15/7, 2021 at 5:24 Comment(0)
A
0

The Apache Spark framework doesn't contain any default files system for storing data, so it uses Apache Hadoop that contains a distributed file system that's economical, and also major companies use Apache Hadoop, so Spark is moving to the Hadoop file system.

Apache Spark is a data processing technology, big data having 10000+ technology, so Spark supports any technology.

But a distributed file system is really helpful for Spark processing, because they help produce output faster compared to other data processing technologies.

If Spark was used stand-alone, it produced output slowly compared to distributed files system.

So finally Spark is running any technology, but the output is slow compared to the Hadoop filesystem.

Anglonorman answered 30/4, 2023 at 3:46 Comment(0)
S
-5

No. It requires a full-blown Hadoop installation to start working - Provide self-contained deployment, not tightly-coupled with Hadoop

Sanguine answered 9/10, 2015 at 9:40 Comment(4)
This is incorrect, it works fine without Hadoop in current versions.Lefton
@ChrisChambers Would you care to elaborate? Comment on that issue says "In fact, Spark does require Hadoop classes no matter what", and on downloads page there are only options to either a pre-built for a specific Hadoop version or one with user-provided Hadoop. And docs say "Spark uses Hadoop client libraries for HDFS and YARN." and this dependency doesn't seem to be optional.Pacheco
@Pacheco correct. I just tried executing the 'User provided Hadoop' download artifact and immediately get a stack trace. I also wish for Spark's classpath to be decoupled from core Hadoop classes. But for prototyping and testing purposes, I take no issue other than the size of the download (120 something MB) all in all. Oh well. Cheers!Bechuana
Stack trace in question: $ ./spark-shell Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:118) at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefault at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 7 more Bechuana

© 2022 - 2024 — McMap. All rights reserved.