"No Filesystem for Scheme: gs" when running spark job locally
Asked Answered
D

4

13

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder)

When running the job locally on my Mac machine, I am getting the following error:

5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs

I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following setup in core-site.xml of the Hadoop installation:

<property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>
     The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
    </description>
</property>

I think my problem comes from the fact I am not sure where exactly each piece need to be configured in this local mode. In the Intellij project, I am using Maven, and so I imported the spark library as follows:

<dependency> <!-- Spark dependency -->
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>1.2.0</version>
    <exclusions>
        <exclusion>  <!-- declare the exclusion here -->
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
        </exclusion>
    </exclusions>
</dependency>

, and Hadoop 1.2.1 as follows:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>1.2.1</version>
</dependency>

The thing is, I am not sure where the hadoop location is configured for Spark, and also where the hadoop conf is configured. Therefore, I may be adding to the wrong Hadoop installation. In addition, is there something that needs to be restarted after modifying the files? As far as I saw, there is no Hadoop service running on my machine.

Dihedron answered 5/1, 2015 at 15:41 Comment(0)
C
12

In Scala, add the following config when setting your hadoopConfiguration:

val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
Clockwork answered 13/3, 2017 at 5:50 Comment(2)
Very elegant. You will probably have to include some adequate dependency for the latest google cloud storage connector to make that possible.Aric
Is there a similar solution for this problem on Spark + Java?Emlynne
T
2

There are a couple ways to help Spark pick up the relevant Hadoop configurations, both involving modifying ${SPARK_INSTALL_DIR}/conf:

  1. Copy or symlink your ${HADOOP_HOME}/conf/core-site.xml into ${SPARK_INSTALL_DIR}/conf/core-site.xml. For example, when bdutil installs onto a VM, it runs:

    ln -s ${HADOOP_CONF_DIR}/core-site.xml ${SPARK_INSTALL_DIR}/conf/core-site.xml
    

Older Spark docs explain that this makes the xml files included in Spark's classpath automatically: https://spark.apache.org/docs/0.9.1/hadoop-third-party-distributions.html

  1. Add an entry to ${SPARK_INSTALL_DIR}/conf/spark-env.sh with:

    export HADOOP_CONF_DIR=/full/path/to/your/hadoop/conf/dir
    

Newer Spark docs seem to indicate this as the preferred method going forward: https://spark.apache.org/docs/1.1.0/hadoop-third-party-distributions.html

Therm answered 7/1, 2015 at 7:6 Comment(4)
But what is the Spark install dir when I use the Spark Maven component?Dihedron
Ah, I see, if you're running straight out of your Maven project, you actually just need to make the core-site.xml (and probably also hdfs-site.xml) available in the classpath as mentioned elsewhere through the normal Maven means, namely by adding the two files to your src/main/resources directory. Edit: Pressed enter too early, here's a link to a blog post describing the similar case of Hadoop-only configuration with Maven: jayunit100.blogspot.com/2013/06/…Therm
After adding the core-site.xml/hdfs-site.xml to the classpath, now I get the following error upon doing sc = new JavaSparkContext(conf); - java.lang.ClassNotFoundException: org.apache.hadoop.fs.LocalFileSystem. I am getting this, even though I have hadoop-core.jar version 1.2.1 in my classpath.Dihedron
If you're running using mvn exec:java then indeed you'd expect the dependencies to be correctly present, but if you're doing mvn package and just running the jarfile, you have to explicitly ensure the right dependencies on your classpath. Commonly, you may want to build an "uberjar" which bundles all the transitive dependencies into a single jar that can be run without having to deal with classpaths. See this page: maven.apache.org/plugins/maven-shade-plugin/examples/… - the second example is similar to what you need, you can try copy/pasting into your pom.xmlTherm
D
1

I can't say what's wrong, but here's what I would try.

  • Try setting fs.gs.project.id: <property><name>fs.gs.project.id</name><value>my-little-project</value></property>
  • Print sc.hadoopConfiguration.get(fs.gs.impl) to make sure your core-site.xml is getting loaded. Print it in the driver and also in the executor: println(x); rdd.foreachPartition { _ => println(x) }
  • Make sure the GCS jar is sent to the executors (sparkConf.setJars(...)). I don't think this would matter in local mode (it's all one JVM, right?) but you never know.

Nothing but your program needs to be restarted. There is no Hadoop process. In local and standalone modes Spark only uses Hadoop as a library, and only for IO I think.

Dyslalia answered 5/1, 2015 at 22:14 Comment(4)
I tried your suggestions. It seems that adding the project id property did not affect. Regarding the fs.gs.impl, I can confirm the value is null, so that's probably the cause of the problem, but I am not sure why. I tried setting it even by code: conf.set("fs.gs.impl", com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.class.getName()); but it didn't change a thing. Is there a call in the API I can make to get the hadoop folder path? Maybe it points to the wrong Hadoop distribution, not the one I set the conf atDihedron
I think either core-site.xml or conf/core-site.xml needs to be on the classpath.Dyslalia
After adding the core-site.xml/hdfs-site.xml to the classpath, now I get the following error upon doing sc = new JavaSparkContext(conf); - java.lang.ClassNotFoundException: org.apache.hadoop.fs.LocalFileSystem. I am getting this, even though I have hadoop-core.jar version 1.2.1 in my classpath.Dihedron
In my project that class comes from hadoop-common-2.2.0.jar.Dyslalia
S
1

You can apply these settings directly on the spark reader/writer as follows:

  spark
    .read
    .option("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
    .option("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
    .option("google.cloud.auth.service.account.enable", "true")
    .option("google.cloud.auth.service.account.json.keyfile", "<path-to-json-keyfile.json>")
    .option("header", true)
    .csv("gs://<bucket>/<path-to-csv-file>")
    .show(10, false)

And add the relevant jar dependency to your build.sbt (or whichever build tool you use) and check https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector for latest:

"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-2.2.6" classifier "shaded"

See GCS Connector and Google Cloud Storage connector for non-dataproc clusters

Swirly answered 6/6, 2022 at 17:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.