Pyspark Failed to find data source: kafka
Asked Answered
C

1

9

I am working on Kafka streaming and trying to integrate it with Apache Spark. However, while running I am getting into issues. I am getting the below error.

This is the command I am using.

df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()

ERROR:

Py4JJavaError: An error occurred while calling o77.load.: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html

How can I resolve this?

NOTE: I am running this in Jupyter Notebook

findspark.init('/home/karan/spark-2.1.0-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
Spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate()
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

Everything is running fine till here (above code)

df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()

This is where things are going wrong (above code).

The blog which I am following: https://www.adaltas.com/en/2019/04/18/spark-streaming-data-pipelines-with-structured-streaming/

Coadjutor answered 6/11, 2019 at 4:53 Comment(0)
D
11

Edit

Using spark.jars.packages works better than PYSPARK_SUBMIT_ARGS

Ref - PySpark - NoClassDefFoundError: kafka/common/TopicAndPartition


It's not clear how you ran the code. Keep reading the blog, and you see

spark-submit \
  ...
  --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 \
  sstreaming-spark-out.py

Seems you missed adding the --packages flag

In Jupyter, you could add this

import os

# setup arguments
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0'

# initialize spark
import pyspark, findspark
findspark.init()

Note: _2.11:2.4.0 need to align with your Scala and Spark versions... Based on the question, yours should be Spark 2.1.0

Dunant answered 6/11, 2019 at 5:40 Comment(2)
Post adding import OS, I am getting another error now. Py4JJavaError: An error occurred while calling o27.load. : java.lang.ClassNotFoundException: Failed to find data source: kafka.Coadjutor
@PKernel This is because the version of spark-sql-kafka does not match the spark version you are currently running.Prem

© 2022 - 2024 — McMap. All rights reserved.