spark-streaming

3

Solved

Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically

I am building a Spark Structured Streaming application where I am doing a batch-stream join. And the source for the batch data gets updated periodically. So, I am planning to do a persist/unpersist...

scala apache-spark apache-spark-sql spark-streaming spark-structured-streaming

Rizzo asked 11/2, 2021 at 12:32

3

Solved

How to calculate the size of dataframe in bytes in Spark?

I want to write one large sized dataframe with repartition, so I want to calculate number of repartition for my source dataframe. numberofpartition = {size of dataframe/default_blocksize} How to c...

scala apache-spark apache-spark-sql size spark-streaming

Crept asked 21/4, 2020 at 7:45

5

How do I stop a spark streaming job?

I have a Spark Streaming job which has been running continuously. How do I stop the job gracefully? I have read the usual recommendations of attaching a shutdown hook in the job monitoring and send...

apache-spark spark-streaming

Intyre asked 15/9, 2015 at 9:41

4

Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found when trying to write data on S3 bucket from Spark

I am trying to write data on an S3 bucket from my local computer: spark = SparkSession.builder \ .appName('application') \ .config("spark.hadoop.fs.s3a.access.key", configuration.AWS_AC...

apache-spark amazon-s3 hadoop pyspark spark-streaming

Stephanestephani asked 20/3, 2022 at 11:10

2

Spark File Streaming get File Names

I need to know the fileName for the input file that is streamed from the input dir. Below is the spark FileStreaming code in scala programming object FileStreamExample { def main(args: Array[St...

scala apache-spark spark-streaming filestream

Chairborne asked 13/10, 2019 at 9:42

2

Solved

How to fix Connection reset by peer message from apache-spark?

I keep getting the the following exception very frequently and I wonder why this is happening? After researching I found I could do .set("spark.submit.deployMode", "nio"); but that did not work eit...

apache-spark spark-streaming

Hankow asked 6/9, 2016 at 10:59

1

Running multiple Spark Streaming jobs of different DStreams in parallel

I have a Spark Streaming application that reads data from multiple Kafka topics. Each topic has a different type of data, and thus requires a different processing pipeline. My initial solution was...

scala apache-spark spark-streaming

Laclos asked 2/4, 2017 at 11:8

3

Solved

Spark: processing multiple kafka topic in parallel

I am using spark 1.5.2. I need to run spark streaming job with kafka as the streaming source. I need to read from multiple topics within kafka and process each topic differently. Is it a good idea...

apache-spark apache-kafka spark-streaming

Forth asked 23/12, 2015 at 7:24

4

The group member's supported protocols are incompatible with those of existing members

I'm facing an issue related to Kafka. I'm having my current service (Producer) that sends the message to a Kafka topic (events). The service is using kafka_2.12 v1.0.0, written in Java. I'm tryin...

apache-spark apache-kafka spark-streaming

Dishabille asked 23/7, 2018 at 20:44

5

Cannot call methods on a stopped SparkContext

When I run the following test, it throws "Cannot call methods on a stopped SparkContext". The possible problem is that I use TestSuiteBase and Streaming Spark Context. At the line val gridEvalsRDD ...

scala apache-spark spark-streaming

Corium asked 27/4, 2016 at 8:52

2

Solved

Kryo: deserialize old version of class

I need to modify a class by adding two new parameters. This class is serialized with Kryo. I'm currently persisting the information related to this class, among other things, as an RDD, every time ...

scala serialization apache-spark spark-streaming kryo

Silicosis asked 23/8, 2016 at 15:22

1

Solved

java.lang.NoSuchMethodError: org.apache.hadoop.hive.common.FileUtils.mkdir while trying to save a table to Hive

I am trying to read a kafka stream and save it to Hive as a table. The consumer code is : import org.apache.spark.sql.{DataFrame, Dataset, SaveMode, SparkSession} import org.apache.spark.sql.functi...

scala hadoop apache-kafka hive spark-streaming

Competitor asked 29/3, 2023 at 17:33

4

Spark doesnt print outputs on the console within the map function

I have a simple Spark application running on cluster mode. val funcGSSNFilterHeader = (x: String) => { println(!x.contains("servedMSISDN") !x.contains("servedMSISDN") } val ssc = new Stream...

scala apache-spark spark-streaming

Recapitulation asked 5/9, 2016 at 4:47

4

Solved

How to handle small file problem in spark structured streaming?

I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and a...

apache-spark apache-spark-sql spark-streaming parquet

Thule asked 10/6, 2019 at 10:23

7

Spark Dataframe validating column names for parquet writes

I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format. However, some of the JSON events contains spaces in the keys which...

apache-spark pyspark apache-spark-sql spark-streaming parquet

Salutation asked 4/7, 2016 at 19:26

1

Is there something like Glue "Bookmark" feature in spark which keeps track at job level?

I am looking to see if there is something like AWS Glue "bookmark" in spark. I know there is checkpoint in spark which works well on individual data source. In Glue we could use bookmark ...

apache-spark pyspark spark-streaming aws-glue incremental-load

Kaolin asked 14/9, 2021 at 6:59

2

Unable to start spark-shell failing to submit spark-submit

I am trying to submit spark-submit but its failing with as weird message. Error: Could not find or load main class org.apache.spark.launcher.Main /opt/spark/bin/spark-class: line 96: CMD: bad arr...

apache-spark apache-spark-sql spark-streaming

Hannus asked 3/8, 2020 at 17:7

4

Drop spark dataframe from cache

I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution; df1.cache() df2.cache() Once use of certain dataframe is over and is no longer ...

apache-spark apache-spark-sql spark-streaming

Eddo asked 26/8, 2015 at 5:40

3

Exception: 'writeStream' can be called only on streaming Dataset/DataFrame

Trying to create a test for spark data streaming writeStream function as shown below: SparkSession spark = SparkSession.builder().master("local").appName("spark session").getOrCreate() val lakeD...

scala apache-spark spark-streaming

Auriscope asked 18/7, 2018 at 17:39

1

exception: org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update

I have a simple Spark job that streams data to a Delta table. The table is pretty small and is not partitioned. A lot of small parquet files are created. As recommended in the documentation (https:...

spark-streaming databricks parquet delta-lake

Coastguardsman asked 12/8, 2021 at 13:22

1

Solved

org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; despite of time-based window

I'm doing the window-based sorting for the Spark Structured Streaming: val filterWindow: WindowSpec = Window .partitionBy("key") .orderBy($"time") controlDataFrame=controlDat...

scala apache-spark spark-streaming

Fronniah asked 22/11, 2021 at 7:39

2

Solved

'Connection Refused' error while running Spark Streaming on local machine

I know there are many threads already on 'spark streaming connection refused' issues. But most of these are in Linux or at least pointing to HDFS. I am running this on my local laptop with Windows....

scala apache-spark spark-streaming

Edina asked 26/7, 2015 at 1:40

1

How to solve the following issue in Spark 3.0? Can not create the managed table. The associated location already exists.;

In my spark job, I tried to overwrite a table in each microbatch of structured streaming batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable") It generated the following error. ...

apache-spark spark-streaming spark3

Selfexpression asked 19/9, 2020 at 9:33

1

In spark, is there any way to unpersist a dataframe/rdd in the middle of execution plan

Given the following series of events: df1 = read df2 = df1.action df3 = df1.action df2a = df2.action df2b = df2.action df3a = df3.action df3b = df3.action df4 = union(df2a, df2b, df3a, d3b) df4.col...

apache-spark pyspark apache-spark-sql spark-streaming

Sabu asked 20/9, 2021 at 12:17

3

Difference in Used, Committed and Max Heap Memory

I am monitoring a spark executor JVM of a OutOfMemoryException. I used Jconsole to connect to executor JVM. Following is the snapshot of Jconsole: In the image used memory is shown as 3.8G and co...

java apache-spark memory-management jvm spark-streaming

Sauter asked 4/1, 2017 at 16:25

spark-streaming Questions

Recommended topics

Hot tags