spark-streaming Questions
3
Solved
I am building a Spark Structured Streaming application where I am doing a batch-stream join. And the source for the batch data gets updated periodically.
So, I am planning to do a persist/unpersist...
Rizzo asked 11/2, 2021 at 12:32
3
Solved
I want to write one large sized dataframe with repartition, so I want to calculate number of repartition for my source dataframe.
numberofpartition = {size of dataframe/default_blocksize}
How to c...
Crept asked 21/4, 2020 at 7:45
5
I have a Spark Streaming job which has been running continuously. How do I stop the job gracefully? I have read the usual recommendations of attaching a shutdown hook in the job monitoring and send...
Intyre asked 15/9, 2015 at 9:41
4
I am trying to write data on an S3 bucket from my local computer:
spark = SparkSession.builder \
.appName('application') \
.config("spark.hadoop.fs.s3a.access.key", configuration.AWS_AC...
Stephanestephani asked 20/3, 2022 at 11:10
2
I need to know the fileName for the input file that is streamed from the input dir.
Below is the spark FileStreaming code in scala programming
object FileStreamExample {
def main(args: Array[St...
Chairborne asked 13/10, 2019 at 9:42
2
Solved
I keep getting the the following exception very frequently and I wonder why this is happening? After researching I found I could do .set("spark.submit.deployMode", "nio"); but that did not work eit...
Hankow asked 6/9, 2016 at 10:59
1
I have a Spark Streaming application that reads data from multiple Kafka topics. Each topic has a different type of data, and thus requires a different processing pipeline.
My initial solution was...
Laclos asked 2/4, 2017 at 11:8
3
Solved
I am using spark 1.5.2. I need to run spark streaming job with kafka as the streaming source. I need to read from multiple topics within kafka and process each topic differently.
Is it a good idea...
Forth asked 23/12, 2015 at 7:24
4
I'm facing an issue related to Kafka.
I'm having my current service (Producer) that sends the message to a Kafka topic (events). The service is using kafka_2.12 v1.0.0, written in Java.
I'm tryin...
Dishabille asked 23/7, 2018 at 20:44
5
When I run the following test, it throws "Cannot call methods on a stopped SparkContext". The possible problem is that I use TestSuiteBase and Streaming Spark Context. At the line val gridEvalsRDD ...
Corium asked 27/4, 2016 at 8:52
2
Solved
I need to modify a class by adding two new parameters. This class is serialized with Kryo.
I'm currently persisting the information related to this class, among other things, as an RDD, every time ...
Silicosis asked 23/8, 2016 at 15:22
1
Solved
I am trying to read a kafka stream and save it to Hive as a table.
The consumer code is :
import org.apache.spark.sql.{DataFrame, Dataset, SaveMode, SparkSession}
import org.apache.spark.sql.functi...
Competitor asked 29/3, 2023 at 17:33
4
I have a simple Spark application running on cluster mode.
val funcGSSNFilterHeader = (x: String) => {
println(!x.contains("servedMSISDN")
!x.contains("servedMSISDN")
}
val ssc = new Stream...
Recapitulation asked 5/9, 2016 at 4:47
4
Solved
I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and a...
Thule asked 10/6, 2019 at 10:23
7
I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format.
However, some of the JSON events contains spaces in the keys which...
Salutation asked 4/7, 2016 at 19:26
1
I am looking to see if there is something like AWS Glue "bookmark" in spark. I know there is checkpoint in spark which works well on individual data source. In Glue we could use bookmark ...
Kaolin asked 14/9, 2021 at 6:59
2
I am trying to submit spark-submit but its failing with as weird message.
Error: Could not find or load main class org.apache.spark.launcher.Main
/opt/spark/bin/spark-class: line 96: CMD: bad arr...
Hannus asked 3/8, 2020 at 17:7
4
I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;
df1.cache()
df2.cache()
Once use of certain dataframe is over and is no longer ...
Eddo asked 26/8, 2015 at 5:40
3
Trying to create a test for spark data streaming writeStream function as shown below:
SparkSession spark = SparkSession.builder().master("local").appName("spark
session").getOrCreate()
val lakeD...
Auriscope asked 18/7, 2018 at 17:39
1
I have a simple Spark job that streams data to a Delta table.
The table is pretty small and is not partitioned.
A lot of small parquet files are created.
As recommended in the documentation (https:...
Coastguardsman asked 12/8, 2021 at 13:22
1
Solved
I'm doing the window-based sorting for the Spark Structured Streaming:
val filterWindow: WindowSpec = Window
.partitionBy("key")
.orderBy($"time")
controlDataFrame=controlDat...
Fronniah asked 22/11, 2021 at 7:39
2
Solved
I know there are many threads already on 'spark streaming connection refused' issues. But most of these are in Linux or at least pointing to HDFS. I am running this on my local laptop with Windows....
Edina asked 26/7, 2015 at 1:40
1
In my spark job, I tried to overwrite a table in each microbatch of structured streaming
batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable")
It generated the following error.
...
Selfexpression asked 19/9, 2020 at 9:33
1
Given the following series of events:
df1 = read
df2 = df1.action
df3 = df1.action
df2a = df2.action
df2b = df2.action
df3a = df3.action
df3b = df3.action
df4 = union(df2a, df2b, df3a, d3b)
df4.col...
Sabu asked 20/9, 2021 at 12:17
3
I am monitoring a spark executor JVM of a OutOfMemoryException. I used Jconsole to connect to executor JVM. Following is the snapshot of Jconsole:
In the image used memory is shown as 3.8G and co...
Sauter asked 4/1, 2017 at 16:25
1 Next >
© 2022 - 2025 — McMap. All rights reserved.