apache-spark-sql

2

Solved

IntelliJ: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/types/DataType

PS. There's a similar question here, but that is in mvn and my project is in sbt. First up, a few required informations: Spark Installed Version: 2.4.0 Scala Installed Version: 2.11.12 I'm try...

scala apache-spark intellij-idea apache-spark-sql sbt

Pelagic asked 18/4, 2019 at 20:14

2

Is there a Spark SQL jdbc driver?

I'm looking for a client jdbc driver that supports Spark SQL. I have been using Jupyter so far to run SQL statements on Spark (running on HDInsight) and I'd like to be able to connect using JDBC s...

apache-spark jdbc apache-spark-sql azure-hdinsight

Essie asked 9/6, 2016 at 18:27

3

Solved

How to extract all elements from array of structs?

I have a Dataframe with different columns where one of the columns is an array of structs: +----------+---------+--------------------------------------+ |id |title | values| +----------+---------+-...

arrays apache-spark pyspark struct apache-spark-sql

Bigamist asked 17/6, 2018 at 22:43

3

Solved

Stream-Static Join: How to refresh (unpersist/persist) static Dataframe periodically

I am building a Spark Structured Streaming application where I am doing a batch-stream join. And the source for the batch data gets updated periodically. So, I am planning to do a persist/unpersist...

scala apache-spark apache-spark-sql spark-streaming spark-structured-streaming

Rizzo asked 11/2, 2021 at 12:32

2

Differences between Spark's Row and InternalRow types

Currently Spark has two implementations for Row: import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.InternalRow What is the need to have both of them? Do they represent the sa...

apache-spark apache-spark-sql apache-spark-dataset

Disrespect asked 2/2, 2017 at 22:8

1

How can InfluxDB be used as Spark Source

How can an InfluxDB database (which has streaming data coming in) be used as Source for Spark Streaming ? Also, Is it possible to use InfluxDB instead of SparkSQL for performing computations on dat...

apache-spark apache-spark-sql influxdb

Animalist asked 31/5, 2018 at 10:3

2

Solved

Error: TimestampType can not accept object while creating a Spark dataframe from a list

I am trying to create a dataframe from the following list: data = [(1,'abc','2020-08-20 10:00:00', 'I'), (1,'abc','2020-08-20 10:01:00', 'U'), (1,'abc','2020-08-21 10:02:00', 'U'), (2,'pqr','2020-0...

pyspark apache-spark-sql

Wilterdink asked 19/7, 2021 at 5:56

3

Solved

Is there a difference between OUTER & FULL_OUTER in Spark SQL?

Spark SQL documentation specifies that join() supports the following join types: Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, and left...

apache-spark apache-spark-sql

Insurmountable asked 2/10, 2017 at 9:37

2

Solved

How do I collect a single column in Spark?

I would like to perform an action on a single column. Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot ...

apache-spark dataframe pyspark apache-spark-sql

Quidnunc asked 19/2, 2016 at 0:32

4

How to execute a stored procedure in Azure Databricks PySpark?

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried. #initialize pyspark import findsp...

python apache-spark-sql azure-databricks pyspark

Muller asked 22/2, 2020 at 16:43

7

Solved

Assign a variable a dynamic value in SQL in Databricks / Spark

I feel like I must be missing something obvious here, but I can't seem to dynamically set a variable value in Spark SQL. Let's say I have two tables, tableSrc and tableBuilder, and I'm creating ta...

apache-spark apache-spark-sql databricks

Regeniaregensburg asked 11/12, 2019 at 0:25

1

How to understand the min/med/max in DAG

I would like to fully understand the meaning of the information about min/med/max. for example: scan time total(min, med, max) 34m(3.1s, 10.8s, 15.1s) means of all cores, the min scan time is ...

performance apache-spark apache-spark-sql bigdata spark-ui

Gasometer asked 23/11, 2019 at 19:52

6

Solved

Comparing schema of dataframe using Pyspark

I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- name: string (nullable = true) # |-- ag...

python apache-spark pyspark apache-spark-sql

Gilda asked 7/2, 2018 at 21:6

3

Solved

Encoder for Row Type Spark Datasets

I would like to write an encoder for a Row type in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders. Below is an example of a map operation: In...

java apache-spark apache-spark-sql apache-spark-dataset apache-spark-encoders

Pitching asked 5/4, 2017 at 18:13

3

Solved

How to specify the path where saveAsTable saves files to?

I am trying to save a DataFrame to S3 in pyspark in Spark1.4 using DataFrameWriter df = sqlContext.read.format("json").load("s3a://somefile") df_writer = pyspark.sql.DataFrameWriter(df) df_writer...

apache-spark pyspark apache-spark-sql

Breezeway asked 16/6, 2015 at 18:4

3

Solved

How to calculate the size of dataframe in bytes in Spark?

I want to write one large sized dataframe with repartition, so I want to calculate number of repartition for my source dataframe. numberofpartition = {size of dataframe/default_blocksize} How to c...

scala apache-spark apache-spark-sql size spark-streaming

Crept asked 21/4, 2020 at 7:45

7

Solved

Including null values in an Apache Spark Join

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default. Here is the default Spark behavior. val numbersDf = Seq( ("123"), ("456"), (null),...

sql scala apache-spark join apache-spark-sql

Locally asked 18/1, 2017 at 20:21

2

Solved

PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe

I would like to compute the maximum of a subset of columns for each row and add it as a new column for the existing Dataframe. I managed to do this in very awkward way: def add_colmax(df,subset_c...

python apache-spark pyspark apache-spark-sql

Spellbinder asked 29/11, 2016 at 19:54

4

Solved

How to sort array of struct type in Spark DataFrame by particular field?

Given following code: import java.sql.Date import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ object SortQuestion extends App{ val spark = SparkSession.builder().ap...

dataframe scala apache-spark apache-spark-sql

Dace asked 5/4, 2018 at 11:34

2

Solved

How to count a boolean in grouped Spark data frame

I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region, salary and IsUnemployed ...

python sql apache-spark pyspark apache-spark-sql

Adai asked 18/2, 2016 at 22:28

2

contains pyspark SQL: TypeError: 'Column' object is not callable

I'm using spark 2.0.1, df.show() +--------+------+---+-----+-----+----+ |Survived|Pclass|Sex|SibSp|Parch|Fare| +--------+------+---+-----+-----+----+ | 0.0| 3.0|1.0| 1.0| 0.0| 7.3| | 1.0| 1.0|0....

python apache-spark pyspark apache-spark-sql

Kennethkennett asked 14/12, 2018 at 22:24

5

Solved

Joining two DataFrames in Spark SQL and selecting columns of only one

I have two DataFrames in Spark SQL (D1 and D2). I am trying to inner join both of them D1.join(D2, "some column") and get back data of only D1, not the complete data set. Both D1 and D2 are ha...

scala apache-spark apache-spark-sql

Striking asked 2/8, 2016 at 13:2

3

Solved

Can we able to use mulitple sparksessions to access two different Hive servers

I have a scenario to compare two different tables source and destination from two separate remote hive servers, can we able to use two SparkSessions something like I tried below:- val spark = Spa...

scala apache-spark hive apache-spark-sql

Altimetry asked 6/7, 2017 at 12:43

4

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu,...

apache-spark teradata pyspark apache-spark-sql

Prescribe asked 24/8, 2015 at 17:36

3

Solved

Spark - How to add an element to an array of structs

arrays dataframe apache-spark struct apache-spark-sql

Holdall asked 14/1, 2019 at 19:57

apache-spark-sql Questions

Recommended topics

Hot tags