apache-spark-sql Questions

2

Solved

PS. There's a similar question here, but that is in mvn and my project is in sbt. First up, a few required informations: Spark Installed Version: 2.4.0 Scala Installed Version: 2.11.12 I'm try...
Pelagic asked 18/4, 2019 at 20:14

2

I'm looking for a client jdbc driver that supports Spark SQL. I have been using Jupyter so far to run SQL statements on Spark (running on HDInsight) and I'd like to be able to connect using JDBC s...
Essie asked 9/6, 2016 at 18:27

3

Solved

I have a Dataframe with different columns where one of the columns is an array of structs: +----------+---------+--------------------------------------+ |id |title | values| +----------+---------+-...
Bigamist asked 17/6, 2018 at 22:43

3

Solved

I am building a Spark Structured Streaming application where I am doing a batch-stream join. And the source for the batch data gets updated periodically. So, I am planning to do a persist/unpersist...

2

Currently Spark has two implementations for Row: import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.InternalRow What is the need to have both of them? Do they represent the sa...
Disrespect asked 2/2, 2017 at 22:8

1

How can an InfluxDB database (which has streaming data coming in) be used as Source for Spark Streaming ? Also, Is it possible to use InfluxDB instead of SparkSQL for performing computations on dat...
Animalist asked 31/5, 2018 at 10:3

2

Solved

I am trying to create a dataframe from the following list: data = [(1,'abc','2020-08-20 10:00:00', 'I'), (1,'abc','2020-08-20 10:01:00', 'U'), (1,'abc','2020-08-21 10:02:00', 'U'), (2,'pqr','2020-0...
Wilterdink asked 19/7, 2021 at 5:56

3

Solved

Spark SQL documentation specifies that join() supports the following join types: Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, and left...
Insurmountable asked 2/10, 2017 at 9:37

2

Solved

I would like to perform an action on a single column. Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot ...
Quidnunc asked 19/2, 2016 at 0:32

4

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried. #initialize pyspark import findsp...
Muller asked 22/2, 2020 at 16:43

7

Solved

I feel like I must be missing something obvious here, but I can't seem to dynamically set a variable value in Spark SQL. Let's say I have two tables, tableSrc and tableBuilder, and I'm creating ta...
Regeniaregensburg asked 11/12, 2019 at 0:25

1

I would like to fully understand the meaning of the information about min/med/max. for example: scan time total(min, med, max) 34m(3.1s, 10.8s, 15.1s) means of all cores, the min scan time is ...
Gasometer asked 23/11, 2019 at 19:52

6

Solved

I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- name: string (nullable = true) # |-- ag...
Gilda asked 7/2, 2018 at 21:6

3

Solved

I would like to write an encoder for a Row type in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders. Below is an example of a map operation: In...

3

Solved

I am trying to save a DataFrame to S3 in pyspark in Spark1.4 using DataFrameWriter df = sqlContext.read.format("json").load("s3a://somefile") df_writer = pyspark.sql.DataFrameWriter(df) df_writer...
Breezeway asked 16/6, 2015 at 18:4

3

Solved

I want to write one large sized dataframe with repartition, so I want to calculate number of repartition for my source dataframe. numberofpartition = {size of dataframe/default_blocksize} How to c...

7

Solved

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default. Here is the default Spark behavior. val numbersDf = Seq( ("123"), ("456"), (null),...
Locally asked 18/1, 2017 at 20:21

2

Solved

I would like to compute the maximum of a subset of columns for each row and add it as a new column for the existing Dataframe. I managed to do this in very awkward way: def add_colmax(df,subset_c...
Spellbinder asked 29/11, 2016 at 19:54

4

Solved

Given following code: import java.sql.Date import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ object SortQuestion extends App{ val spark = SparkSession.builder().ap...
Dace asked 5/4, 2018 at 11:34

2

Solved

I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region, salary and IsUnemployed ...
Adai asked 18/2, 2016 at 22:28

2

I'm using spark 2.0.1, df.show() +--------+------+---+-----+-----+----+ |Survived|Pclass|Sex|SibSp|Parch|Fare| +--------+------+---+-----+-----+----+ | 0.0| 3.0|1.0| 1.0| 0.0| 7.3| | 1.0| 1.0|0....
Kennethkennett asked 14/12, 2018 at 22:24

5

Solved

I have two DataFrames in Spark SQL (D1 and D2). I am trying to inner join both of them D1.join(D2, "some column") and get back data of only D1, not the complete data set. Both D1 and D2 are ha...
Striking asked 2/8, 2016 at 13:2

3

Solved

I have a scenario to compare two different tables source and destination from two separate remote hive servers, can we able to use two SparkSessions something like I tried below:- val spark = Spa...
Altimetry asked 6/7, 2017 at 12:43

4

I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu,...
Prescribe asked 24/8, 2015 at 17:36

3

Solved

Having this schema: root |-- Elems: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- Elem: integer (nullable = true) | | |-- Desc: string (nullable = true) How can w...
Holdall asked 14/1, 2019 at 19:57

© 2022 - 2024 — McMap. All rights reserved.