apache-spark-sql Questions
2
Solved
PS. There's a similar question here, but that is in mvn and my project is in sbt.
First up, a few required informations:
Spark Installed Version: 2.4.0
Scala Installed Version: 2.11.12
I'm try...
Pelagic asked 18/4, 2019 at 20:14
2
I'm looking for a client jdbc driver that supports Spark SQL.
I have been using Jupyter so far to run SQL statements on Spark (running on HDInsight) and I'd like to be able to connect using JDBC s...
Essie asked 9/6, 2016 at 18:27
3
Solved
I have a Dataframe with different columns where one of the columns is an array of structs:
+----------+---------+--------------------------------------+
|id |title | values|
+----------+---------+-...
Bigamist asked 17/6, 2018 at 22:43
3
Solved
I am building a Spark Structured Streaming application where I am doing a batch-stream join. And the source for the batch data gets updated periodically.
So, I am planning to do a persist/unpersist...
Rizzo asked 11/2, 2021 at 12:32
2
Currently Spark has two implementations for Row:
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.InternalRow
What is the need to have both of them? Do they represent the sa...
Disrespect asked 2/2, 2017 at 22:8
1
How can an InfluxDB database (which has streaming data coming in) be used as Source for Spark Streaming ?
Also, Is it possible to use InfluxDB instead of SparkSQL for performing computations on dat...
Animalist asked 31/5, 2018 at 10:3
2
Solved
I am trying to create a dataframe from the following list:
data = [(1,'abc','2020-08-20 10:00:00', 'I'),
(1,'abc','2020-08-20 10:01:00', 'U'),
(1,'abc','2020-08-21 10:02:00', 'U'),
(2,'pqr','2020-0...
Wilterdink asked 19/7, 2021 at 5:56
3
Solved
Spark SQL documentation specifies that join() supports the following join types:
Must be one of: inner, cross, outer, full, full_outer, left,
left_outer, right, right_outer, left_semi, and left...
Insurmountable asked 2/10, 2017 at 9:37
2
Solved
I would like to perform an action on a single column.
Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot ...
Quidnunc asked 19/2, 2016 at 0:32
4
I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried.
#initialize pyspark
import findsp...
Muller asked 22/2, 2020 at 16:43
7
Solved
I feel like I must be missing something obvious here, but I can't seem to dynamically set a variable value in Spark SQL.
Let's say I have two tables, tableSrc and tableBuilder, and I'm creating ta...
Regeniaregensburg asked 11/12, 2019 at 0:25
1
I would like to fully understand the meaning of the information about min/med/max.
for example:
scan time total(min, med, max)
34m(3.1s, 10.8s, 15.1s)
means of all cores, the min scan time is ...
Gasometer asked 23/11, 2019 at 19:52
6
Solved
I have a data frame (df).
For showing its schema I use:
from pyspark.sql.functions import *
df1.printSchema()
And I get the following result:
#root
# |-- name: string (nullable = true)
# |-- ag...
Gilda asked 7/2, 2018 at 21:6
3
Solved
I would like to write an encoder for a Row type in DataSet, for a map operation that I am doing. Essentially, I do not understand how to write encoders.
Below is an example of a map operation:
In...
Pitching asked 5/4, 2017 at 18:13
3
Solved
I am trying to save a DataFrame to S3 in pyspark in Spark1.4 using DataFrameWriter
df = sqlContext.read.format("json").load("s3a://somefile")
df_writer = pyspark.sql.DataFrameWriter(df)
df_writer...
Breezeway asked 16/6, 2015 at 18:4
3
Solved
I want to write one large sized dataframe with repartition, so I want to calculate number of repartition for my source dataframe.
numberofpartition = {size of dataframe/default_blocksize}
How to c...
Crept asked 21/4, 2020 at 7:45
7
Solved
I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.
Here is the default Spark behavior.
val numbersDf = Seq(
("123"),
("456"),
(null),...
Locally asked 18/1, 2017 at 20:21
2
Solved
I would like to compute the maximum of a subset of columns for each row and add it as a new column for the existing Dataframe.
I managed to do this in very awkward way:
def add_colmax(df,subset_c...
Spellbinder asked 29/11, 2016 at 19:54
4
Solved
Given following code:
import java.sql.Date
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortQuestion extends App{
val spark = SparkSession.builder().ap...
Dace asked 5/4, 2018 at 11:34
2
Solved
I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region, salary and IsUnemployed ...
Adai asked 18/2, 2016 at 22:28
2
I'm using spark 2.0.1,
df.show()
+--------+------+---+-----+-----+----+
|Survived|Pclass|Sex|SibSp|Parch|Fare|
+--------+------+---+-----+-----+----+
| 0.0| 3.0|1.0| 1.0| 0.0| 7.3|
| 1.0| 1.0|0....
Kennethkennett asked 14/12, 2018 at 22:24
5
Solved
I have two DataFrames in Spark SQL (D1 and D2).
I am trying to inner join both of them D1.join(D2, "some column")
and get back data of only D1, not the complete data set.
Both D1 and D2 are ha...
Striking asked 2/8, 2016 at 13:2
3
Solved
I have a scenario to compare two different tables source and destination from two separate remote hive servers, can we able to use two SparkSessions something like I tried below:-
val spark = Spa...
Altimetry asked 6/7, 2017 at 12:43
4
I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]).
I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu,...
Prescribe asked 24/8, 2015 at 17:36
3
Solved
Having this schema:
root
|-- Elems: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Elem: integer (nullable = true)
| | |-- Desc: string (nullable = true)
How can w...
Holdall asked 14/1, 2019 at 19:57
1 Next >
© 2022 - 2024 — McMap. All rights reserved.