apache-spark-sql Questions
3
Solved
I am loading some data into Spark with a wrapper function:
def load_data( filename ):
df = sqlContext.read.format("com.databricks.spark.csv")\
.option("delimiter", "\t")\
.option("header", "f...
Quota asked 5/10, 2016 at 7:50
7
I have a pandas data frame my_df, and my_df.dtypes gives us:
ts int64
fieldA object
fieldB object
fieldC object
fieldD object
fieldE object
dtype: object
Then I am trying to convert the pandas d...
Bevon asked 9/11, 2016 at 23:11
2
This is my dataframe I'm trying to drop the duplicate columns with same name using index:
df = spark.createDataFrame([(1,2,3,4,5)],['c','b','a','a','b'])
df.show()
Output:
+---+---+---+---+---+...
Ascribe asked 18/12, 2019 at 18:35
2
Using Apache Spark (or pyspark) I can read/load a text file into a spark dataframe and load that dataframe into a sql db, as follows:
df = spark.read.csv("MyFilePath/MyDataFile.txt", sep=...
Ottie asked 7/7, 2022 at 2:13
11
Solved
When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select
Example JSON schema:
{
"a": {
"b": 1,
"c": 2
}
}
This is what I want ...
Easiness asked 9/3, 2016 at 22:40
9
Solved
I have a text file on HDFS and I want to convert it to a Data Frame in Spark.
I am using the Spark Context to load the file and then try to generate individual columns from that file.
val myFile...
Praemunire asked 21/4, 2016 at 10:6
4
Solved
I run a query on Databricks:
DROP TABLE IF EXISTS dublicates_hotels;
CREATE TABLE IF NOT EXISTS dublicates_hotels
...
I'm trying to understand why I receive the following error:
Error in SQL stat...
Nkrumah asked 13/10, 2021 at 7:51
4
Solved
val columnName=Seq("col1","col2",....."coln");
Is there a way to do dataframe.select operation to get dataframe containing only the column names specified .
I know I can do dataframe.select("col...
Halflife asked 21/3, 2016 at 12:59
2
Solved
I am trying to anonymize/hash a nested column, but haven't been successful. The schema looks something like this:
-- abc: struct (nullable = true)
| |-- xyz: struct (nullable = true)
| | |-- abc123...
Salisbarry asked 7/1, 2022 at 15:15
3
Solved
I try to do very simple - update a value of a nested column;however, I cannot figure out how
Environment:
Apache Spark 2.4.5
Databricks 6.4
Python 3.7
dataDF = [
(('Jon','','Smith'),'1580-01-06'...
Theomania asked 7/12, 2020 at 11:2
6
Solved
How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)?
I know how to create a DataFrame manually, but I cannot automate it:
val...
Loach asked 7/2, 2018 at 8:38
2
Solved
There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is:
The Spark Driver node (sp...
Goatsbeard asked 8/9, 2016 at 0:57
2
I am trying to execute a simple mysql query using Apache Spark and create a data frame. But for some reasons spark appends 'WHERE 1=0' at the end of the query which I want to execute and throws an ...
Pumpernickel asked 16/2, 2018 at 12:42
8
Solved
The data looks like this -
+-----------+-----------+-----------------------------+
| id| point| data|
+-----------------------------------------------------+
| abc| 6|{"key1":"124", "key2": "345"...
Willow asked 27/6, 2018 at 19:38
6
I want to read json or xml file in pyspark.lf my file is split in multiple line in
rdd= sc.textFile(json or xml)
Input
{
" employees":
[
{
"firstName":"John",
&qu...
Celanese asked 25/5, 2015 at 20:0
5
I have another question that is related to the split function.
I am new to Spark/Scala.
below is the sample data frame -
+-------------------+---------+
| VALUES|Delimiter|
+-------------------+--...
Addict asked 14/7, 2021 at 15:41
6
Solved
I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For ins...
Fingertip asked 28/9, 2016 at 5:57
4
I use a sqlContext.read.parquet function in PySpark to read the parquet files everyday. The data has a timestamp column. They changed the timestamp field from 2019-08-26T00:00:13.600+0000 to 2019-0...
Portis asked 28/8, 2019 at 20:54
4
How can I replicate this code to get the dataframe size in pyspark?
scala> val df = spark.range(10)
scala> print(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats...
Ramrod asked 3/6, 2020 at 13:31
6
Solved
I am trying to find a reliable way to compute the size (in bytes) of a Spark dataframe programmatically.
The reason is that I would like to have a method to compute an "optimal" number of partiti...
Mention asked 26/3, 2018 at 13:18
8
Solved
I have a Spark DataFrame that has one column that has lots of zeros and very few ones (only 0.01% of ones).
I'd like to take a random subsample but a stratified one - so that it keeps the ratio o...
Neolithic asked 4/12, 2017 at 16:27
3
I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce th...
Maloney asked 7/3, 2017 at 10:30
13
Solved
So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:
[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}),...
Ramtil asked 18/11, 2015 at 11:16
3
Solved
I want to select all columns in a table except StudentAddress and hence I wrote following query:
select `(StudentAddress)?+.+` from student;
It gives following error in Squirrel Sql client.
org....
Flu asked 26/4, 2017 at 21:1
5
I want to delete data from a delta file in databricks.
Im using these commands
Ex:
PR=spark.read.format('delta').options(header=True).load('/mnt/landing/Base_Tables/EventHistory/')
PR.write.format(...
Dissatisfaction asked 7/12, 2020 at 10:3
© 2022 - 2024 — McMap. All rights reserved.