apache-spark-sql Questions

3

Solved

I am loading some data into Spark with a wrapper function: def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "f...
Quota asked 5/10, 2016 at 7:50

7

I have a pandas data frame my_df, and my_df.dtypes gives us: ts int64 fieldA object fieldB object fieldC object fieldD object fieldE object dtype: object Then I am trying to convert the pandas d...
Bevon asked 9/11, 2016 at 23:11

2

This is my dataframe I'm trying to drop the duplicate columns with same name using index: df = spark.createDataFrame([(1,2,3,4,5)],['c','b','a','a','b']) df.show() Output: +---+---+---+---+---+...
Ascribe asked 18/12, 2019 at 18:35

2

Using Apache Spark (or pyspark) I can read/load a text file into a spark dataframe and load that dataframe into a sql db, as follows: df = spark.read.csv("MyFilePath/MyDataFile.txt", sep=...

11

Solved

When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select Example JSON schema: { "a": { "b": 1, "c": 2 } } This is what I want ...
Easiness asked 9/3, 2016 at 22:40

9

Solved

I have a text file on HDFS and I want to convert it to a Data Frame in Spark. I am using the Spark Context to load the file and then try to generate individual columns from that file. val myFile...
Praemunire asked 21/4, 2016 at 10:6

4

Solved

I run a query on Databricks: DROP TABLE IF EXISTS dublicates_hotels; CREATE TABLE IF NOT EXISTS dublicates_hotels ... I'm trying to understand why I receive the following error: Error in SQL stat...
Nkrumah asked 13/10, 2021 at 7:51

4

Solved

val columnName=Seq("col1","col2",....."coln"); Is there a way to do dataframe.select operation to get dataframe containing only the column names specified . I know I can do dataframe.select("col...
Halflife asked 21/3, 2016 at 12:59

2

Solved

I am trying to anonymize/hash a nested column, but haven't been successful. The schema looks something like this: -- abc: struct (nullable = true) | |-- xyz: struct (nullable = true) | | |-- abc123...
Salisbarry asked 7/1, 2022 at 15:15

3

Solved

I try to do very simple - update a value of a nested column;however, I cannot figure out how Environment: Apache Spark 2.4.5 Databricks 6.4 Python 3.7 dataDF = [ (('Jon','','Smith'),'1580-01-06'...
Theomania asked 7/12, 2020 at 11:2

6

Solved

How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)? I know how to create a DataFrame manually, but I cannot automate it: val...
Loach asked 7/2, 2018 at 8:38

2

Solved

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node (sp...

2

I am trying to execute a simple mysql query using Apache Spark and create a data frame. But for some reasons spark appends 'WHERE 1=0' at the end of the query which I want to execute and throws an ...
Pumpernickel asked 16/2, 2018 at 12:42

8

Solved

The data looks like this - +-----------+-----------+-----------------------------+ | id| point| data| +-----------------------------------------------------+ | abc| 6|{"key1":"124", "key2": "345"...
Willow asked 27/6, 2018 at 19:38

6

I want to read json or xml file in pyspark.lf my file is split in multiple line in rdd= sc.textFile(json or xml) Input { " employees": [ { "firstName":"John", &qu...
Celanese asked 25/5, 2015 at 20:0

5

I have another question that is related to the split function. I am new to Spark/Scala. below is the sample data frame - +-------------------+---------+ | VALUES|Delimiter| +-------------------+--...
Addict asked 14/7, 2021 at 15:41

6

Solved

I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For ins...
Fingertip asked 28/9, 2016 at 5:57

4

I use a sqlContext.read.parquet function in PySpark to read the parquet files everyday. The data has a timestamp column. They changed the timestamp field from 2019-08-26T00:00:13.600+0000 to 2019-0...
Portis asked 28/8, 2019 at 20:54

4

How can I replicate this code to get the dataframe size in pyspark? scala> val df = spark.range(10) scala> print(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats...
Ramrod asked 3/6, 2020 at 13:31

6

Solved

I am trying to find a reliable way to compute the size (in bytes) of a Spark dataframe programmatically. The reason is that I would like to have a method to compute an "optimal" number of partiti...
Mention asked 26/3, 2018 at 13:18

8

Solved

I have a Spark DataFrame that has one column that has lots of zeros and very few ones (only 0.01% of ones). I'd like to take a random subsample but a stratified one - so that it keeps the ratio o...
Neolithic asked 4/12, 2017 at 16:27

3

I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce th...

13

Solved

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot: [ Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}),...
Ramtil asked 18/11, 2015 at 11:16

3

Solved

I want to select all columns in a table except StudentAddress and hence I wrote following query: select `(StudentAddress)?+.+` from student; It gives following error in Squirrel Sql client. org....
Flu asked 26/4, 2017 at 21:1

5

I want to delete data from a delta file in databricks. Im using these commands Ex: PR=spark.read.format('delta').options(header=True).load('/mnt/landing/Base_Tables/EventHistory/') PR.write.format(...
Dissatisfaction asked 7/12, 2020 at 10:3

© 2022 - 2024 — McMap. All rights reserved.