apache-spark-sql - 2

3

Solved

Spark load data and add filename as dataframe column

I am loading some data into Spark with a wrapper function: def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "f...

apache-spark pyspark apache-spark-sql

Quota asked 5/10, 2016 at 7:50

7

pyspark: ValueError: Some of types cannot be determined after inferring

I have a pandas data frame my_df, and my_df.dtypes gives us: ts int64 fieldA object fieldB object fieldC object fieldD object fieldE object dtype: object Then I am trying to convert the pandas d...

python python-2.7 pandas pyspark apache-spark-sql

Bevon asked 9/11, 2016 at 23:11

2

Drop a column with same name using column index in pyspark

This is my dataframe I'm trying to drop the duplicate columns with same name using index: df = spark.createDataFrame([(1,2,3,4,5)],['c','b','a','a','b']) df.show() Output: +---+---+---+---+---+...

apache-spark pyspark apache-spark-sql

Ascribe asked 18/12, 2019 at 18:35

2

Reading zip file into Apache Spark dataframe

Using Apache Spark (or pyspark) I can read/load a text file into a spark dataframe and load that dataframe into a sql db, as follows: df = spark.read.csv("MyFilePath/MyDataFile.txt", sep=...

python python-3.x apache-spark pyspark apache-spark-sql

Ottie asked 7/7, 2022 at 2:13

11

Solved

How do I detect if a Spark DataFrame has a column

When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select Example JSON schema: { "a": { "b": 1, "c": 2 } } This is what I want ...

scala apache-spark dataframe apache-spark-sql

Easiness asked 9/3, 2016 at 22:40

9

Solved

How to create a DataFrame from a text file in Spark

I have a text file on HDFS and I want to convert it to a Data Frame in Spark. I am using the Spark Context to load the file and then try to generate individual columns from that file. val myFile...

scala apache-spark dataframe apache-spark-sql rdd

Praemunire asked 21/4, 2016 at 10:6

4

Solved

Databricks - is not empty but it's not a Delta table

I run a query on Databricks: DROP TABLE IF EXISTS dublicates_hotels; CREATE TABLE IF NOT EXISTS dublicates_hotels ... I'm trying to understand why I receive the following error: Error in SQL stat...

apache-spark-sql databricks delta-lake

Nkrumah asked 13/10, 2021 at 7:51

4

Solved

Scala Spark DataFrame : dataFrame.select multiple columns given a Sequence of column names

val columnName=Seq("col1","col2",....."coln"); Is there a way to do dataframe.select operation to get dataframe containing only the column names specified . I know I can do dataframe.select("col...

scala apache-spark dataframe apache-spark-sql

Halflife asked 21/3, 2016 at 12:59

2

Solved

How to modify pyspark dataframe nested struct column

I am trying to anonymize/hash a nested column, but haven't been successful. The schema looks something like this: -- abc: struct (nullable = true) | |-- xyz: struct (nullable = true) | | |-- abc123...

dataframe apache-spark pyspark struct apache-spark-sql

Salisbarry asked 7/1, 2022 at 15:15

3

Solved

How to update a value in the nested column of struct using pyspark

I try to do very simple - update a value of a nested column;however, I cannot figure out how Environment: Apache Spark 2.4.5 Databricks 6.4 Python 3.7 dataDF = [ (('Jon','','Smith'),'1580-01-06'...

python apache-spark pyspark apache-spark-sql

Theomania asked 7/12, 2020 at 11:2

6

Solved

How to generate a DataFrame with random content and N rows?

How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)? I know how to create a DataFrame manually, but I cannot automate it: val...

scala apache-spark apache-spark-sql

Loach asked 7/2, 2018 at 8:38

2

Solved

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node (sp...

apache-spark apache-spark-sql distributed-computing partitioning bigdata

Goatsbeard asked 8/9, 2016 at 0:57

2

why does spark appends 'WHERE 1=0' at the end of sql query

I am trying to execute a simple mysql query using Apache Spark and create a data frame. But for some reasons spark appends 'WHERE 1=0' at the end of the query which I want to execute and throws an ...

apache-spark apache-spark-sql

Pumpernickel asked 16/2, 2018 at 12:42

8

Solved

Pyspark: explode json in column to multiple columns

The data looks like this - +-----------+-----------+-----------------------------+ | id| point| data| +-----------------------------------------------------+ | abc| 6|{"key1":"124", "key2": "345"...

python apache-spark pyspark apache-spark-sql

Willow asked 27/6, 2018 at 19:38

6

How to read whole file in one string

I want to read json or xml file in pyspark.lf my file is split in multiple line in rdd= sc.textFile(json or xml) Input { " employees": [ { "firstName":"John", &qu...

json apache-spark apache-spark-sql

Celanese asked 25/5, 2015 at 20:0

5

How do I split a column by using delimiters from another column in Spark/Scala

I have another question that is related to the split function. I am new to Spark/Scala. below is the sample data frame - +-------------------+---------+ | VALUES|Delimiter| +-------------------+--...

scala apache-spark apache-spark-sql

Addict asked 14/7, 2021 at 15:41

6

Solved

Spark sql how to explode without losing null values

I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For ins...

java apache-spark null apache-spark-sql

Fingertip asked 28/9, 2016 at 5:57

4

how to fix Illegal Parquet type: INT64 (TIMESTAMP_MICROS) error

I use a sqlContext.read.parquet function in PySpark to read the parquet files everyday. The data has a timestamp column. They changed the timestamp field from 2019-08-26T00:00:13.600+0000 to 2019-0...

apache-spark pyspark apache-spark-sql parquet

Portis asked 28/8, 2019 at 20:54

4

How to find the size of a dataframe in pyspark

How can I replicate this code to get the dataframe size in pyspark? scala> val df = spark.range(10) scala> print(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats...

pyspark apache-spark-sql

Ramrod asked 3/6, 2020 at 13:31

6

Solved

Compute size of Spark dataframe - SizeEstimator gives unexpected results

I am trying to find a reliable way to compute the size (in bytes) of a Spark dataframe programmatically. The reason is that I would like to have a method to compute an "optimal" number of partiti...

apache-spark apache-spark-sql

Mention asked 26/3, 2018 at 13:18

8

Solved

Stratified sampling with pyspark

I have a Spark DataFrame that has one column that has lots of zeros and very few ones (only 0.01% of ones). I'd like to take a random subsample but a stratified one - so that it keeps the ratio o...

apache-spark pyspark apache-spark-sql

Neolithic asked 4/12, 2017 at 16:27

3

Split dataset based on column values in spark

I am trying to split the Dataset into different Datasets based on Manufacturer column contents. It is very slow Please suggest a way to improve the code, so that it can execute faster and reduce th...

java apache-spark apache-spark-sql apache-spark-dataset apache-spark-2.0

Maloney asked 7/3, 2017 at 10:30

13

Solved

Spark Dataframe distinguish columns with duplicated name

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot: [ Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}),...

python apache-spark dataframe pyspark apache-spark-sql

Ramtil asked 18/11, 2015 at 11:16

3

Solved

Select all except particular column in spark sql

I want to select all columns in a table except StudentAddress and hence I wrote following query: select `(StudentAddress)?+.+` from student; It gives following error in Squirrel Sql client. org....

apache-spark apache-spark-sql hive spark-hive

Flu asked 26/4, 2017 at 21:1

5

how to delete data from a delta file in databricks?

I want to delete data from a delta file in databricks. Im using these commands Ex: PR=spark.read.format('delta').options(header=True).load('/mnt/landing/Base_Tables/EventHistory/') PR.write.format(...

sql pyspark apache-spark-sql azure-databricks delta-lake

Dissatisfaction asked 7/12, 2020 at 10:3

apache-spark-sql Questions

Recommended topics

Hot tags