pyspark - McMap

3

Solved

How to sort columns of nested structs alphabetically in pyspark?

I have data with below schema. I want all the columns should be in sorted alphabetically. I want it in pyspark data frame. root |-- _id: string (nullable = true) |-- first_name: string (nullable...

python apache-spark struct pyspark

Swatow asked 6/9, 2019 at 11:54

3

Solved

How to extract all elements from array of structs?

I have a Dataframe with different columns where one of the columns is an array of structs: +----------+---------+--------------------------------------+ |id |title | values| +----------+---------+-...

arrays apache-spark pyspark struct apache-spark-sql

Bigamist asked 17/6, 2018 at 22:43

3

Solved

How to write try except for loading data

I'm pretty new to coding so I apologize for this being stupid question. I'm writing a spark function that takes in a file path and file type and creates a dataframe. If the input is invalid, I want...

python python-3.x pyspark

Denticulation asked 21/5, 2020 at 19:36

3

Solved

pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

This has a different answer to those given in the post above I am getting an error that reads pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manua...

apache-spark pyspark parquet

Proust asked 2/11, 2018 at 16:54

2

Solved

Error: TimestampType can not accept object while creating a Spark dataframe from a list

I am trying to create a dataframe from the following list: data = [(1,'abc','2020-08-20 10:00:00', 'I'), (1,'abc','2020-08-20 10:01:00', 'U'), (1,'abc','2020-08-21 10:02:00', 'U'), (2,'pqr','2020-0...

pyspark apache-spark-sql

Wilterdink asked 19/7, 2021 at 5:56

2

Solved

How do I collect a single column in Spark?

I would like to perform an action on a single column. Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot ...

apache-spark dataframe pyspark apache-spark-sql

Quidnunc asked 19/2, 2016 at 0:32

4

How to execute a stored procedure in Azure Databricks PySpark?

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried. #initialize pyspark import findsp...

python apache-spark-sql azure-databricks pyspark

Muller asked 22/2, 2020 at 16:43

3

Oversampling or SMOTE in Pyspark

I have 7 classes and the total number of records are 115 and I wanted to run Random Forest model over this data. But as the data is not enough to get a high accuracy. So i wanted to apply oversampl...

machine-learning pyspark random-forest oversampling

Paulettepauley asked 26/12, 2018 at 20:31

3

How to set logLevel in a pyspark job

I'm trying to set the log level in a pyspark job. I'm not using the spark shell, so I can't just do what it advises and call sc.setLogLevel(newLevel), since I don't have an sc object. A lot of sou...

pyspark virtualenv

Scare asked 27/3, 2018 at 22:26

8

Solved

What is the correct way to install the delta module in python?

What is the correct way to install the delta module in python?? In the example they import the module from delta.tables import * but i did not find the correct way to install the module in my v...

pyspark databricks delta-lake

Querist asked 17/12, 2019 at 11:37

2

Pyspark: Error executing Jupyter command while running a file using spark-submit

I am able to run pyspark and run a script on Jupyter notebook. But when I try to run the file from terminal using spark-submit, getting this error: Error executing Jupyter command file path [Errn...

pyspark jupyter-notebook spark-submit

Wallaby asked 30/9, 2017 at 23:16

4

"Parquet record is malformed" while column count is not 0

On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error: Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields ar...

hive pyspark amazon-emr parquet

Wreckful asked 10/1, 2020 at 1:13

2

Spark Memory Overhead

Spark memory overhead related question asked multiple times in SO, I went through most of them. However, after going through multiple blogs, I got confused. Below are the questions I have whether ...

apache-spark pyspark hadoop-yarn executor memory-overhead

Shrewmouse asked 24/8, 2020 at 12:39

6

Solved

Comparing schema of dataframe using Pyspark

I have a data frame (df). For showing its schema I use: from pyspark.sql.functions import * df1.printSchema() And I get the following result: #root # |-- name: string (nullable = true) # |-- ag...

python apache-spark pyspark apache-spark-sql

Gilda asked 7/2, 2018 at 21:6

2

Solved

Truncate delta table in Databricks using python

Delta table delete operation is given here for Python and SQL, and truncate using SQL is given here. But I cannot find the documentation for Python truncate table. How to do it for delta table in D...

python pyspark databricks delta-lake

Muriel asked 13/5, 2021 at 10:58

3

Pass additional arguments to foreachBatch in pyspark

I am using foreachBatch in pyspark structured streaming to write each microbatch to SQL Server using JDBC. I need to use the same process for several tables, and I'd like to reuse the same writer f...

apache-spark pyspark spark-structured-streaming databricks

Chirm asked 3/5, 2019 at 16:12

3

Solved

How to specify the path where saveAsTable saves files to?

I am trying to save a DataFrame to S3 in pyspark in Spark1.4 using DataFrameWriter df = sqlContext.read.format("json").load("s3a://somefile") df_writer = pyspark.sql.DataFrameWriter(df) df_writer...

apache-spark pyspark apache-spark-sql

Breezeway asked 16/6, 2015 at 18:4

3

How to load table from SQLite from PySpark?

I am trying to load a table from an SQLite .db file stored on a local disk. Is there any way to do this in PySpark? My solution works but not as elegant. I read the table using Pandas though sqlite...

python sqlite apache-spark pyspark

Quince asked 16/8, 2016 at 22:16

2

Solved

Where can I find an exhaustive list of actions for spark?

I want to know exactly what I can do in spark without triggering the computation of the spark RDD/DataFrame. It's my understanding that only actions trigger the execution of the transformations in ...

python dataframe apache-spark pyspark

Cusp asked 8/7, 2024 at 21:20

1

Multi-key GroupBy with shared data on one key

I am working with a large dataset that includes multiple unique groups of data identified by a date and a group ID. Each group contains multiple IDs, each with several attributes. Here’s a simplifi...

python pandas pyspark

Geodynamics asked 27/6, 2024 at 20:52

7

How to write csv file into one file by pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everyth...

pyspark

Quern asked 12/4, 2016 at 13:21

3

Solved

Spark dataframe not adding columns with null values

I am trying to create a new column by adding two existing columns in my dataframe. Original dataframe ╔══════╦══════╗ ║ cola ║ colb ║ ╠══════╬══════╣ ║ 1 ║ 1 ║ ║ null ║ 3 ║ ║ 2 ║ null ║ ║ 4 ║ 2 ...

python apache-spark pyspark

Chinch asked 18/10, 2018 at 1:47

4

PYCHARM Error-- java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified

I am getting the below error while running a pyspark program on PYCHARM, Error: java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file ...

python pyspark pycharm

Seabrooke asked 8/8, 2021 at 23:22

4

Solved

How to copy and convert parquet files to csv

I have access to a hdfs file system and can see parquet files with hadoop fs -ls /user/foo How can I copy those parquet files to my local system and convert them to csv so I can use them? The fi...

python hadoop apache-spark pyspark parquet

Mclaurin asked 9/9, 2016 at 21:29

4

How to manage options in PySpark more efficiently

Let us consider following pySpark code my_df = (spark.read.format("csv") .option("header","true") .option("inferSchema", "true") .load(my_data_p...

python python-3.x apache-spark pyspark design-patterns

Burnie asked 28/3, 2022 at 4:27

pyspark Questions

Recommended topics

Hot tags