pyspark Questions
3
Solved
I have data with below schema. I want all the columns should be in sorted alphabetically. I want it in pyspark data frame.
root
|-- _id: string (nullable = true)
|-- first_name: string (nullable...
Swatow asked 6/9, 2019 at 11:54
3
Solved
I have a Dataframe with different columns where one of the columns is an array of structs:
+----------+---------+--------------------------------------+
|id |title | values|
+----------+---------+-...
Bigamist asked 17/6, 2018 at 22:43
3
Solved
I'm pretty new to coding so I apologize for this being stupid question. I'm writing a spark function that takes in a file path and file type and creates a dataframe. If the input is invalid, I want...
Denticulation asked 21/5, 2020 at 19:36
3
Solved
This has a different answer to those given in the post above
I am getting an error that reads
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manua...
Proust asked 2/11, 2018 at 16:54
2
Solved
I am trying to create a dataframe from the following list:
data = [(1,'abc','2020-08-20 10:00:00', 'I'),
(1,'abc','2020-08-20 10:01:00', 'U'),
(1,'abc','2020-08-21 10:02:00', 'U'),
(2,'pqr','2020-0...
Wilterdink asked 19/7, 2021 at 5:56
2
Solved
I would like to perform an action on a single column.
Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot ...
Quidnunc asked 19/2, 2016 at 0:32
4
I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried.
#initialize pyspark
import findsp...
Muller asked 22/2, 2020 at 16:43
3
I have 7 classes and the total number of records are 115 and I wanted to run Random Forest model over this data. But as the data is not enough to get a high accuracy. So i wanted to apply oversampl...
Paulettepauley asked 26/12, 2018 at 20:31
3
I'm trying to set the log level in a pyspark job. I'm not using the spark shell, so I can't just do what it advises and call sc.setLogLevel(newLevel), since I don't have an sc object.
A lot of sou...
Scare asked 27/3, 2018 at 22:26
8
Solved
What is the correct way to install the delta module in python??
In the example they import the module
from delta.tables import *
but i did not find the correct way to install the module in my v...
Querist asked 17/12, 2019 at 11:37
2
I am able to run pyspark and run a script on Jupyter notebook.
But when I try to run the file from terminal using spark-submit, getting this error:
Error executing Jupyter command file path [Errn...
Wallaby asked 30/9, 2017 at 23:16
4
On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error:
Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields ar...
Wreckful asked 10/1, 2020 at 1:13
2
Spark memory overhead related question asked multiple times in SO, I went through most of them. However, after going through multiple blogs, I got confused.
Below are the questions I have
whether ...
Shrewmouse asked 24/8, 2020 at 12:39
6
Solved
I have a data frame (df).
For showing its schema I use:
from pyspark.sql.functions import *
df1.printSchema()
And I get the following result:
#root
# |-- name: string (nullable = true)
# |-- ag...
Gilda asked 7/2, 2018 at 21:6
2
Solved
Delta table delete operation is given here for Python and SQL, and truncate using SQL is given here. But I cannot find the documentation for Python truncate table.
How to do it for delta table in D...
Muriel asked 13/5, 2021 at 10:58
3
I am using foreachBatch in pyspark structured streaming to write each microbatch to SQL Server using JDBC. I need to use the same process for several tables, and I'd like to reuse the same writer f...
Chirm asked 3/5, 2019 at 16:12
3
Solved
I am trying to save a DataFrame to S3 in pyspark in Spark1.4 using DataFrameWriter
df = sqlContext.read.format("json").load("s3a://somefile")
df_writer = pyspark.sql.DataFrameWriter(df)
df_writer...
Breezeway asked 16/6, 2015 at 18:4
3
I am trying to load a table from an SQLite .db file stored on a local disk. Is there any way to do this in PySpark?
My solution works but not as elegant. I read the table using Pandas though sqlite...
Quince asked 16/8, 2016 at 22:16
2
Solved
I want to know exactly what I can do in spark without triggering the computation of the spark RDD/DataFrame.
It's my understanding that only actions trigger the execution of the transformations in ...
Cusp asked 8/7, 2024 at 21:20
1
I am working with a large dataset that includes multiple unique groups of data identified by a date and a group ID. Each group contains multiple IDs, each with several attributes. Here’s a simplifi...
7
I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everyth...
Quern asked 12/4, 2016 at 13:21
3
Solved
I am trying to create a new column by adding two existing columns in my dataframe.
Original dataframe
╔══════╦══════╗
║ cola ║ colb ║
╠══════╬══════╣
║ 1 ║ 1 ║
║ null ║ 3 ║
║ 2 ║ null ║
║ 4 ║ 2 ...
Chinch asked 18/10, 2018 at 1:47
4
I am getting the below error while running a pyspark program on PYCHARM,
Error:
java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file ...
4
Solved
I have access to a hdfs file system and can see parquet files with
hadoop fs -ls /user/foo
How can I copy those parquet files to my local system and convert them to csv so I can use them? The fi...
Mclaurin asked 9/9, 2016 at 21:29
4
Let us consider following pySpark code
my_df = (spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true")
.load(my_data_p...
Burnie asked 28/3, 2022 at 4:27
1 Next >
© 2022 - 2025 — McMap. All rights reserved.