pyspark Questions
2
Solved
I`m trying to use the toPandas() function of pyspark on a simple dataframe with an id column (int), a score column (float) and a "pass" column (boolean).
My problem is that whenever I cal...
Ibert asked 16/2, 2023 at 19:32
4
Solved
I am writing a parquet file from a Spark DataFrame the following way:
df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip")
This creates a folder with multiple files in...
Brahui asked 15/1, 2019 at 15:20
8
Solved
I am trying to manually create a pyspark dataframe given certain data:
row_in = [(1566429545575348), (40.353977), (-111.701859)]
rdd = sc.parallelize(row_in)
schema = StructType(
[
StructField(&q...
Stunner asked 16/9, 2019 at 15:11
2
Solved
I would like to compute the maximum of a subset of columns for each row and add it as a new column for the existing Dataframe.
I managed to do this in very awkward way:
def add_colmax(df,subset_c...
Spellbinder asked 29/11, 2016 at 19:54
6
Solved
I just installed pyspark 2.2.0 using conda (using python v3.6 on windows 7 64bit, java v1.8)
$conda install pyspark
It downloaded and seemed to install correctly with no errors. Now when I run p...
Conspiracy asked 20/10, 2017 at 12:58
3
Solved
I tried researching for this a lot but I am unable to find a way to execute and add multiple columns to a PySpark Dataframe at specific positions.
I have the dataframe that looks like this:
Custo...
Ayo asked 27/3, 2019 at 16:41
1
Solved
I encountered an issue while trying to store JSON data as a Delta Lake table using PySpark and Delta Lake.
Here's my code:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructT...
Glasgo asked 7/6, 2024 at 5:28
1
I have a Red Hat system in AWS running Spark on top of HDFS. Now I want to access PySpark from my local machine, i.e., interactive Python.
So, I installed Spyder-Py2 to connect to the remote AWS ma...
Seeto asked 7/3, 2016 at 7:58
7
Solved
I have a pandas data frame which I want to convert into spark data frame. Usually, I use the below code to create spark data frame from pandas but all of sudden I started to get the below error, I ...
Supersaturated asked 4/4, 2023 at 7:32
7
How to find size (in MB) of dataframe in pyspark,
df = spark.read.json("/Filestore/tables/test.json")
I want to find how the size of df or test.json
Noteworthy asked 16/6, 2020 at 15:15
2
Solved
I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region, salary and IsUnemployed ...
Adai asked 18/2, 2016 at 22:28
2
Solved
I want to limit the rate when fetching data from kafka. My code looks like:
df = spark.read.format('kafka') \
.option("kafka.bootstrap.servers",'...')\
.option("subscribe",'A') \
.option("start...
Overstuffed asked 26/6, 2018 at 0:17
2
Solved
This recent blog post from Databricks https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html says that the only change needed to a pandas program to run it under pyspar...
Audraaudras asked 26/10, 2021 at 21:17
3
Solved
Starting to use PySpark on Databricks, and I see I can import pyspark.pandas alongside pandas. What is the different?
I assume it's not like koalas, right?
5
Solved
I create my pyspark dataframe:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, BinaryType, ArrayType, StringType, TimestampType
input_schema = StructTyp...
Monsour asked 21/4, 2023 at 11:21
2
I'm using spark 2.0.1,
df.show()
+--------+------+---+-----+-----+----+
|Survived|Pclass|Sex|SibSp|Parch|Fare|
+--------+------+---+-----+-----+----+
| 0.0| 3.0|1.0| 1.0| 0.0| 7.3|
| 1.0| 1.0|0....
Kennethkennett asked 14/12, 2018 at 22:24
4
I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]).
I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu,...
Prescribe asked 24/8, 2015 at 17:36
3
Solved
I am loading some data into Spark with a wrapper function:
def load_data( filename ):
df = sqlContext.read.format("com.databricks.spark.csv")\
.option("delimiter", "\t")\
.option("header", "f...
Quota asked 5/10, 2016 at 7:50
7
I have a pandas data frame my_df, and my_df.dtypes gives us:
ts int64
fieldA object
fieldB object
fieldC object
fieldD object
fieldE object
dtype: object
Then I am trying to convert the pandas d...
Bevon asked 9/11, 2016 at 23:11
2
This is my dataframe I'm trying to drop the duplicate columns with same name using index:
df = spark.createDataFrame([(1,2,3,4,5)],['c','b','a','a','b'])
df.show()
Output:
+---+---+---+---+---+...
Ascribe asked 18/12, 2019 at 18:35
2
Using Apache Spark (or pyspark) I can read/load a text file into a spark dataframe and load that dataframe into a sql db, as follows:
df = spark.read.csv("MyFilePath/MyDataFile.txt", sep=...
Ottie asked 7/7, 2022 at 2:13
6
So, I am trying to create a Spark session in Python 2.7 using the following:
#Initialize SparkSession and SparkContext
from pyspark.sql import SparkSession
from pyspark import SparkContext
#Crea...
Galcha asked 9/5, 2017 at 7:20
6
I am currently employed as a Junior Data Developer and recently saw a post saying that Azure Synapse can now create SQL tables from Delta tables. I tried creating an SQL table from a Delta table wh...
Mckeever asked 26/2, 2021 at 13:12
2
I have two datasets stored as parquet files with schemas as below:
Dataset 1:
id
col1
col2
1
v1
v3
2
v2
v4
Dataset 2:
id
col3
col4
1
v5
v7
2
v6
v8
I want to join the two dat...
Broch asked 9/4, 2024 at 13:0
2
This question is similar to the one asked here. But, the answer does not help me clearly understand what user memory in spark actually is.
Can you help me understand with an example. Like, an examp...
Scrape asked 26/11, 2022 at 22:38
© 2022 - 2025 — McMap. All rights reserved.