pyspark - 2 - McMap

2

Solved

Pyspark error when converting boolean column to pandas

I`m trying to use the toPandas() function of pyspark on a simple dataframe with an id column (int), a score column (float) and a "pass" column (boolean). My problem is that whenever I cal...

pyspark

Ibert asked 16/2, 2023 at 19:32

4

Solved

Pandas cannot read parquet files created in PySpark

I am writing a parquet file from a Spark DataFrame the following way: df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip") This creates a folder with multiple files in...

python pandas apache-spark pyspark parquet

Brahui asked 15/1, 2019 at 15:20

8

Solved

Manually create a pyspark dataframe

I am trying to manually create a pyspark dataframe given certain data: row_in = [(1566429545575348), (40.353977), (-111.701859)] rdd = sc.parallelize(row_in) schema = StructType( [ StructField(&q...

pyspark

Stunner asked 16/9, 2019 at 15:11

2

Solved

PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe

I would like to compute the maximum of a subset of columns for each row and add it as a new column for the existing Dataframe. I managed to do this in very awkward way: def add_colmax(df,subset_c...

python apache-spark pyspark apache-spark-sql

Spellbinder asked 29/11, 2016 at 19:54

6

Solved

pyspark: The system cannot find the path specified

I just installed pyspark 2.2.0 using conda (using python v3.6 on windows 7 64bit, java v1.8) $conda install pyspark It downloaded and seemed to install correctly with no errors. Now when I run p...

python pyspark environment-variables

Conspiracy asked 20/10, 2017 at 12:58

3

Solved

How to add multiple empty columns to a PySpark Dataframe at specific locations

I tried researching for this a lot but I am unable to find a way to execute and add multiple columns to a PySpark Dataframe at specific positions. I have the dataframe that looks like this: Custo...

apache-spark pyspark

Ayo asked 27/3, 2019 at 16:41

1

Solved

JSON Data Stored as Null Values in Delta Lake Table Using PySpark

I encountered an issue while trying to store JSON data as a Delta Lake table using PySpark and Delta Lake. Here's my code: from pyspark.sql import SparkSession from pyspark.sql.types import StructT...

json apache-spark pyspark delta-lake data-processing

Glasgo asked 7/6, 2024 at 5:28

1

Issues using Spyder Python to connect to a remote machine

I have a Red Hat system in AWS running Spark on top of HDFS. Now I want to access PySpark from my local machine, i.e., interactive Python. So, I installed Spyder-Py2 to connect to the remote AWS ma...

python amazon-web-services amazon-ec2 pyspark spyder

Seeto asked 7/3, 2016 at 7:58

7

Solved

Databricks: Issue while creating spark data frame from pandas

I have a pandas data frame which I want to convert into spark data frame. Usually, I use the below code to create spark data frame from pandas but all of sudden I started to get the below error, I ...

python pandas apache-spark pyspark databricks

Supersaturated asked 4/4, 2023 at 7:32

7

How to find size (in MB) of dataframe in pyspark?

How to find size (in MB) of dataframe in pyspark, df = spark.read.json("/Filestore/tables/test.json") I want to find how the size of df or test.json

scala dataframe apache-spark pyspark databricks

Noteworthy asked 16/6, 2020 at 15:15

2

Solved

How to count a boolean in grouped Spark data frame

I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region, salary and IsUnemployed ...

python sql apache-spark pyspark apache-spark-sql

Adai asked 18/2, 2016 at 22:28

2

Solved

How to use maxOffsetsPerTrigger in pyspark structured streaming?

I want to limit the rate when fetching data from kafka. My code looks like: df = spark.read.format('kafka') \ .option("kafka.bootstrap.servers",'...')\ .option("subscribe",'A') \ .option("start...

pyspark apache-kafka

Overstuffed asked 26/6, 2018 at 0:17

2

Solved

The right way to use the new pyspark.pandas?

This recent blog post from Databricks https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html says that the only change needed to a pandas program to run it under pyspar...

pandas pyspark databricks

Audraaudras asked 26/10, 2021 at 21:17

3

Solved

What is the difference between pyspark.pandas to pandas?

Starting to use PySpark on Databricks, and I see I can import pyspark.pandas alongside pandas. What is the different? I assume it's not like koalas, right?

pandas pyspark

Sika asked 20/9, 2022 at 14:34

5

Solved

Convert PySpark Dataframe to Pandas Dataframe fails on timestamp column

I create my pyspark dataframe: from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, BinaryType, ArrayType, StringType, TimestampType input_schema = StructTyp...

python pandas dataframe apache-spark pyspark

Monsour asked 21/4, 2023 at 11:21

2

contains pyspark SQL: TypeError: 'Column' object is not callable

I'm using spark 2.0.1, df.show() +--------+------+---+-----+-----+----+ |Survived|Pclass|Sex|SibSp|Parch|Fare| +--------+------+---+-----+-----+----+ | 0.0| 3.0|1.0| 1.0| 0.0| 7.3| | 1.0| 1.0|0....

python apache-spark pyspark apache-spark-sql

Kennethkennett asked 14/12, 2018 at 22:24

4

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu,...

apache-spark teradata pyspark apache-spark-sql

Prescribe asked 24/8, 2015 at 17:36

3

Solved

Spark load data and add filename as dataframe column

I am loading some data into Spark with a wrapper function: def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "f...

apache-spark pyspark apache-spark-sql

Quota asked 5/10, 2016 at 7:50

7

pyspark: ValueError: Some of types cannot be determined after inferring

I have a pandas data frame my_df, and my_df.dtypes gives us: ts int64 fieldA object fieldB object fieldC object fieldD object fieldE object dtype: object Then I am trying to convert the pandas d...

python python-2.7 pandas pyspark apache-spark-sql

Bevon asked 9/11, 2016 at 23:11

2

Drop a column with same name using column index in pyspark

This is my dataframe I'm trying to drop the duplicate columns with same name using index: df = spark.createDataFrame([(1,2,3,4,5)],['c','b','a','a','b']) df.show() Output: +---+---+---+---+---+...

apache-spark pyspark apache-spark-sql

Ascribe asked 18/12, 2019 at 18:35

2

Reading zip file into Apache Spark dataframe

Using Apache Spark (or pyspark) I can read/load a text file into a spark dataframe and load that dataframe into a sql db, as follows: df = spark.read.csv("MyFilePath/MyDataFile.txt", sep=...

python python-3.x apache-spark pyspark apache-spark-sql

Ottie asked 7/7, 2022 at 2:13

6

Exception: Java gateway process exited before sending the driver its port number while creating a Spark Session in Python

So, I am trying to create a Spark session in Python 2.7 using the following: #Initialize SparkSession and SparkContext from pyspark.sql import SparkSession from pyspark import SparkContext #Crea...

java python python-2.7 apache-spark pyspark

Galcha asked 9/5, 2017 at 7:20

6

Using Delta Tables in Azure Synapse Dedicated/Serverless SQL Pools

I am currently employed as a Junior Data Developer and recently saw a post saying that Azure Synapse can now create SQL tables from Delta tables. I tried creating an SQL table from a Delta table wh...

sql azure pyspark azure-synapse delta-lake

Mckeever asked 26/2, 2021 at 13:12

2

Does Spark infer partition of parquet file persisted using repartition() on reading?

I have two datasets stored as parquet files with schemas as below: Dataset 1: id col1 col2 1 v1 v3 2 v2 v4 Dataset 2: id col3 col4 1 v5 v7 2 v6 v8 I want to join the two dat...

apache-spark pyspark parquet partitioning

Broch asked 9/4, 2024 at 13:0

2

What is user memory in spark?

This question is similar to the one asked here. But, the answer does not help me clearly understand what user memory in spark actually is. Can you help me understand with an example. Like, an examp...

apache-spark pyspark memory-management

Scrape asked 26/11, 2022 at 22:38

pyspark Questions

Recommended topics

Hot tags