apache-spark-sql - 3

3

extracting numpy array from Pyspark Dataframe

numpy apache-spark pyspark apache-spark-sql apache-spark-mllib

Krems asked 8/2, 2017 at 14:42

3

Solved

Spark Parquet read error : java.io.EOFException: Reached the end of stream with XXXXX bytes left to read

While reading parquet files in spark, if you face the below problem. App > Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 faile...

apache-spark apache-spark-sql parquet

Tamandua asked 30/10, 2019 at 6:14

3

How to get the number of records written (using DataFrameWriter's save operation)?

Is there any way to get the number of records written when using spark to save records? While I know it isn't in the spec currently, I'd like to be able to do something like: val count = df.write....

scala apache-spark apache-spark-sql

Ulita asked 12/5, 2017 at 9:30

3

Solved

Comparison operator in PySpark (not equal/ !=)

I am trying to obtain all rows in a dataframe where two flags are set to '1' and subsequently all those that where only one of two is set to '1' and the other NOT EQUAL to '1' With the following s...

sql apache-spark pyspark null apache-spark-sql

Kirstinkirstyn asked 24/8, 2016 at 10:36

5

Using pyspark, how do I read multiple JSON documents on a single line in a file into a dataframe?

Using Spark 2.3, I know I can read a file of JSON documents like this: {'key': 'val1'} {'key': 'val2'} With this: spark.json.read('filename') How can I read the following in to a dataframe wh...

apache-spark dataframe pyspark apache-spark-sql

Midgett asked 12/7, 2018 at 20:52

4

Solved

How to get Kafka offsets for structured query for manual and reliable offset management?

Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint directory to store offsets and guarantee an "exactly-once" message delivery. But old do...

apache-spark apache-kafka apache-spark-sql offset spark-structured-streaming

Dorsey asked 11/9, 2017 at 10:7

5

Solved

GroupBy column and filter rows with maximum value in Pyspark

I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I a...

python apache-spark pyspark apache-spark-sql

Argon asked 16/2, 2018 at 15:31

5

Pyspark. spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, java.net.SocketException: Connection reset

I am new to pyspark, and i'm trying to run multiple time series in prophet with pyspark (as distributed computing because i have 100s of times series to predict) but i have error as below. import t...

python apache-spark pyspark apache-spark-sql

Petes asked 15/11, 2021 at 11:37

3

Py4JException: Constructor org.apache.spark.sql.SparkSession([class org.apache.spark.SparkContext, class java.util.HashMap]) does not exist

I am trying to run a spark session in the Jupyter Notebook on a EC2 Linux machine via Visual Studio Code. My code looks as following: from pyspark.sql import SparkSession spark = SparkSession.build...

python apache-spark pyspark apache-spark-sql jupyter-notebook

Libnah asked 5/7, 2022 at 18:25

4

Solved

How to add a nested column to a DataFrame

scala apache-spark apache-spark-sql

Villainage asked 1/3, 2018 at 9:49

2

Access AWS Glue from local Spark

Is there any way to run local master Spark SQL queries against AWS Glue? Launch this code on my local PC: SparkSession.builder() .master("local") .enableHiveSupport() .config("hive.metastore.c...

amazon-web-services apache-spark apache-spark-sql aws-glue

Distant asked 15/9, 2018 at 12:49

12

Solved

pyspark show dataframe as table with horizontal scroll in ipython notebook

a pyspark.sql.DataFrame displays messy with DataFrame.show() - lines wrap instead of a scroll. but displays with pandas.DataFrame.head I tried these options import IPython IPython.auto_scro...

pandas pyspark ipython jupyter-notebook apache-spark-sql

Zohara asked 15/4, 2017 at 14:17

3

pyspark - get consistent random value across Spark sessions

I want to add a column of random values to a dataframe (has an id for each row) for something I am testing. I am struggling to get reproducible results across Spark sessions - same random value aga...

apache-spark random pyspark apache-spark-sql

Andryc asked 27/11, 2019 at 20:21

4

Solved

Pandas dataframe to Spark dataframe "Can not merge type error"

I have csv data and created Pandas dataframe using read_csv and forcing all columns as string. Then when I try to create Spark dataframe from the Pandas dataframe, I get the error message below. f...

pandas apache-spark dataframe pyspark apache-spark-sql

Kohler asked 5/8, 2016 at 17:8

5

Solved

Can I change the nullability of a column in my Spark dataframe?

I have a StructField in a dataframe that is not nullable. Simple example: import pyspark.sql.functions as F from pyspark.sql.types import * l = [('Alice', 1)] df = sqlContext.createDataFrame(l, ['...

python pyspark apache-spark-sql

Klatt asked 6/9, 2017 at 10:6

7

Can unix_timestamp() return unix time in milliseconds in Apache Spark?

I'm trying to get the unix time from a timestamp field in milliseconds (13 digits) but currently it returns in seconds (10 digits). scala> var df = Seq("2017-01-18 11:00:00.000", "2017-01-18 1...

apache-spark apache-spark-sql unix-timestamp

Nonmaterial asked 14/2, 2017 at 23:10

2

Solved

SPARK SQL Equivalent of Qualify + Row_number statements

Does anyone know the best way for Apache Spark SQL to achieve the same results as the standard SQL qualify() + rnk or row_number statements? For example: I have a Spark Dataframe called statemen...

sql apache-spark apache-spark-sql window-functions row-number

Diseuse asked 21/7, 2015 at 20:22

1

moduleNotFoundError in Pyspark running with Spark-Submit

I have an ETL code which has been written with Pyspark. I have two bash scripts to run the code. When I use this script, it's run without any problems: #!/bin/bash cd /root/Desktop/project rm e...

apache-spark pyspark apache-spark-sql

Bluecoat asked 18/1 at 6:12

3

spark 2.4 Parquet column cannot be converted in file, Column: [Impressions], Expected: bigint, Found: BINARY

I'm facing a weird issue that I cannot understand. I have source data with a column "Impressions" that is sometimes a bigint / sometimes a string (when I manually explore the data). The HIVE sche...

apache-spark pyspark apache-spark-sql

Marc asked 28/11, 2019 at 21:5

2

How to check if data is cached in dataframe or not yet cached due to lazy execution in Pyspark?

My question is little different from other question I could find on stack overflow. I need to know if the data is already retrieved and stored in a dataframe or if that is yet to happen I am doing ...

pyspark apache-spark-sql

Ariosto asked 22/7, 2020 at 12:57

7

Solved

Pyspark: Replacing value in a column by searching a dictionary

I'm a newbie in PySpark. I have a Spark DataFrame df that has a column 'device_type'. I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop". In ...

python apache-spark dataframe pyspark apache-spark-sql

Scarificator asked 15/5, 2017 at 9:45

4

Solved

How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates becaus...

apache-spark apache-spark-sql parquet amazon-emr

Cuckoopint asked 2/12, 2016 at 7:52

7

to_date fails to parse date in Spark 3.0

I am trying to parse date using to_date() but I get the following exception. SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '12/1/2010 8:26'...

apache-spark pyspark apache-spark-sql spark3

Pergolesi asked 16/7, 2020 at 21:44

2

Solved

PySpark: DataFrame - Convert Struct to Array

apache-spark pyspark apache-spark-sql

Candlestand asked 3/12, 2017 at 8:24

11

How to create an empty DataFrame with a specified schema?

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.

dataframe scala apache-spark apache-spark-sql schema

Bathyal asked 17/7, 2015 at 13:58

apache-spark-sql Questions

Recommended topics

Hot tags