apache-spark-sql Questions

3

I have a dataframe gi_man_df where group can be n: +------------------+-----------------+--------+--------------+ | group | number|rand_int| rand_double| +------------------+-----------------+----...

3

Solved

While reading parquet files in spark, if you face the below problem. App > Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 faile...
Tamandua asked 30/10, 2019 at 6:14

3

Is there any way to get the number of records written when using spark to save records? While I know it isn't in the spec currently, I'd like to be able to do something like: val count = df.write....
Ulita asked 12/5, 2017 at 9:30

3

Solved

I am trying to obtain all rows in a dataframe where two flags are set to '1' and subsequently all those that where only one of two is set to '1' and the other NOT EQUAL to '1' With the following s...
Kirstinkirstyn asked 24/8, 2016 at 10:36

5

Using Spark 2.3, I know I can read a file of JSON documents like this: {'key': 'val1'} {'key': 'val2'} With this: spark.json.read('filename') How can I read the following in to a dataframe wh...
Midgett asked 12/7, 2018 at 20:52

4

Solved

Spark 2.2 introduced a Kafka's structured streaming source. As I understand, it's relying on HDFS checkpoint directory to store offsets and guarantee an "exactly-once" message delivery. But old do...

5

Solved

I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I a...
Argon asked 16/2, 2018 at 15:31

5

I am new to pyspark, and i'm trying to run multiple time series in prophet with pyspark (as distributed computing because i have 100s of times series to predict) but i have error as below. import t...
Petes asked 15/11, 2021 at 11:37

3

I am trying to run a spark session in the Jupyter Notebook on a EC2 Linux machine via Visual Studio Code. My code looks as following: from pyspark.sql import SparkSession spark = SparkSession.build...

4

Solved

I have a dataframe df with the following schema: root |-- city_name: string (nullable = true) |-- person: struct (nullable = true) | |-- age: long (nullable = true) | |-- name: string (nullabl...
Villainage asked 1/3, 2018 at 9:49

2

Is there any way to run local master Spark SQL queries against AWS Glue? Launch this code on my local PC: SparkSession.builder() .master("local") .enableHiveSupport() .config("hive.metastore.c...

12

Solved

a pyspark.sql.DataFrame displays messy with DataFrame.show() - lines wrap instead of a scroll. but displays with pandas.DataFrame.head I tried these options import IPython IPython.auto_scro...
Zohara asked 15/4, 2017 at 14:17

3

I want to add a column of random values to a dataframe (has an id for each row) for something I am testing. I am struggling to get reproducible results across Spark sessions - same random value aga...
Andryc asked 27/11, 2019 at 20:21

4

Solved

I have csv data and created Pandas dataframe using read_csv and forcing all columns as string. Then when I try to create Spark dataframe from the Pandas dataframe, I get the error message below. f...

5

Solved

I have a StructField in a dataframe that is not nullable. Simple example: import pyspark.sql.functions as F from pyspark.sql.types import * l = [('Alice', 1)] df = sqlContext.createDataFrame(l, ['...
Klatt asked 6/9, 2017 at 10:6

7

I'm trying to get the unix time from a timestamp field in milliseconds (13 digits) but currently it returns in seconds (10 digits). scala> var df = Seq("2017-01-18 11:00:00.000", "2017-01-18 1...
Nonmaterial asked 14/2, 2017 at 23:10

2

Solved

Does anyone know the best way for Apache Spark SQL to achieve the same results as the standard SQL qualify() + rnk or row_number statements? For example: I have a Spark Dataframe called statemen...

1

I have an ETL code which has been written with Pyspark. I have two bash scripts to run the code. When I use this script, it's run without any problems: #!/bin/bash cd /root/Desktop/project rm e...
Bluecoat asked 18/1 at 6:12

3

I'm facing a weird issue that I cannot understand. I have source data with a column "Impressions" that is sometimes a bigint / sometimes a string (when I manually explore the data). The HIVE sche...
Marc asked 28/11, 2019 at 21:5

2

My question is little different from other question I could find on stack overflow. I need to know if the data is already retrieved and stored in a dataframe or if that is yet to happen I am doing ...
Ariosto asked 22/7, 2020 at 12:57

7

Solved

I'm a newbie in PySpark. I have a Spark DataFrame df that has a column 'device_type'. I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop". In ...
Scarificator asked 15/5, 2017 at 9:45

4

Solved

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates becaus...
Cuckoopint asked 2/12, 2016 at 7:52

7

I am trying to parse date using to_date() but I get the following exception. SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '12/1/2010 8:26'...
Pergolesi asked 16/7, 2020 at 21:44

2

Solved

I have a dataframe in the following structure: root |-- index: long (nullable = true) |-- text: string (nullable = true) |-- topicDistribution: struct (nullable = true) | |-- type: long (nulla...
Candlestand asked 3/12, 2017 at 8:24

11

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.
Bathyal asked 17/7, 2015 at 13:58

© 2022 - 2024 — McMap. All rights reserved.