apache-spark-sql - 4

5

Solved

How to select all columns instead of hard coding each one?

A PySpark Dataframe is in following format : To just access the stddev row of columns c1,c2,c3 I use : df.describe().createOrReplaceTempView("table1") df2 = sqlContext.sql("SELECT c1 AS f1, c...

apache-spark pyspark apache-spark-sql

Wivinah asked 22/2, 2017 at 0:49

1

Converting data frame columns in category type in pyspark

I have a data frame df and there I want to convert some columns into category type. Using pandas I can do it like below way: for col in categorical_collist: df[col] = df[col].astype('category') ...

pandas apache-spark pyspark types apache-spark-sql

Vino asked 8/9, 2020 at 7:41

8

Solved

How to explode multiple columns of a dataframe in pyspark

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same. Name Age Subjects Grades [Bob] [16] [Maths,Physics,Chemistry] [A,B,C...

python dataframe apache-spark pyspark apache-spark-sql

Jobi asked 28/6, 2018 at 12:19

1

How to workaround this case of lateral join with Spark SQL?

I have a lateral join defined in this way: select A.id, B.value from A left join lateral ( select value from B where B.id = A.id limit 1 ) as X on true; that has the particular point of having...

apache-spark apache-spark-sql lateral-join

Superfuse asked 29/6, 2020 at 7:52

26

Solved

How to change dataframe column names in PySpark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_co...

python apache-spark pyspark apache-spark-sql rename

Monteverdi asked 3/12, 2015 at 22:21

2

Solved

Run a sql query on a PySpark DataFrame

I am using Databricks and I already have loaded some DataTables. However, I have a complex SQL query that I want to operate on these data tables, and I wonder if i could avoid translating it in p...

apache-spark-sql

Conductance asked 7/8, 2019 at 10:43

4

Solved

How to expire state of dropDuplicates in structured streaming to avoid OOM?

I want to count the unique access for each day using spark structured streaming, so I use the following code .dropDuplicates("uuid") and in the next day the state maintained for today should be ...

apache-spark duplicates apache-spark-sql out-of-memory spark-structured-streaming

Obed asked 3/8, 2017 at 3:27

3

Solved

Filter Spark DataFrame by checking if value is in a list, with other criteria

As a simplified example, I tried to filter a Spark DataFrame with following code: val xdf = sqlContext.createDataFrame(Seq( ("A", 1), ("B", 2), ("C", 3) )).toDF("name", "cnt") xdf.filter($"cnt" &...

scala apache-spark apache-spark-sql

Wink asked 29/11, 2015 at 9:55

2

Unable to read from s3 bucket using spark

val spark = SparkSession .builder() .appName("try1") .master("local") .getOrCreate() val df = spark.read .json("s3n://BUCKET-NAME/FOLDER/FILE.json") .select($"uid").show(5) I have given th...

scala amazon-web-services apache-spark amazon-s3 apache-spark-sql

Shumate asked 16/6, 2017 at 12:43

3

How to convert a column in H2OFrame to a python list?

I've read the PythonBooklet.pdf by H2O.ai and the python API documentation, but still can't find a clean way to do this. I know I can do either of the following: Convert H2OFrame to Spark DataFra...

apache-spark apache-spark-sql h2o

Toh asked 3/4, 2017 at 16:5

11

Solved

Spark saveAsTextFile() results in Mkdirs failed to create for half of the directory

I am currently running a Java Spark Application in tomcat and receiving the following exception: Caused by: java.io.IOException: Mkdirs failed to create file:/opt/folder/tmp/file.json/_temporary/0...

java tomcat apache-spark apache-spark-sql

Aksoyn asked 3/3, 2016 at 17:13

2

Solved

What does "Determining location of DBIO file fragments..." mean, and how do I speed it up?

When running simple SQL commands in Databricks, sometimes I get the message: Determining location of DBIO file fragments. This operation can take some time. What does this mean, and how do I ...

apache-spark-sql databricks

Jerold asked 30/11, 2019 at 20:11

4

Solved

Count Non Null values in column in PySpark

I have a dataframe which contains null values: from pyspark.sql import functions as F df = spark.createDataFrame( [(125, '2012-10-10', 'tv'), (20, '2012-10-10', 'phone'), (40, '2012-10-10', 'tv...

apache-spark pyspark apache-spark-sql count null

Gowrie asked 5/2, 2018 at 21:50

3

Solved

How to run Spark SQL Thrift Server in local mode and connect to Delta using JDBC

I'd like connect to Delta using JDBC and would like to run the Spark Thrift Server (STS) in local mode to kick the tyres. I start STS using the following command: $SPARK_HOME/sbin/start-thriftserve...

apache-spark apache-spark-sql delta-lake spark-thriftserver

Sprue asked 6/11, 2021 at 8:8

2

Solved

Memory issue when importing parquet files in Spark

I am trying to query data from parquet files in Scala Spark (1.5), including a query of 2 million rows ("variants" in the following code). val sqlContext = new org.apache.spark.sql.SQLContext(sc) ...

scala apache-spark apache-spark-sql parquet

Laryngeal asked 22/3, 2016 at 1:1

8

Pyspark: show histogram of a data frame column

In pandas data frame, I am using the following code to plot histogram of a column: my_df.hist(column = 'field_1') Is there something that can achieve the same goal in pyspark data frame? (I am i...

python pyspark apache-spark-sql jupyter-notebook

Lahomalahore asked 25/8, 2016 at 20:35

2

Solved

How to use date_add with two columns in pyspark?

apache-spark pyspark apache-spark-sql

Machiavellian asked 26/11, 2019 at 1:22

2

Solved

Where is the reference for options for writing or reading per format?

I use Spark 1.6.1. We are trying to write an ORC file to HDFS using HiveContext and DataFrameWriter. While we can use df.write().orc(<path>) we would rather do something like df.write()...

apache-spark apache-spark-sql apache-spark-1.6

Coldblooded asked 5/6, 2017 at 8:44

3

AttributeError: 'NoneType' object has no attribute 'sc'

Excuse me.Today i want to run a program about how to create DataFrame with sqlContext in Pyspark.The result is a AttributeError,which is"AttributeError: 'NoneType' object has no attribute 'sc'" My...

pyspark apache-spark-sql

Serrated asked 28/11, 2016 at 7:59

3

Solved

Spark 'limit' does not run in parallel?

I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only o...

apache-spark pyspark apache-spark-sql

Togoland asked 22/7, 2018 at 13:56

5

Solved

Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?

I am using Spark 1.5. I have two dataframes of the form: scala> libriFirstTable50Plus3DF res1: org.apache.spark.sql.DataFrame = [basket_id: string, family_id: int] scala> linkPersonItemLes...

scala apache-spark join apache-spark-sql

Whole asked 13/12, 2016 at 14:43

2

Solved

Count number of duplicate rows in SPARKSQL

I have requirement where i need to count number of duplicate rows in SparkSQL for Hive tables. from pyspark import SparkContext, SparkConf from pyspark.sql import HiveContext from pyspark.sql.type...

pyspark apache-spark-sql

Vhf asked 1/2, 2018 at 2:53

2

Solved

PySpark aggregation function for "any value"

I have a PySpark Dataframe with an A field, few B fields that dependent on A (A->B) and C fields that I want to aggregate per each A. For example: A | B | C ---------- A | 1 | 6 A | 1 | 7 B | 2...

python apache-spark pyspark apache-spark-sql coalesce

Dispend asked 25/2, 2018 at 12:49

11

Solved

Renaming columns for PySpark DataFrame aggregates

I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating: (df.groupBy("group") .agg({"money":"sum"}) .show(100) ) This ...

dataframe apache-spark pyspark apache-spark-sql

Bearskin asked 1/5, 2015 at 14:1

3

How to infer schema of serialized JSON column in Spark SQL?

I have a table where there is 1 column which is serialized JSON. I want to apply schema inference on this JSON column. I don't know schema to pass as input for JSON extraction (e.g: from_json funct...

json apache-spark pyspark apache-spark-sql

Cortex asked 29/8, 2021 at 16:13

apache-spark-sql Questions

Recommended topics

Hot tags