apache-spark-sql Questions

5

Solved

A PySpark Dataframe is in following format : To just access the stddev row of columns c1,c2,c3 I use : df.describe().createOrReplaceTempView("table1") df2 = sqlContext.sql("SELECT c1 AS f1, c...
Wivinah asked 22/2, 2017 at 0:49

1

I have a data frame df and there I want to convert some columns into category type. Using pandas I can do it like below way: for col in categorical_collist: df[col] = df[col].astype('category') ...
Vino asked 8/9, 2020 at 7:41

8

Solved

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same. Name Age Subjects Grades [Bob] [16] [Maths,Physics,Chemistry] [A,B,C...

1

I have a lateral join defined in this way: select A.id, B.value from A left join lateral ( select value from B where B.id = A.id limit 1 ) as X on true; that has the particular point of having...
Superfuse asked 29/6, 2020 at 7:52

26

Solved

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_co...
Monteverdi asked 3/12, 2015 at 22:21

2

Solved

I am using Databricks and I already have loaded some DataTables. However, I have a complex SQL query that I want to operate on these data tables, and I wonder if i could avoid translating it in p...
Conductance asked 7/8, 2019 at 10:43

4

Solved

I want to count the unique access for each day using spark structured streaming, so I use the following code .dropDuplicates("uuid") and in the next day the state maintained for today should be ...

3

Solved

As a simplified example, I tried to filter a Spark DataFrame with following code: val xdf = sqlContext.createDataFrame(Seq( ("A", 1), ("B", 2), ("C", 3) )).toDF("name", "cnt") xdf.filter($"cnt" &...
Wink asked 29/11, 2015 at 9:55

2

val spark = SparkSession .builder() .appName("try1") .master("local") .getOrCreate() val df = spark.read .json("s3n://BUCKET-NAME/FOLDER/FILE.json") .select($"uid").show(5) I have given th...

3

I've read the PythonBooklet.pdf by H2O.ai and the python API documentation, but still can't find a clean way to do this. I know I can do either of the following: Convert H2OFrame to Spark DataFra...
Toh asked 3/4, 2017 at 16:5

11

Solved

I am currently running a Java Spark Application in tomcat and receiving the following exception: Caused by: java.io.IOException: Mkdirs failed to create file:/opt/folder/tmp/file.json/_temporary/0...
Aksoyn asked 3/3, 2016 at 17:13

2

Solved

When running simple SQL commands in Databricks, sometimes I get the message: Determining location of DBIO file fragments. This operation can take some time. What does this mean, and how do I ...
Jerold asked 30/11, 2019 at 20:11

4

Solved

I have a dataframe which contains null values: from pyspark.sql import functions as F df = spark.createDataFrame( [(125, '2012-10-10', 'tv'), (20, '2012-10-10', 'phone'), (40, '2012-10-10', 'tv...
Gowrie asked 5/2, 2018 at 21:50

3

Solved

I'd like connect to Delta using JDBC and would like to run the Spark Thrift Server (STS) in local mode to kick the tyres. I start STS using the following command: $SPARK_HOME/sbin/start-thriftserve...

2

Solved

I am trying to query data from parquet files in Scala Spark (1.5), including a query of 2 million rows ("variants" in the following code). val sqlContext = new org.apache.spark.sql.SQLContext(sc) ...
Laryngeal asked 22/3, 2016 at 1:1

8

In pandas data frame, I am using the following code to plot histogram of a column: my_df.hist(column = 'field_1') Is there something that can achieve the same goal in pyspark data frame? (I am i...
Lahomalahore asked 25/8, 2016 at 20:35

2

Solved

I have a dataframe with some columns: +------------+--------+----------+----------+ |country_name| ID_user|birth_date| psdt| +------------+--------+----------+----------+ | Россия|16460783| 486|19...
Machiavellian asked 26/11, 2019 at 1:22

2

Solved

I use Spark 1.6.1. We are trying to write an ORC file to HDFS using HiveContext and DataFrameWriter. While we can use df.write().orc(<path>) we would rather do something like df.write()...
Coldblooded asked 5/6, 2017 at 8:44

3

Excuse me.Today i want to run a program about how to create DataFrame with sqlContext in Pyspark.The result is a AttributeError,which is"AttributeError: 'NoneType' object has no attribute 'sc'" My...
Serrated asked 28/11, 2016 at 7:59

3

Solved

I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only o...
Togoland asked 22/7, 2018 at 13:56

5

Solved

I am using Spark 1.5. I have two dataframes of the form: scala> libriFirstTable50Plus3DF res1: org.apache.spark.sql.DataFrame = [basket_id: string, family_id: int] scala> linkPersonItemLes...
Whole asked 13/12, 2016 at 14:43

2

Solved

I have requirement where i need to count number of duplicate rows in SparkSQL for Hive tables. from pyspark import SparkContext, SparkConf from pyspark.sql import HiveContext from pyspark.sql.type...
Vhf asked 1/2, 2018 at 2:53

2

Solved

I have a PySpark Dataframe with an A field, few B fields that dependent on A (A->B) and C fields that I want to aggregate per each A. For example: A | B | C ---------- A | 1 | 6 A | 1 | 7 B | 2...
Dispend asked 25/2, 2018 at 12:49

11

Solved

I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating: (df.groupBy("group") .agg({"money":"sum"}) .show(100) ) This ...
Bearskin asked 1/5, 2015 at 14:1

3

I have a table where there is 1 column which is serialized JSON. I want to apply schema inference on this JSON column. I don't know schema to pass as input for JSON extraction (e.g: from_json funct...
Cortex asked 29/8, 2021 at 16:13

© 2022 - 2024 — McMap. All rights reserved.