apache-spark-sql Questions
5
Solved
A PySpark Dataframe is in following format :
To just access the stddev row of columns c1,c2,c3 I use :
df.describe().createOrReplaceTempView("table1")
df2 = sqlContext.sql("SELECT c1 AS f1, c...
Wivinah asked 22/2, 2017 at 0:49
1
I have a data frame df and there I want to convert some columns into category type. Using pandas I can do it like below way:
for col in categorical_collist:
df[col] = df[col].astype('category')
...
Vino asked 8/9, 2020 at 7:41
8
Solved
I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.
Name Age Subjects Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C...
Jobi asked 28/6, 2018 at 12:19
1
I have a lateral join defined in this way:
select A.id, B.value
from A
left join lateral (
select value
from B
where B.id = A.id
limit 1
) as X on true;
that has the particular point of having...
Superfuse asked 29/6, 2020 at 7:52
26
Solved
I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:
df.columns = new_co...
Monteverdi asked 3/12, 2015 at 22:21
2
Solved
I am using Databricks and I already have loaded some DataTables.
However, I have a complex SQL query that I want to operate on these data tables, and I wonder if i could avoid translating it in p...
Conductance asked 7/8, 2019 at 10:43
4
Solved
I want to count the unique access for each day using spark structured streaming, so I use the following code
.dropDuplicates("uuid")
and in the next day the state maintained for today should be ...
Obed asked 3/8, 2017 at 3:27
3
Solved
As a simplified example, I tried to filter a Spark DataFrame with following code:
val xdf = sqlContext.createDataFrame(Seq(
("A", 1), ("B", 2), ("C", 3)
)).toDF("name", "cnt")
xdf.filter($"cnt" &...
Wink asked 29/11, 2015 at 9:55
2
val spark = SparkSession
.builder()
.appName("try1")
.master("local")
.getOrCreate()
val df = spark.read
.json("s3n://BUCKET-NAME/FOLDER/FILE.json")
.select($"uid").show(5)
I have given th...
Shumate asked 16/6, 2017 at 12:43
3
I've read the PythonBooklet.pdf by H2O.ai and the python API documentation, but still can't find a clean way to do this. I know I can do either of the following:
Convert H2OFrame to Spark DataFra...
Toh asked 3/4, 2017 at 16:5
11
Solved
I am currently running a Java Spark Application in tomcat and receiving the following exception:
Caused by: java.io.IOException: Mkdirs failed to create file:/opt/folder/tmp/file.json/_temporary/0...
Aksoyn asked 3/3, 2016 at 17:13
2
Solved
What does "Determining location of DBIO file fragments..." mean, and how do I speed it up?
When running simple SQL commands in Databricks, sometimes I get the message:
Determining location of DBIO file fragments. This operation can take
some time.
What does this mean, and how do I ...
Jerold asked 30/11, 2019 at 20:11
4
Solved
I have a dataframe which contains null values:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(125, '2012-10-10', 'tv'),
(20, '2012-10-10', 'phone'),
(40, '2012-10-10', 'tv...
Gowrie asked 5/2, 2018 at 21:50
3
Solved
I'd like connect to Delta using JDBC and would like to run the Spark Thrift Server (STS) in local mode to kick the tyres.
I start STS using the following command:
$SPARK_HOME/sbin/start-thriftserve...
Sprue asked 6/11, 2021 at 8:8
2
Solved
I am trying to query data from parquet files in Scala Spark (1.5), including a query of 2 million rows ("variants" in the following code).
val sqlContext = new org.apache.spark.sql.SQLContext(sc) ...
Laryngeal asked 22/3, 2016 at 1:1
8
In pandas data frame, I am using the following code to plot histogram of a column:
my_df.hist(column = 'field_1')
Is there something that can achieve the same goal in pyspark data frame? (I am i...
Lahomalahore asked 25/8, 2016 at 20:35
2
Solved
I have a dataframe with some columns:
+------------+--------+----------+----------+
|country_name| ID_user|birth_date| psdt|
+------------+--------+----------+----------+
| Россия|16460783| 486|19...
Machiavellian asked 26/11, 2019 at 1:22
2
Solved
I use Spark 1.6.1.
We are trying to write an ORC file to HDFS using HiveContext and DataFrameWriter. While we can use
df.write().orc(<path>)
we would rather do something like
df.write()...
Coldblooded asked 5/6, 2017 at 8:44
3
Excuse me.Today i want to run a program about how to create DataFrame with sqlContext in Pyspark.The result is a AttributeError,which is"AttributeError: 'NoneType' object has no attribute 'sc'"
My...
Serrated asked 28/11, 2016 at 7:59
3
Solved
I have a simple join where I limit on of the sides. In the explain plan I see that before the limit is executed there is an ExchangeSingle operation, indeed I see that at this stage there is only o...
Togoland asked 22/7, 2018 at 13:56
5
Solved
I am using Spark 1.5.
I have two dataframes of the form:
scala> libriFirstTable50Plus3DF
res1: org.apache.spark.sql.DataFrame = [basket_id: string, family_id: int]
scala> linkPersonItemLes...
Whole asked 13/12, 2016 at 14:43
2
Solved
I have requirement where i need to count number of duplicate rows in SparkSQL for Hive tables.
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.type...
Vhf asked 1/2, 2018 at 2:53
2
Solved
I have a PySpark Dataframe with an A field, few B fields that dependent on A (A->B) and C fields that I want to aggregate per each A. For example:
A | B | C
----------
A | 1 | 6
A | 1 | 7
B | 2...
Dispend asked 25/2, 2018 at 12:49
11
Solved
I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating:
(df.groupBy("group")
.agg({"money":"sum"})
.show(100)
)
This ...
Bearskin asked 1/5, 2015 at 14:1
3
I have a table where there is 1 column which is serialized JSON. I want to apply schema inference on this JSON column. I don't know schema to pass as input for JSON extraction (e.g: from_json funct...
Cortex asked 29/8, 2021 at 16:13
© 2022 - 2024 — McMap. All rights reserved.