first_value windowing function in pyspark - McMap

About

first_value windowing function in pyspark

Asked 1/2, 2016 at 23:12 Answered 2/2, 2016 at 0:16

Solved apache-spark pyspark apache-spark-sql window-functions

O

1

7

I am using pyspark 1.5 getting my data from Hive tables and trying to use windowing functions.

According to this there exists an analytic function called firstValue that will give me the first non-null value for a given window. I know this exists in Hive but I can not find this in pyspark anywhere.

Is there a way to implement this given that pyspark won't allow UserDefinedAggregateFunctions (UDAFs)?

Overhang answered 1/2, 2016 at 23:12 Comment(1)

See this link. 1.6 has this, can we replicate? – Overhang 1/2, 2016 at 23:59

O

16

Spark >= 2.0:

first takes an optional ignorenulls argument which can mimic the behavior of first_value:

df.select(col("k"), first("v", True).over(w).alias("fv"))

Spark < 2.0:

Available function is called first and can be used as follows:

df = sc.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3)
]).toDF(["k", "v"])

w = Window().partitionBy("k").orderBy("v")

df.select(col("k"), first("v").over(w).alias("fv"))

but if you want to ignore nulls you'll have to use Hive UDFs directly:

df.registerTempTable("df")

sqlContext.sql("""
    SELECT k, first_value(v, TRUE) OVER (PARTITION BY k ORDER BY v)
    FROM df""")

Overhead answered 2/2, 2016 at 0:16 Comment(1)

Good call on using Hive queries. – Overhang 2/2, 2016 at 21:54

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.