I've found a solution that works without additional coding by using a Window here. So Jeff was right, there is a solution. full code boelow, I'll briefly explain what it does, for more details just look at the blog.
from pyspark.sql import Window
from pyspark.sql.functions import last
import sys
# define the window
window = Window.orderBy('time')\
.rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_column_temperature = last(df6['temperature'], ignorenulls=True).over(window)
# do the fill
spark_df_filled = df6.withColumn('temperature_filled', filled_column_temperature)
So the idea is to define a Window sliding (more on sliding windows here) through the data which always contains the actual row and ALL previous ones:
window = Window.orderBy('time')\
.rowsBetween(-sys.maxsize, 0)
Note that we sort by time, so data is in the correct order. Also note that using "-sys.maxsize" ensures that the window is always including all previous data and is contineously growing as it traverses through the data top-down, but there might be more efficient solutions.
Using the "last" function, we are always addressing the last row in that window. By passing "ignorenulls=True" we define that if the current row is null, then the function will return the most recent (last) non-null value in the window. Otherwise the actual row's value is used.
Done.
Window
, but I'm actually working my way through that conceptually right now. Although if your data is large enough to need a cluster, why impute these instead of dropping the observations? Keep in mind when you impute that you're making up data that doesn't exist - it has its uses, but you should still avoid it if you can. – Blayze