My question is little different from other question I could find on stack overflow. I need to know if the data is already retrieved and stored in a dataframe or if that is yet to happen
I am doing something like this
df1=spark.table("sourceDB.Table1")
df1.cache()
Now, as you might be aware, data is not read yet from the source table due to lazy execution. So I need to have an expression here that says the result as "False" at this point.
After sometime, I am doing some operation that requires data to be retrieved from source. For example.
df1.groupBy("col3").agg(sum("col1").alias("sum_of_col1")).select("sum_of_col1","col3").filter("sum_of_col1 >= 100").show()
At this point, data must have been read and stored in cache for df1. So I need to have an expression here that says the result as "True" at this point.
Is there anyway we can achieve this? I believe df1.is_cached will not help in this situation