I have a problem inside a pyspark udf function and I want to print the number of the row generating the problem.
I tried to count the rows using the equivalent of "static variable" in Python so that when the udf is called with a new row, a counter is incremented. However, it is not working:
import pyspark.sql.functions as F
def myF(input):
myF.lineNumber += 1
if (somethingBad):
print(myF.lineNumber)
return res
myF.lineNumber = 0
myF_udf = F.udf(myF, StringType())
How can I count the number of times a udf is called in order to find the number of the row generating the problem in pyspark?
lineNumber
. – Stragglestruct
with 2 cols : "good result" and "bad result" and then count the bad or good results ... – Straggle