I'm using Zeppelin v0.7.3
notebook to run Pyspark
scripts. In one paragraph, I am running script to write data from dataframe
to a parquet
file in a Blob folder. File is partitioned per country. Number of rows of dataframe is 99,452,829
. When the script reaches 1 hour
, an error is encountered -
Error with 400 StatusCode: "requirement failed: Session isn't active.
My default interpreter for the notebook is jdbc
. I have read about timeoutlifecyclemanager
and added in the interpreter setting zeppelin.interpreter.lifecyclemanager.timeout.threshold
and set it to 7200000
but still encountered the error after it reaches 1 hour runtime at 33% processing completion.
I checked the Blob folder after the 1 hr timeout and parquet files were successfully written to Blob which are indeed partitioned per country.
The script I am running to write DF to parquet Blob is below:
trdpn_cntry_fct_denom_df.write.format("parquet").partitionBy("CNTRY_ID").mode("overwrite").save("wasbs://[email protected]/cbls/hdi/trdpn_cntry_fct_denom_df.parquet")
Is this Zeppelin timeout issue? How can it be extended for more than 1 hour runtime? Thanks for the help.