Spark pool taking time to start in azure synapse Analytics

Asked 25/11, 2020 at 3:27 Answered 13/10, 2023 at 17:9

python azure apache-spark pyspark azure-synapse

I have created 3 different notebook using pyspark code in Azure synapse Analytics. Notebook is running using spark pool. There is only one spark pool for all 3 notebook. when these 3 notebook run individually, spark pool starts for all 3 notebook by default.

The issue which i am facing is related to spark pool. It is taking 10 minutes to start in each notebook. The Vcores assigned is 4 and executor is 1. Can somebody please help me to know how can we boost the start of spark pool in azure synapse Analytics.

Abhor answered 25/11, 2020 at 3:27 Comment(6)

If my answer is useful for you, could you please accept it as an answer? It may help more people who have similar issue. – Godfree 2/12, 2020 at 9:49

Did you visit Spark pausing setting and set the number of idle minutes to whatever time you want? It is not clear why spark pool start every time for each notebook. – Separable 1/6, 2021 at 23:15

have you gotten a fix for this? i'm also having the same issue. – Mccrary 13/9, 2021 at 9:32

Yes, you donot have to split the cells unless it is not required to change language for coding – Abhor 21/9, 2021 at 15:24

@kshitizsinha so in your notebooks, you only have one cell? how much time is reduced after you did that? – Mccrary 6/10, 2021 at 5:54

Before including code into single cell it was taking 10-12 minutes. After merging spark starts in 2-3 minutes approximately – Abhor 8/10, 2021 at 10:53

I have this problem a lot too. It takes 4-5 minutes in my experience as well.

If it takes longer, make sure you publish (save) your notebook first, then reload the page. Sometimes that refreshes the underlying Livy session.

Tournament answered 16/10, 2022 at 0:16 Comment(0)

If you turn on "dynamically allocate executors" for the spark pool, the startup time seems to drop to around 90 seconds (from 3-5 minutes). However, I haven't found a method to shrink that or keep the spark pool alive.

Voltmeter answered 13/10, 2023 at 17:9 Comment(0)

-5

The performance of your Apache Spark pool jobs depends on multiple factors. These performance factors include:

How your data is stored
How the cluster has configured (Small, Medium, Large)
The operations that are used when processing the data.

Common challenges you might face include:

Memory constraints due to improperly sized executors.
Long-running operations
Tasks that result in cartesian operations.

There are also many optimizations that can help you overcome these challenges, such as caching and allowing for data skew.

The following article Optimize Apache Spark jobs (preview) in Azure Synapse Analytics describes common Spark job optimizations and recommendations.

Godfree answered 27/11, 2020 at 6:47 Comment(1)

Unfortunately this answer does not even take the question into an account. You're describing performance of the cluster not its initialisation time which I have personally found abysmally slow... (Tasks that take 5s to perform have to wait over 3 minutes to actually spin up the spark itself...????) – Stanch 15/11, 2021 at 14:37

Recommended topics

Hot tags