Reducing provisioning time of Vertex AI Training (Custom Training Job)
Asked Answered
E

1

6

I'm using the Vertex AI custom training feature at Google Cloud Platform (GCP) to train the model. But every time I triggered training, it takes 10 minutes until it actually starts training due to provisioning time.

Is there any way to reduce the provisioning time of Vertex AI's custom training jobs. Thanks :)

Elenore answered 30/5, 2021 at 11:13 Comment(8)
What are the requirement of your training cluster? GPUs? TPUs? Number of CPUs? And How long take your training job?Paedogenesis
I tried to run with GPUs (specifically a2-highgpu-1g machine with A100 x1). The actual training jobs take 40 minutes and 10 minutes for provisioning time.Elenore
Without GPU, it takes about 3 minutes to start. I assume with GPU it's longer. I think there is no way to speed up the process. In your case, the warmup take 20% of your training job, which is huge. Did you try to set up the same VM on Compute Engine and to see if the provisioning was quicker? If so, you can start your container at startup (startup script) and perform your training directly on Compute Engine and not en Vertex AI; But it will require more technical/IaaS skills than with Vertex AI.Paedogenesis
What is your code structure (python code or maybe custom container image)? How you are loading data (Its in the cloud storage, are you storing it with your code)? Are you using any library?Pointtopoint
I used the custom container with a prebuilt PyTorch image. I used the GCS for data loading. The code is contained in the docker image.Elenore
Many things might cause this. Depends how much information can you share. Whats the location of the nodes, utulization of GPUs? Which region was used for training and which macihne - Pre-built containers for custom trainingPointtopoint
@Pointtopoint I used the us-central1-b location for custom trianing job. Utilization is about to 80%.Elenore
Honestly I doubt that it can be reduced as per Streamline your ML training workflow with Vertex AI - The training job will automatically provision computing resources, and de-provision those resources when the job is complete. There is no worrying about leaving a high-performance virtual machine configuration running. However if you want to be 100% sure, you could create Issue for Google team using Issue TrackerPointtopoint
M
2

You can now provision the resources once, keep them running and reuse them for the next run. This is called Persistent Resource and was not available when you asked the question (it was made available for public preview in November 2023, and for general use in May 2024).

See the documentation.

Mcgrody answered 22/8, 2024 at 16:55 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.