Reducing provisioning time of Vertex AI Training (Custom Training Job)

About

Asked 30/5, 2021 at 11:13 Answered 22/8, 2024 at 16:55

I'm using the Vertex AI custom training feature at Google Cloud Platform (GCP) to train the model. But every time I triggered training, it takes 10 minutes until it actually starts training due to provisioning time.

Is there any way to reduce the provisioning time of Vertex AI's custom training jobs. Thanks :)

Elenore answered 30/5, 2021 at 11:13 Comment(8)

What are the requirement of your training cluster? GPUs? TPUs? Number of CPUs? And How long take your training job? – Paedogenesis 30/5, 2021 at 16:12

I tried to run with GPUs (specifically a2-highgpu-1g machine with A100 x1). The actual training jobs take 40 minutes and 10 minutes for provisioning time. – Elenore 1/6, 2021 at 1:28

Without GPU, it takes about 3 minutes to start. I assume with GPU it's longer. I think there is no way to speed up the process. In your case, the warmup take 20% of your training job, which is huge. Did you try to set up the same VM on Compute Engine and to see if the provisioning was quicker? If so, you can start your container at startup (startup script) and perform your training directly on Compute Engine and not en Vertex AI; But it will require more technical/IaaS skills than with Vertex AI. – Paedogenesis 1/6, 2021 at 7:7

What is your code structure (python code or maybe custom container image)? How you are loading data (Its in the cloud storage, are you storing it with your code)? Are you using any library? – Pointtopoint 2/6, 2021 at 8:11

I used the custom container with a prebuilt PyTorch image. I used the GCS for data loading. The code is contained in the docker image. – Elenore 4/6, 2021 at 6:51

Many things might cause this. Depends how much information can you share. Whats the location of the nodes, utulization of GPUs? Which region was used for training and which macihne - Pre-built containers for custom training – Pointtopoint 7/6, 2021 at 9:51

@Pointtopoint I used the us-central1-b location for custom trianing job. Utilization is about to 80%. – Elenore 7/6, 2021 at 13:31

Honestly I doubt that it can be reduced as per Streamline your ML training workflow with Vertex AI -

The training job will automatically provision computing resources, and de-provision those resources when the job is complete. There is no worrying about leaving a high-performance virtual machine configuration running.

However if you want to be 100% sure, you could create Issue for Google team using Issue Tracker – Pointtopoint 9/6, 2021 at 9:21

You can now provision the resources once, keep them running and reuse them for the next run. This is called Persistent Resource and was not available when you asked the question (it was made available for public preview in November 2023, and for general use in May 2024).

See the documentation.

Mcgrody answered 22/8, 2024 at 16:55 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags