I am unclear abut Vertex AI pricing for model predictions. In the documentation, under the heading More about automatic scaling of prediction nodes one of the points mentioned is:
"If you choose automatic scaling, the number of nodes scales automatically, and can scale down to zero for no-traffic durations"
The example provided in the documentation later also seems to suggest that during a period with no traffic, zero nodes are in use. However, when I create an Endpoint in Vertex AI, under the Autoscaling heading it says:
"Autoscaling: If you set a minimum and maximum, compute nodes will scale to meet traffic demand within those boundaries"
The value of 0 under "Minimum number of compute nodes" is not allowed so you have to enter 1 or greater, and it is mentioned that:
Default is 1. If set to 1 or more, then compute resources will continuously run even without traffic demand. This can increase cost but avoid dropped requests due to node initialization.
My question is, what happens when I select autoscaling by setting Minimum to 1 and Maximum to, say, 10. Does 1 node always run continuously? Or does it scale down to 0 nodes in no traffic condition as the documentation suggests.
To test I deployed an Endpoint with Autoscaling (min and max set to 1) and then when I sent a prediction request the response was almost immediate, suggesting the node was already up. I did that again after about an hour and again the response was immediate suggesting that the node never shut down probably. Also, for high latency requirements, is having autoscale to 0 nodes, if that is indeed possible, even practical, i.e., what latency can we expect for starting up from 0 nodes?