Yes, it is possible to do so with Flexible Resource Scheduling in Cloud Dataflow (docs). Note that there are some things to consider:
- Delayed execution: jobs are scheduled and not executed right away (you can see a new
QUEUED
status for your Dataflow jobs). They are run opportunistically when resources are available within a six-hour window. This makes FlexRS suitable to reduce cost for non-time-critical workloads. Also, be sure to validate your code before sending the job.
- Batch jobs: as of now it only accepts batch jobs and requires to enable autoscaling:
You cannot set autoscalingAlgorithm=NONE
- Dataflow Shuffle: it needs to be enabled. When so, no data is stored on persistent disks attached to the VMs. This way, when a preemption happens and resources are claimed back there is no need to redistribute the data.
- Regions: according to the previous item, only regions where Dataflow Shuffle is supported can be selected. List here, turn-up for new regions will be announced in the release notes. As of now, zone is automatically chosen within the region.
- Machine types: FlexRS currently supports
n1-standard-2
(default) and n1-highmem-16
.
- SDK: requires 2.12.0 or newer for Java or Python.
- Quota: quota is reserved upfront (i.e. queued jobs also consume quota).
In order to run it, use --flexRSGoal=COST_OPTIMIZED
and make sure to take into account that the rest of parameters conform to the FlexRS needs.
A uniform discount rate is applied to FlexRS jobs, you can compare pricing details in the following link.
Note that you might see a Beta disclaimer in the non-English documentation but, as clarified in the release notes, it's Generally Available.