Gradle Support for GCP Dataflow Templates?
Asked Answered
P

3

8

According to Google's Dataflow documentation, Dataflow job template creation is "currently limited to Java and Maven." However, the documentation for Java across GCP's Dataflow site is... messy, to say the least. The 1.x and 2.x versions of Dataflow are pretty far apart in terms of details, I have some specific code requirements that lock me into the 2.0.0r3 codebase, so I'm pretty much required to use Apache Beam. Apache is -- understandably -- quite dedicated to Maven, but institutionally my company's thrown the bulk of its weight behind Gradle, so much so that they migrated all their Java projects over to it last year and have pushed back against re-introducing it.

However, now we seem to be at an impasse, because we've got a specific goal to try to centralize a lot of our back-end gathering in GCP's Dataflow, and GCP Dataflow doesn't appear to have formal support for Gradle. If it does, it's not in the official documentation.

Is there a sufficient technical basis to actually build Dataflow templates with Gradle and the issue is that Google's docs simply haven't been updated to support this? Is there a technical reason why Gradle can't do what's being done with Maven? Is there a better guide for working with GCP Dataflow than the docs on Google's and Apache's websites? I haven't worked with Maven archetypes before, and all the searches I've done for "gradle archetypes" turn up results from, at best, over a year ago. Most of the information points to forum discussions from 2014 and version 1.7rc3, but we're on 3.5. This feels like it ought to be a solved problem, but for the life of me I can't find any current information on this online.

Parka answered 28/4, 2017 at 2:7 Comment(0)
S
5

There's absolutely nothing stopping you writing your Dataflow application/pipeline in Java, and using Gradle to build it.

Gradle will simply produce an application distribution (e.g. ./gradlew clean distTar), which you then extract and run with the --runner=TemplatingDataflowPipelineRunner --dataflowJobFile=gs://... parameters.

It's just a runnable Java application.

The template and all the binaries will then be uploaded to GCS, and you can execute the pipeline through the console, CLI or even Cloud Functions.

You don't even need to use Gradle. You could just run it locally and the template/binaries will be uploaded. But, I'd imagine you are are using a build server like Jenkins.

Maybe the Dataflow docs should read "Note: Template creation is currently limited to Java", because this feature is not available in the Python SDK yet.

Scantling answered 28/4, 2017 at 2:49 Comment(4)
Wow! Thank you! You're correct; the plan is to integrate it into Distelli with our other pipelines, but this one was eluding me. I'll try this first thing tomorrow.Parka
Great. Considering upvoting and accepting my answer if it helped you.Scantling
Gah. Sorry. I tried upvoting the comment but my rep is too low to show it. I sat down, got halfway into the validation, and then got ducked into endless meetings and forgot to come back here and close the loop. Thank you for the ping. Answer worked.Parka
This answer doesn't seem up to date (for instance TemplatingDataflowPipelineRunner doesn't seem to exist.) would you be willing to update it?Nuthouse
C
8

Commandline to Run Cloud Dataflow Job With Gradle

Generic Execution

$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MyPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow" -Pdataflow-runner

Specific Example

$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MySpannerPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow --spannerInstanceId=fooInstance --spannerDatabaseId=barDatabase" -Pdataflow-runner

Explanation of Commandline

  1. gradle clean execute uses the execute task which allows us to easily pass commandline flags to the Dataflow Pipeline. The clean task removes cached builds.

  2. -DmainClass= specifies the Java Main class since we have multiple pipelines in a single folder. Without this, Gradle doesn't know what the Main class is and where to pass the args. Note: Your gradle.build file must include task execute per below.

  3. -Dexec.args= specifies the execution arguments, which will be passed to the Pipeline. Note: Your gradle.build file must include task execute per below.

  4. --runner=DataflowRunner and -Pdataflow-runner ensure that the Google Cloud Dataflow runner is used and not the local DirectRunner

  5. --spannerInstanceId= and --spannerDatabaseId= are just pipeline-specific flags. Your pipeline won't have them so.

build.gradle contents (NOTE: You need to populate your specific dependencies)

apply plugin: 'java'
apply plugin: 'maven'
apply plugin: 'application'

group = 'com.foo.bar'
version = '0.3'

mainClassName = System.getProperty("mainClass")

sourceCompatibility = 1.8
targetCompatibility = 1.8

repositories {

     maven { url "https://repository.apache.org/content/repositories/snapshots/" }
     maven { url "http://repo.maven.apache.org/maven2" }
}

dependencies {
    compile group: 'org.apache.beam', name: 'beam-sdks-java-core', version:'2.5.0'
    // Insert your build deps for your Beam Dataflow project here
    runtime group: 'org.apache.beam', name: 'beam-runners-direct-java', version:'2.5.0'
    runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0'
}

task execute (type:JavaExec) {
    main = System.getProperty("mainClass")
    classpath = sourceSets.main.runtimeClasspath
    systemProperties System.getProperties()
    args System.getProperty("exec.args").split()
}

Explanation of build.gradle

  1. We use the task execute (type:JavaExec) in order to easily pass runtime flags into the Java Dataflow pipeline program. For example, we can specify what the main class is (since we have more than one pipeline in the same folder) and we can pass specific Dataflow arguments (i.e., specific PipelineOptions). more here

  2. The line of build.gradle that reads runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0' is very important. It provides the Cloud Dataflow runner that allows you to execute pipelines in Google Cloud Platform.

Categorize answered 31/7, 2018 at 17:45 Comment(2)
Wow thanks for the detailed answered! that exactly what I was looking forKravits
In addition, I found this Colab that walk you through the apache beam example using gradle for build. It is pretty good.Kravits
S
5

There's absolutely nothing stopping you writing your Dataflow application/pipeline in Java, and using Gradle to build it.

Gradle will simply produce an application distribution (e.g. ./gradlew clean distTar), which you then extract and run with the --runner=TemplatingDataflowPipelineRunner --dataflowJobFile=gs://... parameters.

It's just a runnable Java application.

The template and all the binaries will then be uploaded to GCS, and you can execute the pipeline through the console, CLI or even Cloud Functions.

You don't even need to use Gradle. You could just run it locally and the template/binaries will be uploaded. But, I'd imagine you are are using a build server like Jenkins.

Maybe the Dataflow docs should read "Note: Template creation is currently limited to Java", because this feature is not available in the Python SDK yet.

Scantling answered 28/4, 2017 at 2:49 Comment(4)
Wow! Thank you! You're correct; the plan is to integrate it into Distelli with our other pipelines, but this one was eluding me. I'll try this first thing tomorrow.Parka
Great. Considering upvoting and accepting my answer if it helped you.Scantling
Gah. Sorry. I tried upvoting the comment but my rep is too low to show it. I sat down, got halfway into the validation, and then got ducked into endless meetings and forgot to come back here and close the loop. Thank you for the ping. Answer worked.Parka
This answer doesn't seem up to date (for instance TemplatingDataflowPipelineRunner doesn't seem to exist.) would you be willing to update it?Nuthouse
G
0

Update: 7th December 2020

We can stage dataflow templates using gradle as well.

For stage:

Here are the mandatory parameters:

  • project
  • region
  • gcpTempLocation (good to have if you don't have bucket create access, if not given it will create automatically)
  • stagingLocation
  • templateLocation

Here is the sample command line in gradle:

gradle clean execute -D mainClass=com.something.mainclassname -D exec.args="--runner=DataflowRunner --project=<project_id> --region=<region_name> --gcpTempLocation=gs://bucket/somefolder --stagingLocation=gs://bucket/somefolder --templateLocation=gs://bucket/somefolder"

Assumptions:

  • GOOGLE_APPLICATION_CREDENTIALS environmental variable is set with service account key.

  • gradle is installed.

  • JAVA_HOME environmental variable is set.

  • Bare minimum dependencies are added.

    • compile 'org.apache.beam:beam-sdks-java-core:2.22.0'
    • compile 'org.apache.beam:beam-sdks-java-io-google-cloud-platform:2.22.0'
    • compile 'org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:2.22.0'
    • compile 'org.apache.beam:beam-runners-google-cloud-dataflow-java:2.22.0'
Gainsborough answered 7/12, 2020 at 8:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.