Running an Apache Beam/Google Cloud Dataflow job from a maven-built jar
Asked Answered
N

2

6

I have a Google Cloud Dataflow job that I'm running from IntelliJ IDEA using the following command string:

compile exec:java -Dexec.mainClass=com.mygroup.mainclass "-Dexec.args=--..."

It runs fine from here, I want to deploy this onto a local server to be run automatically at build time. The args specify pipeline options; at any given build we need to start three different jobs using this pipeline and having to recompile three times is wasteful. So I'm using mvn package to produce a jar file in the following manner:

clean compile assembly:single

The problem is, when I run the pipeline via java -jar my_pipeline-1.0-SNAPSHOT-jar-with-dependencies.jar --args in the project's target directory, I'm getting an exception:

Exception in thread "main" java.lang.IllegalArgumentException:
Unknown 'runner' specified 'DataflowRunner', supported pipeline runners [DirectRunner]

Posts on the Beam mailing list suggest passing -Pdataflow-runner to set the classpath at execution time, but I haven't found a way to make that work if I'm just calling the jar. I've tried specifying the profile during the compile and package steps, but that hasn't helped.

Here's my pom.xml. It might be a little messy; I've been trying the scattershot approach to trying to make this work but nothing has stuck to the wall yet.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.mygroup</groupId>
    <artifactId>data_dump</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <slf4j.version>1.7.5</slf4j.version>
        <junit.version>4.8.2</junit.version>
        <hamcrest.version>1.3</hamcrest.version>
        <mockito.version>1.10.19</mockito.version>
        <bigquery.version>v2-rev312-1.22.0</bigquery.version>
        <powermock.version>1.6.6</powermock.version>
        <beam.version>2.0.0</beam.version>
    </properties>
    <build>
        <plugins>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>com.mygroup.mainclass</mainClass>
                            <addClasspath>true</addClasspath>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.3</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <!-- Build an executable JAR -->
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>3.0.2</version>
                <configuration>
                    <archive>
                        <manifest>
                            <addClasspath>true</addClasspath>
                            <classpathPrefix>.</classpathPrefix>
                            <mainClass>com.mygroup.mainclass</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/**</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>cobertura-maven-plugin</artifactId>
                <configuration>
                    <aggregate>true</aggregate>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>clean</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <executions>
                    <execution>
                        <goals>
                            <goal>java</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <executable>java</executable>
                    <arguments>
                    </arguments>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <profiles>
        <profile>
            <id>direct-runner</id>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
            <!-- Makes the DirectRunner available when running a pipeline. -->
            <dependencies>
                <dependency>
                    <groupId>org.apache.beam</groupId>
                    <artifactId>beam-runners-direct-java</artifactId>
                    <version>${beam.version}</version>
                    <scope>runtime</scope>
                </dependency>
            </dependencies>
        </profile>
        <profile>
            <id>dataflow-runner</id>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
            <dependencies>
                <dependency>
                    <groupId>com.google.cloud.dataflow</groupId>
                    <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
                    <version>${beam.version}</version>
                </dependency>
            </dependencies>
        </profile>
        <profile>
            <id>flink-runner</id>
            <!-- Makes the FlinkRunner available when running a pipeline. -->
            <dependencies>
                <dependency>
                    <groupId>org.apache.beam</groupId>
                    <artifactId>beam-runners-flink_2.10</artifactId>
                    <version>${beam.version}</version>
                    <scope>runtime</scope>
                </dependency>
            </dependencies>
            <build>
                <plugins>
                    <plugin>
                        <groupId>org.apache.maven.plugins</groupId>
                        <artifactId>maven-shade-plugin</artifactId>
                    </plugin>
                </plugins>
            </build>
        </profile>
    </profiles>

    <dependencies>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-core</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-bigquery</artifactId>
            <version>0.18.0-beta</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-storage</artifactId>
            <version>1.0.0</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud.dataflow</groupId>
            <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-direct-java</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
            <version>${beam.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
            <version>${beam.version}</version>
            <scope>runtime</scope>
        </dependency>

        <!-- slf4j API frontend binding with JUL backend -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>${slf4j.version}</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-jdk14</artifactId>
            <version>${slf4j.version}</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>RELEASE</version>
        </dependency>
        <dependency>
            <groupId>org.hamcrest</groupId>
            <artifactId>hamcrest-library</artifactId>
            <version>1.3</version>
        </dependency>
        <dependency>
            <groupId>com.microsoft.azure</groupId>
            <artifactId>azure-storage</artifactId>
            <version>5.0.0</version>
        </dependency>
        <dependency>
            <groupId>com.esotericsoftware.yamlbeans</groupId>
            <artifactId>yamlbeans</artifactId>
            <version>1.08</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.5</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-collections4</artifactId>
            <version>4.1</version>
        </dependency>
        <dependency>
            <groupId>org.powermock</groupId>
            <artifactId>powermock-module-junit4</artifactId>
            <version>${powermock.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.powermock</groupId>
            <artifactId>powermock-api-easymock</artifactId>
            <version>${powermock.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.easymock</groupId>
            <artifactId>easymock</artifactId>
            <version>3.4</version>
        </dependency>
    </dependencies>
    <repositories>
        <repository>
            <id>apache-maven-repo</id>
            <name>Apache Nightlies</name>
            <url>https://repository.apache.org/content/repositories/snapshots/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
        </repository>
    </repositories>
</project>
Nichani answered 27/7, 2017 at 17:12 Comment(0)
N
6

Okay, I've solved this. There were a couple of things wrong with my pom.

First, the dependencies were in the wrong order. I moved the beam-runners-google-cloud-dataflow-java dependency to the top of the list and the error I was receiving went away.

I further encountered another exception:

Caused by: java.lang.IllegalStateException: Unable to find registrar for gs.

Following the instructions in this question and building with package instead of assembly, I was able to get my job to start. Whew!

Nichani answered 27/7, 2017 at 20:54 Comment(1)
I was facing the similar issue and had to do this step only to resolve it. "I moved the beam-runners-google-cloud-dataflow-java dependency to the top of the list and the error I was receiving went away." ( beam version : 2.2.4.0)Allwein
T
0

I could fix it by adding activation tag to the dataflow-runner profile, seems missing in WordCount example 2.25 I test.

<activation>
  <activeByDefault>true</activeByDefault>
</activation>
Tristan answered 18/12, 2020 at 2:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.