TensorFlow Extended (TFX): Clarify Beam, Airflow and Kubeflow usage
Asked Answered
J

1

6

I'm hoping someone can clarify the relationship between TensorFlow and its dependencies (Beam, AirFlow, Flink,etc)

I'm referencing the main TFX page: https://www.tensorflow.org/tfx/guide#creating_a_tfx_pipeline_with_airflow ,etc.

In the examples, I see three variants: https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi_pipeline taxi_pipeline_flink.py, taxi_pipeline_kubeflow.py, taxi_pipeline_simple.py

BEAM Example?

There is no "BEAM" example and little describing its use.

Is it correct to assume that taxi_pipeline_simple.py would run even if airflow wasn't installed? I think not since it uses "AirflowDAGRunner". If not, then can you run TFX with only BEAM and its runner? If so, why no example of that?

Flink Example

In taxi_pipeline_flink.py, AirflowDAGRunner is used. I assume that is using AirFlow as an orchestrator which in turn uses Flink as its executor. Correct?

Airflow Example

The page states that BEAM is a required dependency, yet airflow doesn't have beam as one of its executors. It only has SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor, and KubernetesExecutor. Therefore, is BEAM only needed when not using Airflow? When using airflow, what is the purpose of beam if it is required?

Thank you for any insights.

Jacintajacinth answered 17/5, 2019 at 20:6 Comment(4)
feel this is a tfx specific question, did you try their user group?Ornamental
Very much so :) TensorFlow make stackoverflow their de-facto Q/A forum. I think the reason no response is TFX is fairly new and for many it is overkill.Jacintajacinth
Is the question still up-to-date? Right now, I can see a taxi_pipeline_beam.py example and it does NOT use airflow (like expected). Also, regarding beam and airflow: tfx ALWAYS uses Beam, in this case for data manipulation in some components. Then, you have orchestration tools, and Beam can also be used as one of those, as well as Airflow.Rissole
Putting the comment as a more complete answer below..Rissole
R
6

A) In order to run TFX pipelines, you need orchestrators. Examples are Apache Airflow, Kubeflow Pipelines and Apache Beam.

B) Apache Beam is ALSO (and maybe mainly) used for distributed data processing in some TFX components. Therefore, Apache Beam is necessary with any orchestrators you choose (even if you don't use Apache Beam as orchestrator!)

Answering your points:

1) BEAM Example - Right now there is a Beam example at https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_beam.py. As you correctly expected, there is no AirflowDAGRunner there, since this example does not use Airflow as orchestrator.

2) Airflow example - BEAM is a required dependency because of the reason stated above: BEAM is always used by TFX for distributed data processing in some components. Therefore, even with Airflow (or any other) as orchestrator, you need BEAM.

3) Flink example - at the moment, I cannot find this example anywhere (probably due to changes to the link since you posted), but it is possible that Flink would be used as a runner, while Airflow is the orchestrator. However, I couldn't find mentions to Flink in Airflow's documentation.

Hope it helps to some extent.

Rissole answered 23/7, 2019 at 10:10 Comment(1)
Isn't the reason there may not be a good Flink example because Beam is (or, can be) an abstraction over Flink? Therefore any Beam example is sorta implicitly a Flink example (if your Beam is setup to use Flink as it's runner). So in this way, projects like TF can just "speak" Beam, but we can set this up to run on Flink, so the project does not need Flink-specific documentation (thus affirming the value of choosing the abstraction layer... no longer need a DataFlow example plus Flink, plus Spark...)Coagulase

© 2022 - 2024 — McMap. All rights reserved.