Apache Spark and Nifi Integration
Asked Answered
J

2

9

I want to send Nifi flowfile to Spark and do some transformations in Spark and again send the result back to Nifi so that I can to further operations in Nifi. I don't want to write the flowfile written to database or HDFS and then trigger Spark job. I want to send flowfile directly to Spark and receive the result directly from Spark to Nifi. I tried using ExecuteSparkInteractive processor in Nifi but I am stuck. Any examples would be helpful

Jevons answered 31/10, 2018 at 6:17 Comment(0)
D
10

You can't send data directly to spark unless it is spark streaming. If it is traditional Spark with batch execution, then Spark needs to read the data from some type of storage like HDFS. The purpose of ExecuteSparkInteractive is to trigger a Spark job to run on data that has been delivered to HDFS.

If you want to go the streaming route then there are two options...

1) Directly integrate NiFi with Spark streaming

https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark

2) Use Kafka to integrate NiFi and Spark

NiFi writes to a Kafka topic, Spark reads from a Kafka topic, Spark writes back to a Kafka topic, NiFi reads from a Kafka topic. This approach would probably be the best option.

Decurion answered 31/10, 2018 at 13:21 Comment(7)
Hey Bryan, what do you think about implementing service/processor that acts as a broker and delivers flowfile to subscriber ?Depurative
@Depurative I think that is already what is happening in option #1 above right? NiFi makes flow files available via a s2s output port and then Spark streaming retrieves the flow files from the output portDecurion
Hi @Bryan, What about sending data back to Nifi from Spark as a flowfile?Jevons
I don't think Spark offers a "sink" that can be implemented, although I could be wrong, but I assume you could implement something like the other streaming frameworks that send data back to NiFi via site to site - github.com/apache/flink/blob/master/flink-connectors/…Decurion
I would imagine Spark already has a way to write back to Kafka which is why I think the Kafka approach is the best optionDecurion
Your answer makes sense. A lot of mangers in IT industry do not understand the technology and their purpose so these managers blatantly ask dev to use whatever latest available technology to allure potential customersPereyra
I am using approach #1 to send the Nifi Data to Spark. I have set the value of nifi.remote.input.socket.port, but the data is getting stuck in the Nifi Spark output queue. Can you suggest some way to resolve this.Phox
D
3

This might help :

you can do everything in Nifi by following below steps :-

  1. Use ListSFTP to list files from Landing location.
  2. Use UpdateAttribute processor and assign absolute file path to a variable. Use this vaiable in your spark code as processor in next step support Expression language.
  3. Use ExecuteSparkInteractive processor, here you can write spark code (using python or scala or Java) and you can read your input file from landing location (use absolute path variable from step 2) without it being flowing as a Nifi flow file and perform operation/transformation on that file ( use spark.read... to read file into rdd). YOu may right your output to either hive external table or temp hdfs location.
  4. use FetchSFTP processor to read file from temp hdfs location and continue with your further Nifi operations.

Here, you need Livy setup to run spark code from Nifi (through ExecuteSparkINteractive). You may look at how to setup Livy and nifi controller services needed to use livy within Nifi.

Good Luck!!

Draw answered 22/6, 2019 at 20:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.