Execute python scripts in Azure DataFactory
Asked Answered
D

2

5

I have my data stored in blobs and I have written a python script to do some computations and create another csv. How can I execute this in Azure Data Factory ?

Denazify answered 11/9, 2018 at 7:46 Comment(0)
N
8

You could use Azure Data Factory V2 custom activity for your requirements. You can directly execute a command to invoke Python script using Custom Activity.

Please refer to this sample on the GitHub.

Nilson answered 11/9, 2018 at 7:58 Comment(4)
So I have to add a batch service , then create a custom activity , link it to that batch service and execute .py file like we do in cmd ? Or am i wrong anywhere ?Denazify
@Denazify Yes,you could follow the sample on the doc in my answer.Nilson
@Denazify If you think my answer helps you, you could mark it for answer.Thanks a lot.Nilson
@JayGong sorry to dig this up after such a long time. I tried to follow the tutorial and there is just one thing that doesn't work. After generalizing the VM (sudo waagent -deprovision+user) it seems to loose all the information about the python packages I installed. Any ideas how to overcome that?Artemis
M
5

Another option is using a DatabricksSparkPython Activity. This makes sense if you want to scale out, but could require some code modifications for PySpark support. Prerequisite of cause is an Azure Databricks workspace. You have to upload your script to DBFS and can trigger it via Azure Data Factory. The following example triggers the script pi.py:

{
    "activity": {
        "name": "MyActivity",
        "description": "MyActivity description",
        "type": "DatabricksSparkPython",
        "linkedServiceName": {
            "referenceName": "MyDatabricksLinkedservice",
             "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "pythonFile": "dbfs:/docs/pi.py",
            "parameters": [
                "10"
            ],
            "libraries": [
                {
                    "pypi": {
                        "package": "tensorflow"
                    }
                }
            ]
        }
    }
}

See the Documentation for more details.

Monstrance answered 11/9, 2018 at 8:25 Comment(1)
Although valid, It's kind of overkill to create a Databricks cluster just to run a simple Python script that creates a csv. Only choose this option if you REALLY do heavy calculations. Otherwise it's a waste of time and resources...Sharpsighted

© 2022 - 2024 — McMap. All rights reserved.