flake8 linting for databricks python code in github using workflows

Asked 3/4, 2020 at 19:56 Answered 8/5, 2023 at 19:26

Solved python apache-spark github databricks flake8

I have my databricks python code in github. I setup a basic workflow to lint the python code using flake8. This fails because the names that are implicitly available to my script (like spark, sc, dbutils, getArgument etc) when it runs on databricks are not available when flake8 lints it outside databricks (in github ubuntu vm).

How can I lint databricks notebooks in github using flake8?

E.g. errors I get:

test.py:1:1: F821 undefined name 'dbutils'
test.py:3:11: F821 undefined name 'getArgument'
test.py:5:1: F821 undefined name 'dbutils'
test.py:7:11: F821 undefined name 'spark'

my notebook in github:

dbutils.widgets.text("my_jdbcurl", "default my_jdbcurl")

jdbcurl = getArgument("my_jdbcurl")

dbutils.fs.ls(".")

df_node = spark.read.format("jdbc")\
  .option("driver", "org.mariadb.jdbc.Driver")\
  .option("url", jdbcurl)\
  .option("dbtable", "my_table")\
  .option("user", "my_username")\
  .option("password", "my_pswd")\
  .load()

my .github/workflows/lint.yml

on:
  pull_request:
    branches: [ master ]

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-python@v1
      with:
        python-version: 3.8
    - run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Lint with flake8
      run: |
        pip install flake8
        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

Floriated answered 3/4, 2020 at 19:56 Comment(2)

You should find out how databricks invokes flake8, including what dependencies it provides. That will tell you how you should invoke flake8 in GitHub Actions.. – Unswear 3/4, 2020 at 20:11

@bk2204, I didn't quite get that. In this case it's github invoking flake8 not databricks. – Floriated 3/4, 2020 at 22:19

TL;DR

Don't use the built-in variable dbutils in code that would need to run locally (IDE, Unit tests, ...) and in Databricks (production). Create your own instance of DBUtils class instead.

Here is what we ended up doing:

Created a new dbk_utils.py

from pyspark.sql import SparkSession

def get_dbutils(spark: SparkSession):
    try:
        from pyspark.dbutils import DBUtils
        return DBUtils(spark)

    except ModuleNotFoundError:
        import IPython
        return IPython.get_ipython().user_ns["dbutils"]

And update the code that uses dbutils to use this utility:

from dbk_utils import get_dbutils

my_dbutils = get_dbutils()

my_dbutils.widgets.text("my_jdbcurl", "default my_jdbcurl")
my_dbutils.fs.ls(".")

jdbcurl = my_dbutils.widgets.getArgument("my_jdbcurl")

df_node = spark.read.format("jdbc")\
  .option("driver", "org.mariadb.jdbc.Driver")\
  .option("url", jdbcurl)\
  .option("dbtable", "my_table")\
  .option("user", "my_username")\
  .option("password", "my_pswd")\
  .load()

If you're trying to do unit testing as well, then check out:

Floriated answered 8/5, 2023 at 19:26 Comment(0)

One thing you can do is this:

from pyspark.sql import SparkSession


spark = SparkSession.builder.getOrCreate()

This will work with or without Databricks, in normal Python or in the pyspark client.

To detect if you are in a file or in a Databricks notebook, you can run:

try:
    __file__
    print("We are in a file, like in our IDE or being tested by flake8.")
except NameError:
    print("We are in a Databricks notebook. Act accordingly.")

You could then conditionally initialize or create dummy variables for display() and other tools.

This is only a partial solution. I am working on a better solution, I will keep this answer updated.

Disappear answered 10/8, 2021 at 9:43 Comment(2)

how do we handle dbutils? – Anatolian 3/11, 2021 at 8:28

@JamesOwers python from pyspark.dbutils import DBUtils dbutils = DBUtils() This works in normal pyspark, on Databricks vis databricks-connect or on Databricks notebooks. – Disappear 21/12, 2021 at 21:55

You can add --builtins=dbutils,spark,display to ignore variables that are built into Databricks IDE

Dosi answered 26/1, 2023 at 18:46 Comment(2)

but you will hit the next issue, which is functions imported with %run being recognized as undefined. curious if anyone has a solution to that. so far, I've tried flake8_nb and nbqa, both seem to require that the notebook being imported has an .ipynb file extension, but Databricks syntax leaves out that extension. – Dosi 26/1, 2023 at 18:50

Added a solution/answer. Also --builtins=dbutils,spark,display would satisfy flake8 but in the long term, in production code, you'll need to run unit tests and so on, so you'll end up creating local spark sessions, so... – Floriated 9/5, 2023 at 14:33

This is my opinion, all linters dont work for all use cases, this is what I do. I am using a pre-commit hook and ignoring rule F821.

# Flake rules: https://lintlyci.github.io/Flake8Rules/
- repo: https://gitlab.com/pycqa/flake8
  rev: 3.8.4
  hooks:
    - id: flake8
      exclude: (^docs/)
      additional_dependencies: [flake8-typing-imports==1.7.0]
      # F821 undefined name
      args:
        [
          "--max-line-length=127",
          "--config=setup.cfg",
          "--ignore=F821",
        ]

To match your syntax, add the --ignore flag:

flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --ignore=C901,F821 --statistics

Knighthood answered 10/8, 2021 at 22:9 Comment(0)