I personally prefer to package my code, and copy the *.whl package to DBFS, where I can install the tested package and import it.
Edit: To be more explicit.
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
This can be solved by either having separate environments DEV/TST/PRD. Or having versioned packages that can be modified in isolation. I'll clarify later on.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Yes, using the versioned packages method I mentioned in combination with databricks-connect, you are totally able to use your IDE, implement tests, have proper git integration.
Git integration is very simple, but this is not my main concern.
Built-in git integration is actually very poor when working in bigger teams. You can't develop in the same notebook simultaneously, as there's a flat and linear accumulation of changes that are shared with your colleagues. Besides that, you have to link and unlink repositories that are prone to human error, causing your notebooks to be synchronized in the wrong folders, causing runs to break because notebooks can't be imported. I advise you to also use my packaging solution.
The packaging solution works as follows Reference:
- List item
- On your desktop, install pyspark
- Download some anonymized data to work with
- Develop your code with small bits of data, writing unit tests
- When ready to test on big data, uninstall pyspark, install databricks-connect
- When performance and integration is sufficient, push code to your remote repo
- Create a build pipeline that runs automated tests, and builds the versioned package
- Create a release pipeline that copies the versioned package to DBFS
- In a "runner notebook" accept "process_date" and "data folder/filepath" as arguments, and import modules from your versioned package
- Pass the arguments to your module to run your tested code