I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.
The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :
- easier to reason about (less verbose)
- easier to maintain (SQL vs scala/python code)
- you can run it easily on the GUI if needed
- fast without having to really reason about partitioning, caching and so on...
In the end, I only use Spark when I've got something to do that I can't express using SQL.
To be clear, my workflow is often like :
- preprocessing (previously in Spark, now in SQL)
- feature engineering (previously in Spark, now mainly in SQL)
- machine learning model and predictions (Spark ML)
Am I missing something ? Is there any con in using BigQuery this way instead of Spark ?
Thanks