BigQuery replaced most of my Spark jobs, am I missing something? [closed]

Asked 7/5, 2019 at 12:41 Answered 21/2, 2023 at 22:54

sql apache-spark apache-spark-sql google-bigquery bigdata

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.

The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :

easier to reason about (less verbose)
easier to maintain (SQL vs scala/python code)
you can run it easily on the GUI if needed
fast without having to really reason about partitioning, caching and so on...

In the end, I only use Spark when I've got something to do that I can't express using SQL.

To be clear, my workflow is often like :

preprocessing (previously in Spark, now in SQL)
feature engineering (previously in Spark, now mainly in SQL)
machine learning model and predictions (Spark ML)

Am I missing something ? Is there any con in using BigQuery this way instead of Spark ?

Thanks

Crackup answered 7/5, 2019 at 12:41 Comment(5)

Not in my opinion, but you are asking for an opinion, which is why I voted to close. – Chiasmus 7/5, 2019 at 12:42

I'm not looking for an opinion, I'm looking for pros and cons in using BigQuery instead of Spark, maybe there are things that I'm not thinking about, maybe this choice has drawbacks I'm not aware of – Hitch 7/5, 2019 at 12:43

I don't think there are drawbacks. Perhaps keeping spark in a version control system may be easier than SQL, but that is probably a matter of opinion/choice as well. – Assorted 7/5, 2019 at 12:45

Maybe you can try to see if BQML is able to replace Spark ML :) – Purlin 7/5, 2019 at 16:48

@CARREAU Clément.. need your valuable suggestions since you have already done this. we are doing our ingestion at spark layer wherein incoming data gets joins with one more multiple dimensions tables. in some cases, 3 or 4 tables. I am also thinking to shift that part at bq layer. need a lot efforts doing this. I am expecting a lot saving in yarn memory and cpu consumptions. how was your experience? – Timbuktu 10/6, 2024 at 0:27

I'm using PySpark (on GCP Dataproc), BigQuery and we have jobs in both. I will summarize my vision about Pros and Cons of one system against the other. And I do admit that your environment could be different, so that something which I think is Pros might not be like this for you.

Pros of Spark:

better testing of the code, simpler to build unit tests and run them with mocked data and classes, rather in trying to do this with BigQuery
it's possible to use SQL (SparkSQL) for operations and even combine operations over different data sources (DB, files, BQ)
we have JSON files in the format which is not valid for BigQuery, and it cannot parse them (while files have valid JSON format)
possible to implement naturally more complicated logic for some cases, for example, traversing arrays in nested fields and other complicated calculations
better custom monitoring is possible, when we need to check specific metrics in the pipeline we can send related metrics (StatsD, etc.) easier
more natural for CI/CD processes

Pros of BigQuery (all with a note: if all data is available):

simplicity of SQL, when all data is available in a convenient format
DBAs who are not familiar with Python/Scala still could contribute (bcs they know SQL)
awesome infrastructure behind the scene, very performant

With both approaches it's possible to check quickly the result in GUI. For example, Jupyter Notebook allows to run PySpark instantly. I cannot add my notes about ML related traits, though.

Dunfermline answered 21/2, 2023 at 22:54 Comment(0)

A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.

If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:

BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.
Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.

Saluki answered 31/7, 2019 at 23:26 Comment(0)

Recommended topics

Hot tags