Read from BigQuery into Spark in efficient way?

Asked 4/1, 2017 at 10:57 Answered 10/4, 2021 at 10:10

Solved apache-spark google-bigquery google-cloud-dataproc google-hadoop

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is there more efficient way to read data from BigQuery into Spark?

Another Question: reading from BigQuery composed of 2 stages (copying to GCS, reading in parallel from GCS). does copying stage affected by Spark cluster size or it take fixed time?

Humber answered 4/1, 2017 at 10:57 Comment(1)

I updated my answer for the 2nd part. – Basrelief 4/1, 2017 at 23:23

Maybe a Googler will correct me, but AFAIK that's the only way. This is because under the hood it also uses the BigQuery Connector for Hadoop, which accordng to the docs:

The BigQuery connector for Hadoop downloads data into your Google Cloud Storage bucket before running a Hadoop job..

As a side note, this is also true when using Dataflow - it too performs an export of BigQuery table(s) to GCS first and then reads them in parallel.

WRT whether or not the copying stage (which is essentially a BigQuery export job) is influenced by your Spark cluster size, or if it's a fixed time - no. BigQuery export jobs are nondeterministic, and BigQuery uses its own resources for exporting to GCS i.e. not your Spark cluster.

Basrelief answered 4/1, 2017 at 11:47 Comment(3)

Hey I have 3.83TB table in BigQuery. I am trying too connect spark to this table in Bigquery using the spark connector, but it shows no sharded files created after waiting for long time(10-15 mins). Is it expected to remain idle for such long times. I am able to read small tables using the same connector. – Luting 27/4, 2017 at 9:7

@abhishekjha- i would suggest to use BQ compute capabilities instead of bringing such huge data to spark. you can trigger BQ query jobs from sprak pipelines. if you are going to perform daily once this 4 TB data operation you will end up paying substantial amount of cost of Data storage API- lost of fine tuning required to mange at spark layer to deal with TB's data. – Triatomic 28/6, 2021 at 18:49

If we are reading a small table or just few columns via spark jobs.. which one you prefer.. prepare and run SQL at bq layer or to access data using bq apis – Horsy 17/10, 2021 at 1:45

spark-bigquery-connector uses the BigQuery storage API which is super fast.

Stephanstephana answered 27/2, 2020 at 6:42 Comment(0)

I strongly suggest one to verify do you really need to move data to spark engine from BQ Storage.
BQ comes with it's compute and storage capabilities. what is stopping to leverage compute of native BQ. it is free if you are on fixed slot billing model. Native BQ compute is no less then in any case to spark computation capabilities.. if you have pipelines in spark except ingestion, prefer to move pre aggregated, enrichment , ETL to directly in BQ. it would perform better, cost effective and easy to manage. BQ is server less services you don't need to predict the nodes you required to process the data if volumes changes abruptly.

Another Down side with Spark is COST-

Storage API usage adds lot of cost if you are working large datasets.Dataproc/Dataflow use storage API to read data from Big query
Dataproc Nodes cost
Dataproc service cost
optional- if BQ slot cost would be wasted as you won't be using it.

Triatomic answered 10/4, 2021 at 10:10 Comment(0)

Recommended topics

Hot tags