How to integrate Google Cloud SQL with Google Big Query
Asked Answered
E

5

12

I am designing a solution in which Google Cloud SQL will be used to store all data from the regular functioning of the app(kind of OLTP data). The data is expected to grow over time into pretty large size. The data itself is relational in nature and hence we have chosen Cloud SQL instead of Cloud Datastore.

This data needs to be fed into Big Query for analytics and this needs to be near real-time analytics (as the best case), although realistically some lag can be expected. But I am trying to design a solution which reduces this lag to minimum possible.

My question has 3 parts -

  1. Should I use Cloud SQL for storing data and then move it to BigQuery or change the basic design itself and use BigQuery for storing the data initially as well? Is BigQuery suitable for use for regular, low-latency OLTP workloads?(I don't think so - is my assumption correct?)

  2. What is the recommended/best practice for loading Cloud SQL data into BigQuery and have this integration work near real-time?

  3. Is Cloud Dataflow a good option? If I connect Cloud SQL to Cloud DataFlow and further to BigQuery - will it work? Or is there any other way to achieve this which is better(as asked in question 2)?

Eadmund answered 22/9, 2017 at 17:13 Comment(0)
H
15

Take a look at how WePay does this:

The MySQL to GCS operator executes a SELECT query against a MySQL table. The SELECT pulls all data greater than (or equal to) the last high watermark. The high watermark is either the primary key of the table (if the table is append-only), or a modification timestamp column (if the table receives updates). Again, the SELECT statement also goes back a bit in time (or rows) to catch potentially dropped rows from the last query (due to the issues mentioned above).

With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.

Headrest answered 22/9, 2017 at 17:35 Comment(0)
A
14

BigQuery supports Cloud SQL federated queries which lets you directly query Cloud SQL database from BigQuery. To keep Cloud SQL table in sync with BigQuery, you can write a simple script with following query to sync two tables every hour.

INSERT
   demo.customers (column1)
SELECT
   *
FROM
   EXTERNAL_QUERY(
      "project.us.connection",
      "SELECT column1 FROM mysql_table WHERE timestamp > ${timestamp};");

Just remember replace the ${timestamp} with the current timestamp - 1 hour.

Arose answered 12/9, 2019 at 0:27 Comment(0)
L
0

Another method would be to split the write process to CloudSQL and to Cloud Pub/Sub and then have a Dataflow reader to stream into BigQuery. This works well when you have materially different target schema for your BigQuery tables - which is common when denormalizing your relational data.

The upside is that you can reduce overall latency to say a few seconds; however, the main downside is that if your transactional data is highly mutating you will have to create a versioning scheme to track changes.

Lenwood answered 20/12, 2018 at 17:0 Comment(0)
D
0

Google has provided a reference article on this subject related to using a change data capture tool to identify the changed data and only pushing that.

This makes some assumptions that may not work for you:

  • willingness to learn debezium
  • willingness to let GCP connect to your source MySQL database

If those work for your situation it seems like a good solution.

Draftsman answered 22/7, 2021 at 13:4 Comment(0)
P
0

I think you can use federated queries as one possible solution:

A federated query is a way to send a query statement to an external database and get the result back as a temporary table. Federated queries use the BigQuery Connection API to establish a connection with the external database. In your standard SQL query, you use the EXTERNAL_QUERY function to send a query statement to the external database, using that database's SQL dialect. The results are converted to BigQuery standard SQL data types.

You can use federated queries with the following external databases:

Cloud Spanner Cloud SQL After the initial one-time setup, you can write a query with the EXTERNAL_QUERY SQL function.

I leave you the documentation so you can implement it on your project: https://cloud.google.com/bigquery/docs/federated-queries-intro

Phyllis answered 2/6, 2022 at 21:14 Comment(1)
This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From ReviewCoverlet

© 2022 - 2024 — McMap. All rights reserved.