How to batch load custom Avro data generated from another source?

Asked 14/8, 2018 at 14:45 Answered 27/8, 2018 at 13:14

google-cloud-platform google-cloud-spanner

The Cloud Spanner docs say that Spanner can export/import Avro format. Can this path also be used for batch ingestion of Avro data generated from another source? The docs seem to suggest it can only import Avro data that was also generated by Spanner.

I ran a quick export job and took a look at the generated files. The manifest and schema look pretty straight forward. I figured I would post here in case this rabbit hole is deep.

manifest file

{
  "files": [{
    "name": "people.avro-00000-of-00001",
    "md5": "HsMZeZFnKd06MVkmiG42Ag=="
  }]
}

schema file

{
  "tables": [{
    "name": "people",
    "manifestFile": "people-manifest.json"
  }]
}

data file

    {"type":"record",
    "name":"people",
    "namespace":
    "spannerexport","
    fields":[
{"name":"fullName",
"type":["null","string"],
"sqlType":"STRING(MAX)"},{"name":"memberId",
"type":"long",
"sqlType":"INT64"}
],
    "googleStorage":"CloudSpanner",
    "spannerPrimaryKey":"`memberId` ASC",
    "spannerParent":"",
    "spannerPrimaryKey_0":"`memberId` ASC",
    "googleFormatVersion":"1.0.0"}

Grimsley answered 14/8, 2018 at 14:45 Comment(1)

Have you tried what you are asking for or you are just guessing if it might or might not be possible? – Afteryears 21/8, 2018 at 13:25

In response to your question, yes! There are two ways to do ingestion of Avro data into Cloud Spanner.

Method 1

If you place Avro files in a Google Cloud Storage bucket arranged as a Cloud Spanner export operation would arrange them and you generate a manifest formatted as Cloud Spanner expects, then using the import functionality in the web interface for Cloud Spanner will work. Obviously, there may be a lot of tedious formatting work here which is why the official documentation states that this "import process supports only Avro files exported from Cloud Spanner".

Method 2

Instead of executing the import/export job using the Cloud Spanner web console and relying on the Avro manifest and data files to be perfectly formatted, slightly modify the code in either of two public code repositories on GitHub under the Google Cloud Platform user that provide import/export (or backup/restore or export/ingest) functionality for moving data from Avro format into Google Cloud Spanner: (1) Dataflow Templates, especially this file (2) Pontem, especially this file.

Both of these have Dataflow jobs written that allow you to move data into and out of Cloud Spanner using the Avro format. Each has a specific means of parsing an Avro schema for input (i.e., moving data from Avro into Cloud Spanner). Since your use-case is input (i.e., ingesting data into Cloud Spanner that is Avro-formatted), you need to modify the Avro parsing code to fit your specific schema and then execute the Cloud Dataflow job from the commandline locally on your machine (the job is then uploaded to Google Cloud Platform).

If you are not familiar with Cloud Dataflow, it is a tool for defining and running jobs with large data sets.

Iapetus answered 14/8, 2018 at 17:14 Comment(4)

I have tried to implement method 1 that you proposed by taking an Avro export from BigQuery and creating spanner-export.json & manifest file, but when I imported it into Spanner it was only 1 row instead of 600. Did you successfully try out this method? Is there any additional step inbetween? – Erlina 17/8, 2018 at 10:38

@PhilippSh - I did not do ingestion from BigQuery but I have done it from other formats via Avro. When you looked at your Cloud Dataflow job that is generated from the Google Cloud Spanner Import UI, what errors (if any) do you see? In other words, when you do Method 1 (cloud.google.com/spanner/docs/import), what do you see in the corresponding Cloud Dataflow import job that is generated from the import in the Cloud Spanner UI? – Iapetus 17/8, 2018 at 18:2

I reproduced this as well and what Philipp says is true. And it doesn't rise any kind of error, it just load only one row. – Afteryears 21/8, 2018 at 13:22

@Afteryears - Do you have the GCP folder publicly available with the data and the schema? – Iapetus 21/8, 2018 at 18:33

As the documentation specifically states that importing only supports Avro files initially exported from Spanner 1, I've raised a feature request for this which you can track here

1 https://cloud.google.com/spanner/docs/import

Shit answered 27/8, 2018 at 13:14 Comment(0)

Recommended topics

Hot tags