Storing sentence embeddings in Google Cloud BigQuery

About

Asked 3/6, 2021 at 21:34 Answered 10/6, 2021 at 7:41

google-cloud-platform google-bigquery embedding huggingface-transformers

I am looking for a way to store embedding generated by language model like (T5), in BigQuery of Google.

The embedding are in the form of Numpy array or tensor.

I found 3 approaches:

TFRecord, write it to a TFRecord file and store to cloud storage
convert numpy array to string and store as a String column in a table
store to a column with mode as REPEAT. (Not sure in this way if the order of the embedding vector entries can be preserved)

Hope anybody can give some suggestions or other approaches.

Many thanks

Divisible answered 3/6, 2021 at 21:34 Comment(2)

Store it a serizalized or jsonized value as String. – Bogeyman 4/6, 2021 at 7:45

Which method to use to serialise the array? np.array2string() or np.tobytes() ? – Divisible 4/6, 2021 at 9:21

Arrays are first-class citizens in BigQuery - see https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays

The mode REPEATED means that the column is an array.

E.g. a column of type STRING in mode REPEATED means that this column can only contain arrays of type string.

The order of elements is preserved. So I guess you just want to directly store your arrays as arrays in BQ.

In case you want to operate on those arrays later using SQL have look at UNNEST(<array>) which turns arrays into tables so you can run SQL directly on the array (using lateral joins or just a subquery).

Parcel answered 10/6, 2021 at 7:41 Comment(1)

How do you actually implement this in BigQuery? I think this is the original poster's question. – Diedrediefenbaker 27/6, 2021 at 23:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags