Storing sentence embeddings in Google Cloud BigQuery
Asked Answered
D

1

6

I am looking for a way to store embedding generated by language model like (T5), in BigQuery of Google.

The embedding are in the form of Numpy array or tensor.

I found 3 approaches:

  1. TFRecord, write it to a TFRecord file and store to cloud storage
  2. convert numpy array to string and store as a String column in a table
  3. store to a column with mode as REPEAT. (Not sure in this way if the order of the embedding vector entries can be preserved)

Hope anybody can give some suggestions or other approaches.

Many thanks

Divisible answered 3/6, 2021 at 21:34 Comment(2)
Store it a serizalized or jsonized value as String.Bogeyman
Which method to use to serialise the array? np.array2string() or np.tobytes() ?Divisible
P
0

Arrays are first-class citizens in BigQuery - see https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays

The mode REPEATED means that the column is an array.

E.g. a column of type STRING in mode REPEATED means that this column can only contain arrays of type string.

The order of elements is preserved. So I guess you just want to directly store your arrays as arrays in BQ.

In case you want to operate on those arrays later using SQL have look at UNNEST(<array>) which turns arrays into tables so you can run SQL directly on the array (using lateral joins or just a subquery).

Parcel answered 10/6, 2021 at 7:41 Comment(1)
How do you actually implement this in BigQuery? I think this is the original poster's question.Diedrediefenbaker

© 2022 - 2024 — McMap. All rights reserved.