We're trying to build a pipeline that takes data from BigQuery, runs through TensorFlow Transform, before training in TensorFlow.
The pipeline is up and running, but we're having difficulty with null values in BigQuery.
We're using Beam to load from BigQuery:
raw_data = (pipeline
| '{}_read_from_bq'.format(step) >> beam.io.Read(
beam.io.BigQuerySource(query=source_query,
use_standard_sql=True,
)))
I'm playing with the dataset metadata, trying FixedLenFeature
and VarLenFeature
for various columns:
# Categorical feature schema
categorical_features = {
column_name: tf.io.FixedLenFeature([], tf.string) for column_name in categorical_columns
}
raw_data_schema.update(categorical_features)
# Numerical feature schema
numerical_features = {
column_name: tf.io.VarLenFeature(tf.float32) for column_name in numerical_columns
}
raw_data_schema.update(numerical_features)
# Create dataset_metadata given raw_data_schema
raw_metadata = dataset_metadata.DatasetMetadata(
schema_utils.schema_from_feature_spec(raw_data_schema))
As expected, if you try and feed a BigQuery NULL into a FixedLenFeature
, it breaks.
However, when I try to feed strings or integers a VarLenFeature
, it breaks too. This seems to be because VarLenFeature expects a list, but BigQuerySource gives a Python primitive. The exact point where it breaks is here (error is from when I tried with an integer):
File "/usr/local/lib/python3.7/site-packages/tensorflow_transform/impl_helper.py", line 157, in <listcomp>
indices = [range(len(value)) for value in values]
TypeError: object of type 'int' has no len()
[while running 'train_transform/AnalyzeDataset/ApplySavedModel[Phase0]/ApplySavedModel/ApplySavedModel']
When I try VarLenFeature with my string inputs, e.g. "UK", the output is a SparseTensor like this:
SparseTensorValue(indices=[(0, 0), (0, 1)], values=['U', 'K'], dense_shape=(1, 2))
So it seems like I need to be passing a list into VarLenFeature for this to work, but BigQuerySource does not do this by default.
Is there a simple way of achieving this? Or am I totally missing the mark on reading nullable columns from BigQuery?
Thank you very much in advance!
OPTIONAL_NUMERIC_FEATURE_KEYS
- it doesn't seem to work well for us in this scenario though... We will implement imputation in BQ until we have time to figure out the VarLenFeature issues :) – Ikeda