Feeding nullable data from BigQuery into Tensorflow Transform

We're trying to build a pipeline that takes data from BigQuery, runs through TensorFlow Transform, before training in TensorFlow.

The pipeline is up and running, but we're having difficulty with null values in BigQuery.

We're using Beam to load from BigQuery:

    raw_data = (pipeline
                | '{}_read_from_bq'.format(step) >> beam.io.Read(
                    beam.io.BigQuerySource(query=source_query,
                                           use_standard_sql=True,
                                           )))

I'm playing with the dataset metadata, trying FixedLenFeature and VarLenFeature for various columns:

    # Categorical feature schema
    categorical_features = {
        column_name: tf.io.FixedLenFeature([], tf.string) for column_name in categorical_columns
    }
    raw_data_schema.update(categorical_features)

    # Numerical feature schema
    numerical_features = {
        column_name: tf.io.VarLenFeature(tf.float32) for column_name in numerical_columns
    }
    raw_data_schema.update(numerical_features)

    # Create dataset_metadata given raw_data_schema
    raw_metadata = dataset_metadata.DatasetMetadata(
        schema_utils.schema_from_feature_spec(raw_data_schema))

As expected, if you try and feed a BigQuery NULL into a FixedLenFeature, it breaks.

However, when I try to feed strings or integers a VarLenFeature, it breaks too. This seems to be because VarLenFeature expects a list, but BigQuerySource gives a Python primitive. The exact point where it breaks is here (error is from when I tried with an integer):

File "/usr/local/lib/python3.7/site-packages/tensorflow_transform/impl_helper.py", line 157, in <listcomp>
indices = [range(len(value)) for value in values]
TypeError: object of type 'int' has no len()
[while running 'train_transform/AnalyzeDataset/ApplySavedModel[Phase0]/ApplySavedModel/ApplySavedModel']

When I try VarLenFeature with my string inputs, e.g. "UK", the output is a SparseTensor like this:

SparseTensorValue(indices=[(0, 0), (0, 1)], values=['U', 'K'], dense_shape=(1, 2))

So it seems like I need to be passing a list into VarLenFeature for this to work, but BigQuerySource does not do this by default.

Is there a simple way of achieving this? Or am I totally missing the mark on reading nullable columns from BigQuery?

Thank you very much in advance!

You might need to handle NULL(missing) values by yourself. For numerical columns, you could replace NULLs with mean or median. For categorical columns (STRING), you could use some default value like an empty STRING or a new value as a missing value indicator.

I'm not very familiar with VarLenFeature, but you can probably replace NULLs (NULL imputation) in the source_query. Something like:

IFNULL(col, col_mean) AS col_imputed

The downside is that you will have to calculate col_mean first using sql and fill it here as a constant. Another thing is you will need to remember this mean and apply the same mean in prediction as it's not part of tf.transform (your graph).

Bigquery itself has BQML as an ML platform. They do support TRANSFORM and automatic imputation. Maybe you could also take a look :)

Recommended topics

Hot tags