Feeding nullable data from BigQuery into Tensorflow Transform
Asked Answered
I

1

6

We're trying to build a pipeline that takes data from BigQuery, runs through TensorFlow Transform, before training in TensorFlow.

The pipeline is up and running, but we're having difficulty with null values in BigQuery.

We're using Beam to load from BigQuery:

    raw_data = (pipeline
                | '{}_read_from_bq'.format(step) >> beam.io.Read(
                    beam.io.BigQuerySource(query=source_query,
                                           use_standard_sql=True,
                                           )))

I'm playing with the dataset metadata, trying FixedLenFeature and VarLenFeature for various columns:

    # Categorical feature schema
    categorical_features = {
        column_name: tf.io.FixedLenFeature([], tf.string) for column_name in categorical_columns
    }
    raw_data_schema.update(categorical_features)

    # Numerical feature schema
    numerical_features = {
        column_name: tf.io.VarLenFeature(tf.float32) for column_name in numerical_columns
    }
    raw_data_schema.update(numerical_features)

    # Create dataset_metadata given raw_data_schema
    raw_metadata = dataset_metadata.DatasetMetadata(
        schema_utils.schema_from_feature_spec(raw_data_schema))

As expected, if you try and feed a BigQuery NULL into a FixedLenFeature, it breaks.

However, when I try to feed strings or integers a VarLenFeature, it breaks too. This seems to be because VarLenFeature expects a list, but BigQuerySource gives a Python primitive. The exact point where it breaks is here (error is from when I tried with an integer):

File "/usr/local/lib/python3.7/site-packages/tensorflow_transform/impl_helper.py", line 157, in <listcomp>
indices = [range(len(value)) for value in values]
TypeError: object of type 'int' has no len()
[while running 'train_transform/AnalyzeDataset/ApplySavedModel[Phase0]/ApplySavedModel/ApplySavedModel']

When I try VarLenFeature with my string inputs, e.g. "UK", the output is a SparseTensor like this:

SparseTensorValue(indices=[(0, 0), (0, 1)], values=['U', 'K'], dense_shape=(1, 2))

So it seems like I need to be passing a list into VarLenFeature for this to work, but BigQuerySource does not do this by default.

Is there a simple way of achieving this? Or am I totally missing the mark on reading nullable columns from BigQuery?

Thank you very much in advance!

Ikeda answered 22/1, 2020 at 16:50 Comment(0)
H
4

You might need to handle NULL(missing) values by yourself. For numerical columns, you could replace NULLs with mean or median. For categorical columns (STRING), you could use some default value like an empty STRING or a new value as a missing value indicator.

I'm not very familiar with VarLenFeature, but you can probably replace NULLs (NULL imputation) in the source_query. Something like:

IFNULL(col, col_mean) AS col_imputed

The downside is that you will have to calculate col_mean first using sql and fill it here as a constant. Another thing is you will need to remember this mean and apply the same mean in prediction as it's not part of tf.transform (your graph).

Bigquery itself has BQML as an ML platform. They do support TRANSFORM and automatic imputation. Maybe you could also take a look :)

Habanera answered 23/1, 2020 at 8:25 Comment(2)
Thanks very much for your response! We are implementing this as a fallback option. The aspiration is to implement imputation in Tensorflow Transform as it seems like an intended use case, at least from the "census" example: tensorflow.org/tfx/tutorials/transform/… see how they treat their OPTIONAL_NUMERIC_FEATURE_KEYS - it doesn't seem to work well for us in this scenario though... We will implement imputation in BQ until we have time to figure out the VarLenFeature issues :)Ikeda
We have settled on a compromise based on a few experiments and some further feedback. (1) Where imputation can be performed in BQ with the caveats you mention, we use this method: IFNULL(...) in SQL, and a FixedLenFeature. (2) Where we need TFT to impute, we operate on the columns in BQ to transform to arrays: CASE WHEN col IS NULL THEN NULL ELSE ARRAY(SELECT col) END AS col, and use a VarLenFeature.Ikeda

© 2022 - 2024 — McMap. All rights reserved.