Chunk tensorflow dataset records into multiple records
Asked Answered
L

2

6

I have an unbatched tensorflow dataset that looks like this:

ds = ...
for record in ds.take(3):
    print('data shape={}'.format(record['data'].shape))

-> data shape=(512, 512, 87)
-> data shape=(512, 512, 277)
-> data shape=(512, 512, 133)

I want to feed the data to my network in chunks of depth 5. In the example above, the tensor of shape (512, 512, 87) would be divided into 17 tensors of shape (512, 512, 5). The final 2 rows of the matrix (tensor[:,:, 85:87]) should be discarded.

For example:

chunked_ds = ...
for record in chunked_ds.take(1):
    print('chunked data shape={}'.format(record['data'].shape))

-> chunked data shape=(512, 512, 5)

How can I get from ds to chunked_ds? tf.data.Dataset.window() looks like what I need but I cannot get this working.

Lettyletup answered 7/5, 2021 at 17:49 Comment(1)
Hi, can you please share a dataset on which this operation is intended to be done. Some dummy dataset would do.Labonte
C
4

This can be actually done using tf.data.Dataset-only operations:

data = tf.random.normal( shape=[ 10 , 512 , 512 , 87 ] )
ds = tf.data.Dataset.from_tensor_slices( ( data ) )
chunk_size = 5
chunked_ds = ds.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(tf.transpose(x, perm=[2, 0, 1])).batch(chunk_size, drop_remainder=True)) \
                    .map(lambda rec: tf.transpose(rec, perm=[1, 2, 0]))

What is going on there:

First, we treat each each record as a separate Dataset and we permute it so that the last dimension becomes the batch dimension (flat_map will flatten our internal datasets to Tensors again)

.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(tf.transpose(x, perm=[2, 0, 1])

Then we batch it by 5, but we do not care about remainder

.batch(chunk_size, drop_remainder=True))

Finally, re-permute tensors so that we have 512x512 at the beggining:

.map(lambda rec: tf.transpose(rec, perm=[1, 2, 0]))
Chokecherry answered 16/5, 2021 at 13:27 Comment(1)
Perfect. This solution is great!Lettyletup
S
1

In order to express my solution, I'll first create a dummy dataset, which 10 samples each of shape [ 512 , 512 , 87 ],

data = tf.random.normal( shape=[ 10 , 512 , 512 , 87 ] )
ds = tf.data.Dataset.from_tensor_slices( ( data ) )

On executing the below code,

for record in ds.take( 3 ):
    print( record.shape )

We get the output,

(512, 512, 87)
(512, 512, 87)
(512, 512, 87)

For convenience, I have created a dataset in which the length of the last dimension is a constant i.e. 87 ( which contradicts your approach ). But the solution provided is independent of the length of the last dimension.

The solution,

# chunk/window size
chunk_depth = 5

# array to store the chunks
chunks = []

# Iterating through each sample in ds ( Note: ds.as_numpy_iterator() returns NumPy arrays )
for sample in ds.as_numpy_iterator():
    # Length of the last dimension
    feature_size = sample.shape[ 2 ]
    # No. of chunks that can be produced
    num_chunks = feature_size // chunk_depth
    # Perform slicing along the last dimension, storing the "chunks" in the chunks array.
    for i in range( 0 , num_chunks , chunk_depth ):
        chunk = sample[ : , : , i : i + chunk_depth ]
        chunks.append( chunk )

# Convert array -> tf.data.Dataset
chunked_ds = tf.data.Dataset.from_tensor_slices( ( chunks ) )

The output of the below code,

for sample in chunked_ds.take( 1 ):
    print( sample.shape )

is as expected in the question,

(512, 512, 5)

The solution is available as a Colab notebook.

Stokehold answered 15/5, 2021 at 14:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.