How to extract data/labels back from TensorFlow dataset
Asked Answered
G

14

55

There are plenty of examples how to create and use TensorFlow datasets, e.g.

dataset = tf.data.Dataset.from_tensor_slices((images, labels))

My question is how to get back the data/labels from the TF dataset in numpy form? In other words want would be reverse operation of the line above, i.e. I have a TF dataset and want to get back images and labels from it.

Gib answered 20/5, 2019 at 18:53 Comment(0)
D
85

In case your tf.data.Dataset is batched, the following code will retrieve all the y labels:

y = np.concatenate([y for x, y in ds], axis=0)

Quick explanation: [y for x, y in ds] is known as “list comprehension” in python. If dataset is batched, this expression will loop thru each batch and put each batch y (a TF 1D tensor) in the list, and return it. Then, np.concatenate will take this list of 1-D tensor (implicitly casting to numpy) and stack it in the 0-axis to produce a single long vector. In summary, it is just converting a bunch of 1-d little vector into one long vector.

Note: if your y is more complex, this answer will need some minor modification.

Diskin answered 9/7, 2020 at 20:30 Comment(5)
Elegant and pythonic! +1Verine
@TimMironov Thanks. I could also have used _ for the x in that one-liner. Actually, I think there's a downside if you want to extract both x and y. I haven't yet figured out if you can do it in a similar one-liner.Diskin
As a beginner python user this answer is extremely opaque to me. Not saying it's a bad answer but I think it could use a bit more context or explanation.Margotmargrave
It's black magic to me but it works greatMargotmargrave
Couple of people want explanation. I have updated this. The key is to know what list comprehension is, and read numpy concatenate documentation. It is by no means black magic compared to other stuff.Diskin
E
27

Supposing our tf.data.Dataset is called train_dataset , with eager_execution on (default in TF 2.x), you can retrieve images and labels like this:

for images, labels in train_dataset.take(1):  # only take first element of dataset
    numpy_images = images.numpy()
    numpy_labels = labels.numpy()
  • the inline operation .numpy() converts tf.Tensors in numpy arrays
  • if you want to retrieve more elements of the dataset, just increase the number inside the take method. If you want all elements, just insert -1
Extender answered 27/8, 2019 at 14:25 Comment(1)
It should be noted that this method will return count batches of images in some cases, instead of individual images.Flirtatious
L
12

If you are OK with keeping the images and labels as tf.Tensors, you can do

images, labels = tuple(zip(*dataset))

Think of the effect of the dataset as zip(images, labels). When we want to get images and labels back, we can simply unzip it.

If you need the numpy array version, convert them using np.array():

images = np.array(images)
labels = np.array(labels)
Lorrainelorrayne answered 29/12, 2020 at 22:6 Comment(3)
This caused my program to crash on a dataset with ~20,000 images and 12GB of RAM.Margotmargrave
Do you need the data all at once? If not, it may be a good idea to load them in batches.Lorrainelorrayne
Thanks! Putting * and zip consecutively seems to resolve the error: (images,), (labels,) = zip(*training_batches.take(1)) It removes this error for me: ValueError: not enough values to unpack (expected 2, got 1)Zingg
R
8

I think we get a good example here:

https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb#scrollTo=BC4pEXtkp4K-

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

# where mnsit train is a tf dataset
mnist_train = tfds.load(name="mnist", split=tfds.Split.TRAIN)
assert isinstance(mnist_train, tf.data.Dataset)

mnist_example, = mnist_train.take(1)
image, label = mnist_example["image"], mnist_example["label"]

plt.imshow(image.numpy()[:, :, 0].astype(np.float32), cmap=plt.get_cmap("gray"))
print("Label: %d" % label.numpy())

So each individual component of the dataset can be accessed sort of like a dictionary. Presumably different datasets have different field names (Boston housing won't have image, and value, but might have 'features' and 'target' or 'price':

cnn = tfds.load(name="cnn_dailymail", split=tfds.Split.TRAIN)
assert isinstance(cnn, tf.data.Dataset)
cnn_ex, = cnn.take(1)
print(cnn_ex)

returns a dict() with keys ['article', 'highlight'] with numpy strings inside.

Redbird answered 20/5, 2019 at 19:22 Comment(0)
I
6

You can use TF Dataset method unbatch() to unbatch the dataset, then you can easily retrieve the data and the labels from it:

ds_labels=[]
for images, labels in ds.unbatch():
    ds_labels.append(labels) # or labels.numpy().argmax() for int labels

Or in one line:

ds_labels = [labels for _, labels in ds.unbatch()]
Ichthyornis answered 2/1, 2022 at 14:47 Comment(1)
Why is unbatching it necessary? Can't I just iterate the batches? If I do I get a BatchDataset object, which doesn't act like a tensor at all.Crosier
G
1

Here is my own solution to the problem:

def dataset2numpy(dataset, steps=1):
    "Helper function to get data/labels back from TF dataset"
    iterator = dataset.make_one_shot_iterator()
    next_val = iterator.get_next()
    with tf.Session() as sess:
        for _ in range(steps):
           inputs, labels = sess.run(next_val)
           yield inputs, labels

Please note that this function will yield inputs/labels of dataset batch. The steps control how many batches from a dataset will be taken out.

Gib answered 21/5, 2019 at 12:50 Comment(0)
A
1

This worked for me

features = np.array([list(x[0].numpy()) for x in list(ds_test)])
labels = np.array([x[1].numpy() for x in list(ds_test)])



# NOTE: ds_test was created
iris, iris_info = tfds.load('iris', with_info=True)
ds_orig = iris['train']
ds_orig = ds_orig.shuffle(150, reshuffle_each_iteration=False)
ds_train = ds_orig.take(100)
ds_test = ds_orig.skip(100)
Atropine answered 9/6, 2020 at 9:35 Comment(0)
J
0
import numpy as np
import tensorflow as tf

batched_features = tf.constant([[[1, 3], [2, 3]],
                                [[2, 1], [1, 2]],
                                [[3, 3], [3, 2]]], shape=(3, 2, 2))
batched_labels = tf.constant([[0, 0],
                              [1, 1],
                              [0, 1]], shape=(3, 2, 1))
dataset = tf.data.Dataset.from_tensor_slices((batched_features, batched_labels))
classes = np.concatenate([y for x, y in dataset], axis=0)
unique = np.unique(classes, return_counts=True)
labels_dict = dict(zip(unique[0], unique[1]))
print(classes)
print(labels_dict)
# {0: 3, 1: 3}
Jiva answered 19/7, 2021 at 6:12 Comment(1)
While this might answer the question, if possible you should edit your answer to include a short explanation of how this code block answers the question. This helps to provide context, and makes your answer much more useful for future readers.Femi
M
0

TensorFlow's get_single_element() is finally around which can be used to extract data and labels back from datasets.

This avoids the need of generating and using an iterator using .map() or iter() (which could be costly for big datasets).

get_single_element() returns a tensor (or a tuple or dict of tensors) encapsulating all the members of the dataset. We need to pass all the members of the dataset batched into a single element.

This can be used to get features as a tensor-array, or features and labels as a tuple or dictionary (of tensor-arrays) depending upon how the original dataset was created.

Check this answer on SO for an example that unpacks features and labels into a tuple of tensor-arrays.

Moguel answered 19/8, 2021 at 16:39 Comment(0)
D
0

https://www.tensorflow.org/tutorials/images/classification

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
  for i in range(9):
  ax = plt.subplot(3, 3, i + 1)
  plt.imshow(images[i].numpy().astype("uint8"))
  plt.title(class_names[labels[i]])
  plt.axis("off")
Dissident answered 27/8, 2021 at 17:58 Comment(0)
T
0

You can use map function.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map

images = dataset.map(lambda images, labels: images)
labels = dataset.map(lambda images, labels: labels)
Theda answered 17/11, 2022 at 10:28 Comment(0)
K
0

Solution that worked for me (not reported, as of now):

Let's say I have a dataset named 'dataset'.

To get to iterate over batches in the dataset:

dataset.as_numpy_iterator()

To get a list of all batches in the dataset:

list(dataset.as_numpy_iterator())

To get the first batch in the dataset (as a list [data, labels]):

list(dataset.as_numpy_iterator())[0]

To get the 'labels' from the first batch in the dataset:

list(dataset.as_numpy_iterator())[0][1]

And so on ..

Kattegat answered 15/5, 2023 at 15:48 Comment(0)
A
0

For tensorflow = 2.12.0 and text dataset

Load dataset

(ds_train, ds_test), ds_info = tfds.load('imdb_reviews', with_info=True, 
split=['train', 'test'], data_dir="your_dir\\tensorflow_datasets\\")

Extracting data and label

for i, dict in enumerate(ds_train.take(5)):
    print(ds_info.features['label'].int2str(dict["label"].numpy()))
    print(dict["text"].numpy())
Austenite answered 27/6, 2023 at 7:40 Comment(0)
P
0

Assuming that both images and labels are tensors, the following code should work with Tensorflow>=2.12.0:

# Define a function to extract images and labels from dataset
def preprocess_data(image, label):
    # You can add any preprocessing steps here if required
    return image, label

# Apply preprocessing function to dataset
dataset = dataset.map(preprocess_data)

# Split dataset into training and validation sets
train_size = int(0.8 * len(dataset))
train_dataset = dataset.take(train_size)
val_dataset = dataset.skip(train_size)

# Convert training and validation datasets into numpy arrays
X_train = []
y_train = []
for image, label in train_dataset:
    X_train.append(image.numpy())
    y_train.append(label.numpy())

X_train = np.array(X_train)
y_train = np.array(y_train)

# Example of dataset shapes
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

Note: It isn't important to write a preprocessing code if not required.

Prediction answered 14/3 at 6:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.