TensorFlow create dataset from numpy array

Asked 18/12, 2015 at 17:35 Answered 12/4, 2018 at 21:52

Solved python machine-learning tensorflow mnist

TensorFlow as build it a nice way to store data. This is for example used to store the MNIST data in the example:

>>> mnist
<tensorflow.examples.tutorials.mnist.input_data.read_data_sets.<locals>.DataSets object at 0x10f930630>

Suppose to have a input and output numpy arrays.

>>> x = np.random.normal(0,1, (100, 10))
>>> y = np.random.randint(0, 2, 100)

How can I transform them in a tf dataset?

I want to use functions like next_batch

Buttonhole answered 18/12, 2015 at 17:35 Comment(0)

The Dataset object is only part of the MNIST tutorial, not the main TensorFlow library.

You can see where it is defined here:

GitHub Link

The constructor accepts an images and labels argument so presumably you can pass your own values there.

Considerable answered 18/12, 2015 at 17:47 Comment(2)

Ok thanks I had this suspect. I think it would be a helpful tool as part of the main library. AFAIK any batch operation on numpy array requires to perform a copy of the data. This may lead to a slower algorithm – Buttonhole 18/12, 2015 at 17:50

The philosophy is that TensorFlow should just be a core math library, but other open source libraries can provide additional abstractions used for machine learning. Similar to Theano which has libraries like Pylearn2 built on top. If you want to avoid copy operations you can use the queue-based data access functionality rather than feeding placeholders. – Considerable 18/12, 2015 at 17:53

Recently, Tensorflow added a feature to its dataset api to consume numpy array. See here for details.

Here is the snippet that I copied from there:

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)

dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
# [Other transformations on `dataset`...]
dataset = ...
iterator = dataset.make_initializable_iterator()

sess.run(iterator.initializer, feed_dict={features_placeholder: features,
                                          labels_placeholder: labels})

Berkshire answered 12/4, 2018 at 21:52 Comment(4)

In tf2 this does not work anymore. Do you know what is the recommended way in tf2? – Verina 20/12, 2020 at 13:8

for TF2, please check this link – Berkshire 21/1, 2021 at 19:49

@Berkshire do you know how this can be done if all of your dataset does not fit into memory? – Whaler 8/4, 2021 at 23:32

You mean your data set is in NumPy format, but it cannot be loaded into memory? If this is the case, this solution may help. – Berkshire 11/5, 2021 at 1:45

As a alternative, you may use the function tf.train.batch() to create a batch of your data and at the same time eliminate the use of tf.placeholder. Refer to the documentation for more details.

>>> images = tf.constant(X, dtype=tf.float32) # X is a np.array
>>> labels = tf.constant(y, dtype=tf.int32)   # y is a np.array
>>> batch_images, batch_labels = tf.train.batch([images, labels], batch_size=32, capacity=300, enqueue_many=True)

Perfunctory answered 4/11, 2017 at 14:42 Comment(0)

Recommended topics

Hot tags