Feeding .npy (numpy files) into tensorflow data pipeline
Asked Answered
U

4

30

Tensorflow seems to lack a reader for ".npy" files. How can I read my data files into the new tensorflow.data.Dataset pipline? My data doesn't fit in memory.

Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as features and a scalar as their label.

Unbalance answered 20/2, 2018 at 16:8 Comment(0)
E
19

Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:

Consuming NumPy arrays

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.

Here is a post with some instructions.

FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.

If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.

In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.

Ewe answered 20/2, 2018 at 16:14 Comment(5)
I have seen that guide but unfortunately, it doesn't fit in memory!Unbalance
Thank you very much but converting my numpy files to TFRecord is the last thing i want to do since i have around 5,000,000 files and it would take a long time to do that. I think i will go with the keras generator idea. Thanks again!Unbalance
Each file of your 5,000,000 files doesn't fit into memory?Ewe
I'm in a similar situation as OP, I have about a million small files, and using a simple Keras generator worked like a charm. Unfortunately, it doesn't work quite well with multiprocessing and is slower than tf.data APIs, so I ended up converting the whole dataset to TFRecord files, the performance increased quite a bit from the Keras generator, but that's just me, it could be different for other situations.Hulburt
I have faced similar situation with @jackz314 but in my case the loading speed does not increase.Largo
G
25

It is actually possible to read directly NPY files with TensorFlow instead of TFRecords. The key pieces are tf.data.FixedLengthRecordDataset and tf.io.decode_raw, along with a look at the documentation of the NPY format. For simplicity, let's suppose that a float32 NPY file containing an array with shape (N, K) is given, and you know the number of features K beforehand, as well as the fact that it is a float32 array. An NPY file is just a binary file with a small header and followed by the raw array data (object arrays are different, but we're considering numbers now). In short, you can find the size of this header with a function like this:

def npy_header_offset(npy_path):
    with open(str(npy_path), 'rb') as f:
        if f.read(6) != b'\x93NUMPY':
            raise ValueError('Invalid NPY file.')
        version_major, version_minor = f.read(2)
        if version_major == 1:
            header_len_size = 2
        elif version_major == 2:
            header_len_size = 4
        else:
            raise ValueError('Unknown NPY file version {}.{}.'.format(version_major, version_minor))
        header_len = sum(b << (8 * i) for i, b in enumerate(f.read(header_len_size)))
        header = f.read(header_len)
        if not header.endswith(b'\n'):
            raise ValueError('Invalid NPY file.')
        return f.tell()

With this you can create a dataset like this:

import tensorflow as tf

npy_file = 'my_file.npy'
num_features = ...
dtype = tf.float32
header_offset = npy_header_offset(npy_file)
dataset = tf.data.FixedLengthRecordDataset([npy_file], num_features * dtype.size, header_bytes=header_offset)

Each element of this dataset contains a long string of bytes representing a single example. You can now decode it to obtain an actual array:

dataset = dataset.map(lambda s: tf.io.decode_raw(s, dtype))

The elements will have indeterminate shape, though, because TensorFlow does not keep track of the length of the strings. You can just enforce the shape since you know the number of features:

dataset = dataset.map(lambda s: tf.reshape(tf.io.decode_raw(s, dtype), (num_features,)))

Similarly, you can choose to perform this step after batching, or combine it in whatever way you feel like.

The limitation is that you had to know the number of features in advance. It is possible to extract it from the NumPy header, though, just a bit of a pain, and in any case very hardly from within TensorFlow, so the file names would need to be known in advance. Another limitation is that, as it is, the solution requires you to either use only one file per dataset or files that have the same header size, although if you know that all the arrays have the same size that should actually be the case.

Admittedly, if one considers this kind of approach it may just be better to have a pure binary file without headers, and either hard code the number of features or read them from a different source...

Gob answered 19/6, 2018 at 16:18 Comment(2)
As of tensorflow 2.0, tf.decode_raw has been moved to tf.io.decode_raw. tensorflow.org/api_docs/python/tf/io/decode_raw?hl=enOccupant
Could you give a direction how I can modify your code to feed pickled numpy arrays files that are organized in separate folders which stand for their classes? Here is my full question - #74614610Malaise
C
23

You can do it with tf.py_func, see the example here. The parse function would simply decode the filename from bytes to string and call np.load.

Update: something like this:

def read_npy_file(item):
    data = np.load(item.decode())
    return data.astype(np.float32)

file_list = ['/foo/bar.npy', '/foo/baz.npy']

dataset = tf.data.Dataset.from_tensor_slices(file_list)

dataset = dataset.map(
        lambda item: tuple(tf.py_func(read_npy_file, [item], [tf.float32,])))
Concepcionconcept answered 24/3, 2018 at 0:3 Comment(5)
"The parse function would simply decode the filename from bytes to string and call np.load." can you please provide a code for this?Unbalance
Er will that be slow...?Orometer
Yes confirming this is very slow. Lots of overhead added after Python reads Numpy.Largo
@Orometer slow with respect to what ?Concepcionconcept
it worked for me by just changing 2nd line as data = np.load(item.numpy().decode())Unbeknown
E
19

Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:

Consuming NumPy arrays

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

# Load the training data into two NumPy arrays, for example using `np.load()`.
with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.

Here is a post with some instructions.

FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.

If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.

In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.

Ewe answered 20/2, 2018 at 16:14 Comment(5)
I have seen that guide but unfortunately, it doesn't fit in memory!Unbalance
Thank you very much but converting my numpy files to TFRecord is the last thing i want to do since i have around 5,000,000 files and it would take a long time to do that. I think i will go with the keras generator idea. Thanks again!Unbalance
Each file of your 5,000,000 files doesn't fit into memory?Ewe
I'm in a similar situation as OP, I have about a million small files, and using a simple Keras generator worked like a charm. Unfortunately, it doesn't work quite well with multiprocessing and is slower than tf.data APIs, so I ended up converting the whole dataset to TFRecord files, the performance increased quite a bit from the Keras generator, but that's just me, it could be different for other situations.Hulburt
I have faced similar situation with @jackz314 but in my case the loading speed does not increase.Largo
D
3

Problem setup

I had a folder with images that were being fed into an InceptionV3 model for extraction of features. This seemed to be a huge bottleneck for the entire process. As a workaround, I extracted features from each image and then stored them on disk in a .npy format.

Now I had two folders, one for the images and one for the corresponding .npy files. There was an evident problem with the loading of .npy files in the tf.data.Dataset pipeline.

Workaround

I came across TensorFlow's official tutorial on show attend and tell which had a great workaround for the problem this thread (and I) were having.

Load numpy files

First off we need to create a mapping function that accepts the .npy file name and returns the numpy array.

# Load the numpy files
def map_func(feature_path):
  feature = np.load(feature_path)
  return feature

Use the tf.numpy_function

With the tf.numpy_function we can wrap any python function and use it as a TensorFlow op. The function must accept numpy object (which is exactly what we want).

We create a tf.data.Dataset with the list of all the .npy filenames.

dataset = tf.data.Dataset.from_tensor_slices(feature_paths)

We then use the map function of the tf.data.Dataset API to do the rest of our task.

# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item: tf.numpy_function(
          map_func, [item], tf.float16),
          num_parallel_calls=tf.data.AUTOTUNE)
Darla answered 24/3, 2021 at 2:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.