How to load big dataset from CSV into keras
Asked Answered
S

2

6

I'm trying to use Keras with TensorFlow to train a network based on the SURF features that I obtained from several images. I have all this features stored in a CSV file that has the following columns:

 [ID, Code, PointX, PointY, Desc1, ..., Desc64]

The "ID" column is an autoincremental index created by pandas when I store all the values. The "Code" column is the label of the point, this would be just a number that I got by pairing the actual code (which is a string) with a number. "PointX/Y" are the coordinates of the point found in an image of a given class, and "Desc#" is the float value of the corresponding descriptor of that point.

The CSV file contains all the KeyPoints and Descriptors found in all 20.000 images. This gives me a total size of almost 60GB in disk, which I obviously can't fit into memory.

I've been trying to load batches of the file using pandas, then put all the values in a numpy array, and then fitting my model (a Sequential model of only 3 layers). I've used the following code to do so:

chunksize = 10 ** 6
for chunk in pd.read_csv("surf_kps.csv", chunksize=chunksize):
    dataset_chunk = chunk.to_numpy(dtype=np.float32, copy=False)
    print(dataset_chunk)
    # Divide dataset in data and labels
    X = dataset_chunk[:,9:]
    Y = dataset_chunk[:,1]
    # Train model
    model.fit(x=X,y=Y,batch_size=200,epochs=20)
    # Evaluate model
    scores = model.evaluate(X, Y)
    print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

This is alright with the first chunk loaded, but when the loop gets another chunk, accuracy and loss stuck on 0.

Is it wrong the way I'm trying to load all this information?

Thanks in advance!

------ EDIT ------

Ok, now I made a simple generator like this:

def read_csv(filename):
    with open(filename, 'r') as f:
        for line in f.readlines():
            record = line.rstrip().split(',')
            features = [np.float32(n) for n in record[9:73]]
            label = int(record[1])
            print("features: ",type(features[0]), " ", type(label))
            yield np.array(features), label

and use fit_generator with it:

tf_ds = read_csv("mini_surf_kps.csv")
model.fit_generator(tf_ds,steps_per_epoch=1000,epochs=20)

I don't know why, but I keep getting an error just before the first epoch starts:

ValueError: Error when checking input: expected dense_input to have shape (64,) but got array with shape (1,)

The first layer of the model has input_dim=64 and the shape of the features array yielded is also 64.

Shandishandie answered 24/7, 2019 at 19:34 Comment(0)
T
2

I think it is better to use tf.data.Dataset, this may help:

Taenia answered 24/7, 2019 at 23:12 Comment(0)
H
0

If you are using Tf 2.0, you could verify if the contents of the dataset are right. You can simply do this by ,

print(next(iter(tf_ds)))

to see the first element of the dataset and check if it matches the input expected by the model.

Hinze answered 29/7, 2019 at 14:41 Comment(1)
Great, I've found that I was feeding the network with a transposed array, that's where I was failing. Thanks!Shandishandie

© 2022 - 2024 — McMap. All rights reserved.