Number of examples in each tfrecord

Running the sample.sh script in Google Cloud Shell to call the below preprocess on set of images following the steps of flowers example.

https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/trainer/preprocess.py

Preprocess was successfully on both eval set and train set. But the generated .tfrecord.gz files does not seem matching the image numbers in eval/train_set.csv.

i.e. eval-00000-of-00157.tfrecord.gz says there are 158 tfrecord while there are 35227 rows in eval_set.csv. Each record include a valid image_url (all of them are uploaded to Storage), each record has valid label tagged.

Would like to know if there is a way to monitor and control the number of images per tfrecord in preproces.py config.

Thanks

Update, got this work out right:

import tensorflow as tf 
import os
from tensorflow.python.lib.io import file_io

options = tf.python_io.TFRecordOptions(
    compression_type=tf.python_io.TFRecordCompressionType.GZIP)

sum(1 for f in file_io.get_matching_files(os.path.join(url/path, '*.tfrecord.gz'))
    for example in tf.python_io.tf_record_iterator(f, options=options))

import tensorflow as tf from tensorflow.python.lib.io import file_io files = os.path.join('gs://my_bucket/my_dir', 'eval-*.tfrecord.gz') print(sum(1 for f in tf.python_io.file_io.get_matching_files(files) for tf.python_io.tf_record_iterator(f)))

Recommended topics

Hot tags