Tensorflow Dataset API: input pipeline with parquet files

About

Asked 7/8, 2018 at 17:34 Answered 24/3, 2023 at 14:47

I am trying to design an input pipeline with Dataset API. I am working with parquet files. What is a good way to add them to my pipeline?

Hixon answered 7/8, 2018 at 17:34 Comment(0)

We have released Petastorm, an open source library that allows you to use Apache Parquet files directly via Tensorflow Dataset API.

Here is a small example:

   with Reader('hdfs://.../some/hdfs/path') as reader:
        dataset = make_petastorm_dataset(reader)
        iterator = dataset.make_one_shot_iterator()
        tensor = iterator.get_next()
        with tf.Session() as sess:
            sample = sess.run(tensor)
            print(sample.id)

Motch answered 21/9, 2018 at 20:17 Comment(0)

Maybe a little late, but looks like this is available directly within Tensorflow now.

https://www.tensorflow.org/io/api_docs/python/tfio/experimental/IODataset#from_parquet

Graniteware answered 24/3, 2023 at 14:47 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags