How do I read only part of a column from a Parquet file using Parquet.net?
Asked Answered
P

3

9

I am using Parquet.Net to read parquet files, but the only option to read from the parquet file is.

//get the first group
Parquet.ParquetRowGroupReader rowGroup = myParquet.OpenRowGroupReader(0);

//gets the first column
Parquet.Data.DataColumn col1 = rowGroup.ReadColumn(myParquet.Schema.GetDataFields()[0]);

This allows me to get the the first column from the first rowGroup, but the problem is, the first rowGroup can be something like 4million rows and readColumn will read all 4million values.

How do I tell readColumn that I only want it to read, say the first 100 rows. Reading all 4million rows wastes memory and file read time.

I actually got a memory error, until I changed my code to resize that 4million value array down to my 100. After calling each column.

I don't necessarily need row based access, I can work with columns, I just don't need a whole rowGroup worth of values in each column. Is this possible? If row based access is better, how does one use it? The Parquet.Net project site doesn't give any examples, and just talks about tables.

Pachysandra answered 21/7, 2020 at 1:3 Comment(0)
T
3

If you look at the parquet-dotnet documentation they do not recommend writing more than 5000 records into one row group for performance reasons, though at the bottom of the page they say they are designed to hold 50000 rows on average:

It's not recommended to have more than 5'000 rows in a single row group for performance reasons

We are working with 100000 in a row group with my team, overall it may depend on what you are storing but 4000000 records in one row group inside a column does sounds like too much.

So to answer your question, to read only part of the column make your row groups inside the column smaller and then read only as many row groups as you wish. If you want to only read 100 records, read in the first row group and take first 100 from it, reasonably sized row groups are very fast to read.

Tithonus answered 9/12, 2021 at 12:42 Comment(4)
1) You assume that we're writing those files rather than someone else. 2) even if data is split into smaller row groups, how do I find a row group with column values in a required range without reading the whole column of every row group?Opt
1) Yes, if you can't control the size of the row group then parquet.net is probably not the right library to use for reading these files 2) same, I don't think you can do that with parquet.net, maybe look into Spark or other parquet reading options, it may not be possible using .Net currentlyTithonus
This is really strange. It's not difficult to implement.Opt
I think it may be because Parquet format is a column based format that heavily uses compression to save storage space, the columns contain row groups but I think they may be compressed together with their own meta data, so the row group may not be readable until decompressed fully? I don't fully understand how the format works. Maybe there's a way with that DataColumnReader mentioned above to read it line by line despite that.Tithonus
T
3

According to the source code this capability exists in DataColumnReader but this is an internal class and thus not directly usable.

ParquetRowGroupReader uses it inside its ReadColumn method, but exposes no such options.

What can be done in practice is copying the whole DataColumnReader class and using it directly, but this could breed future compatibility issues.

If the problem can wait for some time, I'd recommend copying the class and then opening an issue + pull request to the library with the enhanced class, so the copied class can eventually be removed.

Topdrawer answered 14/12, 2021 at 17:47 Comment(1)
Thank you, I made do by eating the read time and mem usage like in the question. If I ever get the need to work on my project again (which will probably be never) I'll try it! But I'm not familiar with opening issues and making pull requests etc, neither, so... someone else will probably get around to doing it before I do.Pachysandra
O
0

ParquetSharp should be able to do that. It’s a wrapper around the Apache Arrow C++ library but it supports Windows, Linux and macOS.

using System;
using ParquetSharp;

using (var reader = new ParquetFileReader(path))
using (var group = reader.RowGroup(0))
{
  // You can use the logical reader for automatic conversion to a fitting CLR type
  // here `double` as an example
  // (unfortunately this does not work well with complex schemas IME)
  const int batchSize = 4000;
  Span<double> buffer = new double[batchSize];
  var howManyRead = group.Column(0).LogicalReader<double>().ReadBatch(buffer);

  // or if you want raw Parquet (with Dremel data and physical type)
  var resultObject = group.Column(0).Apply(new Visitor());
}

class Visitor : IColumnReaderVisitor<object>
{
    public object OnColumnReader<TValue>(ColumnReader<TValue> columnReader)
      where TValue : unmanaged
    {
        // TValue will be the physical Parquet type
        const int batchSize = 200000;
        var buffer = new TValue[batchSize];
        var definitionLevels = new short[batchSize];
        var repetitionLevels = new short[batchSize];
        long valuesRead;
        var levelsRead = columnReader.ReadBatch(batchSize,
                                                definitionLevels, repetitionLevels,
                                                buffer, out valuesRead);
        // Return stuff you are interested in here, will be `resultObject` above
        return new object(); 
    }
}
Osteoplastic answered 3/4, 2022 at 15:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.