Read Parquet Files using Apache Arrow
Asked Answered
W

1

9

I have some Parquet files that I've written in Python using PyArrow (Apache Arrow):

pyarrow.parquet.write_table(table, "example.parquet")

Now I want to read these files (and preferably get an Arrow Table) using a Java program.

In Python, I can simply use the following to get an Arrow Table from my Parquet file:

table = pyarrow.parquet.read_table("example.parquet")

Is there an equivalent and easy solution in Java?

I couldn't really find any good / working examples nor any usefull documentation for Java (only for Python). Or some examples don't provide all needed Maven dependencies. I also don't want to use a Hadoop file system, I just want to use local files.

Note: I also found out that I can't use "Apache Avro" because my Parquet files contains column names with the symbols [, ] and $ which are invalid characters in Apache Avro.

Also, can you please provide Maven dependencies if your solution uses Maven.


I am on Windows and using Eclipse.


Update (November 2020): I never found a suitable solution and just stuck with Python for my usecase.

Weatherly answered 27/5, 2020 at 15:42 Comment(2)
The PyArrow Table object is not part of the Apache Arrow specification and was not implemented in Java. I am trying to find a solution too. I already implemented with Spark 3.0.1 using Parquet instead. I keep looking for a framework-independent solution.Ineffaceable
Perhaps Dremio (github.com/dremio/dremio-oss) can provide a solution.Ineffaceable
L
-3

it's somewhat an overkill, but you can use Spark.

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Luck answered 27/5, 2020 at 16:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.