Parquet without Hadoop?

L

6

32

I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency?

Lundgren answered 26/3, 2015 at 13:35 Comment(1)

This is most certainly possible now: #50933929 – Subdivide 23/6, 2018 at 2:27

T

18

Investigating the same question I found that apparently it's not possible for the moment. I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet.

In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It is unresolved by the time of writing.

EDIT:

Issues are not tracked on github anymore (first link above is dead). A newer issue I found is located on apache's Jira with the following headline:

make it easy to read and write parquet files in java without depending on hadoop

Tracey answered 24/7, 2015 at 11:24 Comment(1)

This was written in 2015 and updated in 2018. It is 2020 and still no joy. – Certificate 20/3, 2020 at 12:46

F

11

Since it is just a file format it is obviously possible to decouple parquet from the Hadoop ecosystem. Nowadays the simplest approach I could find was through Apache Arrow, see here for a python example.

Here a small excerpt from the official PyArrow docs:

Writing

In [2]: import numpy as np

In [3]: import pandas as pd

In [4]: import pyarrow as pa

In [5]: df = pd.DataFrame({'one': [-1, np.nan, 2.5],
   ...:                    'two': ['foo', 'bar', 'baz'],
   ...:                    'three': [True, False, True]},
   ...:                    index=list('abc'))
   ...: 

In [6]: table = pa.Table.from_pandas(df)

In [7]: import pyarrow.parquet as pq

In [8]: pq.write_table(table, 'example.parquet')

Reading

In [11]: pq.read_table('example.parquet', columns=['one', 'three'])

EDIT:

With Pandas directly

It is also possible to use pandas directly to read and write DataFrames. This makes it as simple as my_df.to_parquet("myfile.parquet") and my_df = pd.read_parquet("myfile.parquet")

Fieldfare answered 27/2, 2019 at 21:55 Comment(0)

F

4

You don't need to have HDFS/Hadoop for consuming Parquet file. There are different ways to consume Parquet.

You could access it using Apache Spark.
If you are on AWS, you can directly load or access it from Redshift or Athena
If you are on Azure, you can load or access it from SQL DataWarehouse or SQL Server
similarly in GCP as well

Fariss answered 30/1, 2020 at 11:23 Comment(1)

All those solutions will use hadoop jars to read it though. But they abstact it away and make it really painless – Littlejohn 30/1, 2020 at 11:28

G

3

Late to the party, but I've been working on something that should make this possible: https://github.com/jmd1011/parquet-readers.

This is still under development, but a final implementation should be out within a month or two of writing this.

Edit: Months later, and still working on this! It is under active development, just taking longer than expected.

Grandfather answered 28/6, 2016 at 18:2 Comment(0)

A

2

What type of data do you have in Parquet? You don't require HDFS to read Parquet files. It is definitely not a pre-requisite. We use parquet files at Incorta for our staging tables. We do not ship with a dependency on HDFS, however, you can store the files on HDFS if you want. Obviously, we at Incorta can read directly from the parquet files, but you can also use Apache Drill to connect, use file:/// as the connection and not hdfs:/// See below for an example.

To read or write Parquet data, you need to include the Parquet format in the storage plugin format definitions. The dfs plugin definition includes the Parquet format.

{
  "type" : "file",
  "enabled" : true,
  "connection" : "file:///",
  "workspaces" : {
  "json_files" : {
  "location" : "/incorta/tenants/demo//drill/json/",
  "writable" : false,
  "defaultInputFormat" : json
  } 
},

Arnaldo answered 22/12, 2016 at 7:32 Comment(0)

S

0

Nowadays you dont need to rely on hadoop as heavily as before.

Please see my other post: How to view Apache Parquet file in Windows?

Subdivide answered 24/6, 2018 at 13:32 Comment(0)

Recommended topics

Hot tags