How to view Apache Parquet file in Windows? [closed]
Asked Answered
O

11

78

I couldn't find any plain English explanations regarding Apache Parquet files. Such as:

  1. What are they?
  2. Do I need Hadoop or HDFS to view/create/store them?
  3. How can I create parquet files?
  4. How can I view parquet files?

Any help regarding these questions is appreciated.

Overplay answered 19/6, 2018 at 16:55 Comment(4)
Windows utility to open and view Parquet files: github.com/mukunku/ParquetViewerOverplay
DUCKDB CLI tool to view parquet data or schema - https://mcmap.net/q/179544/-inspect-parquet-from-command-lineCozenage
Try using Tevis, it's a web-app tool with rich interface for viewing local parquet files as big as 1GB. Website and demo here.Trigeminal
WIndows linux mac parquet viewer with ability to query SQL timestored.com/qstudio/parquet-file-viewerPalestra
O
94

What is Apache Parquet?

Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing the data one row at a time, you typically access it one column at a time.

Apache Parquet is one of the modern big data storage formats. It has several advantages, some of which are:

  • Columnar storage: efficient data retrieval, efficient compression, etc...
  • Metadata is at the end of the file: allows Parquet files to be generated from a stream of data. (common in big data scenarios)
  • Supported by all Apache big data products

Do I need Hadoop or HDFS?

No. Parquet files can be stored in any file system, not just HDFS. As mentioned above it is a file format. So it's just like any other file where it has a name and a .parquet extension. What will usually happen in big data environments though is that one dataset will be split (or partitioned) into multiple parquet files for even more efficiency.

All Apache big data products support Parquet files by default. So that is why it might seem like it only can exist in the Apache ecosystem.

How can I create/read Parquet Files?

As mentioned, all current Apache big data products such as Hadoop, Hive, Spark, etc. support Parquet files by default.

So it's possible to leverage these systems to generate or read Parquet data. But this is far from practical. Imagine that in order to read or create a CSV file you had to install Hadoop/HDFS + Hive and configure them. Luckily there are other solutions.

To create your own parquet files:

To view parquet file contents:

Are there other methods?

Possibly. But not many exist and they mostly aren't well documented. This is due to Parquet being a very complicated file format (I could not even find a formal definition). The ones I've listed are the only ones I'm aware of as I'm writing this response

Overplay answered 19/6, 2018 at 16:55 Comment(3)
I couldn't find any information about file extension for Parquet files elsewhere. I think I'll go with .parquet ;)Argyle
The ParquetViewer has been able to open almost none of my files .:(Totem
@ShaharPrish I would open an issue ticket in the repo with some sample files.Overplay
H
43

This is possible now through Apache Arrow, which helps to simplify communication/transfer between different data formats, see my answer here or the official docs in case of Python.

Basically this allows you to quickly read/ write parquet files in a pandas DataFrame like fashion giving you the benefits of using notebooks to view and handle such files like it was a regular csv file.

EDIT:

As an example, given the latest version of Pandas, make sure pyarrow is installed:

Then you can simply use pandas to manipulate parquet files:

import pandas as pd

# read
df = pd.read_parquet('myfile.parquet')

# write
df.to_parquet('my_newfile.parquet')

df.head()
Hyperon answered 27/2, 2019 at 22:5 Comment(2)
As of 2022, this is the only way to read parquet file in mac OS as parquet-tools is deprecated and even jar file is not working.Tsushima
I have created qv which has builds for macos, windows and linux...Aves
C
25

Do I need Hadoop or HDFS to view/create/store them?

No. Can be done using a library from your favorite language. Ex: With Python you can use. PyArrow, FastParquet, pandas.

How can I view parquet files? How can I create parquet files?

(GUI option for Windows, Linux, MAC)

You can use DBeaver to view parquet data, view metadata and statistics, run sql query on one or multiple files, generate new parquet files etc..

DBeaver leverages DuckDB driver to perform operations on parquet file.

Simply create an in-memory instance of DuckDB using Dbeaver and run the queries like mentioned in this document.

Here is a Youtube video that explains this - https://youtu.be/j9_YmAKSHoA

enter image description here

Alternative: DuckDB CLI tool usage

Cozenage answered 10/10, 2022 at 4:53 Comment(3)
The DBeaver solution worked immediately. It took me longer to watch the part of the video I needed. Steps for those who don't want watch: Connect to DuckDB > Set Path to :memory: > select * from "d:\folder\file_00.parquet"Schatz
Weird that DBeaver + DuckDB solution doesn't work for me on macBier
Definitely the easiest and quickest way of viewing parquet data if you already have DBeaver installedProximo
N
7

In addition to @sal's extensive answer there is one further question I encountered in this context:

How can I access the data in a parquet file with SQL?

As we are still in the Windows context here, I know of not that many ways to do that. The best results were achieved by using Spark as the SQL engine with Python as interface to Spark. However, I assume that the Zeppelin environment works as well, but did not try that out myself yet.

There is very well done guide by Michael Garlanyk to guide one through the installation of the Spark/Python combination.

Once set up, I'm able to interact with parquets through:

from os import walk
from pyspark.sql import SQLContext

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

parquetdir = r'C:\PATH\TO\YOUR\PARQUET\FILES'

# Getting all parquet files in a dir as spark contexts.
# There might be more easy ways to access single parquets, but I had nested dirs
dirpath, dirnames, filenames = next(walk(parquetdir), (None, [], []))

# for each parquet file, i.e. table in our database, spark creates a tempview with
# the respective table name equal the parquet filename
print('New tables available: \n')

for parquet in filenames:
    print(parquet[:-8])
    spark.read.parquet(parquetdir+'\\'+parquet).createOrReplaceTempView(parquet[:-8])

Once loaded your parquets this way, you can interact with the Pyspark API e.g. via:

my_test_query = spark.sql("""
select
  field1,
  field2
from parquetfilename1
where
  field1 = 'something'
""")

my_test_query.show()
Numerology answered 13/2, 2019 at 21:26 Comment(0)
N
6

Here's a quick "hack" to show single table parquet files using Python in Windows (I use Anaconda Python):

  • Install pyarrow package https://pypi.org/project/pyarrow/

  • Install pandasgui package https://pypi.org/project/pandasgui/

  • Create this simple script parquet_viewer.py:

    import pandas as pd
    from pandasgui import show
    import sys
    import os
    
    dfs = {}
    for fn in sys.argv[1:]:
        dfs[os.path.basename(fn)] = pd.read_parquet(fn)
    show(**dfs)
    
  • Associate .parquet file extension by running these commands as administrator (of course you need to adapth the paths to your Python installation):

    assoc .parquet=parquetfile
    ftype parquetfile="c:\Python3\python.exe" "\<path to>\parquet_viewer.py" "%1"
    

This will allow to open parquet files compressed with compression formats (e.g. Zstd) not supported by the .NET viewer in @Sal's answer.

Nikkinikkie answered 21/6, 2021 at 12:21 Comment(1)
Use where python to find the path to python. Run a DOS Admin prompt (not Powershell). If there is a pre-existing file association, right click on any .parquet file, select Open With ... Choose Another App and select parquetfile. It's very slow with 100MB+ files.Eye
L
5

Maybe too late for this thread, just make some complement for anyone who wants to view Parquet file with a desktop application running on MAC or Linux.
There is a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.

It supports complex data type like array, map, etc.

enter image description here

Lasandralasater answered 9/2, 2020 at 17:37 Comment(3)
I cannot read big files (parquet about 116MB) since it holds and the file is not shown...Rentroll
@DavideScicolone Thanks for your feedback, may I know could you pls submit a issue at the git repo and provides us where we can download the file which you can't open?Lasandralasater
I created an issue on GitHub because I cannot read my parquet files: INT96 is not implemented They are generated files from pyspark 2.4.3Insincerity
C
3

On Mac if we want to view the content we can install 'parquet-tools'

  • brew install parquet-tools
  • parquet-tools head filename

We can always read the parquet file to a dataframe in Spark and see the content.

They are of columnar formats and are more suitable for analytical environments,write once and read many. Parquet files are more suitable for Read intensive applications.

Cushion answered 14/12, 2020 at 2:5 Comment(2)
Thanks for the info. It's indeed worth mentioning that Parquet files are immutable. So to make any changes to the file contents a whole new file would need to be created. So write once and read many makes the most sense. Although it is possible to optimize writes by partitioning the data into separate parquet files based on a certain key.Overplay
Note that as of 2024, this is instead brew install parquet-cli && parquet head file.parquetPaisley
D
1

This link allows you to view small parquet files: http://parquet-viewer-online.com/

It was originally submitted by Rodrigo Lozano. This site is based on the github project here: https://github.com/elastacloud/parquet-dotnet

Dream answered 1/10, 2021 at 20:57 Comment(0)
P
1

You can view Parquet files on Windows / MacOS / Linux by having DBeaver connect to an Apache Drill instance through the JDBC interface of the latter:

  1. Download Apache Drill

    Choose the links for "non-Hadoop environments".

    Click either on "Find an Apache Mirror" or "Direct File Download", not on "Client Drivers (ODBC/JDBC)"

  2. Extract the file

    tar -xzvf apache-drill-1.20.2.tar.gz

  3. cd in the extracted folder and run Apache Drill in embedded mode:

    cd apache-drill-1.20.2/
    bin/drill-embedded
    

    You should end up at a prompt saying apache drill> with no errors.

    Make sure the server is running by connecting from your web browser to the web interface of Apache Drill at http://localhost:8047/.

  4. Download DBeaver

  5. From DBeaver, click on File -> New, under the "DBeaver" folder, select "Database Connection", then click "Next"

  6. Select "Apache Drill" and click "Next"

  7. In the connection settings, under the "Main" tab:

    In "Connect by:" select "URL"

    In "JDBC URL:", write "jdbc:drill:drillbit=localhost"

    In username, write "admin"

    In password, write "admin"

    Click "OK"

  8. Now to view your parquet database, click on "SQL Editor" -> "Open SQL script", write:

    SELECT * 
    FROM `dfs`.`[PATH_TO_PARQUET_FILE]` 
    LIMIT 10;
    

    and click the play button.

Done!

Pede answered 15/10, 2022 at 11:32 Comment(0)
E
0

There is a plugin for Excel that allows to connect to qarquet files, but it is behind a paywall:

https://www.cdata.com/drivers/parquet/excel/

Egerton answered 21/7, 2022 at 19:48 Comment(0)
H
-2

You can view it with web assembly app completely in browser https://parquetdbg.aloneguid.uk/

Haplo answered 26/1, 2023 at 20:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.