Get schema of parquet file in Python
Asked Answered
S

8

21

Is there any python library that can be used to just get the schema of a parquet file?

Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to display in some UI of the application. But initializing spark-context and loading data frame and getting the schema from dataframe is time consuming activity. So looking for an alternative way to just get the schema.

Searby answered 10/1, 2017 at 10:54 Comment(3)
Is the file in HDFS or not?Tile
Spark does not need load the whole dataset to get the schema. Get schema from parquet file shall be instant.Readable
@Thiago Baldim - Yes it is in HDFS onlySearby
O
20

This function returns the schema of a local URI representing a parquet file. The schema is returned as a usable Pandas dataframe. The function does not read the whole file, just the schema.

import pandas as pd
import pyarrow.parquet


def read_parquet_schema_df(uri: str) -> pd.DataFrame:
    """Return a Pandas dataframe corresponding to the schema of a local URI of a parquet file.

    The returned dataframe has the columns: column, pa_dtype
    """
    # Ref: https://stackoverflow.com/a/64288036/
    schema = pyarrow.parquet.read_schema(uri, memory_map=True)
    schema = pd.DataFrame(({"column": name, "pa_dtype": str(pa_dtype)} for name, pa_dtype in zip(schema.names, schema.types)))
    schema = schema.reindex(columns=["column", "pa_dtype"], fill_value=pd.NA)  # Ensures columns in case the parquet file has an empty dataframe.
    return schema

It was tested with the following versions of the used third-party packages:

$ pip list | egrep 'pandas|pyarrow'
pandas             1.1.3
pyarrow            1.0.1
Oboe answered 9/10, 2020 at 22:23 Comment(2)
This returns a dataframe of the schema, not a struct type. You can't pass the schema to readStream: "Argument schema should be a str or struct type, got DataFrame."Gefen
when streaming you call readStream, but it won't accept what read_schema puts out, which is not a StructType. I added an answer here that creates a StructTpye that can be passed to pyspark's readStream:Gefen
T
13

This is supported by using pyarrow (https://github.com/apache/arrow/).

from pyarrow.parquet import ParquetFile
# Source is either the filename or an Arrow file handle (which could be on HDFS)
ParquetFile(source).metadata
ParquetFile(source).schema

Note: We merged the code for this only yesterday, so you need to build it from source, see https://github.com/apache/arrow/commit/f44b6a3b91a15461804dd7877840a557caa52e4e

Teferi answered 10/1, 2017 at 12:18 Comment(4)
Thank you. Looks like the build travis-ci.org/apache/arrow/jobs/190525227 status is green. Can you let me know where to get the build from? Otherwise can you point out me the documentation for how to build this arrow?Searby
This works but can't the response be returned as a dict or array instead of normal text?Distraction
it doesn't look like schema - which I thought should be something like "column": "type" valuesSteam
@ElesinOlalekanFuad, Hopefully we'll see this soon.Ibert
H
11

In addition to the answer by @mehdio, in case your parquet is a directory (e.g. a parquet generated by spark), to read the schema / column names:

import pyarrow.parquet as pq
pfile = pq.read_table("file.parquet")
print("Column names: {}".format(pfile.column_names))
print("Schema: {}".format(pfile.schema))
Horsemint answered 3/7, 2020 at 10:45 Comment(1)
What if the file is too big to be read into memory?Xanthine
C
4

There's now an easiest way with the read_schema method. Note that it returns actually a dict where your schema is a bytes literal, so you need an extra step to convert your schema into a proper python dict.

from pyarrow.parquet import read_schema
import json

schema = read_schema(source)
schema_dict = json.loads(schema.metadata[b'org.apache.spark.sql.parquet.row.metadata'])['fields']
Conjuration answered 3/11, 2019 at 16:39 Comment(6)
is this possible with using AWS s3 as source? I have not been able to get it to work except to read it in with ParquetDataset then access the schema attributeBlague
Yes it does, I'm actually using it with s3. Use a buffer via io, see example https://mcmap.net/q/242097/-how-to-read-a-list-of-parquet-files-from-s3-as-a-pandas-dataframe-using-pyarrowConjuration
this requires reading the data into memory though correct? I'm trying to simply get the schemas from parquet files from objects in s3 and compare themBlague
Yes correct. You have 2 options then. 1) Either you pick one parquet file that shouldnt be big (few mega) and that's okay 2) or you have your data exposed as an Athena table and you can use boto3 to get the schema.Conjuration
Does not work for parquet generated by spark: the error message is ArrowIOError: Cannot open for reading: path 'file.parquet' is a directoryHorsemint
Doesn't work in all cases: "'NoneType' object is not subscriptable"Gefen
C
3

The simplest and lightest way I could find to retrieve a schema is using the fastparquet library:

from fastparquet import ParquetFile
    
pf = ParquetFile('file.parquet')
print(pf.schema)
Chalmer answered 1/4, 2022 at 18:30 Comment(1)
A code-only answer is not high quality. While this code may be useful, you can improve it by saying why it works, how it works, when it should be used, and what its limitations are. Please edit your answer to include explanation and link to relevant documentation.Socialist
J
0

As other commentors have mentioned, PyArrow is the easiest way to grab the schema of a Parquet file with Python. My answer goes into more detail about the schema that's returned by PyArrow and the metadata that's stored in Parquet files.

import pyarrow.parquet as pq

table = pq.read_table(path)
table.schema # returns the schema

Here's how to create a PyArrow schema (this is the object that's returned by table.schema):

import pyarrow as pa

pa.schema([
    pa.field("id", pa.int64(), True),
    pa.field("last_name", pa.string(), True),
    pa.field("position", pa.string(), True)])

Each PyArrow Field has name, type, nullable, and metadata properties. See here for more details on how to write custom file / column metadata to Parquet files with PyArrow.

The type property is for PyArrow DataType objects. pa.int64() and pa.string() are examples of PyArrow DataTypes.

Make sure you understand about column level metadata like min / max. That'll help you understand some of the cool features like predicate pushdown filtering that Parquet files allow for in big data systems.

Judicative answered 21/9, 2020 at 1:53 Comment(2)
@Oboe - predicate pushdown is available in Spark via the query optimizer, not from lazy evaluation.Judicative
Why is predicate pushdown not automatic in PyArrow like it is in SQL?Oboe
O
0

Polars provides a dedicated method for parsing the schema of a parquet file without loading the actual data:

import polars as pl
schema = pl.read_parquet_schema("file.parquet")
Outright answered 28/5, 2023 at 1:40 Comment(0)
G
0

What pyarrow.parquet.read_schema puts out is pyarrow-specific and can't be used with pyspark's readStream. Here is a solution that creates a pyspark StructType:

import pyarrow as pa
import pyarrow.parquet
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DecimalType

pa_schema = pyarrow.parquet.read_schema(file_path)

type_mapping = {
    pa.string(): StringType(),
    pa.int32(): IntegerType(),
    pa.float64(): DoubleType(),
    pa.decimal128: DecimalType, 
}

# Convert PyArrow schema to PySpark schema
spark_fields = []
for field in pa_schema:
    field_name = field.name
    field_type = type_mapping.get(field.type, StringType())
    spark_fields.append(StructField(field_name, field_type, nullable=True))
Gefen answered 6/3 at 18:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.