Choosing the right file format is important to building performant data applications. The concepts outlined in this post carry over to Pandas, Dask, Spark, and Presto / AWS Athena.
Column pruning
Column pruning is a big performance improvement that's possible for column-based file formats (Parquet, ORC) and not possible for row-based file formats (CSV, Avro).
Suppose you have a dataset with 100 columns and want to read two of them into a DataFrame. Here's how you can perform this with Pandas if the data is stored in a Parquet file.
import pandas as pd
pd.read_parquet('some_file.parquet', columns = ['id', 'firstname'])
Parquet is a columnar file format, so Pandas can grab the columns relevant for the query and can skip the other columns. This is a massive performance improvement.
If the data is stored in a CSV file, you can read it like this:
import pandas as pd
pd.read_csv('some_file.csv', usecols = ['id', 'firstname'])
usecols
can't skip over entire columns because of the row nature of the CSV file format.
Spark doesn't require users to explicitly list the columns that'll be used in a query. Spark builds up an execution plan and will automatically leverage column pruning whenever possible. Of course, column pruning is only possible when the underlying file format is column oriented.
Popularity
Spark and Pandas have built-in readers writers for CSV, JSON, ORC, Parquet, and text files. They don't have built-in readers for Avro.
Avro is popular within the Hadoop ecosystem. Parquet has gained significant traction outside of the Hadoop ecosystem. For example, the Delta Lake project is being built on Parquet files.
Arrow is an important project that makes it easy to work with Parquet files with a variety of different languages (C, C++, Go, Java, JavaScript, MATLAB, Python, R, Ruby, Rust), but doesn't support Avro. Parquet files are easier to work with because they are supported by so many different projects.
Schema
Parquet stores the file schema in the file metadata. CSV files don't store file metadata, so readers need to either be supplied with the schema or the schema needs to be inferred. Supplying a schema is tedious and inferring a schema is error prone / expensive.
Avro also stores the data schema in the file itself. Having schema in the files is a huge advantage and is one of the reasons why a modern data project should not rely on JSON or CSV.
Column metadata
Parquet stores metadata statistics for each column and lets users add their own column metadata as well.
The min / max column value metadata allows for Parquet predicate pushdown filtering that's supported by the Dask & Spark cluster computing frameworks.
Here's how to fetch the column statistics with PyArrow.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('some_file.parquet')
print(parquet_file.metadata.row_group(0).column(1).statistics)
<pyarrow._parquet.Statistics object at 0x11ac17eb0>
has_min_max: True
min: 1
max: 9
null_count: 0
distinct_count: 0
num_values: 3
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
Complex column types
Parquet allows for complex column types like arrays, dictionaries, and nested schemas. There isn't a reliable method to store complex types in simple file formats like CSVs.
Compression
Columnar file formats store related types in rows, so they're easier to compress. This CSV file is relatively hard to compress.
first_name,age
ken,30
felicia,36
mia,2
This data is easier to compress when the related types are stored in the same row:
ken,felicia,mia
30,36,2
Parquet files are most commonly compressed with the Snappy compression algorithm. Snappy compressed files are splittable and quick to inflate. Big data systems want to reduce file size on disk, but also want to make it quick to inflate the flies and run analytical queries.
Mutable nature of file
Parquet files are immutable, as described here. CSV files are mutable.
Adding a row to a CSV file is easy. You can't easily add a row to a Parquet file.
Data lakes
In a big data environment, you'll be working with hundreds or thousands of Parquet files. Disk partitioning of the files, avoiding big files, and compacting small files is important. The optimal disk layout of data depends on your query patterns.