Parquet File Format

Category: Interoperability
Platform: Databricks, Azure Synapse Analytics, Generic Data Lake

Context

Data products are stored as files on Azure Data Lake Storage Gen2 (Data Product Storage).

To ensure interoperability and consistent usage patterns, we want to agree on a common file format.

We assume that data products frequently will be combined across domains.

We use Apache Parquet for data products.

Low storage and IO costs
Fast querying and processing
Software engineers need to learn Parquet file format.
Append only
binary -> efficient storage -> IO optimized
column-oriented -> efficient JOIN operations
(limited) meta data included
machine-readable
Follow-Up Questions
- Partitioning
- How to document the schema?
- Timestamp format
- How to model mutable data (like master and reference data)?

All major data platforms come with Parquet support out of the box
Automated testing: Query all data products periodically and try to deserialize latest file

This site is open source. Improve this page.