JSON File Format

Category: Interoperability
Platform: Databricks, Azure Synapse Analytics, Generic Data Lake

Context

Data products are stored as files on Azure Data Lake Storage Gen2 (Data Product Storage).

To ensure interoperability and consistent usage patterns, we want to agree on a common file format.

We assume that data products frequently will be combined across domains.

We use JSON as a file format for data products.

Entries are separated with a new line (ndson).

Supports complex structures, such as arrays and objects
No need for managing a schema to write data
JSON is a simple format, known to all engineers
Widespread across many tools (such as Kafka Connectors), which makes data ingestion simple
Expensive IO and retrieval costs when querying data sets, as it is not compressed
Full reads make JOIN operations slow and expensive, compared to column-oriented file formats
Follow-Up Questions
- Partitioning
- How to document the schema?
- Timestamp format

Automated testing: Query all data products periodically and try to deserialize latest file

This site is open source. Improve this page.