Data Mesh Governance / Policies / Interoperability / Data Product Specification

Agile Lab Data Product Specification

Category: Interoperability

Context

How do we specify the syntax and semantics of data products in a standardized way?

Decision

We specify data products with Agile Lab’s Data Product Specification.

Example

id: urn:dmb:dp:my_domain:my_data_product:1
name: my data product
fullyQualifiedName: My Data Product
description: this data product is representing the xxx functional entity
kind: dataproduct
domain: my_domain
version: 1.0.0
environment: development
dataProductOwner: tom_smith_corp.com
dataProductOwnerDisplayName: Tom Smith
email: mailto:distribution_list@corp.com
ownerGroup: dataproduct1_corp.com
devGroup: dataproduct1_dev_corp.com
informationSLA: 2WD
status: DRAFT
maturity: Strategic
billing: {}
tags: []
specific: {}
components:
  - id: urn:dmb:cmp:my_domain:my_data_product:1:my_raw_s3_port
    name: my raw s3 port
    fullyQualifiedName: My Raw S3 Port
    description: s3 raw output port
    kind: outputport
    version: 1.0.1
    infrastructureTemplateId: microservice-id-1
    useCaseTemplateId: template-id-1
    dependsOn: []
    platform: CDP on AWS
    technology: s3_cdp
    outputPortType: Files
    creationDate: 05-04-2022T16:53:00.000Z
    startDate:
    processDescription: this output port is generated by a Spark Job scheduled every day at 2AM and it lasts for approx 2 hours
    dataContract:
      schema: []
      SLA:
        intervalOfChange: 1 hours
        timeliness: 1 minutes 
        upTime: 99.9%
      termsAndConditions: only usable in development environment
      endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port
    dataSharingAgreements:
      purpose: this output port want to provide a rich set of profitability KPIs related to the customer
      billing: 5$ for each full scan
      security: In order to consume this output port an additional security check with compliance must be done
      intendedUsage: the dataset is huge so it is recommended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care
      limitations: is not possible to use this data without a compliance check
      lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january
      confidentiality: if you want to store this data somewhere else, PII columns must be masked    
    tags:
      - tagFQN: experimental
        source: Tag
        labelType: Manual
        state: Confirmed
      - tagFQN: structured
        source: Tag
        labelType: Manual
        state: Confirmed
    sampleData: {}
    semanticLinking: {}
    specific:
      directory: history
      bucket: ms-datamesh-s3
  - id: urn:dmb:cmp:my_domain:my_data_product:1:my_view_impala_port
    name: my view impala port
    fullyQualifiedName: My View Impala Port
    description: impala view output port
    kind: outputPort
    version: 1.1.0
    infrastructureTemplateId: microservice-id-2
    useCaseTemplateId: template-id-2
    dependsOn: [urn:dmb:cmp:my_domain:my_data_product:1:my_raw_s3_port]
    platform: CDP on AWS
    technology: impala_cdp
    outputPortType: SQL
    creationDate: 05-04-2022T17:00:00.000Z
    startDate:
    processDescription:
    dataContract:
      schema:
        - name: employeeId
          dataType: string
          description: global addressable identifier for an employee.
          constraint: PRIMARY_KEY
          tags:
            - tagFQN: GlobalAddressableIdentifier
              source: Tag
              labelType: Manual
              state: Confirmed
        - name: first_name
          dataType: string
          description: employee's first name
          constraint: NOT_NULL
          tags:
            - tagFQN: PII
              source: Tag
              labelType: Manual
              state: Confirmed
        - name: last_name
          dataType: string
          description: employee's last name
          constraint: NOT_NULL
          tags:
            - tagFQN: PII
              source: Tag
              labelType: Manual
              state: Confirmed
        - name: birthdate
          dataType: date
          description: employee's birthdate
          constraint: NOT_NULL
          tags: []
        - name: gender
          dataType: string
          description: employee's gender
          constraint: NOT_NULL
          tags: []
        - name: residential_address
          dataType: struct
          description: employee's residential address
          constraint: NOT_NULL
          tags:
            - tagFQN: PII
              source: Tag
              labelType: Manual
              state: Confirmed
        - name: first_hire_date
          dataType: date
          description: the date of his/her first hire in mybank. No matter is a temporary or permanent contract
          constraint: NOT_NULL
          tags: []
        - name: last_working_date
          dataType: date
          description: the last day the employee worked for mybank
          constraint: NULL
          tags: []
        - name: last_update
          dataType: date
          description: the last date the record has been updated
          constraint: NULL
          tags: []
        - name: businessTs
          dataType: timestamp
          description: the business timestamp, to be leveraged for time-travelling
          constraint: NOT_NULL
          tags: []
        - name: writeTs
          dataType: timestamp
          description: the technical (write) timestamp, to be leveraged for time-travelling
          constraint: NOT_NULL
          tags: []
      SLA:
        intervalOfChange: 1 hours
        timeliness: 1 minutes
        upTime: 99.9%
      termsAndConditions: only usable in development environment
      endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port
      biTempBusinessTs: businessTs
      biTempWriteTs: writeTs
    dataSharingAgreements:
      purpose: this output port want to provide a rich set of profitability KPIs related to the customer
      billing: 5$ for each full scan
      security: In order to consume this output port an additional security check with compliance must be done
      intendedUsage: the dataset is huge so it is recommended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care
      limitations: is not possible to use this data without a compliance check
      lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january
      confidentiality: if you want to store this data somewhere else, PII columns must be masked    
    tags: []
    sampleData:
      columns:
        - name
        - surname
      rows:
        - - Jace
          - Beleren
        - - Gideon
          - Jura
        - - Chandra
          - Nalaar
    semanticLinking: {}
    specific:
      database: my_database
      table: my_table
      location: /my_path
      schema:
        firstName: string
        lastName: string
      format: PARQUET
  - id: urn:dmb:cmp:my_domain:my_data_product:1:my_spark_workload
    name: my spark workload
    fullyQualifiedName: My Spark workload
    description: spark batch workload
    kind: workload
    version: 1.1.1
    infrastructureTemplateId: microservice-id-3
    useCaseTemplateId: template-id-3
    platform: CDP on AWS
    technology: spark
    workloadType: batch
    connectionType: DataPipeline
    tags: []
    readsFrom: [urn:dmb:ex:mainframe_db2_database]
    specific:
      artifactory: ms-datamesh-s3
      artefact: /path/to/my/spark/workload.jar
      service: my_cdp_service
      cluster: my_cde_cluster
      className: com.mycompany.MySparkApp
      args:
       - arg1
       - arg2
      driverCores: 1
      driverMemory: 4g
      executorCores: 4
      executorMemory: 4g
      numExecutors: 3
      schedule:
        cronExpression: 0 0 0,22 ? * * *
  - id: urn:dmb:cmp:my_domain:my_data_product:1:my_observability
    name: my observability
    fullyQualifiedName: My Observability
    description: observability for my data product
    kind: observability
    infrastructureTemplateId: microservice-id-4
    useCaseTemplateId: template-id-4
    version: 1.1.1
    endpoint: http://develop/my_domain/my_data_product/1.0.0/obs
    completeness:
    dataProfiling:
    freshness:
    availability:
    dataQuality:
    specific:
      restApiName: obs_api
      stageName: data_mesh
      bucket: ms-datamesh-s3
      obsEndpoint:
        - artifact: path/to/my/obs_dq.jar
          handler: com.mycompany.MyHandler::handleRequest
          lambdaname: my_data_product_obs_dq
          awsResourceName: my_data_product_obs_dq
          awsResourcePath: /data_quality
        - artifact: path/to/my/obs_workload.jar
          handler: com.mycompany.MyHandler::handleRequest
          lambdaname: my_data_product_obs_workload
          awsResourceName: my_data_product_obs_workload
          awsResourcePath: /workload       

Source: https://github.com/agile-lab-dev/Data-Product-Specification/blob/main/example.yaml

Consequences

Automation