Post

Ensuring ๐——๐—ฎ๐˜๐—ฎ ๐—ค๐˜‚๐—ฎ๐—น๐—ถ๐˜๐˜† ๐—ถ๐—ป ๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€ is now as important as ever, how do we do it?

 Data Quality in Machine Learning

It is extremely important to ensure Data Quality upstream of ML Training and Inference Pipelines, trying to do that in the pipelines will cause unavoidable failure when working at scale.

Data Contracts can be leveraged for this goal.

Data Contract is an agreement between Data Producers and Data Consumers about the qualities to be met by Data being produced.

Data Contract should hold the following non-exhaustive list of metadata:

  • ๐Ÿ‘‰ Schema Definition.
  • ๐Ÿ‘‰ Schema Version.
  • ๐Ÿ‘‰ SLA metadata.
  • ๐Ÿ‘‰ Semantics.
  • ๐Ÿ‘‰ Lineage.
  • ๐Ÿ‘‰ โ€ฆ

Example Architecture Enforcing Data Contracts:

  1. Schema changes are implemented in version control, once approved - they are pushed to the Applications generating the Data, Databases holding the Data and a central Data Contract Registry.

๐—œ๐—บ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฎ๐—ป๐˜ : Ideally you should be enforcing a Data contract at this stage, when producing Data. Data Validation steps down the stream are Detection and Prevention mechanisms that donโ€™t allow low quality data to reach downstream systems. There might be a significant delay before you can do those checks, causing irreversible corruption or loss of data.

Applications push generated Data to Kafka Topics:

  1. Events emitted directly by the Application Services. ๐Ÿ‘‰ This also includes IoT Fleets and Website Activity Tracking.
    • Raw Data Topics for CDC streams.
  2. A Flink Application(s) consumes Data from Raw Data streams and validates it against schemas in the Contract Registry.
  3. Data that does not meet the contract is pushed to Dead Letter Topic.
  4. Data that meets the contract is pushed to Validated Data Topic.
  5. Data from the Validated Data Topic is pushed to object storage for additional Validation.
  6. On a schedule Data in the Object Storage is validated against additional SLAs in Data Contracts and is pushed to the Data Warehouse to be Transformed and Modeled for Analytical purposes.
  7. Modeled and Curated data is pushed to the Feature Store System for further Feature Engineering.
    • Real Time Features are ingested into the Feature Store directly from Validated Data Topic (5). ๐Ÿ‘‰ Ensuring Data Quality here is complicated since checks against SLAs is hard to perform.
  8. High Quality Data is used in Machine Learning Training Pipelines.
  9. The same Data is used for Feature Serving in Inference.
This post is licensed under CC BY 4.0 by the author.