The many layers of data lineage

ref: Borja Vazquez

Having a map showing how data evolves from its sources to its destination is the dream of any organisation. A map where we can navigate through our data assets without having to navigate first between loads of tools and colleagues.


Performance layer

  • What’s making this reporting model finish so late?
  • What’s the delay introduced to reporting models if this model is merged in production?
  • Overlaying performance data over a Direct Acyclic Graph (DAG) can provide a much richer understanding on how single tasks can affect the whole data pipeline performance.


Usage layer

  • Are there any unused models in the warehouse?
  • Is this column being used at all?
  • Having a clear understanding of who and how data is being used is one of the first steps to keeping a healthy and fit to purpose warehouse


Data quality layer

  • What’s blocking the execution of my models?
  • Which models are lacking tests? And documentation?
  • We should think about the data quality layer as a set of layers portraying how good a model is: Are the tables and columns documented? Are the data models meeting their services of agreement? Do they have a defined owner?


Ability to zoom into the right granularity

  • In an ideal lineage map, we can browse through the different granularity levels of data lineage, while keeping the quality layer when drilling down