The many layers of data lineage
ref: Borja Vazquez
Having a map showing how data evolves from its sources to its destination is the dream of any organisation. A map where we can navigate through our data assets without having to navigate first between loads of tools and colleagues.
Performance layer
- What’s making this reporting model finish so late?
- What’s the delay introduced to reporting models if this model is merged in production?
- Overlaying performance data over a Direct Acyclic Graph (DAG) can provide a much richer understanding on how single tasks can affect the whole data pipeline performance.
Usage layer
- Are there any unused models in the warehouse?
- Is this column being used at all?
- Having a clear understanding of who and how data is being used is one of the first steps to keeping a healthy and fit to purpose warehouse
Data quality layer
- What’s blocking the execution of my models?
- Which models are lacking tests? And documentation?
- We should think about the data quality layer as a set of layers portraying how good a model is: Are the tables and columns documented? Are the data models meeting their services of agreement? Do they have a defined owner?
Ability to zoom into the right granularity
- In an ideal lineage map, we can browse through the different granularity levels of data lineage, while keeping the quality layer when drilling down