In the second of this two-part episode of Data+AI Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Kapil Surlaker, VP of Engineering and Head of Data at LinkedIn. In part one, they covered LinkedIn’s challenges related to metadata management and data access APIs. This second part dives deep into data quality.
Kapil has 20+ years of experience in data and infrastructure, at large companies such as Oracle and at multiple startups. At LinkedIn, Kapil has been leading the next generation of big data infrastructure, platforms, tools, and applications to empower data scientists, AI engineers, app developers, to extract value from data. Kapil’s team has been at the forefront of innovation driving multiple open source initiatives such as Pinot, Gobblin, and DataHub. Here are some key talking points from the second part of their insightful chat.
What Does “Data Quality” Mean?
- Data quality spans many dimensions and can be really broad in its coverage. It answers the questions: Is your data accurate? Do you have data integrity, validity, security, timeliness, and completeness? Is the data consistent? Is it interpretable?
- LinkedIn measures tons of different metrics, aiming to understand the health of the business, products, and systems.
- When KPIs differ from the norm, it becomes a question of data quality.
- When determining the root cause of poor data quality, it can differ for each dimension of quality.
- For example, when there is metric inconsistency, you must ask yourself if you have an accurate source of truth for your metrics.
- Timeliness and completeness problems often happen as a result of infrastructure issues. In complex ecosystems you have a lot of moving parts, so a lot can go wrong that impacts timeliness and completeness.
- Data quality problems often don’t actually manifest themselves as data quality problems in obvious ways. It takes time to monitor, detect, and effectively analyze the root cause of these issues, and remediate the issues.
- When assessing and improving data quality, it often helps to categorize it in three buckets. There is (1) the monitoring and observability aspect, (2) anomaly detection and root cause analysis for anomalies, and (3) preventing data quality issues in the first place. The last bucket is the best-case scenario.
- In any complex ecosystem, when something goes wrong for any reason in a pipeline or in a single stage of a data set or a data flow, it can have real consequences on the entire downstream change. So your goal is to detect problems as close to the source as possible.
How Does LinkedIn Maintain Data Quality?
Unified Metrics Platform
- It is important to have an evolving schema of your datasets and signals for when something goes awry, to act as markers of data quality.
- At LinkedIn they built a platform for metric definition and the entire life cycle management of those metrics, called the Unified Metrics Platform. The platform processes all of LinkedIn’s critical metrics – to the point that if it’s not produced by the platform, it wouldn’t be considered a trusted metric. The Unified Metrics Platform defines their source of truth.
- The company turned to machine learning techniques to improve the detection of anomalies and alerting based on user feedback.
- You can have situations where the overall metric that you’re monitoring may not have a significant deviation, but when you look into the sub spaces within that metric, you find significant deviations. To solve this problem, LinkedIn leveraged algorithms to automatically build structures based on the nature of the data itself and build a multi-dimensional data cube.
- When you’re unable to pinpoint the root cause of a deviation, it becomes a matter of identifying the space of the key drivers which might impact the particular metric. You narrow that space, present it to users for their feedback, and then continuously refine the system.
- To detect issues based on the known properties of the data, Linkedin built a system called the Data Sentinel. This system has the ability to take the developer’s knowledge about the dataset and specify it as declarative roots. The Data Sentinel then takes on the responsibility of generating the code to perform data validations.
- Linkedin is considering making Data Sentinel open source in the future.
Building a Data Quality Team
- At LinkedIn, they make sure that team members take the time to treat event schemas as code. They have to be reviewed and checked on. The same goes for metrics. This requires collaboration between different teams. They are constantly talking to each other and coming together to improve not just tools, but also processes.
- What is accepted as state-of-the-art today is almost guaranteed not to be state-of-the-art tomorrow. So when hiring for a data quality or data management team, it is important to look for people who are naturally curious and have a growth mindset.
If you’re interested in any of the topics discussed here, Sandeep and Kapil talked about even more in the full podcast. Be sure to check out Part 1 of this blog post, the full episode and the rest of the Data+AI Battlescars podcast series!