Accelerate BigQuery Join Us April 9

DataOps

Simplifying Data Management at LinkedIn – Metadata Management and APIs

In the first of this two-part episode of Data+AI Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Kapil Surlaker, VP of Engineering and Head of Data at LinkedIn. In this first part, they cover LinkedIn’s challenges […]

  • 2 min read

In the first of this two-part episode of Data+AI Battlescars, Sandeep Uttamchandani, Unravel Data’s CDO, speaks with Kapil Surlaker, VP of Engineering and Head of Data at LinkedIn. In this first part, they cover LinkedIn’s challenges related to Metadata Management and Data Access APIs. Part 2 will dive deep into data quality.

Kapil Sulaker HeadshotKapil has 20+ years of experience in data and infrastructure, at large companies such as Oracle and at multiple startups. At LinkedIn, Kapil has been leading the next generation of big data infrastructure, platforms, tools, and applications to empower data scientists, AI engineers, app developers, to extract value from data. Kapil’s team has been at the forefront of innovation driving multiple open source initiatives such as Pinot, Gobblin, and DataHub. Here are some key talking points from the first part of their insightful chat.

Metadata Management

The Problem: So Much Data

  • LinkedIn manages over a billion data points and over 50 billion relationships.
  • As the number of datasets skyrocketed at LinkedIn, the company found that internal users were spending an inordinate amount of time searching for and trying to understand hundreds of thousands, if not millions, of datasets.
  • LinkedIn could no longer rely on manually generated information about the datasets. The company needed a central metadata repository and a metadata management strategy.

Solution 1: WhereHows

  • The company initiated an in-depth analysis of its Hadoop data lake and asked questions such as: What are the datasets? Where do they come from? Who produced the dataset? Who owns it? What are the associated SLAs? What other sets does a particular dataset depend on?
  • Similar questions were asked about the jobs: What are the inputs and outputs? Who owns the jobs?.
  • The first step was the development of an open source system called WhereHows, a central metadata repository to capture metadata across the diverse datasets, with a search engine.

Solution 2: Pegasus

  • LinkedIn discovered that it was not enough just to focus on the datasets and the jobs. The human element had to be accounted for as well. A broader view was necessary, accommodating both static and dynamic metadata.
  • In order to expand the capabilities of the metadata model, the company realized it needed to take a “push” approach rather than a metadata “scraping” approach.
  • The company then built a library called Pegasus to create a push-based model that improved both efficiency and latency.

The Final Solution: DataHub

  • Kapil’s team then found that you need the ability to query metadata through APIs. You must be able to query the metadata online, so other services can integrate.
  • The team went back to the drawing board and re-architected the system from the ground up based on these learnings.
  • The result was DataHub, the latest version of the company’s open source metadata management system, released last year.
  • A major benefit of this metadata approach is the ability to drive a lot of other functions in the organization that depend on access to metadata.

Data Access APIs

  • LinkedIn recently completely rebuilt its mobile experience, user tracking, and associated data models, essentially needing to “change engines in mid-flight.”
  • The company needed a data API to meet this challenge. One did not exist, however, so they created a data access API named “Dali API” to provide an access layer to offline data.
  • The company used Hive to build the infrastructure and get to market quickly. But in hindsight, using Hive added some dependencies.
  • So LinkedIn built a new, more flexible library called Coral and made it available as open source software. This removed the Hive dependency, and the company will benefit from improvements made to Coral by the community.

If you’re interested in any of the topics discussed here, Sandeep and Kapil talked about even more in the full podcast. Keep an eye out for takeaways from the second part of this chat, and be sure to check out Part 2 of this blog post, the full episode and the rest of the Data+AI Battlescars podcast series!