At the DataOps Unleashed 2022 conference, Luis Carlos Cruz Huertas, Head of Technology Infrastructure & Automation at DBS Bank, discussed how the bank has developed a framework whereby they translate millions of telemetry data points into actionable recommendations and even self-healing capabilities.
The session, Beyond observability and how to get there, opens with Dr. Shivnath Babu, Unravel Co-Founder and CTO, setting the context for observability challenges in the modern data stack—how simple applications grow to become so complex—and walking through the successive stages of what’s “beyond observability.” How can we go from extracting and correlating events, logs, metrics, and traces from applications, datasets, infrastructure, and tenants to get to self-healing systems?
Then Luis shows how DBS is doing just that—how they leverage what he calls “cognitive capabilities” to deliver faster root cause analysis insights and automated corrective actions.
Why DBS went beyond observability
“When you have an always-on environment where banking applications are fully needed, you come to a point where observability [by itself] doesn’t cut it. You come to a point where your NCI [negative customer impact] truly becomes a key valuable indicator on how your systems are relating. We’re no longer in the game of measuring systems to be able to monitor, we’re measuring systems with the intent to provide a better customer experience,” says Luis.
Given the complexity of the bank’s IT ecosystem (not just its data stack), DBS made a strategic decision to not focus on tools developed by third-party vendors but rather build an “overarching brain” that could collect and understand the metrics from the diversity of tools in place without forcing the organization to rip and replace for something new. The objective was to speed root cause analysis across the board, provide less “noise,” reduce manual effort (toil), and get to proactive, predictive alerting on emerging issues.
See how DBS built its “beyond observability” self-healing platform
How DBS built its cognitive capability platform
“You have applications that are collecting different telemetry through different systems or different log collectors—node exporters, metric beats, file beats, you can have an ELK stack. But ultimately what you want to do is create an open platform that you can ingest all this data,” Luis says. And for that you need three elements, he explains:
- a historical repository, where you can collect and cross-check data
- a real-time series database, because time becomes the de facto metadata to identify a critical incident and its correlations
- a log aggregator
Luis notes that one of the things he gets asked constantly is, How do you define the ingestion?
“We do it all based on metadata. We define the metadata before it actually gets ingested. And then we park that into the [system] data lake. On top of that we provide an [ML-driven] analytical engine. Ultimately, what our system does is basically provide a recommendation to our site reliability engineers. And it gives them a list of elements, saying I’ve identified this set of errors or incidents that have happened over the last month and are repetitive and continuous. And then our site reliability engineers need to marry that with our automation engine. You build the right scripting—the right automation—to properly fix and remediate the problem. So that every time an incident is identified, it maps Incident A to Automation B to get Outcome C.”
Luis adds that with Unravel, the telemetry data has already been correlated. He says, “Unravel is huge for us. I don’t need to worry about marrying the correlation. I can just consume it right away.”
Luis concludes: “So in the end, we’re not changing tools, we’re collecting the metrics from all of the tools. We are providing a higher overarching mapping of the data being collected across all the tools in our environment, mapping them through metadata, and leveraging that to provide the right ML.”
The bottom line? DBS Bank is able to go “beyond observability” and leverage machine learning to get closer to the ultimate goal of a self-healing system that doesn’t require any human intervention at all.
Check out the full presentation from DataOps Unleashed on demand here.
See the DBS schematics for its Cognitive Technology Services Platform, tools mapping to the architectural components, solution data flows, overview of data sources, and more.