As organizations run more data applications and pipelines in the cloud, they look for ways to avoid the hidden costs of cloud adoption and migration. Teams seek to maximize business results through cost visibility, forecast accuracy, and financial predictability.
Watch the breakout session video from Data Teams Summit and see how organizations apply agile and lean principles using the FinOps framework to boost efficiency, productivity, and innovation. Transcript available below.
Hi, and welcome to this session, Maximize Business Results with FinOps. I’m Clinton Ford, director of Product Marketing at Unravel, and I’m joined today by Thiago Gil, an ambassador from the FinOps Foundation and a KubeCon, Cloudnative Con 2021/2022 Kubernetes AI Day Program Committee member. Great to have you with us today, Thiago. Thank you.
Now, if you have any questions during our session, please feel free to put those in the Q&A box, or visit Unravel booth after this session. Happy to answer your questions.
So today, we’ll talk about some of the challenges that you face with cloud adoption and how FinOps empowers you and your team to harness those investments, maximize business results, and we’ll share some success stories from companies who are applying these principles. Then Thiago is going to share the state of production machine learning.
So among the challenges that you face, the first is visibility. Simply understanding and deciphering the cloud bill can be an enormous hurdle, and forecasting spend can be really difficult to do accurately.
How to optimize your costs once you get visibility. There are complex dependencies, as you know, within your data pipeline. So making adjustments to resources can have downstream effects, and you don’t want to interrupt the flow of those pipelines.
Finally, governance. So governing that cloud spending is hard. With over 200 services and over 600 instance types on AWS alone, it’s difficult to define what good looks like. The result is that on average, organizations report their public cloud spend is over budget by 13%, and they expect cloud spending to increase by 29% this year.
Observability is a key here, because it unlocks several important benefits. First, visibility. Just getting full visibility to understand where spending is going and how teams are tracking towards their budgets.
Granularity, and seeing the spending details by data team, by pipeline, by data application or product or division, and forecasting, seeing those trends and being able to project out accurately to help forecast future spending and profitability.
So data management represents approximately 40% of the typical cloud bill, and data management services are the fastest growing category of cloud service spending. It’s also driving a lot of the incredible revenue growth that we’ve seen, and innovation in products.
When you combine the best of DataOps and the best of FinOps, you get DataFinOps, and DataFinOps empowers data engineers and business teams to make better choices about your cloud usage. It helps you get the most from your modern data stack investments.
A FinOps approach, though, isn’t just about slashing costs. Although you’ll almost invariably wind up saving money, it’s about empowering data engineers and business teams to make better choices about their cloud usage, and derive the most value from their modern data stack investments.
Managing costs consists of three iterative phases. The first is getting visibility into where the money is going, measuring what’s happening in your cloud environment, understanding what’s going on in a workload aware context. Once you have that observability, next you can optimize. You begin to see patterns emerge where you can eliminate waste, remove inefficiencies and actually make things better, and then you can go from reactive problem solving to proactive problem preventing, sustaining iterative improvements in automating guardrails, enabling self-service optimization.
So each phase builds upon the previous one to create a virtuous cycle of continuous improvement and empowerment for individual team members, regardless of their expertise, to make better decisions about their cloud usage while still hitting their SLAs and driving results. In essence, this shifts the budget left, pulling accountability for managing costs forward.
So now let me share a few examples of FinOps success. A global company in the healthcare industry discovered they were spending twice their target spending for Amazon EMR. They could manually reduce their infrastructure spending without using observability tools, but each time they did, they saw job failures happen as a result. They wanted to understand the reason why cost was so far above their expected range.
Using observability tools, they were able to identify the root cause for the high costs and reduce them without failures.
Using a FinOps approach, they were able to improve EMR efficiency by 50% to achieve their total cost of ownership goals. Using the FinOps framework, their data analytics environment became much easier to manage. They used the best practices from optimizing their own cloud infrastructure to help onboard other teams, and so they were able to improve the time to value across the entire company.
A global web analytics company used a FinOps approach to get the level of fidelity that they needed to reduce their infrastructure costs by 30% in just six months. They started by tagging the AWS infrastructure that powered their products, such as EC2 instances, EBS volumes, RDS instances and network traffic.
The next step was to look by product and understand where they could get the biggest wins. As they looked across roughly 50 different internal projects, they were able to save more than five million per year, and after running some initial analysis, they realized that more than 6,000 data sources were not connected to any destinations at all, or were sent to destinations with expired credentials.
They were wasting $20,000 per month on unused data infrastructure. The team provided daily reporting on their top cost drivers visualized in a dashboard, and then using this information, they boosted margins by 20% in just 90 days.
All right, with that, let’s hand it over to Thiago to give us an update on the state of production machine learning. Thiago, over to you.
Thank you, Clinton. Let’s talk about the state of production ML. This includes understanding the challenges and the best practice for deploying, scaling and managing ML models in production environments, and how FinOps principles and Kubernetes can help organizations to optimize and manage the costs associated with their ML workloads, improve efficiency and scalability and cost effectiveness of their models while aligning them with the business objectives.
ML is moving to Kubernetes because it provides a flexible and scalable platform for deploying and managing machine learning models… Kubernetes [inaudible 00:07:29] getting resources such as CP-1 memory to match the demands of our workloads. Additionally, Kubernetes provides features such as automatic aerobics, self healing and service discovery, which are useful in managing and deploying ML models in a production environment.
The FinOps framework, which includes principles such as team collaboration, ownership and cloud usage, centralized team for financial operations, realtime reporting, decision driven by business value, and taking advantage of the variable cost of model… And taking advantage of the variable cost model of the cloud can relate to Kubernetes in several ways.
Kubernetes can also be used to allocate costs to specific teams of project and track and optimize the performance and cost of workloads in real time. By having a centralized team for financial operations and collaboration among teams, organizations can make better decisions driven by business value, and take advantage of the variable cost model of the cloud by only paying for the resources they use.
FinOps principles such as optimization, automation, cost allocation, and monitoring and metrics can be applied to ML workloads running on Kubernetes to improve their efficiency, scalability and cost effectiveness.
Kubernetes, by its nature, allows for cloud diagnostic workloads. It means that workloads deployed on Kubernetes can run on any cloud provider or on premises. This allows for more flexibility in terms of where ML workloads are deployed and can help to avoid vendor lock-in.
FinOps can help DataOps teams identify and eliminate unnecessary expenses, such as redundant or underutilized resources. This can include optimizing cloud infrastructure costs, negotiating better pricing service and licenses, and identifying opportunities to recycle or repurpose existing resources.
FinOps can help DataOps teams develop financial plans that align with business goals and priorities, such as inventing new technologies or expanding data capabilities.
By setting clear financial objectives and budgets, DataOps teams can make more informed decisions about how to allocate resources and minimize costs.
FinOps can help data teams automate financial processes, such as invoice and payment tracking, to reduce the time and effort to manage these tasks. This can free up DataOps teams members to focus on more strategic tasks, such as data modeling and analysis. FinOps help DataOps teams track financial performance and identify areas for improvement. This can include monitoring key financial metrics, such as cost per data unit or return on investment, to identify opportunities to reduce costs and improve efficiency.
A FinOps team known as Cloud Cost Center of Excellence is a centralized team within an organization that is responsible for managing and optimizing the financial aspects of the organization cloud infrastructure. This team typically has a broad remit that includes monitoring and analyzing cloud usage and cost, developing and implementing policies and best practices, collaborating with teams across the organization, providing guidance and support, providing real-time reporting, and continuously monitoring and adapting to changes in cloud pricing and services. The goal of this team is to provide a centralized point of control and expertise for all cloud related financial matters, ensuring that the organization cloud usage is optimized, cost-effective, and aligns with the overall business objectives.
Our product mindset focus on delivering value to the end user and the business, which help data teams better align their full efforts with the organization’s goals and priorities.
Changing the mindset from project to products can help improving collaboration. When FinOps teams adopt a product mindset, it helps to have a better collaboration between the team responsible for creating and maintaining the products, and cost transparency allows an organization to clearly understand and track the costs associated with its operation, including its cloud infrastructure, by providing visibility into cloud usage, costs and performance metrics, forecasting future costs, making data-driven decision, allocating costs, improving collaboration, and communicating and aligning cloud usage with overall business objectives.
When moving workloads to the cloud, organizations may discover hidden costs related to Kubernetes, so just cost of managing and scaling the cluster, the cost of running the control plane itself, and the cost of networking and storage. This hidden cost can arise from not fully understanding the pricing model of cloud providers, not properly monitoring or managing usage of cloud resources, or not properly planning for data transfer or storage costs.
Applications that require different amounts of computational power can be placed on that spectrum. Some applications like training large AI models require a lot of processing power to keep GPUs fully utilized during training processes by batch processing hundreds of data samples in parallel. However, other applications may only require a small amount of processing power, leading to underutilization of the computational power of GPUs.
When it comes to GPU resources, Kubernetes does not have the native support for GPU allocation, and must rely on third-party solutions, such as Kubernetes device plugin to provide this functionality. These solutions add an extra layer of complexity to resource allocation and scaling, as they require additional configuration and management.
Additionally, GPUs are not as easily being shareable as CPU resources, and have more complex life cycles. They have to be allocated and deallocated to specific parts that have to be managed by the Kubernetes collective itself. This can lead to situations where the GPU resources are not being fully utilized, or where multiple parts are trying to access the same GPU resources at the same time, resulting in computation and performance issues.
So why do we need realtime observability? Sometimes data teams do not realize GPU memories, CPU limits and requests are not treated the same way before it’s too late.
The Prius effect refers to the changing driving behavior observed in some drivers of the Toyota Prius hybrid car, altered their driving style to reduce fuel consumption after receiving realtime feedback on their gasoline consumption.
Observability by design on ML workloads, which includes collecting and monitoring key metrics, logging, tracing, setting up alerts, and running automated experiments allow teams to gain insights into the performance behavior and impact on their ML models. Make data-driven decisions to improve their performance and the reliability, and align with FinOps principles such as cost optimization, forecasting, budgeting, cost allocation, and decision making based on cost benefit analysis, all of which can help organizations optimize and manage the cost associated with their ML workloads.
By providing real time visibility into the performance and resource usage of AI and ML workloads, organizations can proactively identify and address [inaudible 00:18:05] and make more informed decision about how to optimize the cost of running these workloads in the cloud, understand how GPU resources are being consumed by different workloads and make informed decisions about scaling and allocating resources to optimize costs, and even find and troubleshoot GPU scheduling issues, such as GPU starvation or GPU oversubscription, that can cause workloads to consume more resources than necessary, and correct them.
Fantastic. Thank you so much, Thiago. It’s been great having you here today, and we look forward to answering all of your questions. Feel free to enter them in the chat below, or head over to the Unravel booth. We’d be happy to visit with you. Thanks.