As discussed in the past, the advanced FinOps engineer stops focusing only on the visible waste, and starts focusing on agendas that are "always on - always utilized". We talked about adding value to what is running and connecting it to the business to understand if it generates business or revenue.
But, we never talked about how we reached point A to point B.
As a FinOps engineer, when you ask someone why they are doing what they are doing, there will always be a reason, and it'll always be convincing. So, how would we approach such a thing? As the reason would probably not be "because I don't know, I want to waste money".
In this blog, I would like to describe how we break down a justification to understand if it is actually justified. It is important, as we approach optimization guidelines in an organization, that our immediate instinct would be to run toward the waste and monitor the "FinOps best practices". But a lot of the optimizations we can do is to challenge existing workloads that are truly utilized, but not in use (or shouldn't be).
Solution-driven solutions
In many companies, we can find a terrible approach when they try fitting a solution to a problem, instead of defining the problem well to understand how to choose the right solution. In an ideal world, we would approach it by
Defining what are our challenges we're trying to solve
What is the current state
What is our desired state
How our desired state solves our challenges
Solutions matrix where for each solution we describe the solution's features and how it solves each challenge, as well as the TCO of each solution
This reverse-engineering causes people to select a solution that is not necessarily optimized. The way for us to pinpoint such cases which I found the most efficient is to ask them what they are using this solution for. If their answer is a description of the solution's capabilities, and what it does, it's a classic case of a solution-driven solution.
What we would have expected to hear is which pain this solution solved us - that is the right answer we want to hear. If you're justifying what's running with a short description of the product instead of why we're running it in the first place it's a clear indication that our justification isn't strong enough and should be discussed.
Success Criteria
Once we understood that there is some justification to run it, we would like to understand how strong it is, and how their answer holds water. To put a success score on the justification, we will need success criteria to measure. Granted, not everything is straightforward, but we strive to have one nevertheless. Once we measure the success criteria, we can measure the ROI of this workload and make sure (over time) that it still maintains a positive ROI.
Value
We talked about adding value to our workloads. We have to understand if what we're doing generates value, and how we can measure it, or at least have the ability to map and reduce redundant data/infra.
Ok, enough philosophy - how would we translate it to our practice?
Example I - We have huge EBS volumes in our application layer that are 95% in use
Justification - We need to save logs --> describing the functionality of storage --> solution-driven solution. We insist on understanding the challenge and pains it solves us --> "We want to be able to debug incidents that happened in the last 3 days"
Success Criteria - Well-defined TTLs, and review different types of solutions to store logs to keep that retention using other methods or tools.
Value - Understand which logs are being written, for which teams, and what information they offer. We might find a lot of data is written without any additional "debug" value.
Optimization potential - huge.
Example II - We need to increase this DynamoDB table - a huge amount of throttling
Justification - We have a lot of throttling in our DynamoDB table so we want to provision additional capacity units --> 👍
Success Criteria - reduce throttling, while remaining within a good ratio between provisioned and consumed units
Value - Understand the type of data used in DynamoDB, and if this is the right solution for this use-case
Optimization potential
Throttling caused by hot-partition? we can optimize the capacity units to keep the ratio and redesign our tables --> huge optimization potential
Throttling caused by increased activity? --> we're optimized
Summary
Asking the right question is not enough. We have to acknowledge that the answers we get are not well-thought-out. We can't take anything as justified "because we were told so", we have to connect it to measurable data, understand what we're trying to solve, how well it solves it, and what can be done better.
There will always be justification for what people do, we need to look beyond it to truly achieve financial optimization
Comments