Mastering Cloud Cost Optimization: Frameworks for Success

This is the second article of the blog series, “Mastering Cloud Cost Optimization.”

In this article, we will share the frameworks employed by organizations that were successful with their digital transformation and cloud optimization journey.

The cloud spend optimization initiative (and challenge)

In 2017, RightScale’s (now Flexera) State of the Cloud report listed “Optimizing existing cloud spend” as the top initiative for cloud users (53%) for the first time, replacing the 2016 top initiative of “Moving more workloads to cloud”:

The cost optimization initiative has remained number one in every report since, and in 2020, 73% listed it as their top initiative:

So why do organizations still find cloud cost savings as their top cloud initiative and top challenge? As you probably already know from your own experience, it isn’t as easy as it sounds.

In our previous article, “Mastering Cloud Cost Optimization: The Principles” we covered the main cloud cost optimization challenges and the core principles needed to accomplish a well-architected and continuously optimized cloud environment. Before proceeding, make sure to read it for more details and context.

It’s time for action

For years, organizations attempted to achieve cost optimization by focusing on reporting. This included chargeback reports, long excel spreadsheets and dashboards with pretty charts and graphs.

Sadly, this approach rarely works. Staring at data will not reduce a cloud bill, nor will sending reports back and forth.

To be clear, this is not to minimize the importance of cost visibility and reporting — cost visibility is a critical foundation of any organization’s cloud cost optimization strategy and required to establish the needed accountability, but it is not enough. To optimize a cloud environment, you must act and execute actions — but again, this is easier said than done.

To execute actions, one needs to understand what actions to take, when to take them and what the implications of that action will be beyond just cost savings.

We have seen and talked to organizations that created an internal process to identify, analyze and execute cloud optimization actions. Many admitted the process is time-consuming, cumbersome, manual and not scalable, especially in large cloud environments where their efforts had limited impact.

The solution? Automation — the ability to execute optimization actions without any human intervention. From a technical perspective, automation is not hard, especially in public clouds, which offer well documented and robust APIs. The main two challenges with achieving automated cloud optimization are complexity and trust.

Multidimensional complexities

The complexity refers to the process of generating an accurate and actionable cloud optimization action, such as rightsizing a virtual machine (VM), Platform-as-a-Service (PaaS) or even a container.

For example, to properly resize a single workload, you first need to observe the utilization of its resources across multiple metrics (e.g., CPU, memory, IOPS, network, etc.), including monitoring application performance metrics like response times and transactions. The next step is to analyze all the data and determine the best target instance type/SKU out of a massive, ever-growing catalog of configuration options (and prices) offered by the cloud vendor. Then, once the target configuration has been identified, additional constraints must be considered, such as organization policies, OS driver requirements and storage type support. This image illustrates the multiple dimensions that should be considered when scaling a workload:

Then, once the target configuration has been identified, additional constraints must be considered, such as organization policies, OS driver requirements (e.g., NVMe or ENA on AWS) and storage type support (e.g., EBS optimized/Azure Premium LRS). This image illustrates the multiple dimensions that should be considered when scaling a workload:

The journey to automation requires trust

To get organizations to agree to automate actions, you must earn their trust that the actions are accurate, safe and will not hurt the performance of the applications, especially in production.

Trust requires time and a structured approach; it is a journey with multiple stages, and it is closely aligned with the public cloud maturity model:

Start with visibility: The first step is to gain visibility into the entire cloud environment and get a sense of the optimization opportunity at hand. This step includes identifying and aggregating all accounts and subscriptions, understanding the overall spend and the commitments made to the cloud providers and tagging and labeling the different workloads based on their purpose, owner and environment (e.g., prod, test, dev, etc.) This must be done across all subscriptions/accounts.
Tackle the low-hanging fruit first: The first area that we recommend starting with, mainly since it is the path of least resistance, is terminating unused resources like idle, unnecessary VMs, load balancers, public IPs, unattached volumes and old snapshots. A significant amount of savings can be gained in this stage.
Purchase one-year reservations for production: We also recommend that while focusing on non-prod, you should consider purchasing one-year reservations for production. The reason is that optimization takes time — there is no way around it. By purchasing one-year reserved instances or savings plans, you will be able to save 30-40% as you hone your more advanced optimization skills on the non-prod estate. The reason for one-year vs. three-years is that the goal is to build a more sophisticated optimization plan for production during the first year, which will include scaling the production workloads to their optimal size (e.g., rightsizing) and then buying new reservations based on the optimized instance type/SKU.
Implement scheduled suspension: Suspending non-prod workloads after hours can yield instant and rather substantial savings. For example, suspending workloads between 6 PM – 6 AM can reduce compute costs by 50%, and the savings will be even higher if suspending during weekends and holidays.
Execute IaaS scaling in non-prod environments: At this stage, the savings are noticeable, and many teams are eager to find more savings. We recommend leveraging the BU’s motivation and tackling the non-prod environments with scale actions. We created a maturity curve that focuses exclusively on that stage since it is critical for the success of the optimization efforts:
- Start with manual action execution: Review every scale action to validate its accuracy and scrutinize it with a handful of stakeholders from various Business Units (e.g., stakeholders from IT/Cloud Ops, Application Team and finance). Execute the action and validate the impact. Take one step at a time and increase the number of actions executed as the confidence grows.
- Approval workflows: The next step is to implement an approval workflow with your ITSM solution (such as ServiceNow). The optimization scale actions should be routed to the appropriate owner to approve, reject or suggest an adjustment to address elements that were not considered or available when the action was generated. For example, “the suggested instance type is not ideal for this workload since we are planning to double the transactions it will process starting next week.”
- Maintenance/change windows: As for when to execute the scale actions, start by defining a weekly change window where all approved scale actions will be executed. Over time, expand the scope and frequency of the change windows. Many of our customers are using daily change windows to execute scale actions against non-prod. workloads, the mature ones have moved to full real-time automation, which is the goal.
- Purchase reservations for non-prod: After the majority of the long-term non-prod workloads have been optimized to their ideal compute configuration, you can now purchase reservations to obtain additional savings.
- Focus on production: It is time to tackle the production workloads. Leverage all the lessons learned from the non-prod and apply them to the production, following the above steps.
Enable real-time Automation: As mentioned, some of our mature customers have enabled real-time automation. Some were able to do so faster than others since they modernized their applications (more on that in the next section).

Application modernization and cost efficiency

Since scale actions on the cloud are disruptive, not all workloads can be resized often; some require graceful shutdown of specific services as part of the scaling process. When an application is modernized to leverage cloud-native architectures and PaaS services, it unlocks the ability to take optimization actions in real-time and leverage automation, without any impact on the application.

Therefore, it is critical that organizations — in parallel to their continuous optimization initiatives — invest in application modernization and architect their applications for cost efficiency by leveraging PaaS Services and cloud-native technologies such as containers and functions (i.e., serverless).

Stay tuned for the next article in this blog series, “Mastering Cloud Cost Optimization: Cloud Cost Models & Discounts Overview.” Leveraging the correct cloud cost model for workloads is one of the most effective methods to reduce cloud costs. The upcoming blog post will provide an overview of the available cost and discount models on the cloud and when to use them.

Author

Asena Hertz

Product Marketing at Turbonomic