Five ways to monitor and control AWS cloud costs
Many IT teams find that their AWS cloud costs grow less efficient as “clutter” builds up in their accounts. The good news is that both AWS and a small army of third party providers have developed tools to help engineers discover the cause(s) of these inefficiencies.
While there are several “easier” fixes, such as Reserved Instances and eliminating unused resources, the real issue is usually far more complex. Unplanned costs are frequently the result of nonstandard deployments that come from an unclear or absent development processes, poor organisation, or the absence of automated deployment and configuration tools.
Controlling AWS costs is no simple task in enterprises with highly distributed teams, unpredictable legacy applications, and complex lines of dependency. Here are some strategies Logicworks engineers use to keep our clients’ costs down:
1. Cloudcheckr and Trusted Advisor
The first step in controlling AWS costs is to gather historical cost/usage data and set up an interface where this data can be viewed easily.
There are many third party and native AWS resources that provide consolidated monitoring as well as recommendations for potential cost saving, using tools like scheduled runtime and parking calendars to take advantage of the best prices for On-Demand instances.
Cloudcheckr is a sophisticated cloud management tool that is especially useful in enforcing standard policies and alerting developers if any resources are launched outside of that configuration. It also has features like cost heat maps and detailed billing analysis to give managers full visibility into their environments. When unusual costs appear in an AWS bill, Cloudcheckr is the first place to look.
Trusted Advisor is a native AWS resource available with Business-level support. TA’s primary function is to recommend cost savings opportunities and like Cloudcheckr, it also provides availability, security, and fault tolerance recommendations. Even simple tunings in CPU usage and provisioned IOPS can add up to significant savings; Oscar Health recently reported that it saw 20% savings after using Trusted Advisor for just one hour.
Last year, Amazon also launched the Cost Explorer tool, a simple graphical interface displaying the most common cost queries: monthly cost by service, monthly cost by linked account, and daily spend. This level of detail might be suitable for upper management and finance teams, as it does not have particularly specific technological data.
2. Reserved instances
The most obvious way to control compute cost is to purchase reserved EC2 instances for the period of one or three years, either paid all upfront, partially upfront, or none upfront. Customers can see savings of over 50% on reserved instances vs. on-demand instances.
However, reserved instances have several complications. First, it is not a simple matter to predict one or three years of usage when an enterprise has been on AWS for the same amount of time or less; secondly, businesses that are attracted to the pay-as-you-go cloud model are wary of capital costs that harken back to long-term contracts and sunk costs. It can also be difficult to find extra capacity of certain instance types on the marketplace, and enterprises might find this a complicated and costly procedure in any case.
Companies can still get value out of reserved instances by following certain best practices:
- Buy reserved capacity to meet the minimum or average sustained usage for the minimum number of instances necessary to keep the application running, or instances that are historically always running.
- To figure out average sustained usage, use tools like Cloudcheckr and Trusted Advisor (explored above) to audit your usage history. Cloudcheckr will recommend reserved instance purchases based on those figures, which can be especially helpful if you do not want to comb through years of data across multiple applications.
- Focus first on what will achieve the highest savings with rapid ROI; this lowers the potential impact of future unused resources. The best use-cases for reserved instances are applications with very stable usage patterns.
- For larger enterprises, use a single individual, financial team, and AWS account to purchase reserved instances across the entire organisation. This allows for a centralized reserved instance hub so that resources that are not used on one application/team can be taken up by other projects internally.
- Consolidated accounts can purchase reserved instances more effectively when instance families are also consolidated. Reservations cannot be moved between accounts, but they can be moved within RI families. Reservations can be changed at any time from one size to another within a family. The fewer families are maintained, the more ways an RI can be applied. However, as explored below, the cost efficiencies gained by choosing a more recently released, more specialised instance type could outweigh the benefits of consolidating families to make the RI process smoother.
- Many EC2 Instances are underutilised. Experiment with a small number of RIs on stable applications, but you may find better value by choosing smaller instance sizes and via better scheduling of On-Demand instances, without upfront costs.
3. Spot instances
Spot instances allow customers to set the maximum price for compute on EC2. This is great for running background jobs more cheaply, processing large data loads in off-peak times, etc. Those familiar with certain CPC bid rules in advertising may recognise the model.
The issue is that a spot instance might be terminated when you are 90% through a job if the price for that instance rises above the price threshold. An architecture unplanned for this can see the cost of the spot instance wasted. Bid prices need to change dynamically, but without exceeding on-demand prices. Best practice is to set up an Auto Scaling group that only has spot instances; CloudWatch can watch the current rate and in the event of the price meeting a bid, it would scale up the group as long as it is within the parameters of the request. Then create a second Auto Scaling group with on-demand instances (the minimum to keep the lights on), and set an ELB between them so that requests get served either by the spot group or the on-demand group. If the on-demand price is greater than bid price, then create a new launch configuration that sets the min_size of spot instances Auto Scaling group to 0. Sanket Dangi outlines this process here.
Engineers can also use this process to make background jobs run faster, so that spot instances are used to supplement a scheduled runtime if the bid price is below a certain figure, thus minimizing the impact on end-users and potentially saving cost between reserved and on-demand instances.
For those not interested in writing custom scripts, Amazon recently acquired ClusterK, which reallocates resources to on-demand resources when spot instances terminate and “opportunistically rebalance” to spot instances when the price fits. This dramatically expands the use-case for spot instances beyond background applications to mission-critical apps and services.
4. Organise and automate
As IT teams evolve to a more service-oriented structure, highly distributed teams will increasingly have more autonomy over provisioning resources without the red tape and extensive time delay of traditional IT environments. While this is a crucial characteristic of any DevOps team, if it is implemented without the accompanying automation and process best practices, decentralised teams have the potential to produce convoluted and non-standard security rules, configurations, storage volumes, etc. and therefore drive up costs.
The answer to many of these concerns is CloudFormation. The more time an IT team spends in AWS, the more it is absolutely crucial that the team use CloudFormation. Enterprises deploying on AWS without CloudFormation are not truly taking advantage of all the features of AWS, and are exposing themselves to both security and cost risks as multiple developers deploy nonstandard code that is forgotten about / never updated.
CloudFormation allows infrastructure staff to bake in security, network, and instance family/size configurations, so that the process of deploying instances is not only faster but also less risky. Used in combination with a configuration management tool like Puppet, it becomes possible to bring up instances that are ready to go in a matter of minutes. Puppet manifests also provide canonical reference points if anything does not go as planned. Puppet maintains the correct configuration even if it means reverting back to an earlier version. For example, a custom fact to report which security groups an instance is running in, along with a manifest to automatically associate the instance with specific groups as needed. This can significantly lower the risk of downtime associated with faulty deploys. CloudFormation can also dictate which families of instances should be used, if it is important to leverage previously-purchased RIs or provide the flexibility to do so at a later point.
Granted, maintenance of these templates requires a significant amount of staff time, and can initially feel like a step backwards in terms of cost efficiency. CloudFormation takes some time to learn. But investing the time and resources will have enormous impacts on a team’s ability to deploy quickly and encourage consistency within an AWS account. Clutter builds up in any environment, but this can be significantly reduced when a team automates configuration and deployment.
5. Instance types and resource optimisation
Amazon is constantly delivering new products and services. Most of these have been added as a direct result of customer comments about cost or resource efficiencies, and it is well worth keeping on top of these releases to discover if the cost savings outweigh the cost of implementing a new solution. If the team is using CloudFormation, this may be easier.
New instance types often have cost-savings potential. For instance, last year Amazon launched new T2 instances, which provide low cost stable processing power and the ability to build up “CPU credits” during these quiet periods to use automatically during busy times. This is particularly convenient for bursty applications with rare spikes, like small databases and development tools.
A number of Amazon’s new features over the last year have related to price transparency, including the pricing tiers of reserved instances, so it appears safe to expect more services that offer additional cost efficiencies in the next several years.
- » IT operations in 2020: Five things to prepare for – from AIOps to multi-cloud and more
- » AWS and Google Cloud earnings beget laws of large numbers – and expectations – for cloud revenue
- » A comprehensive guide to selecting SaaS project monitoring tools
- » Capital One confirms data breach, cites cloudy approach as key to swift resolution
- » How AWS certifications are increasing tech salaries by up to $12k per year