The IT Cost Box

KEY POINTS:

In this post, I introduce a simple model for thinking about IT costs in three dimensions.

The model presents IT costs as a volume of a box, where each side of the box shows three drivers of IT cost: unit costs, size, and time. You can optimize expenses on all of these three dimensions.

Unit costs describe the price of units of resources needed to run the systems (e.g., compute, storage, or network resources). Size represents the number of resource units required to run the system. Time describes how the size changes in time.

Intro

Running IT systems always costs money. And architects are expected to help keep these costs under control. In this post, I discuss a simple model for thinking about IT costs in three dimensions. I want to help facilitate discussions around IT costs with this model by providing structure and a common vocabulary.

Figure 1 illustrates this model. I borrow it from Gregor Hohpe’s book Cloud Strategy. This model presents IT costs as the volume of the box. Each side of the box shows three drives of IT costs:

Unit costs, describing the price of units of resources needed to run the systems (e.g., compute, storage, or network resources).
Size, representing the number of resource units required to run the system.
Time, describing how the size changes in time.

Figure 1: The IT Cost Box (credit Gregor Hohpe, Cloud Strategy):
Cost[$] = size [units] * time [hours] * unit cost [$/unit/hour]

To calculate the IT system costs, you need to know the price of your units (e.g., price of VMs in the cloud), the amount of these units your system needs now, and how this usage of resources changes in time.

You can optimize IT costs around all of these three dimensions.

Optimizing For Unit Costs

Unit costs describe the price of the smallest piece of resources you can use. What exactly is a unit depends on the infrastructure you are using. In our private cloud, examples of units are processors, storage disks, and network bandwidth. We usually express unit prices as the costs of using some amount of resource per some time unit. For example, running a virtual machine may cost you 10 dollars per day.

In a public cloud, when we use higher-level managed services like hosted databases and machine learning tools, a unit is frequently some discrete action you can perform on a service, e.g., the read or update actions. Then we usually express unit costs are price per action.

In general, as system developers, we have a limited amount of optimization options for unit sizes. If we run all systems in the same data center with the same hardware and VM options, we can only choose the number of VMs but not their characteristics. However, there are some possibilities to optimize for unit costs.

Firstly, you need to be clear about what resources you need. We need to avoid being too defensive or greedy and use a much more significant amount of resources than we need. If you have several options for VMs, for example, then you need to choose the one most suited for your system tasks. Public cloud providers offer many combinations of CPU and RAM so that you can deploy applications that are compute or memory-intensive on virtual machines that match their needs more closely. It makes no sense to always use the most expensive processors suitable for complex machine learning tasks, or to use costly SSD disks to store backup data accessed once a year. Having a good understanding of your workloads is the key to selecting instances suited for them.

Figure 1: Do you need a Ferrari to commute to your work in a city with frequent traffic jams and a speed limit of 50km/h? Similarly, do you need machine learning ready VMs for simple non-real-time data processing tasks?

Second, an option that is sometimes available is using trading the stability for costs. For instance, Google Cloud offers Preemptible VMs. These virtual machines provide the same options as regular VMs. The difference is that they last for up to 24 hours, and Google may terminate them at any time. But if your applications are fault-tolerant and can withstand possible instance termination, then preemptible instances can reduce your costs by up to 80%.

The last way to reduce unit costs is through sound financial management. If you bundle your expenses, you may get a discount. For instance, customers typically get lower unit prices in the public cloud after their usage is above some threshold or commit to a minimum usage level.

Optimizing For Size

Optimizing cost for size means using as few as possible resource units to run your system.

Figure 2: Most cars carry just one or two persons, while they can seat at least four comfortably. Similarly, running your VMs at a 25% utilization level will multiply the number of VMs you need.

We can generally optimize by size by being more precise in defining resource needs or by aggregating workloads to more efficiently use resources.

The key to the first approach to optimizing for size lies in a good understanding of your usage patterns. Good instrumentation and monitoring are critical. A high level of automation also helps as it enables you to add or remove resources quickly. Optimizing for size by better utilizing individual resources may be complicated as having too few resources may cause your system to perform poorly or even crash. Reducing redundancy also needs to be done carefully, as more redundancy may increase your system’s uptime.

Another approach you can apply to optimize for size is aggregation of workloads. Aggregation can create unarguable economic benefits in terms of investment required for service resource capacity, utilization of that capacity, and the delivered cost. However, these advantages need to be traded off against other factors, such as response time or data networking costs.

Optimizing For Time

Optimizing for time means introducing elasticity or auto-scaling in your system design so that your resource creation and termination follow your usage pattern as real-time as possible.

A typical development test environment, for instance, is utilized around 50 hours a week, roughly a quarter of the week’s hours. Stopping or downscaling these environments when they’re not needed can cut your cost dramatically. Similarly, our customers are much more active during the day than at night.

Figure 3: You may need fewer carriages during the night.

Looking Holistically

Any optimization has its costs and risks. If you have a system that consumes resources inefficiently, you may reduce its costs if you invest in making the system more efficient. However, such investment may sometimes be more expensive than the eventual saving that it can bring during the system’s lifetime.

Another topic to think about is where it makes sense to optimize first. If your system has irregular usage patterns, then optimizing for time may be the most cost-effective investment. If your system always has a high load and you may keep allocated resources statically, reduce the number you need or use cheaper units.

To Probe Further

Gregor Hohpe (2020): Cloud Strategy, Chapter “Cloud Savings Have To Be Earned”
Joe Weinman: Articles on Cloud Economics