Jan 16, 2013

How do you measure cloud resiliency?

What are the most effective methods or tools for measuring cloud resiliency?

Here's an article with some metrics.

Measuring the Performance of Your Cloud

"Cloud performance is in the eye of the beholder. If you are a cloud consumer, you simply expect your “service level agreement” to be met. This includes overall performance, elasticity, security and resiliency. The tricky job goes to the cloud provider. The provider has to make the cloud’s resource appear: instantaneous, always on, and infinitely abundant. Hence, this article is predominantly focused on performance from the cloud providers prospective.

In the IBM cloud world, virtualization and automation are the technological keys to the allusion of "instant, always and infinite". Hence, many of these areas are focused on the performance and impact of virtualization, and the automated provisioning system behind it. The ability to express service level agreements (SLAs) against your virtual environment is another key ingredient to measuring performance, allowing the cloud to service the most important workloads first, and degrade others gracefully.

Here is a quick introduction to these 6 areas of cloud performance. For each area, I try to provide a simple definition, give some insight as to why the area is important, and provide high-level thoughts on how one might set up experiments to quantify performance. I am boldly assuming that the cloud is mostly used to run enterprise web applications, so the examples follow the behavior of this workload style. I am hoping in the future to have a more scholarly perspective, but until then, here is a the high-level overview…"

Hello Friends,

Cloud computing aims to allow easy access to shared computing resources while freeing your organization from the capital outlays of owning its own IT infrastructure. But how do you know your cloud infrastructure will be resilient in the real world against rapidly evolving cyber attacks, a growing volume of users, and high-stress application load? You must ensure that your vital cloud-based services will can handle your unique business application traffic, peak user load, and security attacks. For the health of your business, you must harden resiliency by finding the optimal balance of performance, security, and stability of the cloud services that are crucial to your business operations.
Understanding how to find that balance requires precise insight into the performance of every element of your infrastructure under your own mix of conditions. As they try to gather that insight, some cloud vendors have relied on legacy tools deployed across hundreds of servers in cloud environments. The net result: an amalgamation of tools and workarounds that is hugely expensive yet brittle—and that cannot scale to address the task at hand.

Thanks and Regards,
Agili Ron



To add a little to what Christopher posted, here is a nice slideshow about measuring private cloud resiliency. It boils down to three steps:

1. Functional testing.

2. Enhance functional testing with load.

3. Testing performance and security under load. 


There are many vendors offering products designed to measure your cloud's ability to run under catastrophic circumstances. But here's how IBM Developer describes the general process:


"The metrics around resiliency all relate to keeping your cloud running under adverse conditions. Denial-of-Service attacks, run away processes, failed-hardware resources are examples of Security, Isolation, and Resiliency respectively. A cloud should be able to quickly react to issues within the 'hard' or 'soft' aspects of environment by moving workloads to working areas of the cloud and quickly failing over to another virtual environments. A robust enterprise cloud should also support disaster recovery features, allowing your cloud to be linked to another cloud in an active/passive or active/active setup. One can imagine a performance experiment to measure Resiliency being similar to the Elasticity tests. However, instead of the cloud reacting to a breach in SLA, the cloud must now react to a system failure. For example, unplug the 'blade' running the Apache Day Trader workload, to simulate a hardware error. Measure how long it takes the cloud to react to this breach and return the response time back to 2 seconds or better. Similarly, your cloud must support isolation such that if one tenant’s virtual system is 'running amuck' another tenant will not be disturbed. To test this scenario, we create a run away process that continually allocates memory, or disks space. While this is happening we measure the performance of a second tenant to see if we notice any ill effects from its neighboring tenant. Also watch the system vital signs such that the run-away tenant is 'capped'. Your cloud must still perform while under a denial of service attack. Hence, another test involves setting up a denial of service attack by opening up Port 80 and sending bogus HTTP traffic. The cloud should employ an application firewall that filters Port 80 and looks for ill formed HTTP requests and deny them access to the Cloud’s network."

Answer this