Test Resiliency Using Failure Injection

Failure injection (also known as chaos testing) is an effective and essential method to validate and understand the resiliency of your workload and is a recommended practice of the AWS Well-Architected Reliability Pillar. Here you will initiate various failure scenarios and assess how your system reacts.

Preparation

Before testing, please prepare the following:

  1. Region must be the one you selected when you deployed your WebApp

    • We will be using the AWS Console to assess the impact of our testing

    • Throughout this lab, make sure you are in the correct region. For example the following screen shot shows the desired region assuming your WebApp was deployed to Ohio region

      SelectOhio

  2. Get VPC ID

    • A VPC (Amazon Virtual Private Cloud) is a logically isolated section of the AWS Cloud where you have deployed the resources for your service
    • For these tests you will need to know the VPC ID of the VPC you created as part of deploying the service
    • Navigate to the VPC management console: https://console.aws.amazon.com/vpc
    • In the left pane, click Your VPCs
    • 1 - Tick the checkbox next to WebApp1-VPC
    • 2 - Copy the VPC ID

    GetVpcId

    • Save the VPC ID - you will use later whenever <vpc-id> is indicated in a command
  3. Get familiar with the service website

    1. Point a web browser at the URL you saved from earlier
      • If you do not recall this, then in the WebApp1-Static stack click the Outputs tab, and open the WebsiteURL value in your web browser, this is how to access what you just created)
    2. Note the instance_id (begins with i-) - this is the EC2 instance serving this request
    3. Refresh the website several times watching these values
    4. Note the values change. You have deployed two web servers per each of three Availability Zones.
      • The AWS Elastic Load Balancer (ELB) sends your request to any of these three healthy instances.

3.1 EC2 failure injection

This failure injection will simulate a critical problem with one of the three web servers used by your service.

  1. Navigate to the EC2 console at http://console.aws.amazon.com/ec2 and click Instances in the left pane.

  2. There are three EC2 instances with a name beginning with WebApp1. For these EC2 instances note:

    1. Each has a unique Instance ID
    2. There is two instances per each Availability Zone
    3. All instances are healthy

    EC2InitialCheck

  3. Open up two more console in separate tabs/windows. From the left pane, open Target Groups and Auto Scaling Groups in separate tabs. You now have three console views open

    NavToTargetGroupAndScalingGroup

  4. To fail one of the EC2 instances, use the VPC ID as the command line argument replacing <vpc-id> in one (and only one) of the scripts/programs below. (choose the language that you setup your environment for)

    LanguageCommand
    Bash./fail_instance.sh <vpc-id>
    Pythonpython fail_instance.py <vpc-id>
    Javajava -jar app-resiliency-1.0.jar EC2 <vpc-id>
    C#.\AppResiliency EC2 <vpc-id>
    PowerShell.\fail_instance.ps1 <vpc-id>
  5. The specific output will vary based on the command used, but will include a reference to the ID of the EC2 instance and an indicator of success. Here is the output for the Bash command. Note the CurrentState is shutting-down

     $ ./fail_instance.sh vpc-04f8541d10ed81c80
     Terminating i-0710435abc631eab3
     {
         "TerminatingInstances": [
             {
                 "CurrentState": {
                     "Code": 32,
                     "Name": "shutting-down"
                 },
                 "InstanceId": "i-0710435abc631eab3",
                 "PreviousState": {
                     "Code": 16,
                     "Name": "running"
                 }
             }
         ]
     }
    
  6. Go to the EC2 Instances console which you already have open (or click here to open a new one)

    • Refresh it. (Note: it is usually more efficient to use the refresh button in the console, than to refresh the browser) RefreshButton

    • Observe the status of the instance reported by the script. In the screen cap below it is shutting down as reported by the script and will ultimately transition to terminated.

      EC2ShuttingDown

3.2 System response to EC2 instance failure

Watch how the service responds. Note how AWS systems help maintain service availability. Test if there is any non-availability, and if so then how long.

3.2.1 System availability

Refresh the service website several times. Note the following:

  • Website remains available
  • The remaining two EC2 instances are handling all the requests (as per the displayed instance_id)

3.2.2 Load balancing

Load balancing ensures service requests are not routed to unhealthy resources, such as the failed EC2 instance.

  1. Go to the Target Groups console you already have open (or click here to open a new one)

    • If there is more than one target group, select the one with whose name begins with WebAp
  2. Click on the Targets tab and observe:

    • Status of the instances in the group. The load balancer will only send traffic to healthy instances.

    • When the auto scaling launches a new instance, it is automatically added to the load balancer target group.

    • In the screen cap below the unhealthy instance is the newly added one. The load balancer will not send traffic to it until it is completed initializing. It will ultimately transition to healthy and then start receiving traffic.

    • Note the new instance was started in the same Availability Zone as the failed one. Amazon EC2 Auto Scaling automatically maintains balance across all of the Availability Zones that you specify.

      TargetGroups

  3. From the same console, now click on the Monitoring tab and view metrics such as Unhealthy hosts and Healthy hosts

    TargetGroupsMonitoring

3.2.3 Auto scaling

Autos scaling ensures we have the capacity necessary to meet customer demand. The auto scaling for this service is a simple configuration that ensures at least three EC2 instances are running. More complex configurations in response to CPU or network load are also possible using AWS.

  1. Go to the Auto Scaling Groups console you already have open (or click here to open a new one)

    • If there is more than one auto scaling group, select the one with the name that starts with WebApp1
  2. Click on the Activity History tab and observe:

    • The screen cap below shows that instances were successfully started at 17:25

    • At 19:29 the instance targeted by the script was put in draining state and a new instance ending in …62640 was started, but was still initializing. The new instance will ultimately transition to Successful status

      AutoScalingGroup

Draining allows existing, in-flight requests made to an instance to complete, but it will not send any new requests to the instance. Learn more: After the lab see this blog post for more information on draining.

Learn more: After the lab see Auto Scaling Groups to learn more how auto scaling groups are setup and how they distribute instances, and Dynamic Scaling for Amazon EC2 Auto Scaling for more details on setting up auto scaling that responds to demand

3.2.4 EC2 failure injection - conclusion

Deploying multiple servers and Elastic Load Balancing enables a service suffer the loss of a server with no availability disruptions as user traffic is automatically routed to the healthy servers. Amazon Auto Scaling ensures unhealthy hosts are removed and replaced with healthy ones to maintain high availability.

Availability Zones (AZs) are isolated sets of resources within a region, each with redundant power, networking, and connectivity, housed in separate facilities. Each Availability Zone is isolated, but the Availability Zones in a Region are connected through low-latency links. AWS provides you with the flexibility to place instances and store data across multiple Availability Zones within each AWS Region for high resiliency.
Learn more: After the lab see this whitepaper on regions and availability zones