Test Resiliency Using EC2 Failure Injection

4.1 EC2 failure injection

This failure injection will simulate a critical problem with one of the three web servers used by your service.

In Chaos Engineering we always start with a hypothesis. For this experiment the hypothesis is:

Hypothesis: If one EC2 instance dies, then availability will not be impacted

  1. [Optional] Before starting, view the deployment machine in the AWS Step Functions console to verify the deployment has reached the stage where you can start testing:

    • single region: WaitForWebApp shows completed (green)
    • multi region: WaitForWebApp1 shows completed (green)
  2. Navigate to the EC2 console at http://console.aws.amazon.com/ec2 and click Instances in the left pane.

  3. There are three EC2 instances with a name beginning with WebServerforResiliency. For these EC2 instances note:

    1. Each has a unique Instance ID
    2. All instances are running and healthy
    3. There is one instance per each Availability Zone

    EC2InitialCheck

  4. Open up two more console in separate tabs/windows. From the left pane, open Target Groups and Auto Scaling Groups in separate tabs. You now have three console views open

    NavToTargetGroupAndScalingGroup

  5. To fail one of the EC2 instances, use the VPC ID as the command line argument replacing <vpc-id> in one (and only one) of the scripts/programs below. (choose the language that you setup your environment for)

    LanguageCommand
    Bash./fail_instance.sh <vpc-id>
    Pythonpython3 fail_instance.py <vpc-id>
    Javajava -jar app-resiliency-1.0.jar EC2 <vpc-id>
    C#.\AppResiliency EC2 <vpc-id>
    PowerShell.\fail_instance.ps1 <vpc-id>
  6. The specific output will vary based on the command used, but will include a reference to the ID of the EC2 instance and an indicator of success. Here is the output for the Bash command. Note the CurrentState is shutting-down

     $ ./fail_instance.sh vpc-04f8541d10ed81c80
     Terminating i-0710435abc631eab3
     {
         "TerminatingInstances": [
             {
                 "CurrentState": {
                     "Code": 32,
                     "Name": "shutting-down"
                 },
                 "InstanceId": "i-0710435abc631eab3",
                 "PreviousState": {
                     "Code": 16,
                     "Name": "running"
                 }
             }
         ]
     }
    
  7. Go to the EC2 Instances console which you already have open (or click here to open a new one)

    • Refresh it. (Note: it is usually more efficient to use the refresh button in the console, than to refresh the browser)

      RefreshButton

    • Observe the status of the instance reported by the script. In the screen cap below it is shutting down as reported by the script and will ultimately transition to terminated.

      EC2ShuttingDown

4.2 System response to EC2 instance failure

Watch how the service responds. Note how AWS systems help maintain service availability. Test if there is any non-availability, and if so then how long.

4.2.1 System availability

Refresh the service website several times. Note the following:

  • Website remains available
  • The remaining two EC2 instances are handling all the requests (as per the displayed instance_id)
  • Also note the availability_zone value when you refresh. You can see that requests are being handled by the EC2 instances in only two Availability Zones, while the EC2 instance in the third zone is being replaced

Availability can also be measured programmatically using Amazon CloudWatch Synthetics canaries. A canary has already been created as part of the WebServersforResiliencyTesting stack. The canary has been configured to send a simple GET request to the application endpoint at 1 minute intervals.

  • Go to the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation
  • click on the WebServersforResiliencyTesting stack
  • click on the Outputs tab
  • Open the URL for WorkloadAvailability in a new window
  • View canary run data to see if there are any failed runs due to workload unavailability

4.2.2 Load balancing

Load balancing ensures service requests are not routed to unhealthy resources, such as the failed EC2 instance.

  1. Go to the Target Groups console you already have open (or click here to open a new one)

    • If there is more than one target group, select the one with the Load Balancer named ResiliencyTestLoadBalancer
  2. Click on the Targets tab and observe:

    • Status of the instances in the group. The load balancer will only send traffic to healthy instances.

    • When the auto scaling launches a new instance, it is automatically added to the load balancer target group.

      Click here to see an example of what you might expect to see:
  3. From the same console, now click on the Monitoring tab and view metrics such as Unhealthy hosts and Healthy hosts.

4.2.3 Auto scaling

Auto scaling ensures we have the capacity necessary to meet customer demand. The auto scaling for this service is a simple configuration that ensures at least three EC2 instances are running. More complex configurations in response to CPU or network load are also possible using AWS.

  1. Go to the Auto Scaling Groups console you already have open (or click here to open a new one)

    • If there is more than one auto scaling group, select the one with the name that starts with WebServersforResiliencyTesting
  2. Click on the Activity tab and observe the sequence of events

    Click here to see an example of what you might expect to see:

Auto Scaling helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your workload.

  • Learn more: After the lab see Auto Scaling Groups to learn more how auto scaling groups are setup and how they distribute instances

4.3 [Optional] EC2 failure injection using AWS Fault Injection Simulator (FIS)

You can also use AWS FIS to simulate failure of an EC2 instance. This step is optional. You will get experience using AWS FIS later during the RDS failure experiment and application failure experiments.

If you are running this lab as part of a live workshop, then skip this step and come back to it later if you wish

Click here for instructions to simulate EC2 instance failure using AWS FIS:

4.4 EC2 failure injection - conclusion

By deploying multiple servers and using Elastic Load Balancing, the workload can suffer the loss of a server but experience no availability disruption. This is because user traffic is automatically routed to the healthy servers and Amazon Auto Scaling ensures unhealthy hosts are removed and replaced with healthy ones to maintain high availability.

Our hypothesis is confirmed:

Hypothesis: If one EC2 instance dies, then availability will not be impacted