Test Resiliency Using EC2 Failure Injection

4.1 EC2 failure injection

This failure injection will simulate a critical problem with one of the three web servers used by your service.

  1. Before starting, view the deployment machine in the AWS Step Functions console to verify the deployment has reached the stage where you can start testing:

    • single region: WaitForWebApp shows completed (green)
    • multi region: WaitForWebApp1 shows completed (green)
  2. Navigate to the EC2 console at http://console.aws.amazon.com/ec2 and click Instances in the left pane.

  3. There are three EC2 instances with a name beginning with WebServerforResiliency. For these EC2 instances note:

    1. Each has a unique Instance ID
    2. There is one instance per each Availability Zone
    3. All instances are healthy

    EC2InitialCheck

  4. Open up two more console in separate tabs/windows. From the left pane, open Target Groups and Auto Scaling Groups in separate tabs. You now have three console views open

    NavToTargetGroupAndScalingGroup

  5. To fail one of the EC2 instances, use the VPC ID as the command line argument replacing <vpc-id> in one (and only one) of the scripts/programs below. (choose the language that you setup your environment for)

    LanguageCommand
    Bash./fail_instance.sh <vpc-id>
    Pythonpython fail_instance.py <vpc-id>
    Javajava -jar app-resiliency-1.0.jar EC2 <vpc-id>
    C#.\AppResiliency EC2 <vpc-id>
    PowerShell.\fail_instance.ps1 <vpc-id>
  6. The specific output will vary based on the command used, but will include a reference to the ID of the EC2 instance and an indicator of success. Here is the output for the Bash command. Note the CurrentState is shutting-down

     $ ./fail_instance.sh vpc-04f8541d10ed81c80
     Terminating i-0710435abc631eab3
     {
         "TerminatingInstances": [
             {
                 "CurrentState": {
                     "Code": 32,
                     "Name": "shutting-down"
                 },
                 "InstanceId": "i-0710435abc631eab3",
                 "PreviousState": {
                     "Code": 16,
                     "Name": "running"
                 }
             }
         ]
     }
    
  7. Go to the EC2 Instances console which you already have open (or click here to open a new one)

    • Refresh it. (Note: it is usually more efficient to use the refresh button in the console, than to refresh the browser)

      RefreshButton

    • Observe the status of the instance reported by the script. In the screen cap below it is shutting down as reported by the script and will ultimately transition to terminated.

      EC2ShuttingDown

4.2 System response to EC2 instance failure

Watch how the service responds. Note how AWS systems help maintain service availability. Test if there is any non-availability, and if so then how long.

4.2.1 System availability

Refresh the service website several times. Note the following:

  • Website remains available
  • The remaining two EC2 instances are handling all the requests (as per the displayed instance_id)

4.2.2 Load balancing

Load balancing ensures service requests are not routed to unhealthy resources, such as the failed EC2 instance.

  1. Go to the Target Groups console you already have open (or click here to open a new one)

    • If there is more than one target group, select the one with the Load Balancer named ResiliencyTestLoadBalancer
  2. Click on the Targets tab and observe:

    • Status of the instances in the group. The load balancer will only send traffic to healthy instances.

    • When the auto scaling launches a new instance, it is automatically added to the load balancer target group.

    • In the screen cap below the unhealthy instance is the newly added one. The load balancer will not send traffic to it until it is completed initializing. It will ultimately transition to healthy and then start receiving traffic.

    • Note the new instance was started in the same Availability Zone as the failed one. Amazon EC2 Auto Scaling automatically maintains balance across all of the Availability Zones that you specify.

      TargetGroups

  3. From the same console, now click on the Monitoring tab and view metrics such as Unhealthy hosts and Healthy hosts

    TargetGroupsMonitoring

4.2.3 Auto scaling

Autos scaling ensures we have the capacity necessary to meet customer demand. The auto scaling for this service is a simple configuration that ensures at least three EC2 instances are running. More complex configurations in response to CPU or network load are also possible using AWS.

  1. Go to the Auto Scaling Groups console you already have open (or click here to open a new one)

    • If there is more than one auto scaling group, select the one with the name that starts with WebServersforResiliencyTesting
  2. Click on the Activity History tab and observe:

    • The screen cap below shows that all three instances were successfully started at 17:25

    • At 19:29 the instance targeted by the script was put in draining state and a new instance ending in …62640 was started, but was still initializing. The new instance will ultimately transition to Successful status

      AutoScalingGroup

Draining allows existing, in-flight requests made to an instance to complete, but it will not send any new requests to the instance. Learn more: After the lab see this blog post for more information on draining.

Learn more: After the lab see Auto Scaling Groups to learn more how auto scaling groups are setup and how they distribute instances, and Dynamic Scaling for Amazon EC2 Auto Scaling for more details on setting up auto scaling that responds to demand

4.2.4 EC2 failure injection - conclusion

Deploying multiple servers and Elastic Load Balancing enables a service suffer the loss of a server with no availability disruptions as user traffic is automatically routed to the healthy servers. Amazon Auto Scaling ensures unhealthy hosts are removed and replaced with healthy ones to maintain high availability.