Lab complete!
Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool
This failure injection will simulate a critical failure of the web server running on the EC2 instances using FIS.
In Chaos Engineering we always start with a hypothesis. For this experiment the hypothesis is:
Hypothesis: If the server process on a single instance is killed, then availability will not be impacted
WaitForWebApp
shows completed (green)WaitForWebApp1
shows completed (green)Navigate to the FIS console at http://console.aws.amazon.com/fis and click Experiment templates in the left pane.
Click on Create experiment template to define the type of failure you want to inject.
Enter Experiment template for application resiliency testing
for Description and App-resiliency-testing
for Name. For IAM role select WALab-FIS-role
.
Scroll down to Actions and click Add action.
Enter kill-webserver
for the Name. Under Action type select aws:ssm:send-command/AWSFIS-Run-Kill-Process. Under documentParameters enter {"ProcessName":"python3","Signal":"SIGKILL"}
. For duration select Minutes and then enter 2 in the text box next to it. Click Save.
Scroll down to Targets and click Edit next to Instances-Target-1 (aws:ec2:instance).
Under Target method, select Resource tags and filters. Select Count for Selection mode and enter 1
under Number of resources. This ensures that FIS will only kill the web server on one instance.
Scroll down to Resource tags and click Add new tag. Enter Workshop
for Key and AWSWellArchitectedReliability300-ResiliencyofEC2RDSandS3
for Value. These are the same tags that are on the EC2 instances used in this lab.
For Resource filters click Add new filter. Enter State.Name
for Attribute path and running
for Values. This ensures FIS targets a running instance. Click Save.
You can choose to stop running an experiment when certain thresholds are met, in this case, using CloudWatch Alarms under Stop condition. For this lab, you can leave this blank.
Click Create experiment template.
In the warning pop-up, confirm that you want to create the experiment template without a stop condition by entering create
in the text box. Click Create experiment template.
Click on Experiment templates from the menu on the left.
Select the experiment template App-resiliency-testing and click Actions. Select Start experiment.
You can choose to add a tag to the experiment if you wish to do so.
Click Start experiment.
In the pop-up, type start
and click Start experiment.
The instances launched as part of this lab are running simple Python webservers. This experiment uses AWS Systems Manager to run a command on the selected instance(s). In this workshop, the command used is kill-process. When the experiment runs, the python3 web server process is terminated on one of the instances and it can no longer handle requests. Watch how the service responds. Note how AWS systems help maintain service availability. Test if there is any non-availability, and if so then how long.
Refresh the service website several times. Note the following:
instance_id
)availability_zone
value when you refresh. You can see that requests are being handled by the EC2 instances in only two Availability Zones, while the EC2 instance in the third zone is being replacedThis can also be verified by viewing the canary run data.
WebServersforResiliencyTesting
stackLoad balancing and Auto Scaling work here much the way they did for the EC2 failure injection experiment.
In this section, you simulated an application level failure where the web server process running the application was killed using FIS and SSM. Although there was no infrastructure failure, your workload was able to detect and correct the issue by replacing the EC2 instance. Deploying multiple servers and Elastic Load Balancing enables a service suffer the loss of a server with no availability disruptions as user traffic is automatically routed to the healthy servers. Amazon Auto Scaling ensures unhealthy hosts are removed and replaced with healthy ones to maintain high availability.
Our hypothesis is confirmed:
Hypothesis: If the server process on a single instance is killed, then availability will not be impacted
Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool