Lab complete!
Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool
In the scenario used in the lab, the application has a known issue which is triggered by passing a “bad” query string. If such a request is received, the EC2 instance that handles the request will become unresponsive and the application will crash on the instance. The “bad” query string that triggers this is bug with a value of true. The development team is aware of this bug and are working on a fix, however, the issue exists today and customers might accidentally or intentionally trigger it. This is referred to as a “poison pill”, a bug or issue which when introduced into a system could compromise the functionality of the system.
Imagine a situation where a customer accidentally triggers the bug in the application that causes it to shutdown on the instance where the request was received.
&bug=true
.You should see an Internal Server Error response on the browser indicating that the application has stopped working as expected on the instance that processed this request
At this point, there are 7 healthy instances still available so other customers are not impacted. You can verify this:
Note: If you see a response that says “This site can’t be reached”, please make sure you are using the URL obtained from the outputs section of the CloudFormation stack and not the sample URL provided in this lab guide.
Customer Alpha, not aware of this bug in the application, will retry the request.
This new request is then routed to one of the 7 remaining healthy instances. The bug is triggered again and another instance goes down leaving only 6 healthy instances.
This process continues with customer Alpha retrying requests until all instances are unhealthy.
In this situation, a buggy request made by one customer has taken down all instances on the backend resulting in complete downtime and all customers are now affected. This is a widespread scope of impact with 100% of customers affected.
You can look at the AvailabilityDashboard to see the impact of the failure introduced by customer Alpha across all customers.
Switch to the tab that has the AvailabilityDashboard opened. (You can also retrieve the URL from the CloudFormation stack Outputs).
You can see that the introduction of the poison-pill and subsequent retries by customer Alpha has impacted all other customers as the canaries for these customers are unable to make successful requests to the workload.
As previously mentioned, the development team is aware of this bug within the application and are working on a fix, however, the fix will not be rolled out for several weeks/months. They have been able to identify the root cause of the issue and provided a temporary manual fix for it. Whenever this issue is encountered, the Operations team executes the temporary fix to bring the application back up again. They have codified this process into a Systems Manager Document and use Systems Manager to implement the fix on their fleet if outages occur. The Systems Manager Document restarts the application on the selected instances.
Go to the Outputs section of the CloudFormation stack and open the link for “SSMDocument”. This will take you to the Systems Manager console.
Click on Run command which will open a new tab on the browser
Scroll down to the Targets section and select Specify instance tags
Enter Workload
for the tag key and WALab-shuffle-sharding
for the tag value. Click Add.
Scroll down to the Output options section and uncheck the box next to Enable an S3 bucket. This will prevent Systems Manager from writing log files based on the command execution to S3.
Click on Run
You should see the command execution succeed in a few seconds
Once the command has finished execution, you can go back to the application and test it to verify it is working as expected by using one of the customer URLs obtained from the CloudFormation stack Outputs.
Review the AvailabilityDashboard to make sure canary requests are succeeding and normal functionality has returned to all customers. You should see that SuccessPercent has returned to 100 for all customers.
Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool