Lab complete!
Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool
You will now introduce the poison pill into the workload by including the bug query-string with your requests and see how the updated workload architecture handles it. As in the previous case, imagine that customer Alpha triggered the bug in the application again.
Include the query-string bug with a value of true and make a request as customer Alpha. The modified URL should look like this - http://shuffle-alb-1p2xbmzo541rr-1602891463.us-east-1.elb.amazonaws.com/?name=Alpha&bug=true (but using your own URL from the CloudFormation stack Outputs)
This should result in an Internal Server Error response on the browser indicating that the application has stopped working as expected on the instance that processed this request
Just like before, customer Alpha, not aware of this bug in the application, will retry the request.
All requests to this shard will now fail because there are no healthy instances in the shard. No matter how many times the page is refreshed, you will see a 502 Bad Gateway for customer Alpha showing that customer Alpha is experiencing complete downtime. At this point, the overall capacity of the fleet has decreased from 8 EC2 instances to 6 EC2 instances.
Due to shuffle sharding, the other customers are unaffected or have limited impact.
The impact is localized to a specific shard and only customer Alpha experiences unavailability.
Customer Name | Workers |
---|---|
Alpha | Worker-1 and Worker-2 |
Bravo | Worker-2 and Worker-3 |
Charlie | Worker-3 and Worker-4 |
Delta | Worker-4 and Worker-5 |
Echo | Worker-5 and Worker-6 |
Foxtrot | Worker-6 and Worker-7 |
Golf | Worker-7 and Worker-8 |
Hotel | Worker-8 and Worker-1 |
In a shuffle sharded system, the scope of impact of failures can be calculated using the following formula:
The formula can be expanded to calculate the number of unique combinations that can exist given the number of workers and the number of workers per shard, also referred to as shard size. The calculation is performed using factorials.
For example if there were 100 workers, and we assign a unique combination of 5 workers to a shard, then the failure of any 1 shard will only impact 0.0000013% of customers.
With this shuffle sharded architecture, the scope of impact is further reduced by the combination of Workers used to generate shards. Here with eight shards, if a customer experiences a problem, then the shard hosting them as well as the Workers mapped to that shard might be impacted. However, that shard represents only a fraction of the overall service. Since this is just a lab we kept it simple with only eight shards, but with more shards, the scope of impact decreases further. Adding more shards requires adding more capacity (more workers). With higher number of Workers, it is possible to achieve a higher number of unique combinations resulting in exponential improvement of the scope of impact of failures.
You can look at the AvailabilityDashboard to see the impact of the failure introduced by customer Alpha across all customers.
Switch to the tab that the AvailabilityDashboard opened. (You can also retrieve the URL from the CloudFormation stack Outputs).
You can see that the introduction of the poison-pill and subsequent retries by customer Alpha has not impacted any other customer.
Note: This is optional and does not need to be completed if you are planning on tearing down this lab as described in the next section. If you are planning on testing this lab further, please follow the instructions below to fix the application on the EC2 instances.
Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool