Impact of failures with shuffle sharding

Break the application

You will now introduce the poison pill into the workload by including the bug query-string with your requests and see how the updated workload architecture handles it. As in the previous case, imagine that customer Alpha triggered the bug in the application again.

  1. Include the query-string bug with a value of true and make a request as customer Alpha. The modified URL should look like this - http://shuffle-alb-1p2xbmzo541rr-1602891463.us-east-1.elb.amazonaws.com/?name=Alpha&bug=true

  2. This should result in an Internal Server Error response on the browser indicating that the application has stopped working as expected on the instance that processed this request

    PoisonPill

  3. Just like before, customer Alpha, not aware of this bug in the application, will retry the request. Refresh the page to simulate this as you did before. This request is routed to the other healthy instance in the shard for customer Alpha. The bug is triggered again and the other instance goes down as well. The entire shard is now affected.

  4. All requests to this shard will now fail because there are no healthy instances in the shard. No matter how many times the page is refreshed, you will see a 502 Bad Gateway for customer Alpha showing that customer Alpha is experiencing complete downtime. At this point, the overall capacity of the fleet has decresed from 4 EC2 instances to 2 EC2 instances.

    502BadGateway

  5. Due to shuffle sharding, all of the remaining customers are unaffected or have limited impact. Send requests as the following customers and refresh each request multiple times. You should notice that all customers will now receive a response, although some customers will only get responses from a single EC2 instance while others get it from 2 different EC2 instances.

  6. The impact is localized to a specific shard and only customer Alpha is affected. Customers that have a shared EC2 instance with customer Alpha will only have 1 EC2 instance available to respond to requests. While this might lead to some degree of degradation for those customers, it is still an improvement over complete downtime. The scope of impact has now been reduced so that only 12.5% of customers are affected by the failure induced by the poison pill. With larger fleet and shard sizes, the number of combinations will increase resulting in customers having different degrees of degradation i.e. some customers will only have a fraction of their overall shard capacity affected instead of complete downtime.

    Customer NameWorkers
    AlphaWorker-1 and Worker-2
    BravoWorker-2 and Worker-3
    CharlieWorker-3 and Worker-4
    DeltaWorker-4 and Worker-5
    EchoWorker-5 and Worker-6
    FoxtrotWorker-6 and Worker-7
    GolfWorker-7 and Worker-8
    HotelWorker-8 and Worker-1

ShuffleShardedFlowBrokenNodes

In a shuffle sharded system, the scope of impact of failures can be calculated using the following formula:

ScopeShuffleSharding-1

The formula can be expanded to calculate the number of unique combinations that can exist given the number of workers and the number of workers per shard, also referred to as shard size. The calculation is performed using factorials.

ScopeShuffleSharding-2

ScopeShuffleSharding-3

For example if there were 100 workers, and we assign a unique combination of 5 workers to a shard, then the failure of any 1 shard will only impact 0.0000013% of customers.

ScopeShuffleSharding-4

ScopeShuffleSharding-5

With this shuffle sharded architecture, the scope of impact is further reduced by the combination of Workers used to generate shards. Here with six shards, if a customer experiences a problem, then the shard hosting them as well as the Workers mapped to that shard might be impacted. However, that shard represents only a fraction of the overall service. Since this is just a lab we kept it simple with only six shards, but with more shards, the scope of impact decreases further. Adding more shards requires adding more capacity (more workers). With higher number of Workers, it is possible to achieve a higher number of unique combinations resulting in exponential improvement of the scope of impact of failures.

Fix the application

Note: This is optional and does not need to be completed if you are planning on tearing down this lab as described in the next section. If you are planning on testing this lab further, please follow the instructions below to fix the application on the EC2 instances.

Click here for instructions to fix the application: