Impact of failures with shuffle sharding

Break the application

You will now introduce the poison pill into the workload by including the bug query-string with your requests and see how the updated workload architecture handles it. As in the previous case, imagine that customer Alpha triggered the bug in the application again.

  1. Include the query-string bug with a value of true and make a request as customer Alpha. The modified URL should look like this - http://shuffle-alb-1p2xbmzo541rr-1602891463.us-east-1.elb.amazonaws.com/?name=Alpha&bug=true (but using your own URL from the CloudFormation stack Outputs)

  2. This should result in an Internal Server Error response on the browser indicating that the application has stopped working as expected on the instance that processed this request

    PoisonPill

  3. Just like before, customer Alpha, not aware of this bug in the application, will retry the request.

    • Refresh the page to simulate this as you did before.
    • This request is routed to the other healthy instance in the shard for customer Alpha.
    • The bug is triggered again and the other instance goes down as well.
    • The entire shard is now affected.
  4. All requests to this shard will now fail because there are no healthy instances in the shard. No matter how many times the page is refreshed, you will see a 502 Bad Gateway for customer Alpha showing that customer Alpha is experiencing complete downtime. At this point, the overall capacity of the fleet has decreased from 8 EC2 instances to 6 EC2 instances.

    502BadGateway

  5. Due to shuffle sharding, the other customers are unaffected or have limited impact.

    • Verify this by making requests using the URLs for these customers (obtained from the CloudFormation stack Outputs) - Bravo, Charlie, Delta, Echo, Foxtrot, Golf, and Hotel.
    • Refresh each request multiple times. You should notice that all customers (except Alpha) will now receive a response
    • Notice that customers Bravo and Hotel will only get responses from a single EC2 instance while other customers get it from 2 different EC2 instances.
  6. The impact is localized to a specific shard and only customer Alpha experiences unavailability.

    • Customers Bravo and Hotel that have a shared EC2 instance with customer Alpha will still have one EC2 instance available to respond to requests.
    • While this might lead to some degree of degradation for those customers, it is still an improvement over complete downtime.
    • The scope of impact has now been reduced so that only 12.5% of customers are affected by the failure induced by the poison pill.
    • Note that this improvement to scope of impact was achieved without having to increase capacity or take any manual actions.
    • With larger fleet and shard sizes, the number of combinations will increase resulting in customers having different degrees of degradation i.e. some customers will only have a fraction of their overall shard capacity affected instead of complete downtime.
    Customer NameWorkers
    AlphaWorker-1 and Worker-2
    BravoWorker-2 and Worker-3
    CharlieWorker-3 and Worker-4
    DeltaWorker-4 and Worker-5
    EchoWorker-5 and Worker-6
    FoxtrotWorker-6 and Worker-7
    GolfWorker-7 and Worker-8
    HotelWorker-8 and Worker-1

ShuffleShardedFlowBrokenNodes

In a shuffle sharded system, the scope of impact of failures can be calculated using the following formula:

ScopeShuffleSharding-1

The formula can be expanded to calculate the number of unique combinations that can exist given the number of workers and the number of workers per shard, also referred to as shard size. The calculation is performed using factorials.

ScopeShuffleSharding-2

ScopeShuffleSharding-3

For example if there were 100 workers, and we assign a unique combination of 5 workers to a shard, then the failure of any 1 shard will only impact 0.0000013% of customers.

ScopeShuffleSharding-4

ScopeShuffleSharding-5

With this shuffle sharded architecture, the scope of impact is further reduced by the combination of Workers used to generate shards. Here with eight shards, if a customer experiences a problem, then the shard hosting them as well as the Workers mapped to that shard might be impacted. However, that shard represents only a fraction of the overall service. Since this is just a lab we kept it simple with only eight shards, but with more shards, the scope of impact decreases further. Adding more shards requires adding more capacity (more workers). With higher number of Workers, it is possible to achieve a higher number of unique combinations resulting in exponential improvement of the scope of impact of failures.

Verify workload availability

You can look at the AvailabilityDashboard to see the impact of the failure introduced by customer Alpha across all customers.

  1. Switch to the tab that the AvailabilityDashboard opened. (You can also retrieve the URL from the CloudFormation stack Outputs).

  2. You can see that the introduction of the poison-pill and subsequent retries by customer Alpha has not impacted any other customer.

    • Notice that the impact is localized to a specific shard and the other customers are not impacted by this.
    • With sharding, only 12.5% of customers are impacted.
    • NOTE: You might have to wait a few minutes for the dashboard to get updated.

ImpactDashboardShuffleSharded

Fix the application

Note: This is optional and does not need to be completed if you are planning on tearing down this lab as described in the next section. If you are planning on testing this lab further, please follow the instructions below to fix the application on the EC2 instances.

Click here for instructions to fix the application: