Test Resiliency Using RDS Failure Injection

5.1 RDS failure injection

This failure injection will simulate a critical failure of the Amazon RDS DB instance.

In Chaos Engineering we always start with a hypothesis. For this experiment the hypothesis is:

Hypothesis: If the primary RDS instance dies, then availability will not be impacted

  1. Before starting, view the deployment machine in the AWS Step Functions console to verify the deployment has reached the stage where you can start testing:

    • single region: WaitForMultiAZDB shows completed (green)
    • multi region: both WaitForRDSRRStack1 and CheckRDSRRStatus1 show completed (green)
  2. Before you initiate the failure simulation, refresh the service website several times. Every time the image is loaded, the website writes a record to the Amazon RDS database

  3. Click on click here to go to other page and it will show the latest ten entries in the Amazon RDS DB DemoWebsiteClickHere

    1. The DB table shows “hits” on your image web page
    2. These include requests you may make as well as load balancer health checks
    3. Refresh and note that new data is constantly being written to the table
    4. Click on click here to go to other page again to return to the image web page
  4. Go to the RDS Dashboard in the AWS Console at http://console.aws.amazon.com/rds

  5. From the RDS dashboard

    • Click on “DB Instances (n/40)”
    • Click on the DB identifier for your database (if you have more than one database, refer to the VPC ID to find the one for this workshop)
    • If running the multi-region deployment, select the DB instance with Role=Master
  6. Look at the configured values. Note the following:

    • Value of the Status field is Available
    • Region & AZ shows the AZ for your primary DB instance
    • Select the Configuration tab: Multi-AZ. is enabled, and Secondary Zone shows the AZ for you standby DB instance DBInitialConfiguration
  7. To failover of the RDS instance, use the VPC ID as the command line argument replacing <vpc-id> in one (and only one) of the scripts/programs below. (choose the language that you setup your environment for)

    LanguageCommand
    Bash./failover_rds.sh <vpc-id>
    Pythonpython3 fail_rds.py <vpc-id>
    Javajava -jar app-resiliency-1.0.jar RDS <vpc-id>
    C#.\AppResiliency RDS <vpc-id>
    PowerShell.\failover_rds.ps1 <vpc-id>
  8. The specific output will vary based on the command used, but will include some indication that the your Amazon RDS Database is being failedover: Failing over mdk29lg78789zt

5.2 System response to RDS instance failure

Watch how the service responds. Note how AWS systems help maintain service availability. Test if there is any non-availability, and if so then how long.

5.2.1 System availability

  1. The website is not available. Some errors you might see reported:

    • No Response / Timeout: Request was successfully sent to EC2 server, but server no longer has connection to an active database
    • 504 Gateway Time-out: Amazon Elastic Load Balancer did not get a response from the server. This can happen when it has removed the servers that are unable to respond and added new ones, but the new ones have not yet finished initialization, and there are no healthy hosts to receive the request
    • 502 Bad Gateway: The Amazon Elastic Load Balancer got a bad request from the server
    • An error you will not see is This site can’t be reached. This is because the Elastic Load Balancer has a node in each of the three Availability Zones and is always available to serve requests.
  2. Continue on to the next steps, periodically returning to attempt to refresh the website.

5.2.2 Failover to standby

  1. On the database console Configuration tab
    1. Refresh and note the values of the Status field. It will ultimately return to Available when the failover is complete.

    2. Note the AZs for the primary and standby instances. They have swapped as the standby has no taken over primary responsibility, and the former primary has been restarted. (After RDS failover it can take several minutes for the console to update as shown below. The failover has however completed)

      DBPostFailConfiguration

    3. From the AWS RDS console, click on the Logs & events tab and scroll down to Recent events. You should see entries like those below. (Note: you may need to page over to the most recent events) .In this case failover took less than a minute.

       Mon, 11 Oct 2021 19:53:37 GMT - Multi-AZ instance failover started.
       Mon, 11 Oct 2021 19:53:45 GMT - DB instance restarted
       Mon, 11 Oct 2021 19:54:21 GMT - Multi-AZ instance failover completed
      

5.2.3 EC2 server replacement

  1. From the AWS RDS console, click on the Monitoring tab and look at DB connections

    • As the failover happens the existing three servers all cannot connect to the DB

    • AWS Auto Scaling detects this (any server not returning an http 200 status is deemed unhealthy), and replaces the three EC2 instances with new ones that establish new connections to the new RDS primary instance

    • The graph shows an unavailability period of about four minutes until at least one DB connection is re-established

      RDSDbConnections

  2. [optional] Go to the Auto scaling group and AWS Elastic Load Balancer Target group consoles to see how EC2 instance and traffic routing was handled

5.2.4 RDS failure injection - results

  • AWS RDS Database failover took less than a minute
  • Time for AWS Auto Scaling to detect that the instances were unhealthy and to start up new ones took four minutes. This resulted in a four minute non-availability event.

Our requirements for availability require that downtime be under one minute. Therefore our hypothesis is not confirmed:

Hypothesis: If the primary RDS instance dies, then availability will not be impacted

Chaos Engineering uses the scientific method. We ran the experiment, and in the verify step found that our hypothesis was not confirmed, therefore the next step is to improve and run the experiment again.

ChaosEngineeringCycle

5.3 RDS failure injection - improving resiliency

In this section you reduce the unavailability time from four minutes to under one minute.

You observed before that failover of the RDS instance itself takes under one minute. However the servers you are running are configured such that they cannot recognize that the IP address for the RDS instance DNS name has changed from the primary to the standby. Availability is only regained once the servers fail to reach the primary, are marked unhealthy, and then are replaced. This accounts for the four minute delay. In this part of the lab you will update the server code to be more resilient to RDS failover. The new code will re-establish the connection to the database, and therefore uses the new DNS record to connect to the RDS instance.

Use either the Express Steps or Detailed Steps below:

Express Steps
  1. Go to the AWS CloudFormation console at https://console.aws.amazon.com/cloudformation
  2. For the WebServersForResiliencyTesting Cloudformation stack
    1. Redeploy (Update) the stack and Use current template
    2. Change the BootObject parameter to server_with_reconnect.py
Detailed Steps
Click here for detailed steps for updating the Cloudformation stack:

When you see UPDATE_COMPLETE_CLEANUP_IN_PROGRESS you may continue. There is no need to wait.

  • This update deploys three new EC2 instances in a new Auto Scaling group. There may be a period that you will still see the old three instances running, before they are drained and terminated.
  • There may be a short period of unavailability. Make sure the web site is available before continuing.

Now you will re-run the experiment as per the steps below:

  • Before we used a custom script. For this run of the experiment, we will show how to use AWS Fault Injection Simulator (FIS)

5.4 RDS failure injection using AWS Fault Injection Simulator (FIS)

As in section 5.1, you will simulate a critical failure of the Amazon RDS DB instance, but using FIS.

We would not normally change our execution approach as part of the “improve / experiment” cycle. However, for this lab it is illustrative to see the different ways that the experiment can be executed.

5.4.1 Create experiment template

  1. Go to the RDS Dashboard in the AWS Console at http://console.aws.amazon.com/rds

  2. From the RDS dashboard

    • Click on “DB Instances (n/40)”
    • Click on the DB identifier for your database (if you have more than one database, refer to the VPC ID to find the one for this workshop)
    • If running the multi-region deployment, select the DB instance with Role=Master
  3. Look at the configured values. Note the following:

    • Value of the Status field is Available
    • Region & AZ shows the AZ for your primary DB instance
    • Select the Configuration tab: Multi-AZ. is enabled, and Secondary Zone shows the AZ for you standby DB instance
    • Select the Tags tab: Note the Value for the Workshop tag DBInitialConfiguration
  4. Navigate to the FIS console and click Experiment templates in the left pane.

  5. Click on Create experiment template to define the type of failure you want to inject.

    FISconsole

  6. Enter Experiment template for RDS resiliency testing for Description and RDS-resiliency-testing for Name. For IAM role select WALab-FIS-role.

    ExperimentName-RDS

  7. Scroll down to Actions and click Add action.

    AddAction

  8. Enter reboot-database for the Name. Under Action type select aws:rds:reboot-db-instances. Enter true under forceFailover - optional and click Save.

    ActionRDS

  9. Scroll down to Targets and click Edit next to DBInstances-Target-1 (aws:rds:db).

    EditTargetRDS

  10. Under Target method, select Resource tags and filters. Select Count for Selection mode and enter 1 under Number of resources. This ensures that FIS will only reboot one RDS DB instance.

  11. Scroll down to Resource tags and click Add new tag. Enter Workshop for Key and AWSWellArchitectedReliability300-ResiliencyofEC2RDSandS3 for Value. These are the same tags that are on the RDS DB instance used in this lab.

    SelectTargetRDS

  12. You can choose to stop running an experiment when certain thresholds are met, in this case, using CloudWatch Alarms under Stop condition. For this lab, this is a single point in time event (with no duration) so you can leave this blank.

  13. Click Create experiment template.

  14. In the warning pop-up, confirm that you want to create the experiment template without a stop condition by entering create in the text box. Click Create experiment template.

    CreateTemplate

5.4.2 Run the experiment

  1. Click on Experiment templates from the menu on the left.

  2. Select the experiment template RDS-resiliency-testing and click Actions. Select Start experiment.

    StartExperimentRDS-1

  3. You can choose to add a tag to the experiment if you wish to do so.

  4. Click Start experiment.

    StartExperimentRDS-2

  5. In the pop-up, type start and click Start experiment.

    StartExperiment

  6. Check the website availability. Re-check every 20-30 seconds.

  7. Revisit section 5.2 to observe the system response to the RDS instance failure.

    • At a minimum, return to the RDS console, go the the Logs & events tab, and look at the most recent events to verify that a failover has occurred.

5.4.3 RDS failure injection, second experiment - results

  • You will observe that the unavailability time is now under one minute
  • What else is different compared to the previous time the RDS instance failed over?

5.5 RDS failure injection - conclusion

After making the necessary improvements, now our hypothesis is confirmed:

Hypothesis: If the primary RDS instance dies, then availability will not be impacted


Resources

Learn more: After the lab see High Availability (Multi-AZ) for Amazon RDS for more details on high availability and failover support for DB instances using Multi-AZ deployments.

High Availability (Multi-AZ) for Amazon RDS

The primary DB instance switches over automatically to the standby replica if any of the following conditions occur:

  • An Availability Zone outage
  • The primary DB instance fails
  • The DB instance’s server type is changed
  • The operating system of the DB instance is undergoing software patching
  • A manual failover of the DB instance was initiated using Reboot with failover