Lab complete!
Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool
This failure injection will simulate a critical failure of the Amazon RDS DB instance.
In Chaos Engineering we always start with a hypothesis. For this experiment the hypothesis is:
Hypothesis: If the primary RDS instance dies, then steady state will be maintained. Steady state is defined here as downtime less than one minute.
[Optional] Before starting, view the deployment machine in the AWS Step Functions console to verify the deployment has reached the stage where you can start testing:
WaitForMultiAZDB
shows completed (green)WaitForRDSRRStack1
and CheckRDSRRStatus1
show completed (green)Before you initiate the failure simulation, refresh the service website several times. Every time the image is loaded, the website writes a record to the Amazon RDS database
Click on click here to go to other page and it will show the latest ten entries in the Amazon RDS DB
Go to the RDS Dashboard in the AWS Console at http://console.aws.amazon.com/rds
From the RDS dashboard
Look at the configured values. Note the following:
To failover of the RDS instance, use the VPC ID as the command line argument replacing <vpc-id>
in one (and only one) of the scripts/programs below. (choose the language that you setup your environment for)
Language | Command |
---|---|
Bash | ./failover_rds.sh <vpc-id> |
Python | python3 fail_rds.py <vpc-id> |
Java | java -jar app-resiliency-1.0.jar RDS <vpc-id> |
C# | .\AppResiliency RDS <vpc-id> |
PowerShell | .\failover_rds.ps1 <vpc-id> |
The specific output will vary based on the command used, but will include some indication that the your Amazon RDS Database is being failedover: Failing over mdk29lg78789zt
Watch how the service responds. Note how AWS systems help maintain service availability. Test if there is any non-availability, and if so then how long.
The website is not available. Some errors you might see reported:
This can also be verified by viewing the canary run data.
WebServersforResiliencyTesting
stackContinue on to the next steps, periodically returning to attempt to refresh the website or viewing the canary runs.
Refresh and note the values of the Status field. It will ultimately return to Available when the failover is complete.
Note the AZs for the primary and standby instances. They have swapped as the standby has no taken over primary responsibility, and the former primary has been restarted. (After RDS failover it can take several minutes for the console to update as shown below. The failover has however completed)
From the AWS RDS console, click on the Logs & events tab and scroll down to Recent events. You should see entries like those below. (Note: you may need to page over to the most recent events) .In this case failover took less than a minute.
Mon, 11 Oct 2021 19:53:37 GMT - Multi-AZ instance failover started.
Mon, 11 Oct 2021 19:53:45 GMT - DB instance restarted
Mon, 11 Oct 2021 19:54:21 GMT - Multi-AZ instance failover completed
From the AWS RDS console, click on the Monitoring tab and look at DB connections
As the failover happens the existing three servers all cannot connect to the DB
AWS Auto Scaling detects this (any server not returning an http 200 status is deemed unhealthy), and replaces the three EC2 instances with new ones that establish new connections to the new RDS primary instance
The graph shows an unavailability period of about four minutes until at least one DB connection is re-established
[optional] Go to the Auto scaling group and AWS Elastic Load Balancer Target group consoles to see how EC2 instance and traffic routing was handled
Our requirements for availability require that downtime be less than one minute. Therefore our hypothesis is not confirmed:
Hypothesis: If the primary RDS instance dies, then steady state will be maintained. Steady state is defined here as downtime less than one minute.
Chaos Engineering uses the scientific method. We ran the experiment, and in the verify step found that our hypothesis was not confirmed, therefore the next step is to improve and run the experiment again.
In this section you reduce the unavailability time from four minutes to under one minute.
You observed before that failover of the RDS instance itself takes under one minute. However the servers you are running are configured such that they cannot recognize that the IP address for the RDS instance DNS name has changed from the primary to the standby. Availability is only regained once the servers fail to reach the primary, are marked unhealthy, and then are replaced. This accounts for the four minute delay. In this part of the lab you will update the server code to be more resilient to RDS failover. The new code will re-establish the connection to the database, and therefore uses the new DNS record to connect to the RDS instance.
Use either the Express Steps or Detailed Steps below:
server_with_reconnect.py
When you see UPDATE_COMPLETE_CLEANUP_IN_PROGRESS you may continue. There is no need to wait.
Now you will re-run the experiment as per the steps below:
As in section 5.1, you will simulate a critical failure of the Amazon RDS DB instance, but using FIS.
We would not normally change our execution approach as part of the “improve / experiment” cycle. However, for this lab it is illustrative to see the different ways that the experiment can be executed.
Go to the RDS Dashboard in the AWS Console at http://console.aws.amazon.com/rds
From the RDS dashboard
Look at the configured values. Note the following:
Workshop
tag
Navigate to the FIS console and click Experiment templates in the left pane.
Click on Create experiment template to define the type of failure you want to inject.
Enter Experiment template for RDS resiliency testing
for Description and RDS-resiliency-testing
for Name. For IAM role select WALab-FIS-role
.
Scroll down to Actions and click Add action.
Enter reboot-database
for the Name. Under Action type select aws:rds:reboot-db-instances. Enter true
under forceFailover - optional and click Save.
Scroll down to Targets and click Edit next to DBInstances-Target-1 (aws:rds:db).
Under Target method, select Resource tags and filters. Select Count for Selection mode and enter 1
under Number of resources. This ensures that FIS will only reboot one RDS DB instance.
Scroll down to Resource tags and click Add new tag. Enter Workshop
for Key and AWSWellArchitectedReliability300-ResiliencyofEC2RDSandS3
for Value. These are the same tags that are on the RDS DB instance used in this lab.
You can choose to stop running an experiment when certain thresholds are met, in this case, using CloudWatch Alarms under Stop condition. For this lab, this is a single point in time event (with no duration) so you can leave this blank.
Click Create experiment template.
In the warning pop-up, confirm that you want to create the experiment template without a stop condition by entering create
in the text box. Click Create experiment template.
Click on Experiment templates from the menu on the left.
Select the experiment template RDS-resiliency-testing and click Actions. Select Start experiment.
You can choose to add a tag to the experiment if you wish to do so.
Click Start experiment.
In the pop-up, type start
and click Start experiment.
Check the website availability. Re-check every 20-30 seconds.
Revisit section 5.2 to observe the system response to the RDS instance failure.
After making the necessary improvements, now our hypothesis is confirmed:
Hypothesis: If the primary RDS instance dies, then steady state will be maintained. Steady state is defined here as downtime less than one minute.
Learn more: After the lab see High Availability (Multi-AZ) for Amazon RDS for more details on high availability and failover support for DB instances using Multi-AZ deployments.
High Availability (Multi-AZ) for Amazon RDS
The primary DB instance switches over automatically to the standby replica if any of the following conditions occur:
- An Availability Zone outage
- The primary DB instance fails
- The DB instance’s server type is changed
- The operating system of the DB instance is undergoing software patching
- A manual failover of the DB instance was initiated using Reboot with failover
Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool