For the next part of the lab restore access to the getRecommendation API on the RecommendationService
Previously you simulated a failure of the service dependency. Now you will simulate a failure on a single server (of the three servers running). You will simulate a fault on this server that prevents only it from calling the otherwise healthy service dependency.
Navigate to the EC2 Instances console
Select only the EC2 instance in Availability Zone us-east-2c
Click Action > Instance Settings > Attach/Replace IAM Role
From IAM role, click WebApp1-EC2-noDDB-Role-HealthCheckLab
This will return you to the EC2 Instances console. Observe under IAM Instance Profile Name (it is one of the displayed columns) which IAM roles each EC2 instance has attached
The IAM role attached to an EC2 instance determines what permissions it has to access AWS resources. You changed the role of the us-east-2c instance to one that is almost the same as the other two, except it does not have access to DynamoDB. Since DynamoDB is used to mock our service dependency, the us-east-2c server no longer has access to the service dependency (RecommendationService). Stale credentials is an actual fault that servers might experience. Your actions above simulate stale (invalid) credentials on the us-east-2c server.
Observe the website behavior now
The service dependency RecommendationServiceEnabled is still healthy
It is the server in us-east-2c that is unhealthy - it has stale credentials
The service would deliver a better experience if it:
|Well-Architected for Reliability: Best practices|
|Make services stateless where possible: Services should either not require state, or should offload state such that between different client requests, there is no dependence on locally stored data on disk or in memory. This enables servers to be replaced at will without causing an availability impact. Amazon ElastiCache or Amazon DynamoDB are good destinations for offloaded state.|
|Automate healing on all layers: Upon detection of a failure, use automated capabilities to perform actions to remediate. Ability to restart is an important tool to remediate failures. As discussed previously for distributed systems, a best practice is to make services stateless where possible. This prevents loss of data or availability on restart. In the cloud, you can (and generally should) replace the entire resource (for example, EC2 instance, or Lambda function) as part of the restart. The restart itself is a simple and reliable way to recover from failure. Many different types of failures occur in workloads. Failures can occur in hardware, software, communications, and operations. Rather than constructing novel mechanisms to trap, identify, and correct each of the different types of failures, map many different categories of failures to the same recovery strategy. An instance might fail due to hardware failure, an operating system bug, memory leak, or other causes. Rather than building custom remediation for each situation, treat any of them as an instance failure. Terminate the instance, and allow AWS Auto Scaling to replace it. Later, carry out the analysis on the failed resource out of band.|
From the Target Groups console click on the the Health checks tab
Choose one of the options below (Option 1 - Expert or Option 2 - Assisted) to improve the code and add the deep health check.
You may choose this option, or skip to Option 2 - Assisted option
This option requires you have access to place a file in a location accessible via https/https via a URL. For example a public readable S3 bucket, gist (use the raw option to get the URL), or your private webserver.
/healthcheckshould in turn make a test call to RecommendationService using User ID 0
If you completed the Option 1 - Expert option, then skip the Option 2 - Assisted option section and continue with 3.4.3 Health check code
Healthcheck requestin the comments. What will this code do now if called on this health check URL?
Navigate to the AWS CloudFormation console
Click on the HealthCheckLab stack
Leave Use current template selected and click Next
Find the ServerCodeUrl parameter and enter the following:
Click Next until the last page
At the bottom of the page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names
Click Update stack
Click on Events, and click the refresh icon to observe the stack progress
This is the health check code from server_healthcheck.py . The Option 2 - Assisted option uses this code. If you used the Option 1 - Expert option, you can consult this code as a guide.
The CloudFormation stack update reset the EC2 instance IAM roles, so the system is back to its original no-fault state. You will re-introduce the single-server fault and observe the new behavior.
Refresh the web service multiple times and note all three servers are functioning without error
Copy the URL of the web service to a new tab and append
/healthcheck to the end of the URL
The new URL should look like:
Refresh several times and observe the health check on the three servers
Note the check is successful - the check now includes a call to the RecommendationService (the DynamoDB table)
Go to the Target Groups console click on the Targets tab and note the health status as per the ELB health checks.
To re-introduce the stale credentials fault, again change the IAM role for the EC2 instance in us-east-2c to WebApp1-EC2-noDDB-Role-HealthCheckLab
Go to the Target Groups console click on the Targets tab and note the health status as per the ELB health checks (remember to refresh)
Note that the server in us-east-2c is now failing the health check with a http code 503 Service Not Available
The ELB has identified the us-east-2c server as unhealthy and will not route traffic to it
This is known as fail-closed behavior
Refresh the web service multiple times and note it is however still functioning without error
|Well-Architected for Reliability: Best practices|
|Monitor all components of the workload to detect failures: Continuously monitor the health of your workload so that you and your automated systems are aware of degradation or complete failure as soon as they occur.|
|Failover to healthy resources: Ensure that if a resource failure occurs, that healthy resources can continue to serve requests.|
|Well-Architected for Reliability: Health Checks|
|The load balancer will only route traffic to healthy application instances. The health check needs to be at the data plane/application layer indicating the capability of the application on the instance. This check should not be against the control plane. A health check URL for the web application will be present and configured for use by the load balancer|