Now that you have completed this lab, make sure to update your Well-Architected review if you have implemented these changes in your workload.
Click here to access the Well-Architected Tool
The efficiency of issue resolution within an Operations team is directly linked to their tenure and experience. Where an Operator has prior knowledge of a particular issue, they will have a headstart in being able to reach resolution in terms of understanding logs and metrics which were used in previous situations. Whilst this constitutes value to an Operations group, it also represents a single point of failure and a scalability challenge.
This is where playbooks become important. Playbooks are a documented set of predefined steps, which are run to identify an issue. The result of each step can be used to either call more steps to run, or alternatively to trigger manual intervention.
Automating playbook activities wherever possible, is critical to reducing the time to respond to an incident.
The AWS Cloud offers multiple services you can use to build an automated playbook, one which is AWS Systems Manager.
AWS Systems Manager offers an automation document capability (known within Systems Manager as runbooks ), which allows for the creation of a series of executable steps to orchestrate your investigation and remediation. AWS Systems Manager Automation Documents allow a user to run custom scripts, call AWS service APIs, or even run remote commands on cloud or on-premise compute instances.
In this section, you will focus on creating an automated playbook in assisting your investigation, as a Systems Operator.
The Systems Manager Automation Document you are building will require assumed permissions to run the investigation and remediation steps. You will need to create the IAM role that will assume the permissions to perform the playbook activities. To simplify the deployment process, a CloudFormation template has been provided that you can deploy via the console or AWS CLI. Please choose one of the two following deployment steps:
Once you have deployed the CloudFormation stack above, go to the IAM Console.
On the side menu, click on Roles and locate the IAM role named AutomationRole.
Take note of the ARN of the role, as we will need it later in the lab.
In preparation for the investigation, you need to know all services and resources associated to the issue. When the email notification is sent, information in the email does not contain any resources information. To gather this necessary information, we will build a playbook to acquire all related resources using our CloudWatch alarm ARN as a reference.
Codifying your playbook with AWS Systems Manager allows for maximum code reusability. This will reduce overhead in re-writing codes that has identical objectives.
Note: Follow these step to build and run playbook. Select a guide to deploy using either the AWS console, the AWS CLI or via a CloudFormation template deployment.
Once the automation document is created, you can now give it a test.
Playbook-Gather-Resourcesand click on Execute Automation to run your playbook.
In the previous step, you have created a playbook that finds all related AWS resources in the application. In this step you will create a playbook that will interrogate resources, capture recent metrics and logs, to look for insights and better understand the root cause of the issue.
In practice, there can be various possibilities of actions that the playbook can take to investigate, depending on the scenario presented by the issue. The purpose of this Lab is to showcase how you can use playbook to aid investigation, rather than advise on a specific action path.
Therefore, in this lab we will assume an example scenario. The playbook will look at metrics and logs of the ELB, ECS and RDS services in the resource list. The playbook will then highlight the metrics and logs that is considered outside of normal operational threshold.
Please follow the below instructions to build this playbook:
Note: We will deploy this playbook via CloudFormation template to simplify deployment. Please follow the steps below to deploy the CloudFormation template via CLI / or Console.
When the document is created, you can go ahead and run a quick test.
You can find the newly created document under the Owned by me tab of the Document resource in the Systems Manager console.
Click on the playbook called
Playbook-Investigate-Application-Resources and click on Execute Automation to run our playbook.
Paste in the resources list you took note from the output of the previous playbook ( refer to section 3.1 Building the “Gather-Resources” Playbook ) under Resources and click on Execute
Under Executed Steps you should be able to see each of the step the playbook. If you view the content of the document you will be able to see the code and find out what each step does.
For simplicity, we have created a list of output and description for each step. Expand the list below to view.
Wait until all steps are completed successfully.
So far we have 2 separate playbooks. The first playbook gathers the list of resources associated with the application. The second playbook queries the relevant resources and investigates the appropriate logs and metrics.
In this step we will automate our playbooks further by creating a parent playbook that orchestrates the 2 Investigative playbooks. We will add another step to send notification to our Developers and System Owners.
Follow the instructions below to build the parent Playbook.
Note: Select a step-by-step guide below to build the parent playbook using either the AWS console a CloudFormation template.
You can now run the playbook to discover the result of the investigation.
Go to the Output section of the deployed CloudFormation stack
walab-ops-sample-application and take note of below output values.
Go to the Systems Manager Automation document we just created in the previous step,
And then run the playbook passing the ARN as the AlarmARN input value, along with the SNSTopicArn.
walab-ops-sample-applicationstack and copy, paste the value of OutputSystemEventTopicArn
From the report being generated you should see a large number of ELB504Count error and a high TargetResponseTime from the Load balancer. This explains the delay we are seeing from our canary alarm.
If you then look at the ECS summary, you will notice that there is only 1 ECS TaskRunningCount, with a relatively high CPUUtilization average. The script calculates the average of maximum value on the ECS service in the last 6 minutes window. If you do not see CPUUtilization value in the json, you can confirm this by going to the ECS service console and click on the Metrics tab.
Therefore, it is likely that the immediate cause of the latency is resource constrained at the application API level running in ECS. Ideally, if we can increase the number of tasks in the ECS service, the application should be able to release some of the CPU Utilization constraints.
With all of these information provided by our playbook findings, we should be able to determine what is the next course of action to attempt remediation to the issue.
This concludes Section 3 of this lab, click on the link below to move on to the next section to build the remediation runbook.
Click here to access the Well-Architected Tool