Build & Run Remediation Runbook

In contrast to playbooks, runbooks are procedures that accomplish specific tasks to achieve an outcome. In the previous section, you have identified an issue with CPU utilization, which occurs because there is only 1 ECS task running in the cluster. This could be remediated through the use of auto-scaling.

However, implementing this requires preparation and planning. When an incident occurs, operations teams should have a defined escalation path for the issue. Depending on the criticality of the system they should also be equipped to do what is necessary to ensure system availability is protected while the escalation occurs.

In this section, you will build an automated runbook to remediate the CPU utilization issue by increasing the number of tasks in the ECS cluster. Your automated runbook, will notify the owner of the workload and give them the option to be able to intercept the scale-up action should they choose not to proceed.

Actions items in this section:

  1. You will build a runbook to scale up the ECS cluster, with the approval mechanism.
  2. You will execute the runbook and observe the recovery of your application.

4.0 Building the “Approval-Gate” Runbooks.

In this section you will build a reusable runbook, which provides the owner with the ability to deny or approve remediation actions within a defined waiting period. If the wait time is exceeded and a decision has has not been made, the runbook will automatically approve the action as shown.

Section5

We will achieve this through the use of a Systems Manager Automation document, which we will build using the following steps:

  1. The Approval-Gate runbook executes a separate document called the Approve-Timer.

  2. The Approve-Timer runbook will then wait for a preconfigured amount of time and send an approve signal to the Approval-Gate runbook.

  3. Meanwhile, the Approval-Gate runbook then sends an approval request to the workload owner via a designated SNS topic.

  • If the owner choose to approve, the Approval-Gate runbook will continue to the next step.
  • If the owner declines the approval, the runbook will fail, blocking further steps.
  • However, if the owner does not response within the preconfigured wait time, the Approve-Timer runbook will automatically approve the request.

Follow the instructions below to build the runbook:

Note: Select a step-by-step guide below to build the runbook using either the AWS console or CloudFormation template.

Click here for Console step by step
Click here for CloudFormation deployment steps
Click here for CloudFormation CLI deployment step

4.1 Building the “ECS-Scale-Up” runbook.

Section5

Next, you are going to build the ECS-Scale-Up runbook which will complete the following:

  1. Run the Approval-Gate runbook which you created previously.
  2. Wait for the Approval-Gate runbook to complete.
  3. Once the Approval-Gate runbook completes successfully, the runbook will increase the number of ECS tasks in the cluster.

Please follow below steps to build the runbook.

Note: Select a step-by-step guide below to build the runbook using either the AWS console or CloudFormation template.

Click here for Console step by step
Click here for CloudFormation Console deployment step
Click here for CloudFormation CLI deployment step

4.2 Executing remediation Runbook.

Now, lets run the runbook you created above to remediate the issue.

  1. Go to the AWS CloudFormation console.

  2. Click on the stack named walab-ops-sample-application.

  3. Click on the Output tab, and take note following output values. You will need these values to execute the runbook.

    • OutputECSCluster
    • OutputECSService
    • OutputSystemOwnersTopicArn

 Section4

  1. If you are currently using an IAM user or role to log into your AWS Console, take note of the ARN. You will need this ARN when executing the runbook to restrict access to approve or deny request capability.

    To find your current IAM user ARN, go to the IAM console and click Users on the left side menu, then click on your User name. For IAM role, go to the IAM console and click Roles on the left side menu, then click on the Role name, you are using.

    You will see something similar to the example below. Take note of the ARN value,and proceed to the next step.

 Section4

  1. Go to the Systems Manager Automation console, click on Document under Shared Resources, locate and click an automation document called Runbook-ECS-Scale-Up.

  2. Then click *Execute automation.

  3. Fill in the Input parameters with values below.

     Section4

    • For ECSServiceName, place the value of OutputECSService you took note on step 3.

    • For ECSClusterName, Place the value of OutputECSCluster you took note on step 3.

    • For ApproverArn, place the ARN value you took note on step 4.

    • For ECSDesiredCount, place in 100 to increase the task number to 100.

    • For NotificationMessage, place in any message that can help the approver make an informed decision when approving or denying the requested action.

      For example:

      Hello, your mysecretword app is experiencing performance degradation. To maintain quality customer experience we will manually scale up the supporting cluster. This action will be approximately 10 minutes after this message is generated unless you do not consent and deny the action within the period.
      
    • For NotificationTopicArn, place the value of OutputSystemOwnersTopicArn you took note on step 3.

    • For Timer, you can specify PT5M or specify a value defined in ISO 8601 duration format.

  4. Click Execute to run the runbook.

  5. Once the runbook is running, you will receive an email with instructions approve or deny, on the email address subscribed to the owners SNS topic ARN. Follow the link in the email using the User of the ApproverArn you placed in the Input parameters. The link will take you to the SSM Console where you can approve or deny the request.

     Section4

    If you approve, or ignore the email, the request will be automatically be approved after the Timer set in the runbook expires. If you deny, the runbook will fail and no action will be taken.

  6. Once the runbook completes, you can see that the ECS task count increased to the value specified.

  7. Go to ECS console and click on Clusters and select mysecretword-cluster.

  8. Click on the mysecretword-service Service, and you will see the number of running tasks increasing to 100 and the average CPUUtilization decrease.

     Section4

     Section4

  9. Subsequently, you will see the API response time returns to normal and the CloudWatch Alarm returns to an OK state.

     Section4

    You can check both using your CloudWatch Console, following the steps you ran in section 2.1 Observing the alarm being triggered.

Congratulations !

You have now completed the Automating operations with Playbooks and Runbooks lab, click on the link below to cleanup the lab resources.