Operational Excellence Improvement Plans
OPS 5 - How do you reduce defects, ease remediation, and improve flow into production?
Current Risk: HIGH
Test and validate changes
Test and validate changes: Changes should be tested and the results validated at all lifecycle stages (for example, development, test, and production). Use testing results to confirm new features and mitigate the risk and impact of failed deployments. Automate testing and validation to ensure
consistency
of review, to reduce errors caused by manual processes, and reduce the level of effort.
What is AWS CodeBuild?
Local build support for AWS CodeBuildShare design standards
Share design standards: Share existing
best practices
, design standards, checklists, operating procedures, and guidance and governance requirements across teams to reduce complexity and maximize the benefits from development efforts. Ensure that procedures exist to request changes, additions, and exceptions to design standards to support continual improvement and innovation. Ensure that teams are aware of published content so that they can take advantage of content, and limit rework and wasted effort.
Delegating access to your AWS environment
Share an AWS CodeCommit repository
Easy authorization of AWS Lambda functions
Sharing an AMI with specific AWS accounts
Speed template sharing with an AWS CloudFormation designer URL
Using AWS Lambda with Amazon SNSImplement practices to improve code quality
Implement practices to improve code quality: Implement practices to improve code quality to minimize defects and the risk of their being deployed. For example, test-driven development, pair programming, code reviews, and standards adoption.Make frequent, small, reversible changes
Make frequent, small, reversible changes: Frequent, small, and reversible changes reduce the scope and impact of a change. This eases troubleshooting, enables faster remediation, and provides the option to roll back a change. It also increases the rate at which you can deliver value to the business.Use version control
Use version control: Maintain assets in version controlled repositories. Doing so supports tracking changes, deploying new versions, detecting changes to existing versions, and reverting to prior versions (for example, rolling back to a known good state in the
event
of a failure). Integrate the version control capabilities of your configuration management systems into your procedures.
Introduction to AWS CodeCommit
What is AWS CodeCommit?Fully automate integration and deployment
Use build and deployment management systems: Use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort. Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation. This reduces lead time, enables increased frequency of change, and reduces the level of effort.
What is AWS CodeBuild?
Continuous integration best practices for software development
Slalom: CI/CD for serverless applications on AWS
Introduction to AWS CodeDeploy - automated software deployment with Amazon Web Services
What is AWS CodeDeploy?Security Improvement Plans
SEC 2 - How do you manage identities for people and machines?
Current Risk: HIGH
Use temporary credentials
Implement least privilege policies: Assign access policies with least privilege to
IAM
groups and roles to reflect the user's role or function that you have defined.
Grant least privilegeRemove unnecessary permissions: Implement least privilege by removing permissions that are unnecessary.
Reducing policy scope by viewing user activity
View role accessConsider permissions boundaries: A permissions boundary is an advanced feature for using a managed policy that sets the maximum permissions that an identity-based policy can grant to an
IAM
entity. An entity's permissions boundary allows it to perform only the actions that are allowed by both its identity-based policies and its permissions boundaries.
Lab: IAM permissions boundaries delegating role creationConsider resource
tags
for permissions: You can use
tags
to control access to your AWS resources that support
tagging
. You can also
tag
IAM
users and roles to control what they can access.
Lab: IAM tag based access control for EC2
Attribute-based access control (ABAC)Store and use secrets securely
Use AWS Secrets Manager: AWS Secrets Manager is an AWS service that makes it easier for you to manage secrets. Secrets can be database credentials, passwords, third-party API keys, and even arbitrary text.
AWS Secrets ManagerRely on a centralized identity provider
Centralize administrative access: Create an
IAM
identity provider entity to establish a trust relationship between your AWS account and your identity provider (IdP).
IAM
supports IdPs that are compatible with OpenID Connect (OIDC) or SAML 2.0 (
Security
Assertion Markup Language 2.0).
Identity Providers and FederationCentralize application access: Consider
Amazon Cognito
for centralizing application access. It lets you add user sign-up, sign-in, and access control to your web and mobile apps quickly and easily.
Amazon Cognito
scales to millions of users and supports sign-in with social identity providers, such as Facebook, Google, and Amazon, and enterprise identity providers via SAML 2.0.
Amazon CognitoRemove old
IAM
users and groups: After you start using an identity provider (IdP), remove
IAM
users and groups that are no longer required.
Finding unused credentials
Deleting an IAM groupAudit and rotate credentials periodically
Regularly audit credentials: Use credential reports, and
IAM
Access Analyzer to audit
IAM
credentials and permissions.
IAM Access Analyzer
Getting credential report
Lab: Automated IAM user cleanupUse Access Levels to Review
IAM
Permissions: To improve the
security
of your AWS account, regularly review and monitor each of your
IAM
policies. Make sure that your policies grant the least privilege that is needed to perform only the necessary actions.
Use access levels to review IAM permissionsConsider automating
IAM
resource creation and updates:
AWS CloudFormation
can be used to automate the deployment of
IAM
resources including roles and policies, to reduce human error, as the templates can be verified and version controlled.
Lab: Automated deployment of IAM groups and rolesLeverage user groups and attributes
If you are using
AWS Single Sign-On
(SSO), configure groups: AWS SSO provides you with the ability to configure groups of users, and assign groups the desired level of permission.
AWS Single Sign-On - Manage IdentitiesLearn about attribute-based access control (ABAC): Attribute-based access control (ABAC) is an authorization strategy that defines permissions based on attributes.
What Is ABAC for AWS?
Lab: IAM Tag Based Access Control for EC2Reliability Improvement Plans
REL 2 - How do you plan your network topology?
Current Risk: HIGH
Provision redundant connectivity between private networks in the cloud and on-premises environments
Ensure that you have highly available connectivity between AWS and on-premises environment: Use multiple
AWS Direct Connect
(DX) connections or VPN tunnels between separately deployed private networks. Use multiple DX locations for high
availability
. If using multiple
AWS Regions
, ensure redundancy in at least two of them. You might want to evaluate
AWS Marketplace
appliances that terminate VPNs. If you use
AWS Marketplace
appliances, deploy redundant instances for high
availability
in different
Availability Zones
.Ensure IP subnet allocation accounts for expansion and availability
Plan your network to accommodate for growth, regulatory compliance, and integration with others: Growth can be underestimated, regulatory compliance can change, and acquisitions or private network connections can be difficult to implement without proper planning.- Select relevant AWS accounts and Regions based on your service requirements,
latency
, regulatory, and disaster recovery (DR) requirements
- Identify your needs for regional VPC deployments
- Identify the size of the VPCs
- Make VPCs as large as possible. The initial VPC
CIDR block
allocated to your VPC cannot be changed or deleted, but you can add additional non-overlapping
CIDR blocks
to the VPC. This however may fragment your address ranges
- Allow for use of Elastic Load Balancers, Auto Scaling groups, concurrent
AWS Lambda
invocations, and service endpoints
Prefer hub-and-spoke topologies over many-to-many mesh
Prefer hub-and-spoke topologies over many-to-many mesh: If more than two network address spaces (VPCs, on-premises networks) are connected via VPC peering,
AWS Direct Connect
, or VPN, then use a hub-and-spoke model like that provided by AWS Transit Gateway.- For only two such networks, you can simply connect them to each other, but as the number of networks grows, the complexity of such meshed connections becomes untenable. AWS Transit Gateway provides an easy to maintain hub-and-spoke model, allowing routing of traffic across your multiple networks.
What Is a Transit Gateway?
Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
Monitor and manage your CIDR use: Evaluate your potential usage on AWS, add CIDR ranges to existing VPCs, and create VPCs to allow planned growth in usage.- Capture current CIDR consumption (for example, VPCs, subnets, etc.)
- Use service API
operations
to collect current CIDR consumption
- Capture your current subnet usage
- Use service API
operations
to collect subnets per VPC in each Region
DescribeSubnets - Record the current usage
- Determine if you created any overlapping IP ranges
- Calculate the spare capacity
- Note overlapping IP ranges: You can either migrate to a new range of addresses or use Network and Port Translation (NAT) appliances from
AWS Marketplace
if you need to connect the overlapping ranges.
Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
Monitor and manage your CIDR use: Evaluate your potential usage on AWS, add CIDR ranges to existing VPCs, and create VPCs to allow planned growth in usage.- Capture current CIDR consumption (for example, VPCs, subnets, etc.)
- Use service API
operations
to collect current CIDR consumption
- Capture your current subnet usage
- Use service API
operations
to collect subnets per VPC in each Region
DescribeSubnets - Record the current usage
- Determine if you created any overlapping IP ranges
- Calculate the spare capacity
- Note overlapping IP ranges: You can either migrate to a new range of addresses or use Network and Port Translation (NAT) appliances from
AWS Marketplace
if you need to connect the overlapping ranges.
PERF 2 - How do you select your compute solution?
Current Risk: HIGH
Understand the available compute configuration options
Learn about available configuration options: For your selected compute option, use the available configuration options to optimize for your
performance
requirements.Utilize AWS Nitro Systems to enable full consumption of the compute and
memory
resources of the host hardware. Dedicated Nitro Cards enable high speed networking, high speed EBS, and I/O acceleration.
AWS Nitro SystemCollect compute-related metrics
Collect compute-related metrics:
Amazon CloudWatch
can collect metrics across the compute resources in your environment. Use a combination of
CloudWatch
and other metrics-recording tools to track the system-level metrics within your
workload
. Record data such as CPU usage levels,
memory
, disk I/O, and network to gain insight into utilization levels or bottlenecks. This data is crucial to understand how the
workload
is performing and how effectively it is using resources. Use these metrics as part of a data-driven approach to actively tune and optimize your
workload
's resources.
Amazon CloudWatchDetermine the required configuration by right-sizing
Modify your
workload
configuration by right sizing: To optimize both
performance
and overall efficiency, determine which resources your
workload
needs. Choose
memory
-optimized instances for systems that require more
memory
than CPU, or compute-optimized instances for
components
that do data processing that is not
memory
-intensive. Right sizing enables your
workload
to perform as well as possible while only using the required resourcesUse the available elasticity of resources
Take advantage of
elasticity
:
Elasticity
matches the supply of resources you have against the demand for those resources. Instances, containers, and functions provide mechanisms for
elasticity
either in combination with automatic scaling or as a feature of the service. Use
elasticity
in your
architecture
to ensure that you have sufficient capacity to meet
performance
requirements at all scales of use. Ensure that the metrics for scaling up or down elastic resources are validated against the type of
workload
being deployed. If you are deploying a video transcoding application, 100% CPU is expected and should not be your primary metric. Alternatively, you can measure against the queue depth of transcoding jobs waiting to scale your instance types. Ensure that
workload
deployments can handle both scale up and scale down
events
. Scaling down
workload
components
safely is as critical as scaling up resources when demand dictates. Create test scenarios for scale-down
events
to ensure that the
workload
behaves as expected.Re-evaluate compute needs based on metrics
Use a data-driven approach to optimize resources: To achieve maximum
performance
and efficiency, use the data gathered over time from your
workload
to tune and optimize your resources. Look at the trends in your
workload
's usage of current resources and determine where you can make changes to better match your
workload
's needs. When resources are over-committed, system
performance
degrades, whereas underutilization results in a less efficient use of resources and higher
cost
.Cost Optimization Improvement Plans
COST 4 - How do you decommission resources?
Current Risk: HIGH
Implement a decommissioning process
Create and implement a decommissioning process. : Working with the
workload
developers and owners, build a decommissioning process for the
workload
and its resources. The process should cover the method to verify if the
workload
is in use, and also if each of the
workload
resources are in use. The process also covers the steps necessary to decommission the resource, removing them from service while ensuring compliance with any regulatory requirements. Any associated resources are also covered, such as licenses or attached storage. Finally the process provides notification to the
workload
owners that the decommissioning process has been executed.Decommission resources
Decommission resources : Using the decommissioning process, decommission each of the resources that have been identified as orphaned.Decommission resources automatically
Implement
AWS Auto Scaling
: For resources that are supported, configure them with
AWS Auto Scaling
.
Getting Started with Amazon EC2 Auto ScalingConfigure
CloudWatch
to Terminate Instances : Instances can be configured to terminate using
CloudWatch
alarms. Using the metrics from the decommissioning process, implement an alarm with an EC2 action. Ensure you verify the operation in a non-production environment before rolling out.
Create Alarms to Stop, Terminate, Reboot, or Recover an InstanceImplement code within the
workload
: You can use the AWS SDK or
AWS CLI
to decommission
workload
resources. Implement code within the application that integrates with AWS and terminates or removes resources that are no longer used.