Level 300: Optimize Data Pattern using Amazon Redshift Data Sharing

Author

  • Raman Pujani, Solutions Architect.

Introduction

This lab focuses on optimizing data patterns for sustainability, specifically focused on removing unneeded or redundant data, and minimizing data movement across networks.

Goals

At the end of this lab you will:

  • Understand how to identify optimization areas using AWS Well-Architected Sustainability Pillar best practices
  • Baseline and optimize workloads by identifying sustainability key performance indicators (KPIs)
  • Learn how to use the Amazon Redshift Data Sharing feature to implement data management best practices described in AWS Well-Architected Sustainability Pillar

Prerequisites

  • The lab is designed to run in your own AWS account
  • You can launch an Amazon Redshift cluster in the AWS us-east-1 and AWS us-west-1 regions (referred to us-east-1 region, and us-west-1 region throughout the lab) using Redshift ra3 nodes
    • This lab is written and tested in us-east-1 and us-west-1 regions. You may be able to run this in other AWS regions of your choice where Redshift ra3 nodes are available , but results may vary.

NOTE: You will be billed for any applicable AWS resources used if you complete this lab that are not covered in the AWS Free Tier . Amazon Redshift ra3 nodes are not part of Amazon Redshift Free trial, or AWS Free Tier. When you decide to stop the lab at any point in time, please revisit the clean up instructions at the end so you stop incuring cost (e.g. for storage in Amazon S3).

Lab duration

Estimated time required to complete this lab is 90 minutes.

Workload details

AnyCompany (an fictional event management organization) is running a central data warehouse environment on Amazon Redshift in us-east-1 region, which is used by various departments in the organization for their respective storage, analytical processing, and reporting. The marketing department is the top consumer of the data warehouse, and have data engineers, analysts, and scientists based out of US west coast. The marketing team has implemented their own Amazon Redshift cluster in us-west-1 (consumer) region, which refreshes nightly using an Amazon Redshift snapshot received from us-east-1 region (producer) and uploading to the us-west-1 (consumer) region Amazon Redshift cluster. Since the marketing team analytical processing consumes lots of resources & integrated with their west coast based on-premise hosted downstream applications, they perform their analytical processing in us-west-1 region, and other departments use us-east-1 region hosted data warehouse. This requires storing a redundant dataset in us-west-1 region, and transferring huge amounts of data (via nightly ETL feed) over the network between AWS regions.

This is not a sustainability friendly implementation, and can be optimized using AWS Well-Architected Sustainability Pillar best practices for data patterns. Also, with this approach, the insights generated by the Marketing department are not based on live data.

Workload optimization for sustainability

In this case, optimization areas include:

  • Marketing team using only data which is required for their analytical processing, whereas currently they use the full dataset. This reduces storage requirements in us-west-1.
  • By reducing the amount of data copied between us-east-1 and us-west-1 regions, this reduces network traffic.

Technical solution

By introducing Amazon Redshift Data Sharing feature, the marketing department can optimize their implementation for sustainability, avoiding redundant storage & reducing data transfer between AWS regions. Data sharing enables instant, granular, and fast data access across Amazon Redshift clusters without the need to copy or move it. With data sharing, you have live access to data, so that your users can see the most up-to-date and consistent information as it’s updated in Amazon Redshift clusters.

Redshift environment before implementing Data Sharing feature

Both, producer and consumer cluster size is 640 MB each - Total storage consumed is 1280 MB: Before implementing Redshift Data Sharing

Redshift environment after implementing Data Sharing feature

Producer cluster size is 640 MB whereas consumer cluster size is 0 MB - Total storage consumed is 640 MB: After implementing Redshift Data Sharing

Sustainability improvement process

The improvement goals of this lab are to:

  • To eliminate waste, low utilization, and idle or unused resources
  • To maximize the value from resources consumed

This lab use case focuses on removing unneeded or redundant data, and minimizing data movement across network. For more details, refer to Sustainability Pillar Whitepaper which explains the iterative process that evaluates, prioritizes, tests, and deploys sustainability-focused improvements for cloud workloads.

To evaluate specific improvements, understand the resources provisioned by your workload to complete a unit of work. Evaluate potential improvements, and estimate their potential impact, the cost to implement, and the associated risks. To measure improvements over time, first understand what you have provisioned in AWS and how those resources are being consumed.

Refer to Sustainability Pillar Whitepaper for detailed understanding around evaluating specific improvements. At high level:

  • Use Proxy metrics to measure the resources provisioned to achieve business outcomes. (To derive metrics from AWS Cost and Usage reports check out this Well-Architected Lab )

    Proxy Metrics

    For this lab, we will use these proxy metrics:

    • Total data storage used (MB provisioned)
    • Total data transfer over network (MB transferred) To find out how much storage is used for us-west-1 region Redshift cluster, and how much data is transferred over the network between producer (us-east-1) and consumer (us-west-1) clusters across regions for data replication:
  • Select business metrics to quantify the achievement of business outcomes. Your business metrics should reflect the value provided by your workload, for example, the number of simultaneous active users, API calls served, or the number of transactions completed. For this lab, we will use total number of events held (business outcome) as business metric.

  • To calculate a sustainability key performance indicator (KPI), we will use the following formula, divide the provisioned resources by the business outcomes achieved to determine the provisioned resources per unit of work:

    Sustainability KPI

    Our improvement goal is to:

    • Reduce total storage used, and data transfer over the network for all events
    • Reduce per event provisioned resources

Steps: