Sanchit Dilip Jain/Perform Workload Isolation using Amazon Redshift Data Sharing 💡

Introduction

What is Amazon Redshift Data Sharing?
- Amazon Redshift Data Sharing is a feature that enables Redshift cluster owners to share live data across different Amazon Redshift clusters, accounts, and regions.
- With data sharing, users can easily and securely share data with other Redshift users, without having to copy or move the data.
What are Amazon Redshift Data Sharing Benefits?
- Data sharing is a great way to improve your security and performance, as well as save money.
  - Enhanced Security - By using Amazon Redshift data sharing, you can ensure that only authorized users have access to sensitive data by limiting the number of tables they can access. This helps prevent accidental or malicious disclosure of confidential information during queries and analysis.
  - Improved Performance - Data sharing improves query performance by allowing multiple users with different privileges to access the same table simultaneously without affecting one another’s results or causing errors in the query plan generated by SQL queries run against it (for example, if two people were trying to run a join on two distinct tables).
  - Cost savings - Sharing data across accounts and regions can help to reduce data storage costs and eliminate the need to move data.

Demo

Objective:
- In this lab, we will share the data between the ETL cluster (used solely for ETL processing) and the isolated BI cluster to provide the read workload isolation and an optional charge-back for costs.
- It will help in sizing each redshift cluster individually to its price and performance requirements and onboard new workloads quickly.
- Please find below the sample diagram for this lab:
Prerequisite - For this demo, we will perform some prerequisites via provisioning an AWS Cloudformation
- Download the Cloudformation from this URL
- Visit AWS Cloudformation console, upload the CloudFormation template, and click on Launch Stack
- The CloudFormation stack will create the necessary resources required for the demo. Check the CloudFormation console and wait for the status CREATE_COMPLETE as shown below
- The cloud formation stack will automatically create tables mimicking the TPC Benchmark data model (an industry-standard test data set) and load the data in etl-redshift-cluster
- Go to Amazon Redshift Console by clicking here
- Select etl-redshift-cluster under Provisioned Clusters from the Redshift console
- Select Query Editor V2 to query etl-redshift-cluster. Configure Query Editor v2, leave settings as default, and press Configure Account
- Let’s connect to our database by selecting etl-redshift-cluster. Select temporary credentials on the list of options and connect using these credentials Database name: dev Database user: awsuser
- Once connected, TPC benchmark tables are already available in the Redshift clusters with data loaded into tables. To show this, click on dev/public and then drop down Tables
- Create a View on customer table to hide sensitive customer information
  CREATE VIEW customer_view AS SELECT C_CUSTKEY, C_NATIONKEY, C_MKTSEGMENT, C_COMMENT FROM customer;
Create a DataShare
- Switch back to the Amazon Redshift console by clicking here
- Select etl-redshift-cluster and click on Datashares option to create a datashare
- Connect to the database under Databases created in my namespace using the following temporary credentials: Database name: dev Database user: awsuser
- Select Create datashare, and Under the Datashare information tab,
  - select the Datashare type as Datashare
  - provide the datashare name as tpc_share
  - select the database name as a dev.
  - Turn On the Publicly accessible setting. This option specifies whether you can share the data with clusters that are publicly accessible
- Under Datashare objects, click on Add and provide the following details
  - select the Schema as Public
  - Object types as Tables and views
  - Select Add specific objects from schema
  - Select all tables except customer table and Click Add. Please Note: We are sharing customer_view instead to hide sensitive customer information
  - Once added, all objects are available under Datashare objects
  - Under Data Consumers, Select the Add namespaces to the datashare and tick the bi-redshift-cluser namespace. Amazon Redshift cluster namespace is the unique ID of the Amazon Redshift cluster. Click Create datashare
  - Once datashare is created, it appears under Datashares created in my namespace
Consuming and Querying DataShare - Follow the below steps to consume the datashare on bi-redshift-cluster
- Click bi-redshift-cluster under the Amazon Redshift Clusters tab
- Click on the Datashares tab and Connect to the database under Databases created in my namespace using the following temporary credentials: Database name: dev Database user: awsuser
- Under Datashare from other namespaces and AWS accounts, tpc_share datashare is already visible. Click on Create database from datashare
- Specify the Database name as tpc and click on Create
- Once the database is created from datashare (you will see a green banner at the top saying tpc is successfully created), it can be queried from bi-redshift-cluster. Connect to bi-redshift-cluster using Query Editor v2
- Click on bi-redshift-cluster and connect using the following credentials Database name: dev Database user: awsuser
- Once connected, Select the database as tpc and schema as public. All shared objects appear
- Now execute the below query using a three-part notation on the bi-redshift-cluster to access shared data.
```
SELECT c_mktsegment, o_orderpriority, sum(o_totalprice)
FROM tpc.public.customer_view c
JOIN tpc.public.orders o on c_custkey = o_custkey
GROUP BY c_mktsegment, o_orderpriority;
```
Workload Isolation
- As the data sharing feature is supported with RA3 node type, which separates the compute and storage using Amazon Redshift Managed Storage for storing data.
- This enables us to pause the producer redshift cluster and consume the data via the consumer redshift cluster.
- Pause the etl-redshift-cluster and query the same query via bi-redshift-cluster again:
- To pause the etl-redshift-cluster, click on the cluster name on the main redshift console. Under the ‘Actions’ dropdown, click on ‘Pause.’
Note: The status will be Modifying for a few minutes and then change into Paused
- Please rerun the below SQL query on bi-redshift-cluster once the etl-redshift-cluster is paused:
```
SELECT c_mktsegment, o_orderpriority, sum(o_totalprice)
FROM tpc.public.customer_view c
JOIN tpc.public.orders o on c_custkey = o_custkey
GROUP BY c_mktsegment, o_orderpriority;
```
Note: During the time producer redshift cluster is performing Pause operation, consumer redshift cluster will not be able to query the datashare. Once the cluster is Paused, then consumer redshift cluster will be able to query the datashare

Resources

Visit this page to find the latest documentation.