Perform Workload Isolation using Amazon Redshift Data Sharing
Introduction
What is Amazon Redshift Data Sharing?
- Amazon Redshift Data Sharing is a feature that enables Redshift cluster owners to share live data across different Amazon Redshift clusters, accounts, and regions.
- With data sharing, users can easily and securely share data with other Redshift users, without having to copy or move the data.
What are Amazon Redshift Data Sharing Benefits?
- Data sharing is a great way to improve your security and performance, as well as save money.
- Enhanced Security - By using Amazon Redshift data sharing, you can ensure that only authorized users have access to sensitive data by limiting the number of tables they can access. This helps prevent accidental or malicious disclosure of confidential information during queries and analysis.
- Improved Performance - Data sharing improves query performance by allowing multiple users with different privileges to access the same table simultaneously without affecting one another’s results or causing errors in the query plan generated by SQL queries run against it (for example, if two people were trying to run a join on two distinct tables).
- Cost savings - Sharing data across accounts and regions can help to reduce data storage costs and eliminate the need to move data.
- Data sharing is a great way to improve your security and performance, as well as save money.
Demo
Objective:
In this lab, we will share the data between the ETL cluster (used solely for ETL processing) and the isolated BI cluster to provide the read workload isolation and an optional charge-back for costs.
It will help in sizing each redshift cluster individually to its price and performance requirements and onboard new workloads quickly.
Please find below the sample diagram for this lab:
Prerequisite - For this demo, we will perform some prerequisites via provisioning an AWS Cloudformation
Download the Cloudformation from this URL
Visit AWS Cloudformation console, upload the CloudFormation template, and click on Launch Stack
The CloudFormation stack will create the necessary resources required for the demo. Check the CloudFormation console and wait for the status CREATE_COMPLETE as shown below
The cloud formation stack will automatically create tables mimicking the TPC Benchmark data model (an industry-standard test data set) and load the data in etl-redshift-cluster
Go to Amazon Redshift Console by clicking here
Select etl-redshift-cluster under Provisioned Clusters from the Redshift console
Select Query Editor V2 to query etl-redshift-cluster. Configure Query Editor v2, leave settings as default, and press Configure Account
Let’s connect to our database by selecting etl-redshift-cluster. Select temporary credentials on the list of options and connect using these credentials Database name: dev Database user: awsuser
Once connected, TPC benchmark tables are already available in the Redshift clusters with data loaded into tables. To show this, click on dev/public and then drop down Tables
Create a View on customer table to hide sensitive customer information
CREATE VIEW customer_view AS SELECT C_CUSTKEY, C_NATIONKEY, C_MKTSEGMENT, C_COMMENT FROM customer;
Create a DataShare
- Switch back to the Amazon Redshift console by clicking here
- Select etl-redshift-cluster and click on Datashares option to create a datashare
Connect to the database under Databases created in my namespace using the following temporary credentials: Database name: dev Database user: awsuser
Select Create datashare, and Under the Datashare information tab,
select the Datashare type as Datashare
provide the datashare name as tpc_share
select the database name as a dev.
Turn On the Publicly accessible setting. This option specifies whether you can share the data with clusters that are publicly accessible
Under Datashare objects, click on Add and provide the following details
select the Schema as Public
Object types as Tables and views
Select Add specific objects from schema
Select all tables except customer table and Click Add. Please Note: We are sharing customer_view instead to hide sensitive customer information
Once added, all objects are available under Datashare objects
Under Data Consumers, Select the Add namespaces to the datashare and tick the bi-redshift-cluser namespace. Amazon Redshift cluster namespace is the unique ID of the Amazon Redshift cluster. Click Create datashare
Once datashare is created, it appears under Datashares created in my namespace
Consuming and Querying DataShare - Follow the below steps to consume the datashare on bi-redshift-cluster
Click bi-redshift-cluster under the Amazon Redshift Clusters tab
Click on the Datashares tab and Connect to the database under Databases created in my namespace using the following temporary credentials: Database name: dev Database user: awsuser
Under Datashare from other namespaces and AWS accounts, tpc_share datashare is already visible. Click on Create database from datashare
Specify the Database name as tpc and click on Create
Once the database is created from datashare (you will see a green banner at the top saying tpc is successfully created), it can be queried from bi-redshift-cluster. Connect to bi-redshift-cluster using Query Editor v2
Click on bi-redshift-cluster and connect using the following credentials Database name: dev Database user: awsuser
Once connected, Select the database as tpc and schema as public. All shared objects appear
Now execute the below query using a three-part notation on the bi-redshift-cluster to access shared data.
SELECT c_mktsegment, o_orderpriority, sum(o_totalprice) FROM tpc.public.customer_view c JOIN tpc.public.orders o on c_custkey = o_custkey GROUP BY c_mktsegment, o_orderpriority;
Workload Isolation
As the data sharing feature is supported with RA3 node type, which separates the compute and storage using Amazon Redshift Managed Storage for storing data.
This enables us to pause the producer redshift cluster and consume the data via the consumer redshift cluster.
Pause the etl-redshift-cluster and query the same query via bi-redshift-cluster again:
To pause the etl-redshift-cluster, click on the cluster name on the main redshift console. Under the ‘Actions’ dropdown, click on ‘Pause.’
Note: The status will be Modifying for a few minutes and then change into Paused
Please rerun the below SQL query on bi-redshift-cluster once the etl-redshift-cluster is paused:
SELECT c_mktsegment, o_orderpriority, sum(o_totalprice) FROM tpc.public.customer_view c JOIN tpc.public.orders o on c_custkey = o_custkey GROUP BY c_mktsegment, o_orderpriority;
Note: During the time producer redshift cluster is performing Pause operation, consumer redshift cluster will not be able to query the datashare. Once the cluster is Paused, then consumer redshift cluster will be able to query the datashare
Resources
- Visit this page to find the latest documentation.