Sanchit Dilip Jain/Amazon EMR Serverless - RDS Hive MetaStore Integration ⚒️

Amazon EMR Serverless - RDS Hive MetaStore Integration

Introduction

What is Amazon EMR?
- Amazon EMR is a cloud big data platform running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.
- Understanding clusters and nodes
  - A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. An instance in the cluster is called a node, and each node has role within the cluster, referred to as the node type.
  - The node types in Amazon EMR are as follows:
    - Master node: A node that manages the cluster by running software components to coordinate data distribution and tasks among other nodes for processing.
    - Core node: A node with software components that run tasks and store data in your cluster’s Hadoop Distributed File System (HDFS).
    - Task node: A node with software components that only runs tasks and does not store data in HDFS.
What is Amazon EMR Serverless?
- Amazon EMR Serverless is a serverless option on Amazon EMR that makes it easy to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.
- Amazon EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and we only pay for the resources that the applications use.
RDS Hive MetaStore Integration
- A Hive metastore is a centralized location that stores structural information about your tables, including schemas, partition names, and data types. With EMR Serverless, you can persist this table metadata in a metastore with access to your jobs.
- Please note metastores aren’t auto-initialized when creating an EMR Serverless Application or submitting the job on EMR Serverless. If your metastore isn’t initialized with a schema for your Hive version, use the Hive Schema Tool

Demo

1. Prerequisites

Objective: Before scheduling the Spark job in EMR Serverless, we need to create the required AWS resources via AWS CloudFormation template. Below are the resources we will provision
- Setup the VPC network by deploying the following resources -
  - Public and Private subnets,
  - Internet gateway
  - NAT gateways (one in each Availability Zone)
  - Security group to access Hive Metastore from EMR Serverless Application
- An Amazon Simple Storage Service (Amazon S3) bucket
- Initialize the hive metastore schema for the EMR release version 6.6.6
- Setup an EMR Serverless job execution IAM role with permissions to MySQL environment & S3 Bucket and assume an EMR Serverless role.
- EMR Serverless application with version 6.8.0 and Apache Spark runtime
Below are the steps to perform this demo
- Provisioning an AWS Cloudformation
  - Download the Cloudformation from this URL
  - Visit AWS Cloudformation console, upload the CloudFormation template
  - For EC2 Key Pair, select the Private Key (.pem file), from the dropdown menu Leave all the other options as default and deploy the stack.
  - The CloudFormation stack will create the necessary resources required for the demo. Check the CloudFormation console and wait for the status CREATE_COMPLETE as shown below
  - Download the dataset for the demo and upload it to the Amazon S3 bucket provisioned by AWS Cloudformation in the previous step under the input folder.
  - Download the script for the demo and upload it to the Amazon S3 bucket provisioned by AWS Cloudformation in the previous step under the script folder.

2. Submit a Spark job with EMR Serverless Application

Objective: In this demo, we submit a Spark job in EMR Serverless, and verify metastore tables contain different metastore information like tables in TBLS table and columns
Below are the steps to perform this demo
- Go back to the Output section of AWS CloudFormation deployed in the previous step, copy & replace the Values for the respective Key’s in the below code snippet
```
sudo yum install jq -y

export MariaDBHost=<Replace with EMRServerlessHiveMetastoreDB>

export JOB_ROLE_ARN=<Replace with JobRoleArn>

export S3_BUCKET=<Replace with S3Bucket>

export SPARK_APPLICATION_ID=<Replace with SparkApplicationId>

export SECRET_ID=<Replace with RDSSecretId>

export DBUSER=$(aws secretsmanager get-secret-value --secret-id $SECRET_ID | jq --raw-output '.SecretString' | jq -r .MasterUsername)

export DBPASSWORD=$(aws secretsmanager get-secret-value --secret-id $SECRET_ID | jq --raw-output '.SecretString' | jq -r .MasterUserPassword)

export JDBCDriverClass=org.mariadb.jdbc.Driver

export JDBCDriver=mariadb-connector-java.jar
```
- After the values are replaced, copy the code and execute in CloudShell Terminal to set the environment variables.
- Upload the MariaDB connector to your workshop S3 Bucket by executing the following code. wget https://downloads.mariadb.com/Connectors/java/connector-java-3.0.7/mariadb-java-client-3.0.7.jar
```
aws s3 cp mariadb-java-client-3.0.7.jar "s3://${S3_BUCKET}/mariadb-connector-java.jar" 
```
- Run the following command to execute the Spark ETL job on your EMR Serverless Application. The S3 output location will be passed as a parameter to the script, this is passed as an entryPointArguments
```
 aws emr-serverless start-job-run \
 --application-id $SPARK_APPLICATION_ID \
 --execution-role-arn $JOB_ROLE_ARN \
 --job-driver '{
   "sparkSubmit": {
      "entryPoint": "s3://'${S3_BUCKET}'/script/spark-nyctaxi.py",
      "entryPointArguments":["s3://'"${S3_BUCKET}"'/","s3://'"${S3_BUCKET}"'/input/","s3://'"${S3_BUCKET}"'/taxi-data-rds-hive-metastore/"],
      "sparkSubmitParameters": "--jars s3://'${S3_BUCKET}'/'${JDBCDriver}' --conf spark.hadoop.javax.jdo.option.ConnectionDriverName='${JDBCDriverClass}' --conf spark.hadoop.javax.jdo.option.ConnectionUserName='${DBUSER}' --conf spark.hadoop.javax.jdo.option.ConnectionPassword='${DBPASSWORD}' --conf spark.hadoop.javax.jdo.option.ConnectionURL=\"jdbc:mariadb://'${MariaDBHost}':3306/hivemetastore\"  --conf spark.driver.cores=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=2"
     }
   }' \
   --configuration-overrides '{
       "monitoringConfiguration": {
       "s3MonitoringConfiguration": {
           "logUri": "s3://'${S3_BUCKET}'/sparklogs/"
          }
      }
  }'
```
- Navigate to the EMR console . Check the AWS region and ensure that you select the right region. Choose EMR Serverless from the left side bar. A new window pops up showing Amazon EMR Serverless Applications. Click on Manage Applications
- Choose the EMR Serverless Application ID with the Value you copied from the Output section of the emr-hive-metastore CloudFormation stack, the one you submitted the Spark job on.
- Check for the status of the latest job id you submitted and wait till the Run status for that job id is Success
- Navigate to the S3 console and check the output data created in the S3 bucket like emr-hms-[AWS-REGION]-[ACCOUNT-ID], the Value you copied from the Output section of the CloudFormation stack.
- You will notice a new folder created taxi-data-rds-hive-metastore.
- You will notice that a parquet file has been written within taxi-data-rds-hive-metastore
- Go back to the Cloudshell, to verify the metastore table creation in the MariaDB Hive Metastore.
- Connect to the hivemetatsore database using the following command:
```
mysql --host $MariaDBHost --user=$DBUSER --password=$DBPASSWORD hivemetastore
```
- You can see the metastore tables that contain different metastore information like tables in TBLS table and columns in COLUMNS_V2 table.
- Run a SELECT query on the TBLS table to see the table we created in the Spark program.

Resources

Visit Amazon EMR Serverless to find the latest documentation.