Amazon EMR Serverless - Glue Hive MetaStore Integration
Introduction
What is Amazon EMR?
- Amazon EMR is a cloud big data platform running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.
- Understanding clusters and nodes
- A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. An instance in the cluster is called a node, and each node has role within the cluster, referred to as the node type.
- The node types in Amazon EMR are as follows:
- Master node: A node that manages the cluster by running software components to coordinate data distribution and tasks among other nodes for processing.
- Core node: A node with software components that run tasks and store data in your cluster’s Hadoop Distributed File System (HDFS).
- Task node: A node with software components that only runs tasks and does not store data in HDFS.
What is Amazon EMR Serverless?
- Amazon EMR Serverless is a serverless option on Amazon EMR that makes it easy to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.
- Amazon EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and we only pay for the resources that the applications use.
RDS Hive MetaStore Integration
- With Amazon EMR release 5.8.0 or later, we can configure Spark SQL to use the AWS Glue Data Catalog as its metastore.
- AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore.
- This configuration is recommended when we require a persistent metastore or a metastore shared by different clusters, services, applications, or AWS accounts.
Demo
1. Prerequisites
Objective: In this demo, we will perform some prerequisites such as provisioning an AWS Cloudformation, uploading data and script to Amazon S3 bucket
Below are the steps to perform this demo
Provisioning an AWS Cloudformation
Download the Cloudformation from this URL
Visit AWS Cloudformation console, upload the CloudFormation template, and click on Launch Stack.
The CloudFormation stack will create the necessary resources required for the demo. Check the CloudFormation console and wait for the status CREATE_COMPLETE as shown below
Download the dataset for the demo and upload it to the Amazon S3 bucket provisioned by AWS Cloudformation in the previous step under the input folder.
- Download the script for the demo and upload it to the Amazon S3 bucket provisioned by AWS Cloudformation in the previous step under the script folder.
2. Submit Spark jobs from EMR Studio
Objective: In this demo, we will provision Amazon EMR serverless by creating an application and submitting an Apache Spark twice to create a database & tables in AWS Glue and run queries against it.
Below are the steps to perform this demo
In AWS Console, under services, search for EMR. In the EMR console, select EMR Serverless, or go to the EMR Serverless Console
Click Create and launch EMR Studio.
- To create an EMR Serverless application, choose to Create application
- Give a name to your application, my-serverless-application, as the name of our application. Choose Spark as the Type and emr-6.8.0 as the release version. Choose default settings. Click on Create application
- You should see the Status Created.
- Now you are ready to submit a job. To do this, choose to Submit the job.
In the Submit job screen, enter the details below:
Name Runtime role Script location S3 URI Script arguments word_count EMRServerlessS3RuntimeRole s3://YOUR_BUCKET/script/spark-nyctaxi.py [“s3://YOUR_BUCKET/input/” , “s3://YOUR_BUCKET/output/”, “tripdata”] Replace YOUR_BUCKET with the S3 bucket name you noted in the Cloudformation Stack Outputs Section.
Click Submit job
- You can see that the Application status shows as Starting, and the Job run status shows as Pending.
- Once the job has been submitted, the Run status shows Success.
- You can now verify that the job has written its output in the s3 path we provided as an argument when submitting it. You can go to the s3 path and see csv files successfully created by the EMR Serverless application.
- You can go over to Athena console to verify if the table got created and query it to check the data.
- Click on the three dots and click Preview table
Resources
- Visit this page to find the latest documentation.