Orchestration Amazon EMR Serverless using Amazon Managed Workflows for Apache Airflow (MWAA)
Introduction
What is Amazon EMR?
- Amazon EMR is a cloud big data platform running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.
- Understanding clusters and nodes
- A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. An instance in the cluster is called a node, and each node has role within the cluster, referred to as the node type.
- The node types in Amazon EMR are as follows:
- Master node: A node that manages the cluster by running software components to coordinate data distribution and tasks among other nodes for processing.
- Core node: A node with software components that run tasks and store data in your cluster’s Hadoop Distributed File System (HDFS).
- Task node: A node with software components that only runs tasks and does not store data in HDFS.
What is Amazon EMR Serverless?
- Amazon EMR Serverless is a serverless option on Amazon EMR that makes it easy to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.
- Amazon EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and we only pay for the resources that the applications use.
What is Amazon Managed Workflows for Apache Airflow (MWAA)?
- Amazon Managed Workflows for Apache Airflows (MWAA) is a managed Apache Airflow service used to extract business insights across an organization by combining, enriching, and transforming data through a series of tasks called a workflow.
- MWAA orchestrates your workflows using Directed Acyclic Graphs (DAGs) written in Python. We provide MWAA an Amazon Simple Storage Service (S3) bucket where your DAGs, plugins, and Python requirements reside.
Demo
1. Prerequisites
Objective: In this demo, we will perform some prerequisites via provisioning an AWS Cloudformation.
Setup the VPC network by deploying the following resources -
- Public and Private subnets,
- Internet gateway
- NAT gateways (one in each Availability Zone)
- Security group to access Amazon MWAA environment
An Amazon Simple Storage Service (Amazon S3) bucket
Setup an Amazon Managed Workflows for Apache Airflow with an IAM role to integrate with EMR Serverless.
Setup an EMR Serverless job execution IAM role with permissions to S3 Bucket and assume an EMR Serverless role.
EMR Serverless application with version 6.8.0 and Apache Spark runtime
Below are the steps to perform this demo
Provisioning an AWS Cloudformation
Download the Cloudformation from this URL
Visit AWS Cloudformation console, upload the CloudFormation template, and click on Launch Stack.
The CloudFormation stack will create the necessary resources required for the demo. Check the CloudFormation console and wait for the status CREATE_COMPLETE as shown below
Download the airflow-dag and upload it to the Amazon S3 bucket provisioned by AWS Cloudformation in the previous step under the dags folder.
- Download the airflow-plugin, and upload it to the Amazon S3 bucket provisioned by AWS Cloudformation in the previous step.
2. Trigger the jobs in Apache Airflow UI
Objective: In this demo, we submit a Spark job in EMR Serverless via Airflow UI using dag and verify the logs in the Airflow console.
Below are the steps to perform this demo
Navigate to the Managed Apache Airflow Console
Click on Open Airflow UI link for the Managed Apache Airflow environment amazon-mwaa-emr-demo-MwaaEnvironment
- Download the airflow-job-config-template, and replace the values of application_id, iam_role_arn and s3bucket provisioned by Cloudformation in the previous step.
- Upload the config.json into Airflow UI under Admin dropdown
- You should see the DAG emr_serverless_job in Apache Airflow UI (It may take 2-3 minutes to show up the dag in the Airflow UI)
- Click on the play button for dag emr_serverless_job in the Actions column and then Trigger Dag to schedule the spark job on EMR Serverless
- Once the status changes from light green to dark green, the job is completed, as shown below
- Click on the dag emr_serverless_job, in the next window, click on the dark green square icon next to start_job. Click on the Logs button in the popup window.
- You will be able to see the logs for the spark job. Notice that the final state for the job is a success.
Resources
- Visit Amazon EMR Serverless to find the latest documentation.
- Visit Amazon Managed Workflows for Apache Airflow (MWAA) to find the latest documentation.