Big Data on AWS
The paradigm of big data software development has cemented itself as a firm niche in recent years. The Hadoop ecosystem has been a strong underpinning of big data processing because it has so many robust tools and out of the box services that address the most fundamental problems of big data data processing. AWS has a dedicated service for running MapReduce jobs called Elastic Map Reduce and in this post I'll show you how to make a basic cluster of Ec2 instances and submit a MapReduce job to it. This job will perform a word count the words in a series of job requirements.
Getting Started with EMR
To get started with EMR you might consider making an IAM group to manage big data type permissions. In any case you'll need the AmazonElasticMapReduceFullAccess an AmazonElasticMapReduceforEC2Role polices to gain access to the EMR console. We'll also use the MapR Distribution for Hadoop. The example given here is more focused on setting up a cluster and running a demo job than the mechanics for writing custom MapReduce jobs. A MapReduce cluster is essentially a fleet of Ec2 instances and so it's important that you have an Ec2 Key Pair ready to use for your cluster. Your Ec2 Key Pairs are displayable for the Ec2 dashboard.
Starting the Cluster
While you certainly can use the AWS console to start the cluster I'm going to start mine form a local terminal. It's important that if you choose to do this that you local terminal is configured with AWS keys to allow make suitable api class. I'm using the command line here for simplicity because it shows all the options I'm using to start the cluster. Using the command
aws emr create-cluster --name "start" --applications Name=Hive Name=Pig Name=MapR,Args=--edition,m7,--version,4.0.2 --ami- version 3.3.2 --use-default-roles --ec2-attributes KeyName=cert --instance-type m3.xlarge --instance-count 3
will magically start your cluster of 3 instances and will return the clusters id on the command line. If you return to the EMR console you'll see this id as the cluster starts.
There are a few important details to notice with the command that we used. First notice that the KeyName is the name of one of your Ec2 key pairs. It's important to not have ".pem" on the end of the key or the cluster will fail to start. Also of note is that we are using MapR to make Hadoop easier to deal with and ultimately more dependable because it uses a no-NameNode architecture.
Running a Job
Once your cluster is up and running the master node will have a security group allowing ssh access. You can find a connection string for that node in the summary of your cluster.
Once you are on the master node we can run a MapReduce job on the cluster.
Processing Job Ads with MapReduce
To give an example of running a MapReduce job on our cluster we are going to create a text file with several job descriptions. We will then run a word count map reduce job to find the most relevant skills for all of the proposed jobs. To that end I created a text file in
/mapr/MapR_EMR.amazonaws.com/in where I have added the following job descriptions
Demonstrated experience writing/transforming/extending code using Java 8 enterprise. Demonstrated experience working with a DEVOPs orchestration tool such as Jenkins, Rundeck and Artifactory. Demonstrated experience working with the enterprise Github including Maven and/or Gradle. Demonstrated experience in code coverage, unit testing, mocking, and integration testing. Demonstrated experience with HAL conformant JSON and XML using complaint APIs, OpenSSL, PKI and OAuth protections. Demonstrated experience with a knowledge of Linux variants (CentOS, Redhat etc.). Demonstrated experience with Amazon Web Service Experience with performing Identity and Access Management (IdAM) architecture definition including the integration of multi ple COTS products covering entitlement management, digital policy management, authentication, and authorization. Demonstrated experience applying new technologies and supporting software to make recommendations for architectural enhance ments that will improve the capabilities of existing enterprise services suite. Experience defining and developing future software designs and implementations. Experience adjudicating competing priorities, incorporating user requirements into technical project roadmaps/schedule, and providing analysis of alternatives for capability enhancements to better enable customer lead project decisions. Experience applying new technologies and supporting software to make recommendations for enhancements that will improve the capabilities of the existing enterprise IdAM services suite. Experience developing, integrating, or implementing solutions or supporting deliverables consistent with project requiremen ts. Experience providing technical leadership and primary design authority for enterprise level capabilities. Experience with cloud based technologies and design. Experience designing and producing metrics to measure/compare efficiency and effectiveness of alternative architectures.
If you are unfamiliar with "word count" it's the "hello world" of MapReduce. However, in this case it will find us the most commonly used words in job requirements and hence tell us the most important skills to look for in prospective resumes. Running the command
hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /mapr/MapR_EMR.amazonaws.com/in/ /mapr/MapR_EMR.amazonaws.com/out/
will run a "word count" MapReduce job against our job descriptions. Here is an example of successful output
If we look at the directory in
/mapr/MapR_EMR.amazonaws.com/out we will see the output of the reducers. We can see the results for each node in the cluster. Here is a sample of the output from one node
From this we can get a broad sense of what experience is most desired in the current job market. For example, we see here that "enterprise" experience is highly desired.
This has been super cool, but it only scratches the surface of what is possible with Big Data processing in AWS. You can make your own custom MapReduce jobs, package them as *.jar files and submit them to the cluster. The AWS infastructure is powerful and robust. Let's see what you can do with it.