Big data dive amazon emr processing

DATA PROCESSING
WITH AMAZON ELASTIC
MAPREDUCE,
AMAZON AWS USE CASES

Sergey Sverchkov
Project Manager
Altoros Systems
sergey.sverchkov@altoros.com
Skype: sergey.sverchkov

AMAZON EMR – SCALABLE DATA
PROCESSING SERVICE
Amazon EMR
service:
Amazon EC2 +
Amazon S3 +
Apache Hadoop
- Cost-Effective
- Automated
- Scalable
- Easy-to-use

MAPREDUCE

• Simple data-parallel programming model designed for
scalability and fault-tolerance

• Pioneered by Google
– Processes 20 petabytes of data per day

• Popularized by open-source Hadoop project
– Used at Yahoo!, Facebook, Amazon, …

AMAZON EC2 SERVICE

• Elastic - Increase or decrease capacity within
minutes, not hours or days
• Completely controlled
• Flexible –multiple instance types
(CPU, memory, storage), operating systems, and
software packages.
• Reliable –99.95% availability for each Amazon EC2
Region.
• Secure – numerous mechanisms for securing your
compute resources.
• Inexpensive: Reserved Instance and Spot Instances
• Easy to Start.

AMAZON S3 STORAGE

• Write, read, and delete objects containing from 1 byte to
5 terabytes
• Objects are stored in a bucket
• Authentication mechanisms
• Options for secure data upload/download and encryption
of data at rest
• Designed to provide 99.999999999% durability and
99.99% availability of objects over a given year
• Reduced Redundancy Storage (RRS)

AMAZON EMR FEATURES

• Web-based interface and command-line tools for running
Hadoop jobs on Amazon EC2
• Data stored in Amazon S3
• Monitors job and shuts down machines after use
• Small extra charge on top of EC2 pricing
• Significantly reduces the complexity of the time-
consuming set-up, management and tuning of Hadoop
clusters

GETTING STARTED – SIGN UP

• Sign up for Amazon EMR / AWS at
http://aws.amazon.com
• Need to be signed also for Amazon S3 and Amazon EC2
• Locate and save AWS credentials:
– AWS Access Key ID
– AWS Secret Access Key
– EC2 Key Pair
• Optionally install on desktop:
– EMR command line client
– S3 command line

GETTING STARTED – SECURITY, TOOLS

EMR JOB FLOW - BASIC STEPS

1. Upload input data to S3
2. Create job flow by defining Map and Reduce
3. Download output data from S3

WORD COUNT – INPUT DATA

• Word count input data size in sample S3 bucket:
./s3cmd du
s3://elasticmapreduce/samples/wordcount/input/
19105856 s3://elasticmapreduce/samples/wordcount/input/
• Word count input data files
./s3cmd ls s3://elasticmapreduce/samples/wordcount/input/
2009-04-02 02:55 2392524
s3://elasticmapreduce/samples/wordcount/input/0001
2009-04-02 02:55 2396618
2009-04-02 02:55 1593915
2009-04-02 02:55 1720885
2009-04-02 02:55 2216895

EMR WORD COUNT SAMPLE
• Starting instances, bootstrapping, running job steps:

EMR WORD COUNT SAMPLE
• Start the word count sample job from EMR command line:
$ ./elastic-mapreduce --create --name "word count
commandline test" --stream --input
s3n://elasticmapreduce/samples/wordcount/input --mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--reducer aggregate --output
s3n://test.emr.bucket/wordcount/output2
• Output contains job number:
Created job flow j-317IN1TUMRQ5B

WORD COUNT – OUTPUT DATA
• Locate and download output data in the specified output S3 bucket:

REAL-WORLD EXAMPLE - GENOTYPING

• Crossbow is a scalable, portable, and automatic Cloud
Computing tool for finding SNPs from short read data.
• Crossbow is designed to be easy to run (a) in "the cloud"
Amazon's Elastic MapReduce service, (b) on any
Hadoop cluster, or (c) on any single computer, without
Hadoop.
• Open-source available to anyone
http://bowtie-bio.sourceforge.net/crossbow/

SINGLE-NUCLEOTIDE POLYMORPHISM

• A single-nucleotide
polymorphism
(SNP, pronounced snip) is a
DNA sequence variation
occurring when a single
nucleotide — A, T, C or G
— in the genome (or other
shared sequence) differs
between members of a
biological species or paired
chromosomes in an
individual.

SNP ANALYSIS IN AMAZON EMR
• Crossbow web inteface
http://bowtie-bio.sourceforge.net/crossbow/ui.html

SNP ANALYSIS – DATA IN AMAZON S3

• Data for SNP analysis is uploaded to Amazon S3 bucket
• Output of analysis is placed in S3

SNP ANALYSIS – INPUT / OUTPUT DATA
• Input data – single file ~ 1.4GB
@E201_120801:4:1:1208:14983#ACAGTG/1
GAAGGAATAATGAGACCTNACGTTTCTGNNCNNNNNNNNNNNNNNNNNNN
+E201_120801:4:1:1208:14983#ACAGTG/1
gggfggdgfgdgg_e^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@E201_120801:4:1:1208:6966#ACAGTG/1
GCTGGGATTACAGACACANGCCACCACANNTNNNNNNNNNNNNNNNNNNN
+E201_120801:4:1:1208:6966#ACAGTG/1
• Output data – multiple files
chr1 841900 G A 3 A 68
2 2 G 0 0 0
2 0 1.00000 1.00000 1
chr1 922615 T G 2 G 38
3 3 T 67 1 1
4 0 1.00000 1.00000 0
chr1 1011278 A G 12 G 69
1 1 A 0 0 0
1 0 1.00000 1.00000 1

SNP ANALYSIS - TIME

• To process 1.4GB on 1 EMR instance – 6 hours
• To process 1.4GB on 2 EMR instances – 4 hours
• To process 1.4GB on 4 EMR instances – 2.5 hours
• Haven’t tried more instances…

AND MORE CASES FOR AMAZON AWS

Customer 1 successful migration from dedicated hosting to
Amazon:
 1 EC2 xlarge Linux instance (15 GB, 4 cores, 64bit) with 4
EBS volumes 250GB in US West (North California) region
 Runs 1 heavy web sites with > 1К concurrent users
 Tomcat app server and Oracle SE 11.2
 Amazon Elastip IP for web site
 Continuous Oracle backup to Amazon S3 through Oracle
secure backup for S3
 And it costs for customer only …wow
 <2 days for LIVE migration on weekend

AND MORE CASES FOR AMAZON AWS
Customer 2 successful migration from Rackspace to
Amazon:
 Rackspace hosting + service cost $..К, and service level
very low. Rackspace server was fixed.
 Migrated to 1 Amazon 2xlarge (34.2 GB, 4 virtual cores) EC2
Windows 2008 R2 instance. >100 web sites for corporate
customers. 2 EBS volumes 1.5TB
 Amazon Oracle RDS as backend – fully automated Oracle
database with expandable storage.
 200GB of user data in RDS.
 Full LIVE migration completed in 48 hours with DNS
names switch.
 And budget is significantly lower!!

THANK YOU
WELCOME FOR DISCUSSION…

Sergey Sverchkov
sergey.sverchkov@altoros.com
skype: sergey.sverchkov

Big data dive amazon emr processing

More Related Content

Similar to Big data dive amazon emr processing

More from Olga Lavrentieva

Recently uploaded

Big data dive amazon emr processing

Editor's Notes