KEMBAR78
Large Scale Data With Hadoop | PPTX
Large Scale Data with HadoopGalen Riley and Josh PattersonPresented at DevChatt 2010
AgendaThinking at ScaleHadoop ArchitectureDistributed File SystemMapReduce Programming ModelExamples
Data is BigThe Data Deluge (2/25/2010)“Eighteen months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information flow through its network each day. Now the amount has increased tenfold.”http://www.economist.com/opinion/displaystory.cfm?story_id=15579717
Data is BigSensor data collection128 sensors37 GB/day10 bytes/sample, 30 per secondIncreasing 10x by 2012 http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-source
Disks are SlowDisk Seek, Data TransferReading FilesDisk seek for every accessBuffered reads, locality  still seeking every disk page
Disks are Slow10ms seek, 10MB/s transfer1TB file, 100b records, 10kb page10B entries, 1B pages1GB of updatesSeek for each update, 1000 daysSeek for each page, 100 daysTransfer entire TB, 1 day
Disks are SlowIDE drive – 75 MB/sec, 10ms seekSATA drive – 300MB/s, 8.5ms seekSSD – 800MB/s, 2 ms “seek”(1TB = $4k!) 
// SidetrackObservation: transfer speed improves at a greater rate than seek speedImprovement by treating disks like tapesSeek as little as possible in favor of sequential reads Operate at transfer speed http://weblogs.java.net/blog/2008/03/18/disks-have-become-tapes
An Idea: Parallelism1 drive – 75 MB/sec16 days for 100TB1000 drives – 75 GB/sec22 minutes for 100TB
A Problem: Parallelism is HardIssuesSynchronizationDeadlockLimited bandwidthTiming issuesApples v. Oranges, but… MPIData distribution, communication between nodes done manually by the programmerConsiderable effort achieving parallelism compared to actual processing
A Problem: ReliabilityComputers are complicatedHard drivePower supplyOverheating
A Problem: Reliability1 Machine3 years mean time between failures1000 Machines1 day mean time between failures
RequirementsBackupReliablePartial failure, graceful decline rather than full haltData recoverability, if a node fails, another picks up its workloadNode recoverability, a fixed node can rejoin the group without a full group restartScalability, adding resources adds load capacityEasy to use
Hadoop: Robust, Cheap, ReliableApache project, open sourceDesigned for commodity hardwareCan lose whole nodes and not lose dataIncludes MapReduce programming model
Why Commodity Hardware?Single large computer systems are expensive and proprietaryHigh initial costs, plus lock-in with vendorExisting methods do not work at petabyte-scaleSolution: Scale “out” instead of “up”
Hadoop Distributed File SystemThroughput Good, Latency BadData CoherencyWrite-once, read-many access model Files are broken up into blocksTypically 64MB or 128MB block sizeEach replicated on multiple DataNodes on writeIntelligent ClientClient can find location of blocksClient accesses data directly from DataNode
Source: http://wiki.apache.org/hadoop/HadoopPresentations?action=AttachFile&do=get&target=hdfs_dhruba.pdf
HDFS: PerformanceRobust in the face of multiple machine failures through aggressive replication of data blocksHigh PerformanceChecksum of 100 TB in 10 minutes,~166 GB/secBuilt to house petabytes of data
MapReduceSimple programming model that abstracts parallel programming complications away from data processing logicMade popular at Google, drives their processing systems, used on 1000s of computers in various clustersHadoop provides an open source version of MR
MapReduce Data Flow
Using MapReduceMapReduce is a programming model for efficient distributed computingIt works like a Unix pipeline:cat input | grep | sort | uniq -c | cat > outputInput | Map | Shuffle & Sort | Reduce | OutputEfficiency fromStreaming through data, reducing seeksPipeliningA good fit for a lot of applicationsLog processingWeb index building
Hadoop In The FieldYahooFacebookTwitterCommercial support available from Cloudera
Hadoop In Your BackyardopenPDC project at TVAhttp://openpdc.codeplex.comCluster is currently:20 nodes200TB of physical drive spaceUsed for Cheap, redundant storageTime series data mining
Examples – Word CountHello, World!MapInput:foofoo barOutput all words in a dataset as:{ key, value }	{“foo”, 1}, {“foo”, 1}, {“bar”, 1}ReduceInput:{“foo”, (1, 1)}, {“bar”, (1)}Output:{“foo”, 2}, {“bar”, 1}
Word Count: Mapperpublic static class MapClass extends MapReduceBaseimplements Mapper<LongWritable, Text, Text, IntWritable> {	private final static IntWritable one = new 				IntWritable(1);	private Text word = new Text();	public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,		Reporter reporter) throws IOException	{		String line = value.toString();StringTokenizeritr = new StringTokenizer(line);		while (itr.hasMoreTokens()) {word.set(itr.nextToken());output.collect(word, one);		}	}}
Word Count: Reducerpublic static class Reduce extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable> {	public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,	Reporter reporter) throws IOException {int sum = 0;		while (values.hasNext()) {			sum += values.next().get();		}output.collect(key, new IntWritable(sum));	}}
Examples – Stock AnalysisInput dataset: Symbol,Date,Open,High,Low,CloseGOOG,2010-03-19,555.23,568.00,557.28,560.00YHOO,2010-03-19,16.62,16.81,16.34,16.44GOOG,2010-03-18,564.72,568.44,562.96,566.40YHOO,2010-03-18,16.46,16.57,16.32,16.56Interested in biggest delta for each stock
Examples – Stock AnalysisMapOutput	{“GOOG”, 10.72},	{“YHOO”, 0.47},	{“GOOG”, 5.48},	{“YHOO”, 0.25}ReduceInput: {“GOOG”, (10.72, 5.48)},{“YHOO”, (0.47, 0.25)}Output:{“GOOG”, 10.72},{“YHOO”, 0.47}
Examples – Time Series AnalysisMap:{pointId, Timestamp + 30s of data}Reduce:Data mining!Classify samples based on training datasetOutput samples that fall into interesting categories, index in database
Other StuffCompatibility with Amazon Elastic CloudHadoopStreamingMapReducewith anything that uses stdin/stdoutHbase, distributed column-store databasePig, data analysis (transforms, filters, etc)Hive, data warehousing infrastructureMahout, machine learning algorithms
Parting Thoughts“We don't have better algorithms than anyone else. We just have more data.”Peter NorvigArtificial Intelligence: A Modern ApproachChief scientist at Google
ContactGalen Rileyhttp://galenriley.com@TotallyGreatJosh Pattersonhttp://jpatterson.floe.tv@jpatanooga

Large Scale Data With Hadoop

  • 1.
    Large Scale Datawith HadoopGalen Riley and Josh PattersonPresented at DevChatt 2010
  • 2.
    AgendaThinking at ScaleHadoopArchitectureDistributed File SystemMapReduce Programming ModelExamples
  • 3.
    Data is BigTheData Deluge (2/25/2010)“Eighteen months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information flow through its network each day. Now the amount has increased tenfold.”http://www.economist.com/opinion/displaystory.cfm?story_id=15579717
  • 4.
    Data is BigSensordata collection128 sensors37 GB/day10 bytes/sample, 30 per secondIncreasing 10x by 2012 http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-source
  • 5.
    Disks are SlowDiskSeek, Data TransferReading FilesDisk seek for every accessBuffered reads, locality  still seeking every disk page
  • 6.
    Disks are Slow10msseek, 10MB/s transfer1TB file, 100b records, 10kb page10B entries, 1B pages1GB of updatesSeek for each update, 1000 daysSeek for each page, 100 daysTransfer entire TB, 1 day
  • 7.
    Disks are SlowIDEdrive – 75 MB/sec, 10ms seekSATA drive – 300MB/s, 8.5ms seekSSD – 800MB/s, 2 ms “seek”(1TB = $4k!) 
  • 8.
    // SidetrackObservation: transferspeed improves at a greater rate than seek speedImprovement by treating disks like tapesSeek as little as possible in favor of sequential reads Operate at transfer speed http://weblogs.java.net/blog/2008/03/18/disks-have-become-tapes
  • 9.
    An Idea: Parallelism1drive – 75 MB/sec16 days for 100TB1000 drives – 75 GB/sec22 minutes for 100TB
  • 10.
    A Problem: Parallelismis HardIssuesSynchronizationDeadlockLimited bandwidthTiming issuesApples v. Oranges, but… MPIData distribution, communication between nodes done manually by the programmerConsiderable effort achieving parallelism compared to actual processing
  • 11.
    A Problem: ReliabilityComputersare complicatedHard drivePower supplyOverheating
  • 12.
    A Problem: Reliability1Machine3 years mean time between failures1000 Machines1 day mean time between failures
  • 13.
    RequirementsBackupReliablePartial failure, gracefuldecline rather than full haltData recoverability, if a node fails, another picks up its workloadNode recoverability, a fixed node can rejoin the group without a full group restartScalability, adding resources adds load capacityEasy to use
  • 14.
    Hadoop: Robust, Cheap,ReliableApache project, open sourceDesigned for commodity hardwareCan lose whole nodes and not lose dataIncludes MapReduce programming model
  • 15.
    Why Commodity Hardware?Singlelarge computer systems are expensive and proprietaryHigh initial costs, plus lock-in with vendorExisting methods do not work at petabyte-scaleSolution: Scale “out” instead of “up”
  • 16.
    Hadoop Distributed FileSystemThroughput Good, Latency BadData CoherencyWrite-once, read-many access model Files are broken up into blocksTypically 64MB or 128MB block sizeEach replicated on multiple DataNodes on writeIntelligent ClientClient can find location of blocksClient accesses data directly from DataNode
  • 17.
  • 18.
    HDFS: PerformanceRobust inthe face of multiple machine failures through aggressive replication of data blocksHigh PerformanceChecksum of 100 TB in 10 minutes,~166 GB/secBuilt to house petabytes of data
  • 19.
    MapReduceSimple programming modelthat abstracts parallel programming complications away from data processing logicMade popular at Google, drives their processing systems, used on 1000s of computers in various clustersHadoop provides an open source version of MR
  • 20.
  • 21.
    Using MapReduceMapReduce isa programming model for efficient distributed computingIt works like a Unix pipeline:cat input | grep | sort | uniq -c | cat > outputInput | Map | Shuffle & Sort | Reduce | OutputEfficiency fromStreaming through data, reducing seeksPipeliningA good fit for a lot of applicationsLog processingWeb index building
  • 22.
    Hadoop In TheFieldYahooFacebookTwitterCommercial support available from Cloudera
  • 23.
    Hadoop In YourBackyardopenPDC project at TVAhttp://openpdc.codeplex.comCluster is currently:20 nodes200TB of physical drive spaceUsed for Cheap, redundant storageTime series data mining
  • 24.
    Examples – WordCountHello, World!MapInput:foofoo barOutput all words in a dataset as:{ key, value } {“foo”, 1}, {“foo”, 1}, {“bar”, 1}ReduceInput:{“foo”, (1, 1)}, {“bar”, (1)}Output:{“foo”, 2}, {“bar”, 1}
  • 25.
    Word Count: Mapperpublicstatic class MapClass extends MapReduceBaseimplements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString();StringTokenizeritr = new StringTokenizer(line); while (itr.hasMoreTokens()) {word.set(itr.nextToken());output.collect(word, one); } }}
  • 26.
    Word Count: Reducerpublicstatic class Reduce extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {int sum = 0; while (values.hasNext()) { sum += values.next().get(); }output.collect(key, new IntWritable(sum)); }}
  • 27.
    Examples – StockAnalysisInput dataset: Symbol,Date,Open,High,Low,CloseGOOG,2010-03-19,555.23,568.00,557.28,560.00YHOO,2010-03-19,16.62,16.81,16.34,16.44GOOG,2010-03-18,564.72,568.44,562.96,566.40YHOO,2010-03-18,16.46,16.57,16.32,16.56Interested in biggest delta for each stock
  • 28.
    Examples – StockAnalysisMapOutput {“GOOG”, 10.72}, {“YHOO”, 0.47}, {“GOOG”, 5.48}, {“YHOO”, 0.25}ReduceInput: {“GOOG”, (10.72, 5.48)},{“YHOO”, (0.47, 0.25)}Output:{“GOOG”, 10.72},{“YHOO”, 0.47}
  • 29.
    Examples – TimeSeries AnalysisMap:{pointId, Timestamp + 30s of data}Reduce:Data mining!Classify samples based on training datasetOutput samples that fall into interesting categories, index in database
  • 30.
    Other StuffCompatibility withAmazon Elastic CloudHadoopStreamingMapReducewith anything that uses stdin/stdoutHbase, distributed column-store databasePig, data analysis (transforms, filters, etc)Hive, data warehousing infrastructureMahout, machine learning algorithms
  • 31.
    Parting Thoughts“We don'thave better algorithms than anyone else. We just have more data.”Peter NorvigArtificial Intelligence: A Modern ApproachChief scientist at Google
  • 32.

Editor's Notes

  • #3 So everyone knows what data processing is, but what do we mean by “scale” ?
  • #4 Simply- Data. Is. Big.… So this is the trend. The amount of data we can collect is increasing exponentially, and most companies aren’t capable of handling it. Patterson likes to call this “the data tsunami.” Let’s talk about a real example of this…
  • #5 Okay, so data is big. No big deal, I’ve got a processor with four cores that will chew through anything. However, the speed of my application is constrained by the speed at which I can get data. It is not going to fit in memory, so it’s going to be living on a hard disk. This brings us to problem number 2.
  • #6 Hard drive speed comes from two numbers.Disk seek time: The time to move the read head on a drive to where the data is storedData transfer: The speed that I can get information off the diskHard drives are wonderful because they are random access devices, and I can get data anywhere off the disk ay time I want it just by seeking and reading.Seeking takes a while, though. Fortunately, I can take advantage of locality, and read a page of data in to a bufferLet’s look at an example….
  • #7 This example has nice round numbers.I have a fictional hard drive that has a 10ms seek time, and a 10 meg/second transfer speedOn it, there’s a 1TB of data, made up of 100 byte records in 10K pages. That’s 10 billion entries over a billion pages, and I want to update 10% of this data set – a gig.…So seeks are slow, but I can transfer the whole file in a single day. Again, my application is always bound by the slowest piece in the pipe, so I get the most benefit by speeding that part up – the disk.
  • #8 Here are some real drives. I grabbed the specs that are advertised on newegg....Solid state drives are expensive though, $4k per terabyte. I can’t afford to buy a new one every month for my sensor collection.
  • #9 …With this observation in mind, let’s consider treading a hard disk (a random access device) like tape (a sequential device)…So we get closer to 1 day instead of a thousand
  • #10 I bet a lot of you know where this is going already – parallelism!So let’s say I’ve got one of those ide drives from earlier. I can sequentially process 100TB of data in 16 days.Alright, let’s get a thousand of them and run them in parallel – 75 GIGS per second, and I’m done in 22 minutes.Alright, parallel processing solves our problem, let’s call it a day!
  • #11 There’s an issue, though. Parallelism is really really hard.…And that’s not the only problem, either.
  • #12 Even if you have an OS that you know won’t crash, and code that won’t kill it either, reliability is still an issue because hardware fails.
  • #13 So let’s buy some expensive fault-tolerant hardware…
  • #16 A system that is robust in the face of machine failureA platform to allows multiple groups to collaborateA solution that scales linearly with respect to costA vision that will not lock us into a single vendor over time
  • #25 Alright, now we’re going to do some examples. I find it is most useful to look at what happens to the data instead of what the code looks like. Let’s review MR, but think about the data.So I have a big file that I want to process. It is split up in blocks and spread over my cluster. I start a job and Hadoop initiates a bunch of map tasks- this processing occurs where the data exists already. The mapper reads in a part of the file, and emits several key/value pairs. These are collected, sorted into buckets based on key, and each bucket goes to a reduce task. Each reducer processes a bucket and outputs the result. Of course, I can chain these steps together.Word count is the ‘Hello World’ of mapreduce. I’m interested in the frequency of words in a dataset.…Word count isn’t entirely silly by the way. Consider the suggestions that pop up when you start a google search. What you see is a list of search strings that people use frequently. Think of it as phrase count instead of word count.
  • #26 I’ve got some code here , but I’m going to skip going over it in detail. The slides will be available if you want to pour over it.We talked about MR being accessible for a programmer when compared to an MPI approach, and this is the entire map class for word count.
  • #28 Here’s another example to illustrate that my map process can do more than just read data in and push it back out. Here’s a file with information about stock prices – the ticker symbol, a date,the open price, the high and low prices for the day, and what it closed at. Since we’re talking about big data sets here, I want you to imagine that it’s got every stock for the last 50 years and there’s not enough room on my slide to include it all.I’m interested in volatility or something, so I want the biggest change in price for a particular stock. Let’s look at the data.
  • #29 My mapper reads in a record, filters out the information I’m not interested in (date and open/close prices), and emits the delta for each day.
  • #31 I think that collecting data without doing anything interesting with it is a big sin. So, here’s a business case for someone in the room, perhaps. Say you want to grep through some server logs that you’ve been collecting forever but never got around to doing anything with. Amazon EC2 supports Hadoop, so you can run your job without having to buy any hardware at all.And a list of stuff that is built on top of Hadoop.… You don’t have to write your jobs in java. I know that I love python, and I bet you do too.…We’ll be contributing some our time series stuff to the Mahout project.
  • #32 So let’s conclude with a quote from Peter Norvig that I think justifies our entire presentation.