Large Scale Data With Hadoop

Large Scale Data with HadoopGalen Riley and Josh PattersonPresented at DevChatt 2010

AgendaThinking at ScaleHadoop ArchitectureDistributed File SystemMapReduce Programming ModelExamples

Data is BigThe Data Deluge (2/25/2010)“Eighteen months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information flow through its network each day. Now the amount has increased tenfold.”http://www.economist.com/opinion/displaystory.cfm?story_id=15579717

Data is BigSensor data collection128 sensors37 GB/day10 bytes/sample, 30 per secondIncreasing 10x by 2012 http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-source

Disks are SlowDisk Seek, Data TransferReading FilesDisk seek for every accessBuffered reads, locality  still seeking every disk page

Disks are Slow10ms seek, 10MB/s transfer1TB file, 100b records, 10kb page10B entries, 1B pages1GB of updatesSeek for each update, 1000 daysSeek for each page, 100 daysTransfer entire TB, 1 day

Disks are SlowIDE drive – 75 MB/sec, 10ms seekSATA drive – 300MB/s, 8.5ms seekSSD – 800MB/s, 2 ms “seek”(1TB = $4k!)

// SidetrackObservation: transfer speed improves at a greater rate than seek speedImprovement by treating disks like tapesSeek as little as possible in favor of sequential reads Operate at transfer speed http://weblogs.java.net/blog/2008/03/18/disks-have-become-tapes

An Idea: Parallelism1 drive – 75 MB/sec16 days for 100TB1000 drives – 75 GB/sec22 minutes for 100TB

A Problem: Parallelism is HardIssuesSynchronizationDeadlockLimited bandwidthTiming issuesApples v. Oranges, but… MPIData distribution, communication between nodes done manually by the programmerConsiderable effort achieving parallelism compared to actual processing

A Problem: ReliabilityComputers are complicatedHard drivePower supplyOverheating

A Problem: Reliability1 Machine3 years mean time between failures1000 Machines1 day mean time between failures

RequirementsBackupReliablePartial failure, graceful decline rather than full haltData recoverability, if a node fails, another picks up its workloadNode recoverability, a fixed node can rejoin the group without a full group restartScalability, adding resources adds load capacityEasy to use

Hadoop: Robust, Cheap, ReliableApache project, open sourceDesigned for commodity hardwareCan lose whole nodes and not lose dataIncludes MapReduce programming model

Why Commodity Hardware?Single large computer systems are expensive and proprietaryHigh initial costs, plus lock-in with vendorExisting methods do not work at petabyte-scaleSolution: Scale “out” instead of “up”

Hadoop Distributed File SystemThroughput Good, Latency BadData CoherencyWrite-once, read-many access model Files are broken up into blocksTypically 64MB or 128MB block sizeEach replicated on multiple DataNodes on writeIntelligent ClientClient can find location of blocksClient accesses data directly from DataNode

Source: http://wiki.apache.org/hadoop/HadoopPresentations?action=AttachFile&do=get&target=hdfs_dhruba.pdf

HDFS: PerformanceRobust in the face of multiple machine failures through aggressive replication of data blocksHigh PerformanceChecksum of 100 TB in 10 minutes,~166 GB/secBuilt to house petabytes of data

MapReduceSimple programming model that abstracts parallel programming complications away from data processing logicMade popular at Google, drives their processing systems, used on 1000s of computers in various clustersHadoop provides an open source version of MR

Hadoop In The FieldYahooFacebookTwitterCommercial support available from Cloudera

Hadoop In Your BackyardopenPDC project at TVAhttp://openpdc.codeplex.comCluster is currently:20 nodes200TB of physical drive spaceUsed for Cheap, redundant storageTime series data mining

Examples – Word CountHello, World!MapInput:foofoo barOutput all words in a dataset as:{ key, value } {“foo”, 1}, {“foo”, 1}, {“bar”, 1}ReduceInput:{“foo”, (1, 1)}, {“bar”, (1)}Output:{“foo”, 2}, {“bar”, 1}

Word Count: Mapperpublic static class MapClass extends MapReduceBaseimplements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString();StringTokenizeritr = new StringTokenizer(line); while (itr.hasMoreTokens()) {word.set(itr.nextToken());output.collect(word, one); } }}

Word Count: Reducerpublic static class Reduce extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {int sum = 0; while (values.hasNext()) { sum += values.next().get(); }output.collect(key, new IntWritable(sum)); }}

Examples – Stock AnalysisInput dataset: Symbol,Date,Open,High,Low,CloseGOOG,2010-03-19,555.23,568.00,557.28,560.00YHOO,2010-03-19,16.62,16.81,16.34,16.44GOOG,2010-03-18,564.72,568.44,562.96,566.40YHOO,2010-03-18,16.46,16.57,16.32,16.56Interested in biggest delta for each stock

Examples – Stock AnalysisMapOutput {“GOOG”, 10.72}, {“YHOO”, 0.47}, {“GOOG”, 5.48}, {“YHOO”, 0.25}ReduceInput: {“GOOG”, (10.72, 5.48)},{“YHOO”, (0.47, 0.25)}Output:{“GOOG”, 10.72},{“YHOO”, 0.47}

Examples – Time Series AnalysisMap:{pointId, Timestamp + 30s of data}Reduce:Data mining!Classify samples based on training datasetOutput samples that fall into interesting categories, index in database

Other StuffCompatibility with Amazon Elastic CloudHadoopStreamingMapReducewith anything that uses stdin/stdoutHbase, distributed column-store databasePig, data analysis (transforms, filters, etc)Hive, data warehousing infrastructureMahout, machine learning algorithms

Parting Thoughts“We don't have better algorithms than anyone else. We just have more data.”Peter NorvigArtificial Intelligence: A Modern ApproachChief scientist at Google

ContactGalen Rileyhttp://galenriley.com@TotallyGreatJosh Pattersonhttp://jpatterson.floe.tv@jpatanooga

Large Scale Data With Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Large Scale Data With Hadoop

Recently uploaded

Large Scale Data With Hadoop

Editor's Notes