KEMBAR78
OSC2012: Big Data Using Open Source: Netapp Project - Technical | PDF
Open source Big Data case study: Building a
platform for remote device support at NetApp
(Part II – Technical)
Topics



                                                     Big Data Perspective

                                                     Case Study: NetApp AutoSupport

                                                     Technology Primer

                                                     Design Overview




Copyright © 2012 Accenture All rights reserved.                                        2
Big Data

         The concept is disruptive. The technology is disruptive. And, markets and
         clients are being impacted.




                                                        1 Wordle for   Credit Suisse, Does Size Matter Only?, September 2011


Copyright © 2012 Accenture All rights reserved.                                                                                3
Shifts in Data and Analytics
                    The changing landscape and required winning strategies are creating shifts
                    within Big Data collection and analytics
                         Data Explosion                                                  Monetization
                                                   • Unstructured data is doubling                         • Growth of enterprise data
                                                     every 3 months                                          monetization services
                                                   • 2011 saw 47% growth overall                           • Large retailers monetizing own
                                                   • By 2015, number of networked                            data to provide insights to
                                                     devices will be 2x global                               suppliers
                                                     population
                      Data-led Innovation                                                Social Media
                                                   • De-coupling data from                                 • Growing market for scrubbed,
                                                     applications                                            aggregate data from social
                                                   • Disparate external data shaping                         media and blogs
                                                     context                                               • Greater focus on data that
                                                   • Cost effective mobilization of                          provides insight in a customer’s
                                                     massive scale data                                      digital persona

                           Technology                                                  Data Mobilization
                                                   • Commodity priced storage and                          • Novel approaches to analyze
                                                     compute                                                 unstructured data creating
                                                                                                             shorter time from data to insight
                                                   • Emergence of open source and
                                                     big data technologies solving                         • Shift towards data consumption
                                                     production problems at scale                            in multiple environments
                                                                                                             (business apps, mobile, social)


 Copyright © 2012 Accenture All rights reserved.                                                                                                 4
The Big Data Approach

                                                        Treat data as a strategic asset, seek to
                                                        maximize it’s value to the organization


                                                        Invest in common services, data platforms
                                                        and tools


                                                        Rapidly prototype, deliver, and measure
                                                        value-added data services, evolve over time


                                              •   Data-driven decision making   •   End-to-end ownership of
                                              •   Experimentation and               services
                                                  continuous improvement with   •   Sharing of platform, tools and
                                                  academic rigor                    code
                                                                                                          Culture
Copyright © 2012 Accenture All rights reserved.                                                                      5
Topics



                                                     Big Data Perspective

                                                     Case Study: NetApp AutoSupport

                                                     Technology Primer

                                                     Design Overview




Copyright © 2012 Accenture All rights reserved.                                        6
Client Context

                      NetApp, Inc.
                      • Industry: Data storage, data management
                      • 77% Fortune 500 companies are customers
                      • Creator of Data ONTAP: industry leading storage OS




Copyright © 2012 Accenture All rights reserved.                              7
AutoSupport

                                                                •   Secure automated “call-home” service
                                                                •   Catch issues before they become critical
                                                                •   System monitoring and alerting
                                                                •   RMA requests without customer action
                                                                •   Faster incident management


                                                                         AutoSupport
                                                  Storage Devices         Messages        AutoSupport
                                                                                         Data Warehouse




Copyright © 2012 Accenture All rights reserved.                                                                8
Business Challenges
                                                                                      SAP CRM                   MyASUP               eBI              STOR             ASUP Tools              Analytics & Mining


   • Increase in response times / lower                                                                                                                                                                              Presentation




     availability for services                                                            CRM Module

                                                                                      Rules Module
                                                                                                                    Java Interface

                                                                                                                       Rules
                                                                                                                        Rules
                                                                                                                                                          Jasper

                                                                                                                                                      Stored Proc
                                                                                                                                                                     Rest Interface

                                                                                                                                                                              Rules
                                                                                                                                                                               Rules
                                                                                                                                                                                                     Rules
                                                                                                                                                                                                      Rules
                                                                                                                                                                                                           Various   Interface


                                                                                                                                                                                                                     Rules

   • Incoming data volume doubling every 16
                                                                                                                         Rules                                                   Rules                  Rules
                                                                                                                                           eB
                                                                                  PMBTA                                                                                BI
                                                                                                                                           I
                                                                                                                                                                                                                     Integrate


     months                                                                               Custom ETL            Custom ETL
                                                                                                                                                DSS

                                                                                                                                                      Custom ETL         Custom ETL                                  Transform


   • Proliferation of ad hoc datamarts and                                      Xterra DB               PWillows
                                                                                                                                       DW 3
                                                                                                                                                ODS

                                                                                                                                                                       DW 2                             Adhoc DB’s
                                                                                                                                                                                                                     Stage



     point solutions                                                             Xterra
                                                                                 Parser
                                                                                                          Light
                                                                                                          Parser
                                                                                                                       Parser
                                                                                                                                                Loader

                                                                                                                                            Parser
                                                                                                                                                                    Core
                                                                                                                                                                    Parser                           Adhoc           Extract



   • Unable to analyze full AutoSupport
                                                                                                                                                                                                     Parsers

                                                                                                       Xterra
                                                                                                       File
                                                                                                                                                                                                                     Source

     contents efficiently
                                                                                                                                                         SAP CRM                GEO      DRM      HDD
                                                                                ASUP                                                                     STAGE      PNOW                                   DM
                                                                                                                             File Storage
                                                                                Messages




                                                                     AutoSupport Flat-File Storage Requirement
                                                  3500
                                                  3000
                                                                                 Total Usage (tb)
                                                  2500
                                                                                 Projected Total Usage (tb)
                                                  2000
                                                  1500                           Doubles
                                                  1000
                                                   500
                                                    0
                                                    Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16


Copyright © 2012 Accenture All rights reserved.                                                                                                                                                                                     9
Solution Design Goals
Improve data access and technology cost effectiveness and performance.

 •    Improve system response times
      and data availability
 •    Expose common data services for
      consumption across business units
 •    Standardize key business metrics
      into common rules repository
 •    Lower operational costs as
      ecosystem continues to scale
 •    Provide more granular analytical
      capabilities


 Copyright © 2012 Accenture All rights reserved.                         10
Role of Open Source
                      Platform is composed of open source technologies purpose-built for large-scale
                      storage, processing and analysis




                                                     1 Actual Big Data Solution Blueprint for a hybrid deployment




Copyright © 2012 Accenture All rights reserved.                                                                     11
Topics



                                                     Big Data Perspective

                                                     Case Study: NetApp AutoSupport

                                                     Technology Primer

                                                     Design Overview




Copyright © 2012 Accenture All rights reserved.                                        12
Technology Primer – Hadoop
Hadoop Distributed Filesystem                     Hadoop MapReduce
(HDFS)                                            • Parallel processing for large datasets
• Divides files into smaller “blocks”,              across machines
  stored across machines                          • Breaks job into tasks, using a simple map()
• Automated replication, fault tolerance            and reduce() paradigm for data flows




Copyright © 2012 Accenture All rights reserved.                                              13
Technology Primer – MapReduce

MapReduce
                                                                                         Map(key,value)
(Simple Example – Word Count)
                                                                                         Reduce(key, List<value> values)
                                                  Map Phase              Shuffle Phase

                                                              <one,1>
                                                                                                          <one,1>
                                                     m        <fish,1>
                    Input                                                                                 <two,1>
                                                                                              r
                 One fish,                                    <two,1>
                                                     m        <fish,1>                                    <red,1>
                 two fish,
                                                                                              r           <blue,1>
                 red fish,
                 blue fish.                                   <red,1>
                                                     m
                                                              <fish,1>
                                                                                              r            <fish,4>

                                                     m        <blue,1>
                                                              <fish,1>
Copyright © 2012 Accenture All rights reserved.                                                                            14
Technology Primer – NoSQL

• “Not only” SQL
   • Catch-all term for various non-relational database systems

• Typical areas of differentation
   • Data model semantics
                 • eg. Database, Document, Key-Value
        • CAP trade-offs
                 • Consistency, Availability, Partition-Tolerance
        • Scale-out architecture
                 • eg. Sharding, Distributed hash
        • Query language

                                  Examples: HBase, Cassandra, mongoDB, Neo4j, etc.
Copyright © 2012 Accenture All rights reserved.                                      15
Topics



                                                     Big Data Perspective

                                                     Case Study: NetApp AutoSupport

                                                     Technology Primer

                                                     Design Overview




Copyright © 2012 Accenture All rights reserved.                                        16
Data Pipeline Overview



                                                                           Data Service
                                                                            Interface

                      Incoming Messages


                                                              Core Data      Ad hoc
                                                  Ingestion
                                                              Processing    analytics




                                                                               ETL




Copyright © 2012 Accenture All rights reserved.                                           17
Data Ingestion
    Technologies
    • Apache Flume, Apache Hadoop, Drools BRMS, JMS
    Capabilities
    • Handle dynamic data volumes
                                                                                           Notifications
    • Normalization of disparate file formats
    • Real-time aggregation of documents                                                         JMS

    • JMS alerts for critical messages
                                                         Parsing tier           Aggregation & sink tier

Documents from
Front End HTTP/SMTP                                  Flume              Flume           Flume
Gateway                               Routing tier   agent              agent           agent
                                                                                                           Aggregated files


                                            Flume    Flume              Flume           Flume
                                            client   agent              agent           agent
                                            Rules                                                                    HDFS
                                            Engine
                                                     Flume              Flume           Flume
                                                     agent              agent           agent

Copyright © 2012 Accenture All rights reserved.                                                                               18
Core Data Processing
Technologies
• MapReduce, HBase, Solr, Avro
Capabilities
• Parallel processing for increased throughput
• Efficient storage of complex data objects in Avro
                                                                                                   Search indexes



                                                  Parse text                                           Solr
                                                  contents     Transform and derive data objects
                                                                                                          Primary storage
           Documents gathered
           from Flume                              Map
                                                                                                            HBase
                                                                        Reduce
                                                   Map                 HDFS
                                                                     Write derived objects to            Data warehouse
                                                                     data stores

                                                   Map
                                                                         Reduce                               Hive
Copyright © 2012 Accenture All rights reserved.                                                                             19
Data Services
 Technologies
 • Apache HBase, Solr, Tomcat
 Capabilities
 • Unified web services API for end
   users
 • Support for complex queries and
   searches across multiple dimensions
   with Solr
 • Access both raw and derived content
   for a given system




Copyright © 2012 Accenture All rights reserved.   20
Analytics / ETL
 Technologies
 • Apache Hive, Pig, Datameer (Ad hoc analytics)
 • Pentaho (ETL / Data Integration)
 Capabilities
 • Analytical environment for both business analysts and “power
   users”
    • Hive or Pig as higher level query languages
    • Datameer for analytics with a spreadsheet UI
 • ETL through Pentaho MapReduce
          • (runs Pentaho ETL server inside of a MapReduce Job)



Copyright © 2012 Accenture All rights reserved.                   21
Successes and Challenges
  Successes
  • Web service interface contracts simplified integration with
    user tools, allowed for flexibility in internal implementation
  • Open source core allowed rapid for rapid iteration
  • Met or exceeded all SLAs using commodity hardware,
    significantly driving down costs
  Challenges
  • Monitoring a large distributed system requires discipline and
    a strong operations team
  • Shared storage systems and Big Data technologies don’t
    always play well together
  • “Schemaless” systems can become a headache to
    maintain, especially with complex data models

Copyright © 2012 Accenture All rights reserved.                      22
Thank you

                                                  Jonathan Bender
                                                  Consultant, Accenture Technology Labs
                                                  jonathan.bender@accenture.com




Copyright © 2012 Accenture All rights reserved.                                           23

OSC2012: Big Data Using Open Source: Netapp Project - Technical

  • 1.
    Open source BigData case study: Building a platform for remote device support at NetApp (Part II – Technical)
  • 2.
    Topics  Big Data Perspective  Case Study: NetApp AutoSupport  Technology Primer  Design Overview Copyright © 2012 Accenture All rights reserved. 2
  • 3.
    Big Data The concept is disruptive. The technology is disruptive. And, markets and clients are being impacted. 1 Wordle for Credit Suisse, Does Size Matter Only?, September 2011 Copyright © 2012 Accenture All rights reserved. 3
  • 4.
    Shifts in Dataand Analytics The changing landscape and required winning strategies are creating shifts within Big Data collection and analytics Data Explosion Monetization • Unstructured data is doubling • Growth of enterprise data every 3 months monetization services • 2011 saw 47% growth overall • Large retailers monetizing own • By 2015, number of networked data to provide insights to devices will be 2x global suppliers population Data-led Innovation Social Media • De-coupling data from • Growing market for scrubbed, applications aggregate data from social • Disparate external data shaping media and blogs context • Greater focus on data that • Cost effective mobilization of provides insight in a customer’s massive scale data digital persona Technology Data Mobilization • Commodity priced storage and • Novel approaches to analyze compute unstructured data creating shorter time from data to insight • Emergence of open source and big data technologies solving • Shift towards data consumption production problems at scale in multiple environments (business apps, mobile, social) Copyright © 2012 Accenture All rights reserved. 4
  • 5.
    The Big DataApproach Treat data as a strategic asset, seek to maximize it’s value to the organization Invest in common services, data platforms and tools Rapidly prototype, deliver, and measure value-added data services, evolve over time • Data-driven decision making • End-to-end ownership of • Experimentation and services continuous improvement with • Sharing of platform, tools and academic rigor code Culture Copyright © 2012 Accenture All rights reserved. 5
  • 6.
    Topics  Big Data Perspective  Case Study: NetApp AutoSupport  Technology Primer  Design Overview Copyright © 2012 Accenture All rights reserved. 6
  • 7.
    Client Context NetApp, Inc. • Industry: Data storage, data management • 77% Fortune 500 companies are customers • Creator of Data ONTAP: industry leading storage OS Copyright © 2012 Accenture All rights reserved. 7
  • 8.
    AutoSupport • Secure automated “call-home” service • Catch issues before they become critical • System monitoring and alerting • RMA requests without customer action • Faster incident management AutoSupport Storage Devices Messages AutoSupport Data Warehouse Copyright © 2012 Accenture All rights reserved. 8
  • 9.
    Business Challenges SAP CRM MyASUP eBI STOR ASUP Tools Analytics & Mining • Increase in response times / lower Presentation availability for services CRM Module Rules Module Java Interface Rules Rules Jasper Stored Proc Rest Interface Rules Rules Rules Rules Various Interface Rules • Incoming data volume doubling every 16 Rules Rules Rules eB PMBTA BI I Integrate months Custom ETL Custom ETL DSS Custom ETL Custom ETL Transform • Proliferation of ad hoc datamarts and Xterra DB PWillows DW 3 ODS DW 2 Adhoc DB’s Stage point solutions Xterra Parser Light Parser Parser Loader Parser Core Parser Adhoc Extract • Unable to analyze full AutoSupport Parsers Xterra File Source contents efficiently SAP CRM GEO DRM HDD ASUP STAGE PNOW DM File Storage Messages AutoSupport Flat-File Storage Requirement 3500 3000 Total Usage (tb) 2500 Projected Total Usage (tb) 2000 1500 Doubles 1000 500 0 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16 Copyright © 2012 Accenture All rights reserved. 9
  • 10.
    Solution Design Goals Improvedata access and technology cost effectiveness and performance. • Improve system response times and data availability • Expose common data services for consumption across business units • Standardize key business metrics into common rules repository • Lower operational costs as ecosystem continues to scale • Provide more granular analytical capabilities Copyright © 2012 Accenture All rights reserved. 10
  • 11.
    Role of OpenSource Platform is composed of open source technologies purpose-built for large-scale storage, processing and analysis 1 Actual Big Data Solution Blueprint for a hybrid deployment Copyright © 2012 Accenture All rights reserved. 11
  • 12.
    Topics  Big Data Perspective  Case Study: NetApp AutoSupport  Technology Primer  Design Overview Copyright © 2012 Accenture All rights reserved. 12
  • 13.
    Technology Primer –Hadoop Hadoop Distributed Filesystem Hadoop MapReduce (HDFS) • Parallel processing for large datasets • Divides files into smaller “blocks”, across machines stored across machines • Breaks job into tasks, using a simple map() • Automated replication, fault tolerance and reduce() paradigm for data flows Copyright © 2012 Accenture All rights reserved. 13
  • 14.
    Technology Primer –MapReduce MapReduce Map(key,value) (Simple Example – Word Count) Reduce(key, List<value> values) Map Phase Shuffle Phase <one,1> <one,1> m <fish,1> Input <two,1> r One fish, <two,1> m <fish,1> <red,1> two fish, r <blue,1> red fish, blue fish. <red,1> m <fish,1> r <fish,4> m <blue,1> <fish,1> Copyright © 2012 Accenture All rights reserved. 14
  • 15.
    Technology Primer –NoSQL • “Not only” SQL • Catch-all term for various non-relational database systems • Typical areas of differentation • Data model semantics • eg. Database, Document, Key-Value • CAP trade-offs • Consistency, Availability, Partition-Tolerance • Scale-out architecture • eg. Sharding, Distributed hash • Query language Examples: HBase, Cassandra, mongoDB, Neo4j, etc. Copyright © 2012 Accenture All rights reserved. 15
  • 16.
    Topics  Big Data Perspective  Case Study: NetApp AutoSupport  Technology Primer  Design Overview Copyright © 2012 Accenture All rights reserved. 16
  • 17.
    Data Pipeline Overview Data Service Interface Incoming Messages Core Data Ad hoc Ingestion Processing analytics ETL Copyright © 2012 Accenture All rights reserved. 17
  • 18.
    Data Ingestion Technologies • Apache Flume, Apache Hadoop, Drools BRMS, JMS Capabilities • Handle dynamic data volumes Notifications • Normalization of disparate file formats • Real-time aggregation of documents JMS • JMS alerts for critical messages Parsing tier Aggregation & sink tier Documents from Front End HTTP/SMTP Flume Flume Flume Gateway Routing tier agent agent agent Aggregated files Flume Flume Flume Flume client agent agent agent Rules HDFS Engine Flume Flume Flume agent agent agent Copyright © 2012 Accenture All rights reserved. 18
  • 19.
    Core Data Processing Technologies •MapReduce, HBase, Solr, Avro Capabilities • Parallel processing for increased throughput • Efficient storage of complex data objects in Avro Search indexes Parse text Solr contents Transform and derive data objects Primary storage Documents gathered from Flume Map HBase Reduce Map HDFS Write derived objects to Data warehouse data stores Map Reduce Hive Copyright © 2012 Accenture All rights reserved. 19
  • 20.
    Data Services Technologies • Apache HBase, Solr, Tomcat Capabilities • Unified web services API for end users • Support for complex queries and searches across multiple dimensions with Solr • Access both raw and derived content for a given system Copyright © 2012 Accenture All rights reserved. 20
  • 21.
    Analytics / ETL Technologies • Apache Hive, Pig, Datameer (Ad hoc analytics) • Pentaho (ETL / Data Integration) Capabilities • Analytical environment for both business analysts and “power users” • Hive or Pig as higher level query languages • Datameer for analytics with a spreadsheet UI • ETL through Pentaho MapReduce • (runs Pentaho ETL server inside of a MapReduce Job) Copyright © 2012 Accenture All rights reserved. 21
  • 22.
    Successes and Challenges Successes • Web service interface contracts simplified integration with user tools, allowed for flexibility in internal implementation • Open source core allowed rapid for rapid iteration • Met or exceeded all SLAs using commodity hardware, significantly driving down costs Challenges • Monitoring a large distributed system requires discipline and a strong operations team • Shared storage systems and Big Data technologies don’t always play well together • “Schemaless” systems can become a headache to maintain, especially with complex data models Copyright © 2012 Accenture All rights reserved. 22
  • 23.
    Thank you Jonathan Bender Consultant, Accenture Technology Labs jonathan.bender@accenture.com Copyright © 2012 Accenture All rights reserved. 23