KEMBAR78
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack | PPTX
Global Sponsor:
Big Data on the Microsoft Platform
Andrew J. Brust, CEO, Blue Badge Insights
With Hadoop, MS BI and the SQL Server stack
CEO and Founder, Blue Badge Insights
Big Data blogger for ZDNet
Microsoft Regional Director, MVP
Co-chair VSLive! and 17 years as a speaker
Founder, Microsoft BI User Group of NYC
 http://www.msbigdatanyc.com
Co-moderator, NYC .NET Developers Group
 http://www.nycdotnetdev.com
“Redmond Review” columnist for
Visual Studio Magazine
brustblog.com, Twitter: @andrewbrust
Meet Andrew
Read all about it!
My New Blog (bit.ly/bigondata)
Agenda
Big Data, Hadoop and HDInsight
MapReduce
Hive ODBC, BI Stack
Hekaton, NoSQL
SQL Server Parallel Data Warehouse, MPP, PolyBase
What is Big Data?
100s of TB into PB and higher
Involving data from: financial data, sensors, web logs,
social media, etc.
Parallel processing often involved
 Hadoop is emblematic, but other technologies are Big Data too
Processing of data sets too large for transactional
databases
 Analyzing interactions, rather than transactions
 The three V’s: Volume, Velocity, Variety
•Big Data tech sometimes imposed on small data problems
What is Hadoop?
Open source implementation of Google’s MapReduce and
GFS (Google File System)
Allows for scale-out processing of petabyte scale data
 1 PB = 1,024 TB
Also distributed storage
Commodity hardware
Can work against flat files, or certain database formats
Native processing involves imperative Java code
Other languages supported through “Streaming”
7
What is HDInsight?
Microsoft’s Hadoop distribution, on Windows
 Most other distros on Linux
Based on Hortonworks Data Platform (HDP)
Runs on Azure, eventually on Windows Server, and as
sandbox on dev PC
For .NET devs: .NET SDK for Hadoop, LINQ provider
8
Global Sponsor:
Demo
HDInsight
The Hadoop Stack
MapReduce, HDFS
Database
RDBMS Import/Export
Query: HiveQL and Pig Latin
Machine Learning/Data Mining
Log file integration
MapReduce, in a Diagram
mapper
mapper
mapper
mapper
mapper
mapper
Input
reducer
reducer
reducer
Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Output
Input
Input
Input
K1
K2
K3
Output
Output
Output
A MapReduce Example
• Count by suite, on each floor
• Send per-suite, per platform totals to lobby
• Sort totals by platform
• Send two platform packets to 10th, 20th, 30th floor
• Tally up each platform
• Merge tallies into one spreadsheet
• Collect the tallies
MapReduce Options
Pig, Hive, Sqoop, Mahout also generate MapReduce code
13
Java
JavaScript (“Rhino”)
Other languages, especially Python, via Streaming
C# via Streaming
C# via .NET SDK
Hortonworks Data
Platform for
Windows
MRLib (NuGet
Package)
LINQ to Hive
OdbcClient +
Hive ODBC
Driver
Deployment
WebHDFS
client
MR code in C#,
HadoopJob,
MapperBase,
ReducerBase
Amenities for
Visual Studio/.NET
Global Sponsor:
Demo
MapReduce
Hive
Began as Hadoop sub-project
 Now top-level Apache project
Provides a SQL-like (“HiveQL”) abstraction over
MapReduce
Has its own HDFS table file format (and it’s fully schema-
bound)
Can also work over HBase
Acts as a bridge to many BI products which expect tabular
data
Hive ODBC Consumers
17
Excel 2010 or 2013 (including via add-in)
PowerPivot
SQL Server Analysis Services, Tabular Mode
SQL Server Reporting Services
ADO.NET OdbcClient provider
LINQ provider
xVelocity Technologies
Formerly known as VertiPaq
PowerPivot, SSAS Tabular, SQL Server columnar indexes
Implements BI Semantic Model (BISM)
Uses column store technology
 Compression
 In-memory
 Speed
Not a Big Data technology per se, but very useful for
analysis of job output
18
Power View
Reports on BISM models (PowerPivot, SSAS Tabular)
Hosted in SharePoint 2010, 2013 Enterprise
Also Excel 2013 (but not on ARM/Windows RT)
Interactive data exploration
19
Global Sponsor:
Demo
Hive ODBC + BI Stack
Project “Hekaton”
In-memory engine for SQL Server transactional workloads
Tables must be declared as in-memory explicitly
In-memory and standard tables can coexist in same db
Stored procs on in-mem tables are compiled to native code
Hekaton and xVelocity are separate
Hekaton ≠ PowerPivot/SSAS Tabular
Hekaton ≠ Columnstore indexes
Compare to SAP HANA
 In-memory, transactional, analytical, column store
21
NoSQL
NoSQL databases are non-relational and non- or loosely-
schematized
HBase is a NoSQL database, of the wide column variety
 Hive implements a SQL layer over it
 HBase not yet in HDInsight
HBase table = HDFS file
Three other NoSQL categories
 Key-value store, document store, graph database
 Azure Table Storage is a key-value store NoSQL database
Some of them aren’t really Big Data tools, but market
themselves that way anyway
22
SQL Parallel Data Warehouse (PDW)
SQL PDW is a Massively Parallel Processing (MPP)
database
Teradata, IBM Netezza, HP Vertica also in this category
It’s an array/cluster of SQL servers made to look like one
SQL Server
Available as appliance only
 Purchase from HP, Dell
 Server, storage and network all pre-built and configured
Many other MPP products based on PostgreSQL
PDW loosely based on acquired DATAllegro product
 Implemented MPP with Ingres, written in Java, running on Linux
23
MapReduce versus MPP
24
MapReduce MPP
 Splits preprocessing amongst mapper
nodes and aggregation amongst reducers
 Scales infinitely on commodity hardware
 Uses locally attached commodity disks on
nodes
 Uses imperative code
 Processes flat files, wide column tables
(HBase), relational tables (Hive)
 Divide-and-conquer approach, parallel,
distributed
 Splits query amongst nodes then unifies
result sets
 Scales to high-end assets in the
appliance cabinet
 Uses shared storage (can be more
network traffic)
 Uses SQL
 Works with relational tables only
 Divide-and-conquer approach, parallel,
distributed
PolyBase
To be included in next version of PDW
Mashup of SQL Server and Hadoop
Enables PDW to address Hadoop data nodes (HDFS)
directly
Parallelism managed by PDW
Tables are imported into SQL Server db
 They are EXTERNAL tables
 They can participate in joins with standard tables
25
Resources
MS Big Data/HDInsight
 http://www.microsoft.com/bigdata
Apache Hadoop
 http://hadoop.apache.org/
Apache HBase
 http://hbase.apache.org/
SQL PDW
 http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx
PolyBase
 http://gsl.azurewebsites.net/Projects/Polybase.aspx
xVelocity
 http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/in-
memory.aspx
Column store
 http://en.wikipedia.org/wiki/Column-oriented_DBMS
Power View
 http://office.microsoft.com/en-us/excel-help/power-view-explore-visualize-and-present-your-
data-HA102835634.aspx
Hekaton
 http://blogs.technet.com/b/dataplatforminsider/archive/2012/11/08/breakthrough-performance-
with-in-memory-technologies.aspx
26
Global Sponsor:
Questions?
Global Sponsor:
Thank You for Attending

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack

  • 1.
    Global Sponsor: Big Dataon the Microsoft Platform Andrew J. Brust, CEO, Blue Badge Insights With Hadoop, MS BI and the SQL Server stack
  • 2.
    CEO and Founder,Blue Badge Insights Big Data blogger for ZDNet Microsoft Regional Director, MVP Co-chair VSLive! and 17 years as a speaker Founder, Microsoft BI User Group of NYC  http://www.msbigdatanyc.com Co-moderator, NYC .NET Developers Group  http://www.nycdotnetdev.com “Redmond Review” columnist for Visual Studio Magazine brustblog.com, Twitter: @andrewbrust Meet Andrew
  • 3.
  • 4.
    My New Blog(bit.ly/bigondata)
  • 5.
    Agenda Big Data, Hadoopand HDInsight MapReduce Hive ODBC, BI Stack Hekaton, NoSQL SQL Server Parallel Data Warehouse, MPP, PolyBase
  • 6.
    What is BigData? 100s of TB into PB and higher Involving data from: financial data, sensors, web logs, social media, etc. Parallel processing often involved  Hadoop is emblematic, but other technologies are Big Data too Processing of data sets too large for transactional databases  Analyzing interactions, rather than transactions  The three V’s: Volume, Velocity, Variety •Big Data tech sometimes imposed on small data problems
  • 7.
    What is Hadoop? Opensource implementation of Google’s MapReduce and GFS (Google File System) Allows for scale-out processing of petabyte scale data  1 PB = 1,024 TB Also distributed storage Commodity hardware Can work against flat files, or certain database formats Native processing involves imperative Java code Other languages supported through “Streaming” 7
  • 8.
    What is HDInsight? Microsoft’sHadoop distribution, on Windows  Most other distros on Linux Based on Hortonworks Data Platform (HDP) Runs on Azure, eventually on Windows Server, and as sandbox on dev PC For .NET devs: .NET SDK for Hadoop, LINQ provider 8
  • 9.
  • 10.
    The Hadoop Stack MapReduce,HDFS Database RDBMS Import/Export Query: HiveQL and Pig Latin Machine Learning/Data Mining Log file integration
  • 11.
    MapReduce, in aDiagram mapper mapper mapper mapper mapper mapper Input reducer reducer reducer Input Input Input Input Input Input Output Output Output Output Output Output Output Input Input Input K1 K2 K3 Output Output Output
  • 12.
    A MapReduce Example •Count by suite, on each floor • Send per-suite, per platform totals to lobby • Sort totals by platform • Send two platform packets to 10th, 20th, 30th floor • Tally up each platform • Merge tallies into one spreadsheet • Collect the tallies
  • 13.
    MapReduce Options Pig, Hive,Sqoop, Mahout also generate MapReduce code 13 Java JavaScript (“Rhino”) Other languages, especially Python, via Streaming C# via Streaming C# via .NET SDK
  • 14.
    Hortonworks Data Platform for Windows MRLib(NuGet Package) LINQ to Hive OdbcClient + Hive ODBC Driver Deployment WebHDFS client MR code in C#, HadoopJob, MapperBase, ReducerBase Amenities for Visual Studio/.NET
  • 15.
  • 16.
    Hive Began as Hadoopsub-project  Now top-level Apache project Provides a SQL-like (“HiveQL”) abstraction over MapReduce Has its own HDFS table file format (and it’s fully schema- bound) Can also work over HBase Acts as a bridge to many BI products which expect tabular data
  • 17.
    Hive ODBC Consumers 17 Excel2010 or 2013 (including via add-in) PowerPivot SQL Server Analysis Services, Tabular Mode SQL Server Reporting Services ADO.NET OdbcClient provider LINQ provider
  • 18.
    xVelocity Technologies Formerly knownas VertiPaq PowerPivot, SSAS Tabular, SQL Server columnar indexes Implements BI Semantic Model (BISM) Uses column store technology  Compression  In-memory  Speed Not a Big Data technology per se, but very useful for analysis of job output 18
  • 19.
    Power View Reports onBISM models (PowerPivot, SSAS Tabular) Hosted in SharePoint 2010, 2013 Enterprise Also Excel 2013 (but not on ARM/Windows RT) Interactive data exploration 19
  • 20.
  • 21.
    Project “Hekaton” In-memory enginefor SQL Server transactional workloads Tables must be declared as in-memory explicitly In-memory and standard tables can coexist in same db Stored procs on in-mem tables are compiled to native code Hekaton and xVelocity are separate Hekaton ≠ PowerPivot/SSAS Tabular Hekaton ≠ Columnstore indexes Compare to SAP HANA  In-memory, transactional, analytical, column store 21
  • 22.
    NoSQL NoSQL databases arenon-relational and non- or loosely- schematized HBase is a NoSQL database, of the wide column variety  Hive implements a SQL layer over it  HBase not yet in HDInsight HBase table = HDFS file Three other NoSQL categories  Key-value store, document store, graph database  Azure Table Storage is a key-value store NoSQL database Some of them aren’t really Big Data tools, but market themselves that way anyway 22
  • 23.
    SQL Parallel DataWarehouse (PDW) SQL PDW is a Massively Parallel Processing (MPP) database Teradata, IBM Netezza, HP Vertica also in this category It’s an array/cluster of SQL servers made to look like one SQL Server Available as appliance only  Purchase from HP, Dell  Server, storage and network all pre-built and configured Many other MPP products based on PostgreSQL PDW loosely based on acquired DATAllegro product  Implemented MPP with Ingres, written in Java, running on Linux 23
  • 24.
    MapReduce versus MPP 24 MapReduceMPP  Splits preprocessing amongst mapper nodes and aggregation amongst reducers  Scales infinitely on commodity hardware  Uses locally attached commodity disks on nodes  Uses imperative code  Processes flat files, wide column tables (HBase), relational tables (Hive)  Divide-and-conquer approach, parallel, distributed  Splits query amongst nodes then unifies result sets  Scales to high-end assets in the appliance cabinet  Uses shared storage (can be more network traffic)  Uses SQL  Works with relational tables only  Divide-and-conquer approach, parallel, distributed
  • 25.
    PolyBase To be includedin next version of PDW Mashup of SQL Server and Hadoop Enables PDW to address Hadoop data nodes (HDFS) directly Parallelism managed by PDW Tables are imported into SQL Server db  They are EXTERNAL tables  They can participate in joins with standard tables 25
  • 26.
    Resources MS Big Data/HDInsight http://www.microsoft.com/bigdata Apache Hadoop  http://hadoop.apache.org/ Apache HBase  http://hbase.apache.org/ SQL PDW  http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx PolyBase  http://gsl.azurewebsites.net/Projects/Polybase.aspx xVelocity  http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/in- memory.aspx Column store  http://en.wikipedia.org/wiki/Column-oriented_DBMS Power View  http://office.microsoft.com/en-us/excel-help/power-view-explore-visualize-and-present-your- data-HA102835634.aspx Hekaton  http://blogs.technet.com/b/dataplatforminsider/archive/2012/11/08/breakthrough-performance- with-in-memory-technologies.aspx 26
  • 27.
  • 28.