Introduction to Hortonworks
Data Platform (HDP)
© Copyright IBM Corporation 2021
Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit objectives
• Describe the functions and features of HDP.
• List the IBM added value components.
• Describe the purpose and benefits of each added value component.
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Hortonworks Data Platform
overview
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Hortonworks Data Platform
• HDP is a platform for data at rest.
• It is a secure, enterprise-ready open-source Apache Hadoop distribution
that is based on a centralized architecture (YARN).
• HDP has the following attributes:
▪ Open
▪ Central
▪ Interoperable
▪ Enterprise-ready
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Hortonworks Data Platform
Governance
Tools Security Operations
Integration
Data Lifecycle Zeppelin Ambari User Views Ranger Ambari
and Governance
Knox Cloudbreak
Falcon
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
Flume HBase HAWQ
Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL
Tez Tez Slider S T
NFS
YARN: Data Operating System
WebHDFS Hadoop Distributed File System (HDFS)
Data Management
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Data flow
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Data Flow
Governance
Tools Security Operations
Integration
Data Lifecycle Zeppelin Ambari User Views Ranger Ambari
and Governance
Knox Cloudbreak
Falcon
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
Flume HBase HAWQ
Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL
Tez Tez Slider S T
NFS
YARN: Data Operating System
WebHDFS Hadoop Distributed File System (HDFS)
Data Management
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Kafka
• Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-
subscribe messaging system.
▪ Used for building real-time data pipelines and streaming apps
• Often used in place of traditional message brokers like JMS and AMQP
because of its higher throughput, reliability and replication.
• Kafka works in combination with variety of Hadoop tools:
▪ Apache Storm
▪ Apache HBase
▪ Apache Spark
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Sqoop
• Tool to easily import information from structured databases (Db2,
MySQL, Netezza, Oracle, and mode.) and related Hadoop systems
(such as Hive and HBase) into your Hadoop cluster
• Can also be used to extract data from Hadoop and export it to relational
databases and enterprise data warehouses
• Helps offload some tasks such as ETL from Enterprise Data Warehouse
to Hadoop for lower cost and efficient execution
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Data access
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Data access
Governance
Tools Security Operations
Integration
Data Lifecycle Zeppelin Ambari User Views Ranger Ambari
and Governance
Knox Cloudbreak
Falcon
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
Flume HBase HAWQ
Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL
Tez Tez Slider S T
NFS
YARN: Data Operating System
WebHDFS Hadoop Distributed File System (HDFS)
Data Management
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Hive
• Apache Hive is a data warehouse system built on top of Hadoop.
• Hive facilitates easy data summarization, ad-hoc queries, and the
analysis of very large datasets that are stored in Hadoop.
• Hive provides SQL on Hadoop
▪ Provides SQL interface, better known as HiveQL or HQL, which allows for
easy querying of data in Hadoop
• Includes HCatalog
▪ Global metadata management layer that exposes Hive table metadata to
other Hadoop applications.
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Pig
• Apache Pig is a platform for analyzing large data sets.
• Pig consists of a high-level language called Pig Latin, which was
designed to simplify MapReduce programming.
• Pig's infrastructure layer consists of a compiler that produces
sequences of MapReduce programs from this Pig Latin code that you
write.
• The system is able to optimize your code, and "translate" it into
MapReduce allowing you to focus on semantics rather than efficiency.
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
HBase
• Apache HBase is a distributed, scalable, big data store.
• Use Apache HBase when you need random, real-time read/write
access to your big data.
▪ The goal of the HBase project is to be able to handle very large tables of
data that are running on clusters of commodity hardware.
• HBase is modeled after Google's BigTable and provides BigTable-like
capabilities on top of Hadoop and HDFS. HBase is a NoSQL data store.
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Accumulo
• Apache Accumulo is a sorted, distributed key/value store that provides
robust, scalable data storage and retrieval.
• Based on Google’s BigTable and runs on YARN
▪ Think of it as a "highly secure HBase"
• Features:
▪ Server-side programming
▪ Designed to scale
▪ Cell-based access control
▪ Stable
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Spark
• Apache Spark is a fast and general engine for large-scale data
processing.
• Spark has a variety of advantages including:
▪ Speed
− Run programs faster than MapReduce in memory
▪ Easy to use
− Write apps quickly with Java, Scala, Python, R
▪ Generality
− Can combine SQL, streaming, and complex analytics
▪ Runs on a variety of environments and can access diverse data sources
− Hadoop, Mesos, standalone, cloud…
− HDFS, Cassandra, HBase, S3…
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Storm
• Apache Storm is an open source distributed real-time computation
system.
▪ Fast
▪ Scalable
▪ Fault-tolerant
• Used to process large volumes of high-velocity data
• Useful when milliseconds of latency matter and Spark isn't fast enough
▪ Has been benchmarked at over a million tuples processed per second per
node
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Data lifecycle and governance
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Data Lifecycle and Governance
Governance
Tools Security Operations
Integration
Data Lifecycle Zeppelin Ambari User Views Ranger Ambari
and Governance
Knox Cloudbreak
Falcon
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
Flume HBase HAWQ
Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL
Tez Tez Slider S T
NFS
YARN: Data Operating System
WebHDFS Hadoop Distributed File System (HDFS)
Data Management
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Atlas
• Apache Atlas is a scalable and extensible set of core foundational
governance services
▪ Enables enterprises to effectively and efficiently meet their compliance
requirements within Hadoop
• Exchange metadata with other tools and processes within and outside
of Hadoop
▪ Allows integration with the whole enterprise data ecosystem
• Atlas Features:
▪ Data classification
▪ Centralized auditing
▪ Centralized lineage
▪ Security and policy engine
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Security
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Security
Governance
Tools Security Operations
Integration
Data Lifecycle Zeppelin Ambari User Views Ranger Ambari
and Governance
Knox Cloudbreak
Falcon
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
Flume HBase HAWQ
Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL
Tez Tez Slider S T
NFS
YARN: Data Operating System
WebHDFS Hadoop Distributed File System (HDFS)
Data Management
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Ranger
• Centralized security framework to enable, monitor, and manage
comprehensive data security across the Hadoop platform
• Manage fine-grained access control over Hadoop data access
components like Apache Hive and Apache HBase
• Using Ranger console can manage policies for access to files, folders,
databases, tables, or column with ease
• Policies can be set for individual users or groups
▪ Policies enforced within Hadoop
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Operations
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Operations
Governance
Tools Security Operations
Integration
Data Lifecycle Zeppelin Ambari User Views Ranger Ambari
and Governance
Knox Cloudbreak
Falcon
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
Flume HBase HAWQ
Map
Pig Hive Accumulo Storm Solr Spark Partners
Reduce
Kafka Phoenix Db2 Big SQL
Tez Tez Slider S T
NFS
YARN: Data Operating System
WebHDFS Hadoop Distributed File System (HDFS)
Data Management
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Ambari
• For provisioning, managing, and monitoring Apache Hadoop clusters.
• Provides intuitive, easy-to-use Hadoop management web UI backed by
its RESTful APIs
• Ambari REST APIs
▪ Allow application developers and system integrators to easily integrate
Hadoop provisioning, management, and monitoring capabilities in their own
applications
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Cloudbreak
• A tool for provisioning and managing Apache Hadoop clusters in the
cloud
• Automates launching of elastic Hadoop clusters
• Policy-based autoscaling on several cloud infrastructure platforms,
including:
▪ Microsoft Azure
▪ Amazon Web Services
▪ Google Cloud Platform
▪ OpenStack
▪ Platforms that support Docker container
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021
Oozie
• Oozie is a Java based workflow scheduler system to manage Apache
Hadoop jobs
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions
• Integrated with the Hadoop stack
▪ YARN is its architectural center
▪ Supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop
Introduction to Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2021