KEMBAR78
Big Data Class - Introduction | PDF | Apache Hadoop | Apache Spark
0% found this document useful (0 votes)
106 views60 pages

Big Data Class - Introduction

The document provides an introduction to big data, including: - The world now has 5 zettabytes of data that is growing exponentially as more data is generated every day from sources like emails, social media posts, searches, videos watched, and photos uploaded. - Big data is defined by its volume, velocity, variety and value. It requires new technologies and processes to manage the huge volumes of structured and unstructured data that is being generated from many different sources. - Examples of big data use cases include consumer behavioral analytics using clickstream data to gain insights and optimize conversions, and predictive analytics using sensor data to predict maintenance needs or operational results.

Uploaded by

Ameek ghuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views60 pages

Big Data Class - Introduction

The document provides an introduction to big data, including: - The world now has 5 zettabytes of data that is growing exponentially as more data is generated every day from sources like emails, social media posts, searches, videos watched, and photos uploaded. - Big data is defined by its volume, velocity, variety and value. It requires new technologies and processes to manage the huge volumes of structured and unstructured data that is being generated from many different sources. - Examples of big data use cases include consumer behavioral analytics using clickstream data to gain insights and optimize conversions, and predictive analytics using sensor data to predict maintenance needs or operational results.

Uploaded by

Ameek ghuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Big Data

Class - 1
Introduction

► We now have 5 zettabyte of Data in the world


► 1000000000000000000000 of Bytes
► Every 1 Billion of computer of 1TB of Storage
► 175 zettabyte by 2025
► 125 zettabyte of DVD Stack can be around the earth 222
times
Introduction

► We now generate  2.5 quintillion bytes of data


everyday
► 1000000000000 Bytes of Data
► Everyday 300BN Emails
► 65BN Whatsapp message
► 26 BN weather forecast request
► Every day 5BN Hours on YouTube and 500000 New Hours on
video content
► Everyday 5BN searches on Google
► 3BN Photos & 300M FB Photos, 750M status
► 165M Hours on NetFlix
Big Data

► 90% of all Data generated in last 18 months


► Historically, a number of the large-scale Internet
search, advertising, and social networking companies
pioneered Big Data hardware and software innovations.
► For example, Google analyzes the clicks, links, and
content on 1.5 trillion page views per day and delivers
search results plus personalized advertising in
milliseconds
Big Data

► Big Data describes a holistic information management


strategy that includes and integrates many new types of
data and data management alongside traditional data.
► While many of the techniques to process and analyze
these data types have existed for some time, it has
been the massive proliferation of data and the lower
cost computing models that have encouraged broader
adoption.
► In addition, Big Data has popularized two foundational
storage and processing technologies: Apache Hadoop
and the NoSQL database
Big Data

► Big Data has also been defined by the four “V”s:


► Volume,
► Velocity,
► Variety,
► Value.

These become a reasonable test to determine whether


you should add Big Data to your information architecture
Volume

► The amount of data


► While volume indicates more data, it is the granular
nature of the data that is unique
► . Big Data requires processing high volumes of
low-density data, that is, data of unknown value, such
as twitter data feeds, clicks on a web page, network
traffic, sensor-enabled equipment capturing data at the
speed of light, and many more.
► It is the task of Big Data to convert low-density data
into high-density data, that is, data that has value. For
some companies, this might be tens of terabytes, for
others it may be hundreds of petabytes.
Velocity

► A fast rate that data is received and perhaps acted upon.


► The highest velocity data normally streams directly into
memory versus being written to disk.
► Some Internet of Things (IoT) applications have health and
safety ramifications that require real-time evaluation and
action.
► Other internet-enabled smart products operate in real-time
or near real-time. As an example, consumer ecommerce
applications seek to combine mobile device location and
personal preferences to make time sensitive offers.
Operationally, mobile application experiences have large
user populations, increased network traffic, and the
expectation for immediate response.
Variety

► New unstructured data types


► Unstructured and semi-structured data types, such as
text, audio, and video require additional processing to
both derive meaning and the supporting metadata
► Once understood, unstructured data has many of the
same requirements as structured data, such as
summarization, lineage, audit ability, and privacy.
► Further complexity arises when data from a known
source changes without notice. Frequent or real-time
schema changes are an enormous burden for both
transaction and analytical environments.
Value

► Data has intrinsic value—but it must be discovered


► There are a range of quantitative and investigative techniques to
derive value from data – from discovering a consumer preference
or sentiment, to making a relevant offer by location, or for
identifying a piece of equipment that is about to fail.
► The technological breakthrough is that the cost of data storage
and compute has exponentially decreased, thus providing an
abundance of data from which statistical sampling and other
techniques become relevant, and meaning can be derived.
► However, finding value also requires new discovery processes
involving clever and insightful analysts, business users, and
executives.
► The real Big Data challenge is a human one, which is learning to
ask the right questions, recognizing patterns, making informed
assumptions, and predicting behavior.
Why Big Data ?

► Data management is getting more complex than it has


ever been before. Big Data is everywhere, on
everyone’s mind, and in many different forms:
advertising, social graphs, news feeds,
recommendations, marketing, healthcare, security,
government, and so on.
► In the last three years, thousands of technologies having
to do with Big Data acquisition, management, and
analytics have emerged; this has given IT teams the
hard task of choosing, without having a comprehensive
methodology to handle the choice most of the time.
How will we make use of the
data?
► Sell new products and services
► Personalize customer experiences
► Sense product maintenance needs
► Predict risk, operational results
► Sell value-added data
Which business processes can
benefit?
► Operational ERP/CRM systems
► BI and Reporting systems
► Predictive analytics, modeling, data mining
Data Ingestion

► Sensor-based real-time events


► Near real-time transaction events
► Real-time analytics
► Near real time analytics
► No immediate analytics
Data Storage

► HDFS (Hadoop plus others)


► File system
► Data Warehouse
► RDBMS
► NoSQL database
Data Processing

► Leave it at the point of capture


► Add minor transformations
► ETL data to analytical platform
► Export data to desktops
Performance

How to maximize speed of ad hoc query, data


transformations, and analytical modeling?
► Analyze and transform data in real-time
► Optimize data structures for intended use
► Use parallel processing
► Increase hardware and memory
► Database configuration and operations
► Dedicate hardware sandboxes
► Analyze data at rest, in-place
What’s Different about Big
Data?
► Big Data introduces new technology, processes, and
skills to your information architecture and the people
that design, operate, and use them
► With new technology, there is a tendency to separate
the new from the old, but we strongly urge you to resist
this strategy. While there are exceptions, the
fundamental expectation is that finding patterns in this
new data enhances your ability to understand your
existing data.
Identifying Big Data
Symptoms
► The two main areas that get people to start thinking
about Big Data are when they start having issues related
to data size and volume; although most of the time
these issues present true and legitimate reasons to think
about Big Data, today, they are not the only reasons to
go this route.
► There are others symptoms that you should also
consider—type of data, for example. How will you
manage to increase various types of data when
traditional data stores, such as SQL databases, expect
you to do the structuring, like creating tables?
Identifying Big Data
Symptoms
► This is not feasible without adding a flexible, schema
less technology that handles new data structures as they
come. When I talk about types of data, you should
imagine unstructured data, graph data, images, videos,
voices, and so on.
► Another symptom comes out of this premise: Big Data is
also about extracting added value information from a
high-volume variety of data. 
► When, previously, there were more read transactions
than write transactions, common caches or databases
were enough when paired with weekly ETL (extract,
transform, load) processing jobs.
Identifying Big Data
Symptoms
► Today that’s not the trend any more. Now, you need an
architecture that is capable of handling data as it comes
through long processing to near real-time processing
jobs.
► The architecture should be distributed and not rely on
the rigid high-performance and expensive mainframe;
instead, it should be based on a more available,
performance driven, and cheaper technology to give it
more flexibility.
Business Use Cases

► In addition to technical and architecture considerations,


you may be facing use cases that are typical Big Data
use cases. Some of them are tied to a specific industry;
others are not specialized and can be applied to various
industries.
► These considerations are generally based on analyzing
application’s logs, such as web access logs, application
server logs, and database logs, but they can also be
based on other types of data sources such as social
network data.
► When you are facing such use cases, you might want to
consider a distributed Big Data architecture if you want
to be able to scale out as your business grows.
Consumer Behavioral
Analytics
► Knowing your customer, or what we usually call the
“360-degree customer view” might be the most popular
Big Data use case.
► This customer view is usually used on e-commerce
websites and starts with an unstructured
click-stream—in other words, it is made up of the active
and passive website navigation actions that a visitor
performs.
► By counting and analyzing the clicks and impressions on
ads or products, you can adapt the visitor’s user
experience depending on their behavior, while keeping
in mind that the goal is to gain insight in order to
optimize the funnel conversion.
Sentiment Analysis

► Companies care about how their image and reputation is


perceived across social networks; they want to minimize
all negative events that might affect their notoriety and
leverage positive events.
► By crawling a large amount of social data in a
near-real-time way, they can extract the feelings and
sentiments of social communities regarding their brand,
and they can identify influential users and contact them
in order to change or empower a trend depending on
the outcome of their interaction with such users.
CRM Onboarding

► You can combine consumer behavioral analytics with


sentiment analysis based on data surrounding the
visitor’s social activities. 
► Companies want to combine these online data sources
with the existing offline data, which is called CRM
(customer relationship management) onboarding, in
order to get better and more accurate customer
segmentation.
► Thus, companies can leverage this segmentation and
build a better targeting system to send
profile-customized offers through marketing actions.
Prediction

► Learning from data has become the main Big Data trend
for the last few years such as in the telecommunication
industry, where prediction router log analysis is
democratized.
► Every time an issue is likely to occur on a device, the
company can predict it and order part to avoid
downtime or lost profits.
► When combined with the previous use cases, you can
use predictive architecture to optimize the product
catalog selection and pricing depending on the user’s
global behavior.
Hadoop Distribution

In a Big Data project that involves Hadoop-related


ecosystem technologies, you have two choices:
► Download the project you need separately and try to
create or assemble the technologies in a coherent,
resilient, and consistent architecture.
► Use one of the most popular Hadoop distributions,
which assemble or create the technologies for you.
Hortonworks and Cloudera are the main actors in this
field. There are a couple of differences between the
two vendors, but for starting a Big Data package, they
are equivalent, as long as you don’t pay attention to the
proprietary add-ons.
Cloudera 

► Cloudera adds a set of in-house components to the


Hadoop-based components;
► these components are designed to give you better
cluster management and search experiences.
Cloudera Components

► Impala: A real-time, parallelized, SQL-based engine that


searches for data in HDFS (Hadoop Distributed File
System) and Base. Impala is considered to be the fastest
querying engine within the Hadoop distribution vendors
market, and it is a direct competitor of Spark from UC
Berkeley.
► Cloudera Manager: This is Cloudera’s console to
manage and deploy Hadoop components within your
Hadoop cluster.
► Hue: A console that lets the user interact with the data
and run scripts for the different Hadoop components
contained in the cluster.
Cloudera Hadoop Distribution

•Orange Hadoop
core stack.
•Pink Hadoop
ecosystem
project.
•Blue
Cloudera-specifi
c components.
Hortonworks HDP

► Hortonworks is 100-percent open source and is used to


package stable components rather than the last version
of the Hadoop project in its distribution.
► It adds a component management console to the stack
that is comparable to Cloudera Manager.
Hortonworks Hadoop
Distribution
Hadoop Distributed File
System (HDFS)
► The data is stored when it is ingested into the Hadoop
cluster. Generally it ends up in a dedicated file system
called HDFS.
► Key Features:
► Distribution
► High-throughput access
► High availability
► Fault tolerance
► Tuning
► Security
► Load balancing
HDFS

► HDFS is the first class citizen for data storage in a Hadoop


cluster. Data is automatically replicated across the cluster
data nodes.

Data Replication
Apache Flume

► When you are looking to produce ingesting logs, I would


highly recommend that you use Apache Flume; it’s designed
to be reliable and highly available and it provides a simple,
flexible, and intuitive programming model based on
streaming data flows. Basically, you can configure a data
pipeline without a single line of code, only through
configuration.
► Flume is composed of sources, channels, and sinks. The
Flume source basically consumes an event from an external
source, such as an Apache Avro source, and stores it into the
channel. The channel is a passive storage system like a
file system; it holds the event until a sink consumes it. The
sink consumes the event, deletes it from the channel, and
distributes it to an external target.
Apache Flume

Flume
Architecture
Apache Flume

► With Flume, the idea is to use it to move different log


files that are generated by the web servers to HDFS
► For example. Remember that we are likely to work on a
distributed architecture that might have load balancers,
HTTP servers, application servers, access logs, and so
on. 
► We can leverage all these assets in different ways and
they can be handled by a Flume pipeline. 
Apache Sqoop

► Sqoop is a project designed to transfer bulk data


between a structured data store and HDFS.
► You can use it to either import data from an external
relational database to HDFS, Hive, or even HBase, or to
export data from your Hadoop cluster to a relational
database or data warehouse.
► Sqoop supports major relational databases such as
Oracle, MySQL, and Postgres. This project saves you
from writing scripts to transfer the data; instead, it
provides you with performance data transfers features.
Apache Sqoop

► Since the data can grow quickly in our relational


database, it’s better to identity fast growing tables
from the beginning and use Sqoop to periodically
transfer the data in Hadoop so it can be analyzed.
► Then, from the moment the data is in Hadoop, it is
combined with other data, and at the end, we can use
Sqoop export to inject the data in our business
intelligence (BI) analytics tools.
Yarn: NextGen MapReduce

► MapReduce was the main processing framework in the first


generation of the Hadoop cluster; it basically grouped sibling
data together (Map) and then aggregated the data in
depending on a specified aggregation operation (Reduce).
► In Hadoop 1.0, users had the option of writing MapReduce
jobs in different languages—Java, Python, Pig, Hive, and so
on. Whatever the users chose as a language, everyone relied
on the same processing model: MapReduce.
► Since Hadoop 2.0 was released, however, a new architecture
has started handling data processing above HDFS. Now
that YARN (Yet Another Resource Negotiator) has been
implemented, others processing models are allowed and
MapReduce has become just one among them. This means
that users now have the ability to use a specific processing
model depending on their particular use case.
Yarn: NextGen MapReduce

YARN
structure
Hive

► Hive, a high-level programming language brings users


the simplicity and power of querying data from HDFS in
a SQL-like way.
► When you use another language rather than using native
MapReduce, the main drawback is the performance.
► Hive is used for batch processing such as long-term
processing job with a low priority.
Spark Streaming

► Spark Streaming lets you write a processing job as you


would do for batch processing in Java, Scale, or Python,
but for processing data as you stream it. 
► This can be really appropriate when you deal with high
throughput data sources such as a social network
(Twitter), clickstream logs, or web access logs.
► Spark Streaming is an extension of Spark, which
leverages its distributed data processing framework and
treats streaming computation as a series of
nondeterministic, micro-batch computations on small
intervals.
► Spark Streaming can get its data from a variety of
sources but when it is combined
Apache Kafka

► Apache Kafka is a distributed publish-subscribe mes


► Kafka is a persistent messaging and high-throughput
system, it supports both queue and topic semantics, and
it uses ZooKeeper to form the cluster nodes.saging
application written by LinkedIn in Scale. 
► Kafka implements the publish-subscribe enterprise
integration pattern and supports parallelism and
enterprise features for performance and improved fault
tolerance.
Apache Kafka

Kafka Partitioned Topic


Example
Spark MLlib

► MLlib enables machine learning for Spark, it leverages


the Spark Direct Acyclic Graph (DAG) execution engine,
and it brings a set of APIs that ease machine learning
integration for Spark.
►  It’s composed of various algorithms that go from basic
statistics, logistic regression, k-means clustering, and
Gaussian mixtures to singular value decomposition and
multinomial naive Bayes.
NoSQL Stores

► NoSQL datastores are fundamental pieces of the data


architecture because they can ingest a very large
amount of data
► It provides scalability and resiliency, and thus high
availability
► NoSQL means Not Only SQL
Couchbase

► Couchbase is a document-oriented NoSQL database that


is easily scalable, provides a flexible model, and is
consistently high performance.
►  It exposes a fast key-value store with managed cache
for sub-millisecond data operations, purpose-built
indexers for fast queries and a powerful query engine
for executing SQL-like queries.
ElasticSearch

► ElasticSearch is a NoSQL technology that is very popular


for its scalable distributed indexing engine and search
features.
► It’s based on Apache Lucene and enables real-time data
analytics and full-text search in your architecture.
► ElasticSearch is part of the ELK platform, which stands
for ElasticSearch + Logstash + Kibana, which is delivered
by Elastic the company.
► The three products work together to provide the best
end-to-end platform for collecting, storing, and
visualizing data
ElasticSearch

► Logstash lets you collect data from many kinds of


sources—such as social data, logs, messages queues, or
sensors—it then supports data enrichment and
transformation, and finally it transports them to an
indexation system such as ElasticSearch.
► ElasticSearch indexes the data in a distributed,
scalable, and resilient system. It’s schemaless and
provides libraries for multiple languages so they can
easily and fatly enable real-time search and analytics in
your application.
► Kibana is a customizable user interface in which you
can build a simple to complex dashboard to explore and
visualize data indexed by ElasticSearch.
Big Data Architecture

► Keeping all the Big Data technology we are going to use in


mind, we can now go forward and build the foundation of
our architecture.
Architecture Overview
► A web application the visitor can use to navigate in a catalog of
products
► A log ingestion application that is designed to pull the logs and
process them
► A learning application for triggering recommendations for our
visitor
► A processing engine that functions as the central processing
cluster for the architecture
► A search engine to pull analytics for our process data
Big Data Architecture
Log Ingestion Application

► The log ingestion application is used to consume


application logs such as web access logs.
► To ease the use case, a generated web access log is
provided and it simulates the behavior of visitors
browsing the product catalog.
► These logs represent the clickstream logs that are used
for long-term processing but also for real-time
recommendation.
Log Ingestion Application

► There can be two options in the architecture:


► the first can be ensured by Flume and can transport the
logs as they come in to our processing application;
► the second can be ensured by ElasticSearch, Logstash, and
Kibana (the ELK platform) to create access analytics.
Log Ingestion Application
Learning Application

► The learning application receives a stream of data and


builds prediction to optimize our recommendation
engine.
► This application uses a basic algorithm to introduce the
concept of machine learning based on Spark MLlib.
Learning Application
Processing Engine

► The processing engine is the heart of the architecture;


it receives data from multiple kinds of source and
delegates the processing to the appropriate model.
Search Engine

► The search engine leverages the data processed by the


processing engine and exposes a dedicated RESTful API
that will be used for analytic purposes.

You might also like