KEMBAR78
BDA Unit 1 Notes-1 | PDF | Apache Hadoop | Analytics
0% found this document useful (0 votes)
20 views34 pages

BDA Unit 1 Notes-1

Big data refers to vast amounts of structured, semi-structured, and unstructured data that organizations collect for advanced analytics and machine learning. It is characterized by the five V's: Volume, Variety, Veracity, Value, and Velocity, and is typically stored in data lakes and processed using cloud-based systems. Analytics architecture plays a crucial role in managing and analyzing this data, supporting data-driven decision-making across various industries while facing challenges such as data quality, integration, and talent gaps.

Uploaded by

Mr. Praneeth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views34 pages

BDA Unit 1 Notes-1

Big data refers to vast amounts of structured, semi-structured, and unstructured data that organizations collect for advanced analytics and machine learning. It is characterized by the five V's: Volume, Variety, Veracity, Value, and Velocity, and is typically stored in data lakes and processed using cloud-based systems. Analytics architecture plays a crucial role in managing and analyzing this data, supporting data-driven decision-making across various industries while facing challenges such as data quality, integration, and talent gaps.

Uploaded by

Mr. Praneeth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Introduction of Big Data:

Big data is a combination of structured, semistructured and unstructured data


collected by organizations that can be mined for information and used in machine
learning projects, predictive modeling and other advanced analytics applications.

Systems that process and store big data have become a common component of data
management architectures in organizations, combined with tools that support big
data analytics uses. Big data is often characterized by the three V's:

Data Storage and Analysis:

Big data is often stored in a data lake. While data warehouses are commonly built
on relational databases and contain structured data only, data lakes can support
various data types and typically are based on Hadoop clusters, cloud object storage
services, NoSQL databases or other big data platforms.

Many big data environments combine multiple systems in a distributed


architecture; for example, a central data lake might be integrated with other
platforms, including relational databases or a data warehouse. The data in big data
systems may be left in its raw form and then filtered and organized as needed for
particular analytics uses. In other cases, it's preprocessed using data mining tools
and data preparation software so it's ready for applications that are run regularly.

Big data processing places heavy demands on the underlying compute


infrastructure. The required computing power often is provided by clustered
systems that distribute processing workloads across hundreds or thousands of
commodity servers, using technologies like Hadoop and the Spark processing
engine.
Getting that kind of processing capacity in a cost-effective way is a challenge. As a
result, the cloud is a popular location for big data systems. Organizations can
deploy their own cloud-based systems or use managed big-data-as-a-service offerings
from cloud providers. Cloud users can scale up the required number of servers just
long enough to complete big data analytics projects. The business only pays for the
storage and compute time it uses, and the cloud instances can be turned off until
they're needed again.

Characteristics of BigData:

Big Data contains a large amount of data that is not being processed by traditional
data storage or the processing unit. It is used by many multinational
companies to process the data and business of many organizations. The data flow
would exceed 150 exabytes per day before replication.

There are five v's of Big Data that explains the characteristics.

5 V's of Big Data


o Volume
o Veracity
o Variety
o Value
o Velocity
Volume

The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many
more.

Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are uploaded
each day. Big data technologies can handle large amounts of data.

Variety

Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
The data is categorized as below:

Structured data: In Structured schema, along with all the required columns. It is
in a tabular form. Structured Data is stored in the relational database management
system.

Semi-structured: In Semi-structured, the schema is not appropriately defined,


e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction
Processing) systems are built to work with semi-structured data. It is stored in
relations, i.e., tables.

Unstructured Data: All the unstructured files, log files, audio files,
and image files are included in the unstructured data. Some organizations have
much data available, but they did not know how to derive the value of data since
the data is raw.

Quasi-structured Data:The data format contains textual data with inconsistent


data formats that are formatted with effort and time with some tools.

Example: Web server logs, i.e., the log file is created and maintained by some
server that contains a list of activities.

Veracity

Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage data
efficiently. Big Data is also essential in business development.
For example, Facebook posts with hashtags.

Value

Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.

Velocity

Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.

Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media sites, sensors,
mobile devices, etc.d Skip 10s

Big Data Analytics:

To get valid and relevant results from big data analytics applications, data
scientists and other data analysts must have a detailed understanding of the
available data and a sense of what they're looking for in it. That makes data
preparation, which includes profiling, cleansing, validation and transformation of
data sets, a crucial first step in the analytics process.

Once the data has been gathered and prepared for analysis, various data
science and advanced analytics disciplines can be applied to run different
applications, using tools that provide big data analytics features and capabilities.
Those disciplines include machine learning and its deep learning offshoot,
predictive modeling, data mining, statistical analysis, streaming analytics, text
mining and more.

Using customer data as an example, the different branches of analytics that can be
done with sets of big data include the following:

 Comparative analysis. This examines customer behavior metrics and real-time


customer engagement in order to compare a company's products, services and
branding with those of its competitors.
 Social media listening. This analyzes what people are saying on social media
about a business or product, which can help identify potential problems and
target audiences for marketing campaigns.
 Marketing analytics. This provides information that can be used to improve
marketing campaigns and promotional offers for products, services and
business initiatives.
 Sentiment analysis. All of the data that's gathered on customers can be analyzed
to reveal how they feel about a company or brand, customer satisfaction levels,
potential issues and how customer service could be improved.

Types of Data Analytics


There are four major types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Data Analytics and its Types

Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. Predictive
analytics uses data to determine the probable outcome of an event or a likelihood
of a situation occurring. Predictive analytics holds a variety of statistical
techniques from modeling, machine learning, data mining, and game theory that
analyze current and historical facts to make predictions about a future
event. Techniques that are used for predictive analytics are:
 Linear Regression
 Time Series Analysis and Forecasting
 Data Mining

Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to
approach future events. It looks at past performance and understands the
performance by mining historical data to understand the cause of success or failure
in the past. Almost all management reporting such as sales, marketing, operations,
and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to
classify customers or prospects into groups. Unlike a predictive model that focuses
on predicting the behavior of a single customer, Descriptive analytics identifies
many different relationships between customer and product.

Common examples of Descriptive analytics are company reports that provide


historic reviews like:
 Data Queries
 Reports
 Descriptive Statistics
 Data dashboard

Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science,
business rule, and machine learning to make a prediction and then suggests a
decision option to take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting
action benefits from the predictions and showing the decision maker the
implication of each decision option. Prescriptive Analytics not only anticipates
what will happen and when to happen but also why it will happen. Further,
Prescriptive Analytics can suggest decision options on how to take advantage of a
future opportunity or mitigate a future risk and illustrate the implication of each
decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by
using analytics to leverage operational and usage data combined with data of
external factors such as economic data, population demography, etc.

Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any
question or for the solution of any problem. We try to find any dependency and
pattern in the historical data of the particular problem.

For example, companies go for this analysis because it gives a great insight into a
problem, and they also keep detailed information about their disposal otherwise
data collection may turn out individual for every problem and it will be very time-
consuming. Common techniques used for Diagnostic Analytics are:
 Data discovery
 Data mining
 Correlations

Typical Analytics Architecture


Analytics architecture refers to the overall design and structure of an analytical
system or environment, which includes the hardware, software, data, and processes
used to collect, store, analyze, and visualize data.
Key components of Analytics Architecture-

Analytics architecture refers to the infrastructure and systems that are used to
support the collection, storage, and analysis of data. There are several key
components that are typically included in an analytics architecture:
1. Data collection: This refers to the process of gathering data from various
sources, such as sensors, devices, social media, websites, and more.
2. Transformation: When the data is already collected then it should be cleaned
and transformed before storing.
3. Data storage: This refers to the systems and technologies used to store and
manage data, such as databases, data lakes, and data warehouses.
4. Analytics: This refers to the tools and techniques used to analyze and interpret
data, such as statistical analysis, machine learning, and visualization.
Together, these components work together to enable organizations to collect, store,
and analyze data in order to make informed decisions and drive business outcomes.
The analytics architecture is the framework that enables organizations to collect,
store, process, analyze, and visualize data in order to support data-driven decision-
making and drive business value.
Benefits:

There are several ways in which you can use analytics architecture to benefit your
organization:
1. Support data-driven decision-making: Analytics architecture can be used to
collect, store, and analyze data from a variety of sources, such as transactions,
social media, web analytics, and sensor data. This can help you make more
informed decisions by providing you with insights and patterns that you may
not have been able to detect otherwise.
2. Improve efficiency and effectiveness: By using analytics architecture to
automate tasks such as data integration and data preparation, you can reduce the
time and resources required to analyze data, and focus on more value-added
activities.
3. Enhance customer experiences: Analytics architecture can be used to gather
and analyze customer data, such as demographics, preferences, and behaviors,
to better understand and meet the needs of your customers. This can help you
improve customer satisfaction and loyalty.
4. Optimize business processes: Analytics architecture can be used to analyze
data from business processes, such as supply chain management, to identify
bottlenecks, inefficiencies, and opportunities for improvement. This can help
you optimize your processes and increase efficiency.
5. Identify new opportunities: Analytics architecture can help you discover new
opportunities, such as identifying untapped markets or finding ways to improve
product or service offerings.
Analytics architecture can help you make better use of data to drive business value
and improve your organization’s performance.
Applications of Analytics Architecture
Analytics architecture can be applied in a variety of contexts and industries to
support data-driven decision-making and drive business value. Here are a few
examples of how analytics architecture can be used:
1. Financial services: Analytics architecture can be used to analyze data from
financial transactions, customer data, and market data to identify patterns and
trends, detect fraud, and optimize risk management.
2. Healthcare: Analytics architecture can be used to analyze data from electronic
health records, patient data, and clinical trial data to improve patient outcomes,
reduce costs, and support research.
3. Retail: Analytics architecture can be used to analyze data from customer
transactions, web analytics, and social media to improve customer experiences,
optimize pricing and inventory, and identify new opportunities.
4. Manufacturing: Analytics architecture can be used to analyze data from
production processes, supply chain management, and quality control to
optimize operations, reduce waste, and improve efficiency.
5. Government: Analytics architecture can be used to analyze data from a variety
of sources, such as census data, tax data, and social media data, to support
policy-making, improve public services, and promote transparency.
Analytics architecture can be applied in a wide range of contexts and industries to
support data-driven decision-making and drive business value.
Limitations of Analytics Architecture
There are several limitations to consider when designing and implementing an
analytical architecture:
1. Complexity: Analytical architectures can be complex and require a high level
of technical expertise to design and maintain.
2. Data quality: The quality of the data used in the analytical system can
significantly impact the accuracy and usefulness of the results.
3. Data security: Ensuring the security and privacy of the data used in the
analytical system is critical, especially when working with sensitive or personal
information.
4. Scalability: As the volume and complexity of the data increase, the analytical
system may need to be scaled to handle the increased load. This can be a
challenging and costly task.
5. Integration: Integrating the various components of the analytical system can be
a challenge, especially when working with a diverse set of data sources and
technologies.
6. Cost: Building and maintaining an analytical system can be expensive, due to
the cost of hardware, software, and personnel.
7. Data governance: Ensuring that the data used in the analytical system is
properly governed and compliant with relevant laws and regulations can be a
complex and time-consuming task.
8. Performance: The performance of the analytical system can be impacted by
factors such as the volume and complexity of the data, the quality of the
hardware and software used, and the efficiency of the algorithms and processes
employed.

Advantages of Analytics Architecture


There are several advantages to using an analytical architecture in data-driven
decision-making:
1. Improved accuracy: By using advanced analytical techniques and tools, it is
possible to uncover insights and patterns in the data that may not be apparent
through traditional methods of analysis.
2. Enhanced decision-making: By providing a more complete and accurate view
of the data, an analytical architecture can help decision-makers to make more
informed decisions.
3. Increased efficiency: By automating certain aspects of the analysis process, an
analytical architecture can help to reduce the time and effort required to
generate insights from the data.
4. Improved scalability: An analytical architecture can be designed to handle
large volumes of data and scale as the volume of data increases, enabling
organization to make data-driven decisions at a larger scale.
5. Enhanced collaboration: An analytical architecture can facilitate collaboration
and communication between different teams and stakeholders, helping to ensure
that everyone has access to the same data and insights.
6. Greater flexibility: An analytical architecture can be designed to be flexible
and adaptable, enabling organizations to easily incorporate new data sources
and technologies as they become available.
7. Improved data governance: An analytical architecture can include
mechanisms for ensuring that the data used in the system is properly governed
and compliant with relevant laws and regulations.
8. Enhanced customer experience: By using data and insights generated through
an analytical architecture, organization can improve their understanding of their
customers and provide a more personalized and relevant customer experience.

Major Challenges of Big Data Analytics


Some of the major challenges that big data analytics include the following:
Uncertainty of Data Management Landscape: Because big data is continuously
expanding, new companies and technologies are developed every day. A big
challenge for companies is to find out which technology works bests for them
without introducing new risks and problems.
The Big Data Talent Gap: While Big Data is growing, very few experts are
available. This is because Big data is a complex field, and people who understand
this field’s complexity and intricate nature are far from between. Another major
challenge in the field is the talent gap that exists in the industry
Getting data into the big data platform: Data is increasing every single day.
This means that companies have to tackle a limitless amount of data on a regular
basis. The scale and variety of data available today can overwhelm any data
practitioner, which is why it is important to make data accessibility simple and
convenient for brand managers and owners.
Need for synchronization across data sources: As data sets become more
diverse, they must be incorporated into an analytical platform. It can create gaps
and lead to wrong insights and messages if ignored.
Getting important insights through the use of Big data analytics: It is
important that companies gain proper insights from big data analytics, and it is
important that the correct department has access to this information. A major
challenge in big data analytics is bridging this gap in an effective fashion.

Benefits of big data analytics

Incorporating big data analytics into a business or organisation has several


advantages. These include:
 Cost reduction: Big data can reduce costs in storing all business data in one place.
Tracking analytics also helps companies find ways to work more efficiently to cut
costs wherever possible.
 Product development: Developing and marketing new products, services, or brands
is much easier when based on data collected from customers’ needs and wants. Big
data analytics also helps businesses understand product viability and to keep up with
trends.
 Strategic business decisions: The ability to constantly analyse data helps
businesses make better and faster decisions, such as cost and supply chain
optimisation.
 Customer experience: Data-driven algorithms help marketing efforts (targeted ads,
for example) and increase customer satisfaction by delivering an enhanced customer
experience.
 Risk management: Businesses can identify risks by analyzing data patterns and
developing solutions for managing those risks.

Hadoop Eco System:


Hadoop ecosystem is a platform or framework which helps in solving the big data
problems.
It comprises of different components and services ( ingesting, storing, analyzing,
and maintaining) inside of it.
Most of the services available in the Hadoop ecosystem are to supplement the main
four core components of Hadoop which include HDFS, YARN, MapReduce and
Common.
Hadoop ecosystem includes both Apache Open Source projects and other wide
variety of commercial tools and solutions.
Some of the well known open source examples include Spark, Hive,
Pig, Sqoop and Oozie
Differences between Business Intelligence vs Big Data:

Comparison of Objectives Business Intelligence Big Data

Purpose The purpose of Business The primary purpose of

Intelligence is to help the Big Data is to capture,

business to make better process, and analyze the

decisions. Business data, both structured and

Intelligence helps deliver unstructured, to improve

accurate reports by customer outcomes.

extracting information

directly from the data

source.

EcoSystem / Components Operation systems, ERP Hadoop, Spark, R Server,

databases, Data hive, HDFS, etc.

Warehouse, Dashboard,

etc.

Tools Below is the list of tools Below is the list of tools

used for business used in Big Data. These

intelligence. tools or frameworks store a


These tools enable large amount of data and

businesses to collate, process them to get

analyze and visualize data insights from data to make

to make better business good decisions for the

decisions and develop good business.

strategic plans.
 Hadoop

 Tableau  Spark

 Qlik Sense  Hive

 Online analytical  Polybase

processing (OLAP)  Presto

 Sisense  Cassandra

 Data Warehousing  Plotly

 Digital Dashboards  Cloudera

and Data Mining  Storm etc

 Microsoft Power BI

 Google Analytics etc

Characteristics/ Properties Below are the six features Big data can be described

of Business Intelligence by volume, Variety,


Location intelligence, Variability, Velocity, and

Executive Dashboards, Veracity.

“what if” analysis,

Interactive reports,

Metadata layer, and

Ranking reports.

Benefits Below is the list of benefits Below is the list of benefits

of Business Intelligence: of Big Data:

 It helps in making  Better Decision

better business making.

decisions.  Fraud detection.

 Faster and more  Storage, mining, and

accurate reporting analysis of data.

and analysis.  Market prediction

 Improved data &and forecasting.

quality.  Improves the

 Reduced costs. service.

 Increase revenues.  Helps in

implementing the
 Improved new strategies.

operational  Keep up with

efficiency etc. customer trends.

 Cost savings.

 Better sales insights,

which help in

increasing revenues

etc.

Applied Fields Social Media, Healthcare, The Banking Sector,

Gaming Industry, Food Entertainment, Social

Industry, etc. Media, Healthcare, Retail,

Wholesale, etc.

Hadoop:
Hadoop is a framework that uses distributed storage and parallel processing to store
and manage big data. It is the software most used by data analysts to handle big data,
and its market size continues to grow. There are three components of Hadoop:
Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
Hadoop MapReduce - Hadoop MapReduce is the processing unit.
Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource
management unit.

Features of Hadoop
Apache Hadoop is the most popular and powerful big data tool, Hadoop provides
world’s most reliable storage layer — HDFS,
a batch Processing engine — MapReduce and
a Resource Management Layer — YARN.
Important features of Hadoop which are given below-
1. Open Source
Apache Hadoop is an open source project. It means its code can be modified
according to business requirements.
2. Distributed Processing
As data is stored in a distributed manner in HDFS across the cluster, data is
processed in parallel on a cluster of nodes.
3. Fault Tolerance
This is one of the very important features of Hadoop. By default 3 replicas of
each block is stored across the cluster in Hadoop and it can be changed also as per
the requirement. So if any node goes down, data on that node can be recovered
from other nodes easily with the help of this characteristic. Failures of nodes or
tasks are recovered automatically by the framework. This is how Hadoop is fault
tolerant.

4. Reliability
Due to replication of data in the cluster, data is reliably stored on the cluster of
machine despite machine failures. If your machine goes down, then also your data
will be stored reliably due to this characteristic of Hadoop.
5. High Availability
Data is highly available and accessible despite hardware failure due to multiple
copies of data. If a machine or few hardware crashes, then data will be accessed
from another path.
6. Scalability
Hadoop is highly scalable in the way new hardware can be easily added to the
nodes. This feature of Hadoop also provides horizontal scalability which means
new nodes can be added on the fly without any downtime.
7. Economic
Apache Hadoop is not very expensive as it runs on a cluster of commodity
hardware. We do not need any specialized machine for it. Hadoop also provides
huge cost saving also as it is very easy to add more nodes on the fly here. So if
requirement increases, then you can increase nodes as well without any downtime
and without requiring much of pre-planning.
8. Easy to use

No need of client to deal with distributed computing, the framework takes care of
all the things. So this feature of Hadoop is easy to use.
9. Data Locality
This one is a unique features of Hadoop that made it easily handle the Big Data.
Hadoop works on data locality principle which states that move computation to data
instead of data to computation. When a client submits the MapReduce algorithm,
this algorithm is moved to data in the cluster rather than bringing data to the
location where the algorithm is submitted and then processing it.

Hadoop Assumptions

Hadoop is written with large clusters of computers in mind and is built around the
following hadoop assumptions:
 Hardware may fail, (as commodity hardware can be used)

 Processing will be run in batches. Thus there is an emphasis on high throughput


as opposed to low latency.

 Applications that run on HDFS have large data sets. A typical file in HDFS is
gigabytes to terabytes in size.

 Applications need a write-once-read-many access model.


 Moving Computation is Cheaper than Moving Data.

Design Principles of Hadoop

Below are the design principles of Hadoop on which it works:

a) System shall manage and heal itself

 Automatically and transparently route around failure (Fault Tolerant)

 Speculatively execute redundant tasks if certain nodes are detected to be slow

b) Performance shall scale linearly

 Proportional change in capacity with resource change (Scalability)

c) Computation should move to data

 Lower latency, lower bandwidth (Data Locality)

d) Simple core, modular and extensible (Economical)

Comparison with other systems


S.No. RDBMS Hadoop

Traditional row-column based An open-source software used for


1. databases, basically used for data storing data and running applications
storage, manipulation and retrieval. or processes concurrently.
S.No. RDBMS Hadoop

In this structured data is mostly In this both structured and


2.
processed. unstructured data is processed.

It is best suited for OLTP


3. It is best suited for BIG data.
environment.

4. It is less scalable than Hadoop. It is highly scalable.

Data normalization is required in Data normalization is not required in


5.
RDBMS. Hadoop.

It stores transformed and aggregated


6. It stores huge volume of data.
data.

7. It has no latency in response. It has some latency in response.

The data schema of RDBMS is static The data schema of Hadoop is


8.
type. dynamic type.

Low data integrity available than


9. High data integrity available.
RDBMS.

Cost is applicable for licensed Free of cost, as it is an open source


10.
software. software.

Hadoop Components:

The Hadoop Architecture Mainly consists of 4 components.

 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is based on the
YARN framework. The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working so fast.
When you are dealing with Big Data, serial processing is no more of any use.
MapReduce has mainly 2 tasks which are divided phase-wise: In first phase, Map is
utilized and in next phase Reduce is utilized.
2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is


mainly designed for working on commodity Hardware devices(inexpensive
devices), working on a distributed file system design. HDFS is designed in such a
way that it believes more in storing the data in a large chunk of blocks rather than
storing small data blocks.

HDFS in Hadoop provides Fault-tolerance and High availability to the storage


layer and the other devices present in that Hadoop cluster. Data storage Nodes in
HDFS.

NameNode(Master)

DataNode(Slave)

3. YARN(Yet Another Resource Negotiator)


YARN is a Framework on which MapReduce works. YARN performs 2 operations
that are Job scheduling and Resource Management. The Purpose of Job schedular
is to divide a big task into small jobs so that each job can be assigned to various
slaves in a Hadoop cluster and Processing can be Maximized. Job Scheduler also
keeps track of which job is important, which job has more priority, dependencies
between the jobs and all the other information like job timing, etc. And the use of
Resource Manager is to manage all the resources that are made available for
running a Hadoop cluster.

Features of YARN
Multi-Tenancy

Scalability

Cluster-Utilization

Compatibility

HADOOP DAEMONS:
Daemons mean Process. Hadoop Daemons are a set of processes that run on
Hadoop. Hadoop is a framework written in Java, so all these processes are Java
Processes.

Apache Hadoop 2 consists of the following Daemons:

NameNode

DataNode

Secondary Name Node

Resource Manager

Node Manager

Namenode, Secondary NameNode, and Resource Manager work on a Master


System while the Node Manager and DataNode work on the Slave machine.

1. NameNode

NameNode works on the Master System. The primary purpose of Namenode is to


manage all the MetaData. Metadata is the list of files stored in HDFS(Hadoop
Distributed File System). As we know the data is stored in the form of blocks in a
Hadoop cluster. So the DataNode on which or the location at which that block of
the file is stored is mentioned in MetaData. All information regarding the logs of
the transactions happening in a Hadoop cluster (when or who read/wrote the data)
will be stored in MetaData. MetaData is stored in the memory.
Features:

It never stores the data that is present in the file.

As Namenode works on the Master System, the Master system should have good
processing power and more RAM than Slaves.

It stores the information of DataNode such as their Block id’s and Number of
Blocks

2. DataNode

DataNode works on the Slave system. The NameNode always instructs DataNode
for storing the Data. DataNode is a program that runs on the slave system that
serves the read/write request from the client. As the data is stored in this DataNode,
they should possess high memory to store more Data.

3. Secondary NameNode

Secondary NameNode is used for taking the hourly backup of the data. In case the
Hadoop cluster fails, or crashes, the secondary Namenode will take the hourly
backup or checkpoints of that data and store this data into a file name fsimage. This
file then gets transferred to a new system. A new MetaData is assigned to that new
system and a new Master is created with this MetaData, and the cluster is made to
run again correctly.
This is the benefit of Secondary Name Node. Now in Hadoop2, we have High-
Availability and Federation features that minimize the importance of this
Secondary Name Node in Hadoop2. It continuously reads the MetaData from the
RAM of NameNode and writes into the Hard Disk.

4. Resource Manager

Resource Manager is also known as the Global Master Daemon that works on the
Master System. The Resource Manager Manages the resources for the applications
that are running in a Hadoop Cluster. The Resource Manager Mainly consists of 2
things.

1. ApplicationsManager
2. Scheduler
An Application Manager is responsible for accepting the request for a client and
also makes a memory resource on the Slaves in a Hadoop cluster to host
the Application Master. The scheduler is utilized for providing resources for
applications in a Hadoop cluster and for monitoring this application.

5. Node Manager

The Node Manager works on the Slaves System that manages the memory
resource within the Node and Memory Disk. Each Slave Node in a Hadoop cluster
has a single NodeManager Daemon running in it. It also sends this monitoring
information to the Resource Manager.

Comparing SQLdatabases and Hadoop

Feature Hadoop SQL

Technology Modern Traditional

Volume Usually in PetaBytes Usually in GigaBytes


Feature Hadoop SQL

Storage, processing, retrieval and Storage, processing, retrieval and


Operations pattern extraction from data pattern mining of data

Fault
Hadoop is highly fault tolerant SQL has good fault tolerance
Tolerance

Stores data in the form of key-value


Stores structured data in tabular
Storage pairs, tables, hash map etc in
format with fixed schema in cloud
distributed systems.

Scaling Linear Non linear

Well-known industry leaders in


Cloudera, Horton work, AWS etc.
Providers provides Hadoop systems.
SQL systems are Microsoft, SAP,
Oracle etc.

Interactive and batch oriented data


Data Access Batch oriented data access
access

It is licensed and costs a fortune to


It is open source and systems can buy a SQL server, moreover if
Cost be cost effectively scaled system runs out of storage
additional charges also emerge

Statements are executed very SQL syntax is slow when


Time quickly executed in millions of rows

It stores data in HDFS and process


It does not have any advanced
Optimization though Map Reduce with huge
optimization techniques
optimization techniques.

Dynamic schema, capable of


storing and processing log data, Static Schema, capable of storing
Structure real-time data, images, videos, data(fixed schema) in tabular
sensor data etc.(both structured and format only(structured)
unstructured)
Feature Hadoop SQL

Write data once, read data multiple Read and Write data multiple
Data Update times times

Integrity Low High

Hadoop uses JDBC(Java Database


Connectivity) to communicate with SQL systems can read and write
Interaction SQL systems to send and receive data to Hadoop systems
data

Hardware Uses commodity hardware Uses propriety hardware

Learning Hadoop for entry-level as


Learning SQL is easy for even
Training well as seasoned profession is
entry-level professionals
moderately hard

Design of HDFS:
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is
mainly designed for working on commodity Hardware devices(inexpensive
devices), working on a distributed file system design. HDFS is designed in such a
way that it believes more in storing the data in a large chunk of blocks rather than
storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage
layer and the other devices present in that Hadoop cluster. Data storage Nodes in
HDFS.

 NameNode(Master)
 DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for
storing the data in a Hadoop cluster, the number of DataNodes can be from 1 to
500 or even more than that. The more number of DataNode, the Hadoop cluster
will be able to store more data. So it is advised that the DataNode should have
High storing capacity to store a large number of file blocks.
Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily check the
status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

High Level Architecture Of Hadoop

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the
single block of data is divided into multiple blocks of size 128MB which is
default and you can also change it manually.

Let’s understand this concept of breaking down of file in blocks with an example.
Suppose you have uploaded a file of 400MB to your HDFS then what happens is
this file got divided into blocks of 128MB+128MB+128MB+16MB = 400MB
size. Means 4 blocks are created each of 128MB except the last one. Hadoop
doesn’t know or it doesn’t care about what data is stored in these blocks so it
considers the final file blocks as a partial record as it does not have any idea
regarding it. In the Linux file system, the size of a file block is about 4KB which
is very much less than the default size of file blocks in the Hadoop file system.
As we all know Hadoop is mainly configured for storing the large size data which
is in petabyte, this is what makes Hadoop file system different from other file
systems as it can be scaled, nowadays file blocks of 128MB to 256MB are
considered in Hadoop.

File Read in HDFS

Step 1: The client opens the file it wishes to read by calling open() on the File

System Object(which for HDFS is an instance of Distributed File System).

Step 2: Distributed File System( DFS) calls the name node, using remote

procedure calls (RPCs), to determine the locations of the first few blocks in the

file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block. The DFS returns an FSDataInputStream to the client

for it to read data from. FSDataInputStream in turn wraps a DFSInputStream,

which manages the data node and name node I/O

Step 3: The client then calls read() on the stream. DFSInputStream, which has

stored the info node addresses for the primary few blocks within the file, then

connects to the primary (closest) data node for the primary block in the file.

Step 4: Data is streamed from the data node back to the client, which calls read()

repeatedly on the stream.

Step 5: When the end of the block is reached, DFSInputStream will close the

connection to the data node, then finds the best data node for the next block.

Step 6: When the client has finished reading the file, a function is called, close()

on the FSDataInputStream.
File Write in HDFS

Step 1: The client creates the file by calling create() on

DistributedFileSystem(DFS).

Step 2: DFS makes an RPC call to the name node to create a new file in the file

system’s namespace, with no blocks associated with it. The name node performs

various checks to make sure the file doesn’t already exist and that the client has

the right permissions to create the file. If these checks pass, the name node

prepares a record of the new file; otherwise, the file can’t be created and therefore

the client is thrown an error i.e. IOException.


Step 3: Because the client writes data, the DFSOutputStream splits it into

packets, which it writes to an indoor queue called the info queue. The data queue

is consumed by the DataStreamer, which is liable for asking the name node to

allocate new blocks by picking an inventory of suitable data nodes to store the

replicas.

Step 4: Similarly, the second data node stores the packet and forwards it to the

third (and last) data node in the pipeline.

Step 5: The DFSOutputStream sustains an internal queue of packets that are

waiting to be acknowledged by data nodes, called an “ack queue”.

Step 6: This action sends up all the remaining packets to the data node pipeline

and waits for acknowledgments before connecting to the name node to signal

whether the file is complete or not.

You might also like