Big Data Notes
Big Data Notes
Streaming data refers to data that is continuously generated, usually in high volumes and at high
velocity. A streaming data source would typically consist of continuous timestamped logs that record
events as they happen – such as a user clicking on a link in a web page, or a sensor reporting the current
temperature.
Data streaming refers to the practice of sending, receiving, and processing information in a stream rather than in
discrete batches.
1. Data Production
2. Data Ingestion
3. Data Processing
5. Data Reporting
Streaming architectures must account for the unique characteristics of data streams, which tend to
generate massive amounts of data (terabytes to petabytes) that it is at best semi-structured and requires
significant pre-processing and ETL to become useful.
COMPONENTS
Here are some key concepts and components related to streaming data in big data:
1. Data Sources: Streaming data originates from diverse sources including IoT devices, website
clickstreams, social media feeds, server logs, financial transactions, etc. These sources
continuously generate data that needs to be processed and analyzed in real-time.
2. Streaming Platforms: These are the frameworks or platforms that facilitate the ingestion,
processing, and analysis of streaming data. Examples include Apache Kafka, Apache Flink,
Apache Storm, and Amazon Kinesis. These platforms provide capabilities such as fault
tolerance, scalability, and support for various data processing tasks.
3. Data Ingestion: This involves collecting data from various sources and feeding it into the
streaming platform for processing. Ingestion mechanisms must handle large volumes of data
with low latency to ensure that data is processed in near real-time.
4. Processing: Streaming data processing involves performing operations on data streams as they
are received. This includes filtering, aggregating, joining with other data streams, and applying
machine learning models for real-time predictions or anomaly detection.
5. Windowing and Time Handling: Many streaming systems use windowing techniques to
divide streams into finite segments for processing. This allows computations such as
aggregations to be performed over specific time intervals or event counts. Handling time
accurately is crucial for tasks like event ordering and time-based aggregations.
6. Integration with Batch Processing: Often, streaming data systems need to integrate with batch
processing frameworks like Apache Hadoop or Spark for more complex analytics or to store
processed data in data lakes or warehouses.
7. Scalability and Fault Tolerance: Streaming systems must be scalable to handle increasing
data volumes and fault-tolerant to ensure continuous operation despite hardware failures or
network issues.
8. Data Storage: Depending on the use case, streaming data may be stored temporarily (e.g., in-
memory databases or caches) or persistently (e.g., in data warehouses or data lakes) after
processing.
9. Use Cases: Streaming data is utilized in various applications such as real-time analytics, fraud
detection, monitoring and alerting, recommendation systems, and IoT data processing.
10. Challenges: Managing the velocity and volume of streaming data, ensuring data quality and
consistency, maintaining low latency, and managing the complexity of distributed systems are
some of the challenges associated with streaming data in big data environments.
• Stream computing is a computing paradigm that reads data from collections of software or
hardware sensors in stream form and computes continuous data streams.
• Stream computing uses software programs that compute continuous data streams.
• Stream computing uses software algorithm that analyzes the data in real time.
• Stream computing is one effective way to support Big Data by providing extremely low-latency
velocities with massively parallel processing architectures.
• It is becoming the fastest and most efficient way to obtain useful knowledge from Big Data.
Stream computing is one effective way to support big data by providing extremely low-latency
velocities with massively parallel processing architectures, and is becoming the fastest and most
efficient way to obtain useful knowledge from big data, allowing organizations to react quickly when
problems appear or to predict new trends in the near future.
Application Background
Big data stream computing is able to analyze and process data in real time to gain an immediate insight,
and it is typically applied to the analysis of vast amount of data in real time and to process them at a
high speed. Many application scenarios require big data stream computing. For example, in financial
industries, big data stream computing technologies can be used in risk management, marketing
management, business intelligence, and so on. In the Internet, big data stream computing technologies
can be used in search engines, social networking, and so on. In Internet of things, big data stream
computing technologies can be used in intelligent transportation, environmental monitoring, and so on.
System Architecture for Stream Computing
In big data stream computing environments, stream computing is the model of straight through
computing. The input data stream is in a real-time data stream form, and all continuous data streams are
computed in real time, and the results must be updated also in real time. The volume of data is so high
that there is no enough space for storage, and not all data need to be stored. Most data will be discarded,
and only a small portion of the data will be permanently stored in hard disks.
Applications:
Financial Services
Telecommunications
Healthcare
IoT
Retail and E-commerce
Media and Entertainment
transportation and Logistics
Energy and Utilities
Advantages:
Real-Time Insights
Low Latency
Scalability
Efficient Resource Utilization
Continuous Processing
Adaptive and Dynamic
Disadvantages:
Complexity
Data Consistency
Cost
Fault Tolerance
Counting distinct elements is a problem that frequently arises in distributed systems. In general, the size
of the set under consideration (which we will henceforth call the universe) is enormous. For example, if
we build a system to identify denial of service attacks, the set could consist of all IP V4 and V6 addresses.
Another common use case is to count the number of unique visitors on popular websites like Twitter or
Facebook. An obvious approach if the number of elements is not very large would be to maintain a Set.
We can check if the set contains the element when a new element arrives. If not, we add the element to
the set. The size of the set would give the number of distinct elements. However, if the number of
elements is vast or we are maintaining counts for multiple streams, it would be infeasible to maintain the
set in memory. Storing the data on disk would be an option if we are only interested in offline
The first algorithm for counting distinct elements is the Flajolet-Martin algorithm, named after the
algorithm's creators. The Flajolet-Martin algorithm is a single pass algorithm. If there are m distinct
elements in a universe comprising of n elements, the algorithm runs in O(n) time and O(log(m)) space
complexity.
First, we pick a hash function h that takes stream elements as input and outputs a bit string. The length
of the bit strings is large enough such that the result of the hash function is much larger than the size of
the universe. We require at least log nbits if there are n elements in the universe.
r(a) is used to denote the number of trailing zeros in the binary representation of h(a) for an element a in
the stream.
y = hash(x)
r = get_righmost_set_bit(y)
set_bit(B, r)
R = get_righmost_unset_bit(B)
return 2 ^ R
Estimating Moments :-
• Estimating moments is a generalization of the problem of counting distinct elements in a stream.
The problem, called computing "moments," involves the distribution of frequencies of different
elements in the stream.
• Suppose a stream consists of elements chosen from a universal set. Assume the universal set is
ordered so we can speak of the ith𝑖𝑡ℎ element for any i.
• Let mi𝑚𝑖 be the number of occurrences of the ith𝑖𝑡ℎ element for any i. Then the kth𝑘𝑡ℎ-order
moment of the stream is the sum over all i of (mi)k.
Estimating moments, such as mean, variance, skewness, and kurtosis, involves calculating statistical
properties that describe the distribution of data. Here’s a step-by-step outline of how these moments are
typically estimated:
• Definition: The variance measures the spread of data points around the mean and is defined
σ2=n1i=1∑n(xi−μ)2
• Definition: Skewness measures the asymmetry of the distribution around the mean:
Skewness=1n∑i=1n(xi−μ)3(1n∑i=1n(xi−μ)2)3/2\text{Skewness} = \frac{\frac{1}{n}
\sum_{i=1}^{n} (x_i - \mu)^3}{\left( \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2
\right)^{3/2}}Skewness=(n1∑i=1n(xi−μ)2)3/2n1∑i=1n(xi−μ)3
• Definition: Kurtosis measures the "tailedness" or the heaviness of the tails of the distribution:
Kurtosis=1n∑i=1n(xi−μ)4(1n∑i=1n(xi−μ)2)2−3\text{Kurtosis} = \frac{\frac{1}{n}
\sum_{i=1}^{n} (x_i - \mu)^4}{\left( \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 \right)^2} -
3Kurtosis=(n1∑i=1n(xi−μ)2)2n1∑i=1n(xi−μ)4−3
Advantages:
Computational Efficiency
Scalability
Real-Time Analysis
Memory Efficiency
Disadvantages:
Accuracy Trade-offs
Bias and Error
Complexity in Implementation
Assumption Sensitivity
Sampling Issues
Thefunctioningofreal-timeanalytics :
Real-time data analytics tools can either push or pull. Streaming requires the faculty to shove gigantic
amounts of brisk-moving data. When streaming takes too many assets and isn’t empirical, data can
be hauled at interludes that can range from seconds to hours. The tow can happen in between business
needs which require figuring funds so as not to disrupt functioning. The reaction times for real-time
analytics can differ from nearly immediate to a few seconds or minutes. The components of real-time
data analytics include as follows:
• Aggregator
• Broker
• Analytics engine
• Stream processor
1. Financial Trading: RTAP is widely used in financial markets to analyze market data, detect
trading opportunities, and execute trades within milliseconds. It helps traders make informed
decisions based on up-to-the-second market trends and sentiment analysis.
2. Network Monitoring and Security: RTAP applications are crucial for monitoring network
traffic in real-time to detect anomalies, intrusions, or potential security threats. It enables
immediate responses to mitigate risks and maintain network integrity.
3. Social Media Analytics: Companies use RTAP to monitor social media platforms for
mentions, sentiment analysis of brand mentions, customer feedback, and trending topics. This
allows them to engage with customers promptly and manage their brand reputation effectively.
4. IoT (Internet of Things) Data Analysis: With the proliferation of IoT devices, RTAP is
essential for analyzing streams of sensor data in real-time. It enables predictive maintenance,
monitoring of equipment performance, and optimizing operational efficiency.
5. Online Gaming and Entertainment: Real-time analytics in gaming applications monitors
player behavior, analyzes gameplay data, and adjusts game mechanics dynamically. It enhances
player experience by personalizing gameplay and detecting cheating or fraud in real-time.
6. Healthcare Monitoring: RTAP is used in healthcare settings for real-time patient monitoring,
analyzing physiological data, and alerting healthcare providers to critical conditions. It enables
proactive intervention and improves patient outcomes.
7. Supply Chain and Logistics: RTAP applications track shipments, monitor inventory levels,
and optimize logistics operations in real-time. It helps businesses streamline their supply chain
processes, reduce costs, and improve delivery efficiency.
8. Customer Experience Management: RTAP analyzes customer interactions across multiple
channels (such as websites, mobile apps, and customer service calls) to provide personalized
recommendations, resolve issues in real-time, and optimize customer satisfaction.
9. Predictive Maintenance: RTAP applications analyze data from machinery and equipment
sensors to predict maintenance needs before failures occur. It minimizes downtime, reduces
maintenance costs, and improves asset reliability.
10. Environmental Monitoring: RTAP is used in environmental monitoring to analyze data from
sensors measuring air quality, water quality, and weather conditions in real-time. It helps
authorities and organizations respond quickly to environmental events or emergencies.
Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative
subset of data points to identify patterns and trends in the larger data set being examined. It enables data
scientists, predictive modelers and other data analysts to work with a smaller, more manageable subset
of data, rather than trying to analyze the entire data population. With a representative sample, they can
build and run analytical models more quickly, while still producing accurate findings.
Data sampling is a widely used statistical approach that can be applied to a range of use cases, such as
analyzing market trends, web traffic or political polls. For example, researchers who use data sampling
don't need to speak with every individual in the U.S. to discover the most common method of
commuting to work. Instead, they can choose a representative subset of data -- such as 1,000 or10,000
participants -- in the hopes that this number will be sufficient to produce accurate results.
Data sampling enables data scientists and researchers to extrapolate knowledge about the broader
population from a smaller subset of data. By using a representative data sample, they can make
predictions about the larger population with a certain level of confidence, without having to collect and
analyze data from each member of the population.
What Is Real-Time Sentiment Analysis?
Real-time Sentiment Analysis is a machine learning (ML) technique that automatically recognizes and
extracts the sentiment in a text whenever it occurs. It is most commonly used to analyze brand and
product mentions in live social comments and posts. An important thing to note is that real-time
sentiment analysis can be done only from social media platforms that share live feeds like Twitter does.
The real-time sentiment analysis process uses several ML tasks such as natural language processing,
text analysis, semantic clustering, etc to identify opinions expressed about brand experiences in live
feeds and extract business intelligence from them.Real-time sentiment analysis is an important
artificial intelligence-driven process that is used by organizations for live market research for brand
experience and customer experience analysis purposes.
A real-time sentiment analysis platform needs to be first trained on a data set based on your industry
and needs. Once this is done, the platform performs live sentiment analysis of real -time feeds
effortlessly.
Why is Sentiment Analysis Important?
Sentiment analysis is the contextual meaning of words that indicates the social sentiment of a brand and
also helps the business to determine whether the product they are manufacturing is going to make a
demand in the market or not.
1. Sentiment Analysis is required as it stores data in an efficient, cost friendly.
2. Sentiment analysis solves real-time issues and can help you solve all real-time scenarios.
Data Collection: Data is continuously collected from various sources, such as social media platforms
(Twitter, Facebook), review sites (Amazon, Yelp), news websites, blogs, and forums.
Data Preprocessing: Text data undergoes preprocessing steps like tokenization (breaking text into
words or phrases), lowercasing, removing stop words (common words like "and", "the"), and possibly
stemming or lemmatization (reducing words to their base form).
Feature Extraction: Features relevant to sentiment analysis are extracted. These could include word
frequencies, n-grams (sequences of words), part-of-speech tags, or even embeddings (numerical
representations of words or documents).
• Machine Learning Models: Techniques like Support Vector Machines (SVM), Naive Bayes,
or more advanced neural networks (such as LSTM or Transformer models) are trained on
labeled data to predict sentiment.
• Lexicon-based Approaches: Using predefined sentiment lexicons (dictionaries) where words
are assigned sentiment scores, and then aggregating these scores to determine overall sentiment.
Real-Time Processing: Techniques like stream processing (using frameworks like Apache Kafka
or Apache Flink) are employed to handle the continuous flow of incoming data in real-time.
Scalability and Performance: Given the large volume of data, systems need to be scalable. This
often involves distributed computing frameworks (like Apache Spark) and efficient data storage and
retrieval mechanisms (such as NoSQL databases).
Visualization and Reporting: Results are often visualized in dashboards or reported in real-time,
providing insights into trends, spikes in sentiment, or issues that require immediate attention.
Feedback Loop: The system may incorporate a feedback loop where predictions are continuously
refined based on new data and user feedback
Types
Real-time sentiment analysis in big data can be categorized into several types based on the techniques
and approaches used. Here are some common types:
Advantages:
1. Immediate Feedback
2. Enhanced Customer Engagement
3. Crisis Management
4. Competitive Advantage
5. Improved Decision Making
Disvantages:
1. Accuracy Challenges
2. Scalability Issues
3. Privacy Concerns
4. Bias and Misinterpretation
5. Costs
Stock market prediction in the context of big data analytics refers to the use of large and diverse datasets,
often derived from various sources such as historical stock prices, trading volumes, economic
indicators, news sentiment, and alternative data sources (like social media or satellite imagery), to
forecast future movements or trends in financial markets.
Key aspects of stock market prediction using big data analytics include:
1. Data Collection and Integration: Gathering and combining vast amounts of structured and
unstructured data from multiple sources to create a comprehensive dataset for analysis.
2. Data Preprocessing: Cleaning, filtering, and transforming raw data to ensure accuracy,
consistency, and relevance for predictive modeling.
3. Feature Extraction and Engineering: Identifying and creating informative features
(predictors) from the data that are likely to influence stock prices, such as technical indicators,
economic factors, sentiment scores, or derived metrics.
4. Model Development: Applying machine learning algorithms, statistical models, or advanced
analytics techniques to build predictive models. Common approaches include regression
analysis, time series forecasting, ensemble methods (like random forests or gradient boosting),
and deep learning models (such as recurrent neural networks or convolutional neural networks).
5. Model Training and Validation: Training predictive models using historical data, optimizing
model parameters, and validating model performance using metrics like accuracy, precision,
recall, and profit-loss ratios.
6. Real-Time Processing: Implementing real-time or near-real-time analytics to continuously
update predictions as new data becomes available, enabling timely decision-making in volatile
market conditions.
7. Risk Management and Evaluation: Integrating risk management strategies to assess and
mitigate uncertainties associated with stock market predictions, including evaluating model
robustness and sensitivity to market changes.
8. Deployment and Monitoring: Deploying predictive models into operational environments,
monitoring model performance over time, and adapting models as market conditions evolve or
new data sources emerge.
9. Ethical and Regulatory Considerations: Adhering to ethical standards in data usage, ensuring
data privacy and security, and complying with regulatory requirements governing financial
markets and data management.
Applications
stock market prediction using big data analytics finds various applications across different areas of
finance and investment. Here are some key applications:
Advantages:
1. Data-Driven Insights
2. improved Decision-Making
3. Real-Time Analysis
4. Risk Management
5. Cost Efficiency
6. Personalized Strategies
Disadvantages:
1. Complexity
2. Data Quality Issues
3. Market Uncertainty
4. Regulatory and Ethical Concerns
5. ack of Human Judgment
UNIT -3
HDFS
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications.
This open source framework works by rapidly transferring data between nodes. It's often used by
companies who need to handle and store big data. HDFS is a key component of many Hadoop systems,
as it provides a means for managing big data, as well as supporting big data analytics.
What is HDFS?
HDFS stands for Hadoop Distributed File System. HDFS operates as a distributed file system designed
to run on commodity hardware.
HDFS is fault-tolerant and designed to be deployed on low-cost, commodity hardware. HDFS provides
high throughput data access to application data and is suitable for applications that have large data sets
and enables streaming access to file system data in Apache Hadoop
• Manage large datasets - Organizing and storing datasets can be a hard talk to handle. HDFS
is used to manage the applications that have to deal with huge datasets. To do this, HDFS should
have hundreds of nodes per cluster.
• Detecting faults - HDFS should have technology in place to scan and detect faults quickly
and effectively as it includes a large number of commodity hardware. Failure of components is
a common issue.
• Hardware efficiency - When large datasets are involved it can reduce the network traffic and
increase the processing speed.
There are five core elements of big data organized by HDFS services:
HDFS components
It's important to know that there are three main components of Hadoop. Hadoop HDFS, Hadoop
MapReduce, and Hadoop YARN. Let's take a look at what these components bring to Hadoop:
• Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.
• Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop. This software
framework is used to write applications to process vast amounts of data.
• Hadoop YARN - Hadoop YARN is a resource management component of Hadoop. It
processes and runs data for batch, stream, interactive, and graph processing - all of which are
stored in HDFS.
Introduction to Hadoop Streaming
Hadoop Streaming uses UNIX standard streams as the interface between Hadoop and your program so
you can write MapReduce program in any language which can write to standard output and read
standard input. Hadoop offers a lot of methods to help non-Java development.
The primary mechanisms are Hadoop Pipes which gives a native C++ interface to Hadoop and Hadoop
Streaming which permits any program that uses standard input and output to be used for map tasks and
reduce tasks.
Some of the key features associated with Hadoop Streaming are as follows :
As it can be clearly seen in the diagram above that there are almost 8 key parts in a Hadoop Streaming
Architecture. They are :
• Input Reader/Format
• Key Value
• Mapper Stream
• Key-Value Pairs
• Reduce Stream
• Output Format
• Map External
• Reduce External
The involvement of these components will be discussed in detail when we explain the working of the
Hadoop streaming.
• Input is read from standard input and the output is emitted to standard output by Mapper
and the Reducer. The utility creates a Map/Reduce job, submits the job to an appropriate
cluster, and monitors the progress of the job until completion.
• Every mapper task will launch the script as a separate process when the mapper is
initialized after a script is specified for mappers. Mapper task inputs are converted into
lines and fed to the standard input and Line oriented outputs are collected from the
standard output of the procedure Mapper and every line is changed into a key, value pair
which is collected as the outcome of the mapper.
• Each reducer task will launch the script as a separate process and then the reducer is
initialized after a script is specified for reducers. As the reducer task runs, reducer task
input key/value pairs are converted into lines and fed to the standard input (STDIN) of the
process.
• Each line of the line-oriented outputs is converted into a key/value pair after it is collected
from the standard output (STDOUT) of the process, which is then collected as the output
of the reducer
• Mapper Phase Code
• !/usr/bin/python
• import sys
myline = myline.strip()
words = myline.split()
for myword in words:
print '%s\t%s' % (myword, 1)
• Make sure this file has execution permission (chmod +x /home/ expert/hadoop-
1.2.1/mapper.py).
• Reducer Phase Code
• #!/usr/bin/python
• from operator import itemgetter
• import sys
• current_word = ""
• current_count = 0
• word = ""
myline = myline.strip()
count = myline.split('\t', 1)
try:
• count = int(count)
except ValueError:
• if current_word == word:
• current_count += count
• else:
• if current_word:
• current_count = count
• current_word = word
•
• if current_word == word:
• print '%s\t%s' % (current_word, current_count)
Advantages:
5. Scalability
Disadvantages:
1. Performance Overhead
2. Resource Management
5. Security Concerns
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running
on clusters of commodity hardware.
Very large files: “Very large” in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
Streaming data access : HDFS is built around the idea that the most efficient data processing pattern
is a write-once, readmany-times pattern. A dataset is typically generated or copied from source, then
various analyses are performed on that dataset over time.
Commodity hardware : Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s
designed to run on clusters of commodity hardware (commonly available hardware available from
multiple vendors3) for which the chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face
of such failure.
Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation. The Java
abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in Hadoop,
and there are several concrete implementations.Hadoop is written in Java, so most Hadoop filesystem
interactions are mediated through the Java API. The filesystem shell, for example, is a Java application
that uses the Java FileSystem class to provide filesystem operations.By exposing its filesystem interface
as a Java API, Hadoop makes it awkward for non-Java applications to access HDFS. The HTTP REST
API exposed by the WebHDFS protocol makes it easier for other languages to interact with HDFS.
Note that the HTTP interface is slower than the native Java client, so should be avoided for very large
data transfers if possible.
There are two ways of accessing HDFS over HTTP: directly, where the HDFS daemons serve HTTP
requests to clients and via a proxy, which accesses HDFS on the client’s behalf using the usual
DistributedFileSystem API.
In this section, we dig into the Hadoop FileSystem class: the API for interacting with one of Hadoop’s
filesystems.
A file in a Hadoop filesystem is represented by a Hadoop Path object. FileSystem is a general filesystem
API, so the first step is to retrieve an instance for the filesystem we want to use—HDFS, in this case.
There are several static factory methods for getting a FileSystem instance
The Seekable interface permits seeking to a position in the file and provides a query method for the
current offset from the start of the file (getPos()) . Calling seek() with a position that is greater than the
length of the file will result in an IOException.
FSDataOutputStream
package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
}
}
However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is because
HDFS allows only sequential writes to an open file or appends to an already written file. In other words,
there is no support for writing to anywhere other than the end of the file, so there is no value in being
able to seek while writing.
This method creates all of the necessary parent directories if they don’t already exists and returns true
if its success full.
File patterns It is a common requirement to process sets of files in a single operation. For example, a
MapReduce job for log processing might analyze a month’s worth of files contained in a number of
directories.Hadoop provides two FileSystem methods for processing globs:
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter)
throws IOException
The globStatus() methods return an array of FileStatus objects whose paths match the supplied pattern,
sorted by path.
What is MapReduce?
MapReduce is a Java-based, distributed execution framework within the Apache Hadoop Ecosystem. It
takes away the complexity of distributed programming by exposing two processing steps that
developers implement: 1) Map and 2) Reduce. In the Mapping step, data is split between parallel
processing tasks. Transformation logic can be applied to each chunk of data. Once completed, the
Reduce phase takes over to handle aggregating data from the Map set.. In general, MapReduce
uses Hadoop Distributed File System (HDFS) for both input and output.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
• The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
• The Reduce task takes the output from the Map as an input and combines those data tuples
(key-value pairs) into a smaller set of tuples.
Let us now take a close look at each of the phases and try to understand their significance.
• Input Phase − Here we have a Record Reader that translates each record in an input file and
sends the parsed data to the mapper in the form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-value pairs and processes
each one of them to generate zero or more key-value pairs.
• Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate
keys.
• Combiner − A combiner is a type of local Reducer that groups similar data from the map phase
into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the
grouped key-value pairs onto the local machine, where the Reducer is running. The individual
key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in the Reducer task.
• Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them. Here, the data can be aggregated, filtered, and combined in a
number of ways, and it requires a wide range of processing. Once the execution is over, it gives
zero or more key-value pairs to the final step.
• Output Phase − In the output phase, we have an output formatter that translates the final key-
value pairs from the Reducer function and writes them onto a file using a record writer.
Legacy applications and Hadoop native tools like Sqoop and Pig leverage MapReduce today. There is
very limited MapReduce application development nor any significant contributions being made to it as
an open source technology.
Advantages of MapReduce
1. Scalability
2. Flexibility
3. Security and authentication
4. Faster processing of data
5. Very simple programming model
6. Availability and resilient nature
alternatives to MapReduce
1. Apache Spark
2. Apache Storm
3. Ceph
4. Hydra
5. Google BigQuery
Apache Hadoop is an open source, Java-based, software framework and parallel data processing engine.
It enables big data analytics processing tasks to be broken down into smaller tasks that can be performed
in parallel by using an algorithm (like the MapReduce algorithm), and distributing them across a
Hadoop cluster. A Hadoop cluster is a collection of computers, known as nodes, that are networked
together to perform these kinds of parallel computations on big data sets. Unlike other computer
clusters, Hadoop clusters are designed specifically to store and analyze mass amounts of structured and
unstructured data in a distributed computing environment.
The final part of the system are the Client Nodes, which are responsible for loading the data and fetching
the results.
• Master nodes are responsible for storing data in HDFS and overseeing key operations, such as
running parallel computations on the data using MapReduce.
• The worker nodes comprise most of the virtual machines in a Hadoop cluster, and perform the
job of storing the data and running computations. Each worker node runs the DataNode and
TaskTracker services, which are used to receive the instructions from the master nodes.
• Client nodes are in charge of loading the data into the cluster. Client nodes first submit
MapReduce jobs describing how data needs to be processed and then fetch the results once the
processing is finished.
A Hadoop cluster size is a set of metrics that defines storage and compute capabilities to run Hadoop
workloads, namely :
• Number of nodes : number of Master nodes, number of Edge Nodes, number of Worker Nodes.
• Configuration of each type node: number of cores per node, RAM and Disk Volume.
Types of Hadoop clusters
1. Single Node Hadoop Cluster: In Single Node Hadoop Cluster as the name suggests the
cluster is of an only single node which means all our Hadoop Daemons.
2. 2. Multiple Node Hadoop Cluster: In multiple node Hadoop clusters as the name suggests
it contains multiple nodes. In this kind of cluster set up all of our Hadoop Daemons, will store
in different-different nodes in the same cluster setup.
advantages of a Hadoop
Cluster?
• Hadoop clusters can boost the processing speed of many big data analytics jobs, given their
ability to break down large computational tasks into smaller tasks that can be run in a parallel,
distributed fashion.
• Hadoop clusters are easily scalable and can quickly add nodes to increase throughput, and
maintain processing speed, when faced with increasing data blocks.
• The use of low cost, high availability commodity hardware makes Hadoop clusters relatively
easy and inexpensive to set up and maintain.
• Hadoop clusters replicate a data set across the distributed file system, making them resilient to
data loss and cluster failure.
• Hadoop clusters make it possible to integrate and leverage data from multiple different source
systems and data formats.
• It is possible to deploy Hadoop using a single-node installation, for evaluation purposes.
Challenges :
Hadoop, an open-source platform, has transformed how businesses handle huge amounts of data.
Because of its ability to store and process massive volumes of data across distributed computer clusters,
it has become a popular choice for enterprises looking to use the potential of big data analytics.
Introduction
As data breaches and cyber threats have increased, guaranteeing the security of Hadoop clusters has
become a major problem. Hadoop security refers to the methods and practices implemented to safeguard
sensitive data and prevent unauthorized access or misuse inside a Hadoop system.
The need for Hadoop security derives from the increased vulnerability of data to hostile attacks. Hadoop
is used by businesses to store and handle vast volumes of data, including sensitive information like
customer records, financial data, and intellectual property. This valuable data becomes a tempting target
for fraudsters if sufficient security measures are not implemented. Furthermore, data privacy laws, such
as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA),
require businesses to protect their consumers' information. Failure to follow these regulations can result
in serious financial penalties and reputational harm.
3 A's of security
Organizations must prioritize the three core security principles known as the 3 A's: Authentication,
Authorization, and Auditing to manage security concerns in a Hadoop environment effectively.
1. Authentication:
The authentication process ensures that only authorized users can access the Hadoop cluster. It entails
authenticating users' identities using various mechanisms such as usernames and passwords, digital
certificates, or biometric authentication.
2. Authorization:
Authorization governs the actions an authenticated user can take within the Hadoop cluster. It entails
creating access restrictions and permissions depending on the roles and responsibilities of the users.
3. Auditing:
Auditing is essential for monitoring and tracking user activity in the Hadoop cluster. Organizations can
investigate suspicious or unauthorized activity by keeping detailed audit logs.
Hadoop is a powerful and effective method for managing and analyzing data. Data security, on the other
hand, is critical in any big data ecosystem. Hadoop recognizes this critical element and offers several
measures to assure data security throughout its distributed infrastructure.
Authentication and authorization are two of the key ways in which Hadoop ensures security. Hadoop
has strong authentication procedures to verify user identities and prevent unauthorized data access. It
supports various authentication protocols, including Kerberos, LDAP, and SSL, to ensure safe access
to Hadoop clusters. Furthermore, Hadoop uses role-based access control (RBAC) to design and enforce
access permissions, allowing administrators to give or restrict capabilities based on user roles and
responsibilities.
Hadoop also includes auditing and logging features to track and monitor user activity. It records
important events like file access, user authentication, and administrative tasks, allowing
administrators to detect suspicious or unauthorized behavior. These records can benefit from forensic
analysis, compliance reporting, and troubleshooting.
Different Tools for Hadoop Security
Various effective methods are available to strengthen Hadoop security and maintain data
confidentiality, integrity, and availability.
Apache Ranger
Apache Ranger is a Hadoop security framework that allows administrators to design fine-grained access
control policies. Ranger's centralized policy management enables organizations to restrict user access,
monitor activity, and uniformly implement security policies across the Hadoop ecosystem.
Apache Knox
Apache Knox serves as a security gateway for Hadoop systems. It acts as a single entry point for users
and authenticates them before allowing access to the cluster's services. Knox enhances Hadoop's overall
security posture by providing perimeter security, encryption, and integration with external
authentication systems.
Apache Sentry
Apache Sentry is a Hadoop-based role-based access control (RBAC) solution. It lets administrators
create and enforce granular access privileges, ensuring that users only have access to the data required
for their specific activities. Sentry's permission procedures improve data security and reduce the danger
of unauthorized access.
Cloudera Navigator
Cloudera Navigator provides full Hadoop data governance and security features. It includes data
discovery, metadata management, and lineage tracing, allowing organizations to monitor and audit data
access, enforce compliance requirements, and quickly detect any questionable activity.
UNIT – 4
Link analysis is an analysis technique that focuses on relationships and connections in a dataset. Link
analysis gives you the ability to calculate centrality measures—namely degree, betweenness, closeness,
and eigenvector—and see the connections on a link chart or link map.
Examples
A crime analyst is investigating a criminal network. Data from cell phone records can be used to
determine the relationship and hierarchy between members of the network.
A credit card company is developing a new system to detect credit card theft. The system uses the known
patterns of transactions for each client, such as the city, stores, and types of transactions, to identify
anomalies and alert the client of a potential theft.
A set of An online social network that uses a network of profiles and relationships to connect users.
interconnected
Network
nodes and Airline networks that use a network of airports and flights to transport travelers from their ori
links. destination.
A point or
vertex that
represents an
object, such as
The profiles in a social network. Associated properties may include the user's name, hom
a person,
employer.
Node place, crime
type, or tweet.
The airports in an airline network. Associated properties may include the airport name.
The node may
also include
associated
properties.
The
relationships
or connections The relationship between profiles in the network, such as friend, follower, or connection.
between properties may include the length of the relationship.
Link nodes. The
link may also The flights between airports in an airline network. Associated properties may include the numb
include between airports.
associated
properties.
Centrality
Degree centrality
Degree centrality is based on the number of direct connections a node has. Degree centrality should be
used when you want to determine which nodes have the most direct influence.
degCentrality(x)=deg(x)/(NodesTotal-1)
Betweenness centrality
Betweenness centrality is based on the extent a node is part of the shortest path between other nodes.
Betweenness centrality should be used when you want to determine which nodes are used to connect
other nodes to each other.
btwCentrality(x)=Σa,bϵNodes(pathsa,b(x)/pathsa,b)
where:
Closeness centrality
Closeness centrality is based on the average of the shortest network path distance between nodes.
Closeness centrality should be used when you want to determine which nodes are most closely
associated to the other nodes in the network
loseCentrality(x)=(nodes(x,y)/(NodesTotal-1))*(nodes(x,y)/dist(x,y)Total)
Page Rank:
PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results.
PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of
measuring the importance of website pages. According to Google:
PageRank works by counting the number and quality of links to a page to determine a rough estimate
of how important the website is. The underlying assumption is that more important websites are likely
to receive more links from other websites.
Algorithm
The PageRank algorithm outputs a probability distribution used to represent the likelihood that a
person randomly clicking on links will arrive at any particular page. PageRank can be calculated for
collections of documents of any size.
PageRank effectively works in big data environments by leveraging distributed computing frameworks
to handle the scalability, fault tolerance, and computational complexity associated with large-scale web
graph processing. This enables applications such as search engines to compute and update web page
rankings efficiently across massive datasets.
Graph Representation: Websites and web pages are represented as nodes in a graph, and hyperlinks
between them are represented as edges. This graph can grow to billions of nodes and edges, making it
a massive dataset.
Scalability: PageRank computations can be distributed across multiple machines in a cluster using
parallel processing frameworks like MapReduce, Apache Spark, or other distributed computing
paradigms. This allows handling large-scale graphs efficiently.
Distributed Computation:
• Mapping Phase: Each node in the graph computes its contributions to other nodes it links to
(outgoing links). This can be parallelized across the cluster, where each node (page) computes
its contribution independently.
• Shuffling and Sorting: Intermediate results (contribution values) are shuffled and sorted based
on the destination nodes they contribute to. This prepares the data for the next phase.
• Reducing Phase: Each node aggregates the contributions it receives from incoming links
(nodes). The PageRank score is then updated using a formula that includes damping factors and
the accumulated contributions. This process iterates until convergence.
Fault Tolerance: Distributed computing frameworks like MapReduce provide fault tolerance
mechanisms. If a node or task fails during computation, it can be automatically restarted or reassigned
to another node in the cluster, ensuring robustness in large-scale computations.
Iterative Nature: PageRank computation is iterative, meaning it refines the PageRank values
through multiple iterations until they stabilize (convergence). Each iteration involves updating
PageRank scores based on the contributions from neighboring nodes, ensuring the ranking reflects the
overall structure of the web graph.
Performance Optimization: Techniques such as graph partitioning, efficient data shuffling, and
minimizing communication overhead between nodes are employed to optimize the performance of
PageRank computation in distributed environments.
Integration with Big Data Ecosystem: PageRank algorithms can be integrated with other
components of the big data ecosystem, such as distributed file systems
Components:
1. Graph Representation
2. Data Storage
3. MapReduce or Similar Framework
4. Iteration and Convergence
5. Integration with Ecosystem
Advantages:
Disadvantages
1. Vulnerability to manipulation
2. Emphasis on old pages
3. Lack of user context
4. Limited in dealing withspam and low-quality content
Efficient computation of PageRank in the context of big data involves several strategies to handle the
large volumes of data and complex computations efficiently. Here are some key approaches:
1. Graph Partitioning: Split the web graph into smaller partitions that can be processed
independently. This allows parallel computation of PageRank on each partition, reducing the
overall computation time. Techniques like the METIS algorithm can be used for effective
partitioning.
2. Iterative Computation: Use iterative algorithms such as the Power Method or the more
advanced algorithms like Gauss-Seidel or Jacobi methods. These algorithms are designed to
converge to the PageRank vector efficiently over multiple iterations.
3. Distributed Computing: Utilize distributed computing frameworks such as Apache Hadoop
or Apache Spark. These frameworks enable parallel processing across a cluster of computers,
distributing the computation of PageRank across multiple nodes to handle big data scale.
4. Data Compression: Apply techniques like sparse matrix representations and compression
algorithms to reduce the memory footprint and computational overhead, especially when
dealing with large, sparse matrices representing the web graph.
5. Optimized Data Structures: Use efficient data structures such as hash tables or inverted
indices to store and access the graph structure and PageRank values quickly during
computation.
6. Incremental Updates: Implement incremental algorithms that update PageRank values only
for nodes that have changed, rather than recalculating from scratch. This is particularly useful
in dynamic graphs where updates are frequent.
7. Preprocessing: Preprocess the graph data to remove redundancy, identify dangling nodes
(nodes with no outgoing links), and handle other structural optimizations that can simplify the
PageRank computation.
8. Algorithmic Improvements: Explore advanced algorithms or modifications to the original
PageRank algorithm that can improve convergence speed or reduce computational complexity,
while still maintaining accurate results.
9. Scalable Infrastructure: Ensure the computing infrastructure (hardware and software) is
scalable and can handle the volume and velocity of data required for computing PageRank in
real-time or near real-time.
10. Caching and Memory Management: Utilize caching mechanisms to store intermediate results
and optimize memory management to handle large data sets efficiently.
Spam links are also commonly referred to as "toxic backlinks," because every link that is posted must
be relevant to the content that is written. This indirectly forces users to visit the site in question, even
though the link or site being promoted does not match the content at all.
1. Hidden Link
Hidden links are the first type of spam link. Hidden links are a technique where links embedded in
content cannot be seen by users. These links are typically hidden behind images and given a background
color until they are hidden in the site's code and are completely invisible to users.
2. Nofollow Link
Some nofollow links on a site can be detected as spam links. Nofollow links are a tactic that is claimed
to be a technique that can increase the profile of these backlinks and trick Google's spam detectors.
3. Spam Posting
Post spam is an act in which someone consistently shares a link in public forums, comments, or other
inappropriate places. This technique is widely used by link spammers because it is considered the easiest
way to do link spamming.
4.Link Farms:Link farms are a type of link spamming that involves site owner cooperation. Site owners
who engage in link farming will continuously link to each other for the sole purpose of building
backlinks.
5.Directory Spam:Directories can be a double-edged sword when looking to improve your SERP rank.
When dealing with local SEO, registering your business across different authoritative directories can
cause serious improvements in your search rank.
Recommendation System?
A recommendation system is an artificial intelligence or AI algorithm, usually associated with machine
learning, that uses Big Data to suggest or recommend additional products to consumers. These can be
based on various criteria, including past purchases, search history, demographic information, and other
factors. Recommender systems are highly useful as they help users discover products and services they
might otherwise have not found on their own.
Recommender systems are trained to understand the preferences, previous decisions, and characteristics
of people and products using data gathered about their interactions. These include impressions, clicks,
likes, and purchases.
1. Collaborative Filtering
2. Content-Based Filtering
• Description: Recommends items similar to those a user has liked, based on item attributes or
content.
• Approach: Extract features from item descriptions (e.g., text, metadata) and recommend items
with similar features.
• Advantages: Handles new items well; can recommend items based on user preferences.
• Challenges: Limited diversity in recommendations; requires quality metadata.
Advantages:
1. Enhanced Comparison
2. Improved Methodology
3. Efficient Data Sharing
4. Sales Analysis
5. Identifying Event Relations
DisAdvantages:
a. Can be time-consuming
b. Can be misleading
c. Can be difficult to interpret
d. May not be suitable for all types of data
• Scalability: Techniques should scale with increasing data volumes and complexity.
• Interactivity: Tools should provide responsive and intuitive interfaces for exploring and
analyzing data.
• Integration: Techniques should integrate seamlessly with existing data pipelines and analytics
workflows.
• Security: Ensure data privacy and compliance with regulations when interacting with sensitive
data.
UNIT – 5
Pandas Introduction
•
•• is a powerful and open-source Python library. The Pandas library is used for data manipulation
Pandas
and analysis. Pandas consist of data structures and functions to perform efficient operations on data.
import pandas as pd
import numpy as np
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print("Pandas Series:\n", ser)
Pandas DataFrame
Pandas DataFrame is a two-dimensional data structure with labeled axes (rows and columns).
import pandas as pd
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']