KEMBAR78
Big Data Notes | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
17 views37 pages

Big Data Notes

Streaming data is continuously generated information that is processed in real-time, allowing for immediate insights and actions. A streaming data architecture consists of components that ingest, process, and analyze these data streams, often utilizing platforms like Apache Kafka and Amazon Kinesis. Real-time analytics applications span various industries, including finance, healthcare, and IoT, providing advantages such as low latency and scalability, while also facing challenges like complexity and data consistency.

Uploaded by

midunmystic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views37 pages

Big Data Notes

Streaming data is continuously generated information that is processed in real-time, allowing for immediate insights and actions. A streaming data architecture consists of components that ingest, process, and analyze these data streams, often utilizing platforms like Apache Kafka and Amazon Kinesis. Real-time analytics applications span various industries, including finance, healthcare, and IoT, providing advantages such as low latency and scalability, while also facing challenges like complexity and data consistency.

Uploaded by

midunmystic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT – 2

What is Streaming Data?

Streaming data refers to data that is continuously generated, usually in high volumes and at high
velocity. A streaming data source would typically consist of continuous timestamped logs that record
events as they happen – such as a user clicking on a link in a web page, or a sensor reporting the current
temperature.

Data streaming refers to the practice of sending, receiving, and processing information in a stream rather than in
discrete batches.

1. Data Production

2. Data Ingestion

3. Data Processing

4. Streaming Data Analytics

5. Data Reporting

6. Data Visualization & Decision Making

What is Streaming Data Architecture?


A streaming data architecture is a framework of software components built to ingest, process, and
analyze data streams – typically in real time or near-real time. Rather than writing and reading data in
batches, a streaming data architecture consumes data immediately as it is generated, persists it to
storage, and may include various additional components per use case – such as tools for real-time
processing, data manipulation, and analytics.

Streaming architectures must account for the unique characteristics of data streams, which tend to
generate massive amounts of data (terabytes to petabytes) that it is at best semi-structured and requires
significant pre-processing and ETL to become useful.

COMPONENTS

Here are some key concepts and components related to streaming data in big data:

1. Data Sources: Streaming data originates from diverse sources including IoT devices, website
clickstreams, social media feeds, server logs, financial transactions, etc. These sources
continuously generate data that needs to be processed and analyzed in real-time.
2. Streaming Platforms: These are the frameworks or platforms that facilitate the ingestion,
processing, and analysis of streaming data. Examples include Apache Kafka, Apache Flink,
Apache Storm, and Amazon Kinesis. These platforms provide capabilities such as fault
tolerance, scalability, and support for various data processing tasks.
3. Data Ingestion: This involves collecting data from various sources and feeding it into the
streaming platform for processing. Ingestion mechanisms must handle large volumes of data
with low latency to ensure that data is processed in near real-time.
4. Processing: Streaming data processing involves performing operations on data streams as they
are received. This includes filtering, aggregating, joining with other data streams, and applying
machine learning models for real-time predictions or anomaly detection.
5. Windowing and Time Handling: Many streaming systems use windowing techniques to
divide streams into finite segments for processing. This allows computations such as
aggregations to be performed over specific time intervals or event counts. Handling time
accurately is crucial for tasks like event ordering and time-based aggregations.
6. Integration with Batch Processing: Often, streaming data systems need to integrate with batch
processing frameworks like Apache Hadoop or Spark for more complex analytics or to store
processed data in data lakes or warehouses.
7. Scalability and Fault Tolerance: Streaming systems must be scalable to handle increasing
data volumes and fault-tolerant to ensure continuous operation despite hardware failures or
network issues.
8. Data Storage: Depending on the use case, streaming data may be stored temporarily (e.g., in-
memory databases or caches) or persistently (e.g., in data warehouses or data lakes) after
processing.
9. Use Cases: Streaming data is utilized in various applications such as real-time analytics, fraud
detection, monitoring and alerting, recommendation systems, and IoT data processing.
10. Challenges: Managing the velocity and volume of streaming data, ensuring data quality and
consistency, maintaining low latency, and managing the complexity of distributed systems are
some of the challenges associated with streaming data in big data environments.

stream processing vs batch processing


In batch data processing, data is downloaded in batches before being processed, stored, and analyzed.
On the other hand, stream data ingests data continuously, allowing it to be processed simultaneously
and in real-time.
Streaming Computer:Stream computing in the context of big data refers to the real-time processing
of data streams to extract meaningful insights or take immediate actions as the data is generated. It
involves continuously ingesting, processing, and analyzing data as it flows from various sources

• Stream computing is a computing paradigm that reads data from collections of software or
hardware sensors in stream form and computes continuous data streams.
• Stream computing uses software programs that compute continuous data streams.
• Stream computing uses software algorithm that analyzes the data in real time.
• Stream computing is one effective way to support Big Data by providing extremely low-latency
velocities with massively parallel processing architectures.
• It is becoming the fastest and most efficient way to obtain useful knowledge from Big Data.
Stream computing is one effective way to support big data by providing extremely low-latency
velocities with massively parallel processing architectures, and is becoming the fastest and most
efficient way to obtain useful knowledge from big data, allowing organizations to react quickly when
problems appear or to predict new trends in the near future.
Application Background
Big data stream computing is able to analyze and process data in real time to gain an immediate insight,
and it is typically applied to the analysis of vast amount of data in real time and to process them at a
high speed. Many application scenarios require big data stream computing. For example, in financial
industries, big data stream computing technologies can be used in risk management, marketing
management, business intelligence, and so on. In the Internet, big data stream computing technologies
can be used in search engines, social networking, and so on. In Internet of things, big data stream
computing technologies can be used in intelligent transportation, environmental monitoring, and so on.
System Architecture for Stream Computing
In big data stream computing environments, stream computing is the model of straight through
computing. The input data stream is in a real-time data stream form, and all continuous data streams are
computed in real time, and the results must be updated also in real time. The volume of data is so high
that there is no enough space for storage, and not all data need to be stored. Most data will be discarded,
and only a small portion of the data will be permanently stored in hard disks.

Applications:
Financial Services
Telecommunications
Healthcare
IoT
Retail and E-commerce
Media and Entertainment
transportation and Logistics
Energy and Utilities
Advantages:
Real-Time Insights
Low Latency
Scalability
Efficient Resource Utilization
Continuous Processing
Adaptive and Dynamic
Disadvantages:
Complexity
Data Consistency
Cost
Fault Tolerance

Streaming Algorithms — II — Counting Distinct Elements

Counting distinct elements is a problem that frequently arises in distributed systems. In general, the size

of the set under consideration (which we will henceforth call the universe) is enormous. For example, if

we build a system to identify denial of service attacks, the set could consist of all IP V4 and V6 addresses.

Another common use case is to count the number of unique visitors on popular websites like Twitter or

Facebook. An obvious approach if the number of elements is not very large would be to maintain a Set.

We can check if the set contains the element when a new element arrives. If not, we add the element to

the set. The size of the set would give the number of distinct elements. However, if the number of

elements is vast or we are maintaining counts for multiple streams, it would be infeasible to maintain the

set in memory. Storing the data on disk would be an option if we are only interested in offline

computation using batch processing frameworks like Map Reduce.


Flajolet — Martin Algorithm

The first algorithm for counting distinct elements is the Flajolet-Martin algorithm, named after the

algorithm's creators. The Flajolet-Martin algorithm is a single pass algorithm. If there are m distinct

elements in a universe comprising of n elements, the algorithm runs in O(n) time and O(log(m)) space

complexity.

First, we pick a hash function h that takes stream elements as input and outputs a bit string. The length

of the bit strings is large enough such that the result of the hash function is much larger than the size of

the universe. We require at least log nbits if there are n elements in the universe.

r(a) is used to denote the number of trailing zeros in the binary representation of h(a) for an element a in

the stream.

Flajolet-Martin Psuedocode and Explanation

L = 64 (size of the bitset), B= bitset of size L

hash_func = (ax + b) mod 2^L

for each item x in stream

y = hash(x)

r = get_righmost_set_bit(y)

set_bit(B, r)

R = get_righmost_unset_bit(B)

return 2 ^ R

Estimating Moments :-
• Estimating moments is a generalization of the problem of counting distinct elements in a stream.
The problem, called computing "moments," involves the distribution of frequencies of different
elements in the stream.
• Suppose a stream consists of elements chosen from a universal set. Assume the universal set is
ordered so we can speak of the ith𝑖𝑡ℎ element for any i.
• Let mi𝑚𝑖 be the number of occurrences of the ith𝑖𝑡ℎ element for any i. Then the kth𝑘𝑡ℎ-order
moment of the stream is the sum over all i of (mi)k.

Estimating moments, such as mean, variance, skewness, and kurtosis, involves calculating statistical
properties that describe the distribution of data. Here’s a step-by-step outline of how these moments are
typically estimated:

1. Mean (First Moment)

• Definition: The mean of a dataset X={x1,x2,...,xn}X = \{x_1, x_2, ..., x_n\}X={x1,x2,...,xn}


is calculated as:

μ=1n∑i=1nxi\mu = \frac{1}{n} \sum_{i=1}^{n} x_iμ=n1i=1∑nxi

where nnn is the number of data points.

2. Variance (Second Central Moment)

• Definition: The variance measures the spread of data points around the mean and is defined

σ2=n1i=1∑n(xi−μ)2

3. Skewness (Third Central Moment)

• Definition: Skewness measures the asymmetry of the distribution around the mean:

Skewness=1n∑i=1n(xi−μ)3(1n∑i=1n(xi−μ)2)3/2\text{Skewness} = \frac{\frac{1}{n}
\sum_{i=1}^{n} (x_i - \mu)^3}{\left( \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2
\right)^{3/2}}Skewness=(n1∑i=1n(xi−μ)2)3/2n1∑i=1n(xi−μ)3

4. Kurtosis (Fourth Central Moment)

• Definition: Kurtosis measures the "tailedness" or the heaviness of the tails of the distribution:

Kurtosis=1n∑i=1n(xi−μ)4(1n∑i=1n(xi−μ)2)2−3\text{Kurtosis} = \frac{\frac{1}{n}
\sum_{i=1}^{n} (x_i - \mu)^4}{\left( \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 \right)^2} -
3Kurtosis=(n1∑i=1n(xi−μ)2)2n1∑i=1n(xi−μ)4−3

Advantages:

Computational Efficiency

Scalability

Real-Time Analysis

Memory Efficiency

Disadvantages:

Accuracy Trade-offs
Bias and Error

Complexity in Implementation

Assumption Sensitivity

Sampling Issues

Real -Time Analytics in big data:


Real-time analytics in Big Data is a process of analyzing data as soon as it is generated, allowing
for immediate insights and actions. This means that data is collected, processed, and analyzed
instantly, enabling users to make quick decisions based on the most current information available.
Real-time data analytics lets users see, examine and recognize data as it enters a system. Logic and
mathematics are put into the data, so it can give users a perception for making real-time decisions.
Real-time analytics permits businesses to get awareness and take action on data immediately or soon
after the data enters their system. Real-time app analytics respond to queries within seconds. They
grasp a large amount of data with high velocity and low reaction time.

Thefunctioningofreal-timeanalytics :
Real-time data analytics tools can either push or pull. Streaming requires the faculty to shove gigantic
amounts of brisk-moving data. When streaming takes too many assets and isn’t empirical, data can
be hauled at interludes that can range from seconds to hours. The tow can happen in between business
needs which require figuring funds so as not to disrupt functioning. The reaction times for real-time
analytics can differ from nearly immediate to a few seconds or minutes. The components of real-time
data analytics include as follows:
• Aggregator
• Broker
• Analytics engine
• Stream processor

Here are some common applications and uses of RTAP:

1. Financial Trading: RTAP is widely used in financial markets to analyze market data, detect
trading opportunities, and execute trades within milliseconds. It helps traders make informed
decisions based on up-to-the-second market trends and sentiment analysis.
2. Network Monitoring and Security: RTAP applications are crucial for monitoring network
traffic in real-time to detect anomalies, intrusions, or potential security threats. It enables
immediate responses to mitigate risks and maintain network integrity.
3. Social Media Analytics: Companies use RTAP to monitor social media platforms for
mentions, sentiment analysis of brand mentions, customer feedback, and trending topics. This
allows them to engage with customers promptly and manage their brand reputation effectively.
4. IoT (Internet of Things) Data Analysis: With the proliferation of IoT devices, RTAP is
essential for analyzing streams of sensor data in real-time. It enables predictive maintenance,
monitoring of equipment performance, and optimizing operational efficiency.
5. Online Gaming and Entertainment: Real-time analytics in gaming applications monitors
player behavior, analyzes gameplay data, and adjusts game mechanics dynamically. It enhances
player experience by personalizing gameplay and detecting cheating or fraud in real-time.
6. Healthcare Monitoring: RTAP is used in healthcare settings for real-time patient monitoring,
analyzing physiological data, and alerting healthcare providers to critical conditions. It enables
proactive intervention and improves patient outcomes.
7. Supply Chain and Logistics: RTAP applications track shipments, monitor inventory levels,
and optimize logistics operations in real-time. It helps businesses streamline their supply chain
processes, reduce costs, and improve delivery efficiency.
8. Customer Experience Management: RTAP analyzes customer interactions across multiple
channels (such as websites, mobile apps, and customer service calls) to provide personalized
recommendations, resolve issues in real-time, and optimize customer satisfaction.
9. Predictive Maintenance: RTAP applications analyze data from machinery and equipment
sensors to predict maintenance needs before failures occur. It minimizes downtime, reduces
maintenance costs, and improves asset reliability.
10. Environmental Monitoring: RTAP is used in environmental monitoring to analyze data from
sensors measuring air quality, water quality, and weather conditions in real-time. It helps
authorities and organizations respond quickly to environmental events or emergencies.

Advantages of Real-time analytics


Real-time analytics offers the following advantages over traditional analytics as follows.
• Create custom interactive analytics tools.
• Share information through transparent dashboards.
• Customize monitoring of behavior.
• Make immediate changes when needed.
• Apply machine learning.

What is data sampling?

Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative
subset of data points to identify patterns and trends in the larger data set being examined. It enables data
scientists, predictive modelers and other data analysts to work with a smaller, more manageable subset
of data, rather than trying to analyze the entire data population. With a representative sample, they can
build and run analytical models more quickly, while still producing accurate findings.

Data sampling is a widely used statistical approach that can be applied to a range of use cases, such as
analyzing market trends, web traffic or political polls. For example, researchers who use data sampling
don't need to speak with every individual in the U.S. to discover the most common method of
commuting to work. Instead, they can choose a representative subset of data -- such as 1,000 or10,000
participants -- in the hopes that this number will be sufficient to produce accurate results.

Data sampling enables data scientists and researchers to extrapolate knowledge about the broader
population from a smaller subset of data. By using a representative data sample, they can make
predictions about the larger population with a certain level of confidence, without having to collect and
analyze data from each member of the population.
What Is Real-Time Sentiment Analysis?

Real-time Sentiment Analysis is a machine learning (ML) technique that automatically recognizes and
extracts the sentiment in a text whenever it occurs. It is most commonly used to analyze brand and
product mentions in live social comments and posts. An important thing to note is that real-time
sentiment analysis can be done only from social media platforms that share live feeds like Twitter does.

The real-time sentiment analysis process uses several ML tasks such as natural language processing,
text analysis, semantic clustering, etc to identify opinions expressed about brand experiences in live
feeds and extract business intelligence from them.Real-time sentiment analysis is an important
artificial intelligence-driven process that is used by organizations for live market research for brand
experience and customer experience analysis purposes.

A real-time sentiment analysis platform needs to be first trained on a data set based on your industry
and needs. Once this is done, the platform performs live sentiment analysis of real -time feeds
effortlessly.
Why is Sentiment Analysis Important?
Sentiment analysis is the contextual meaning of words that indicates the social sentiment of a brand and
also helps the business to determine whether the product they are manufacturing is going to make a
demand in the market or not.
1. Sentiment Analysis is required as it stores data in an efficient, cost friendly.
2. Sentiment analysis solves real-time issues and can help you solve all real-time scenarios.

Data Collection: Data is continuously collected from various sources, such as social media platforms
(Twitter, Facebook), review sites (Amazon, Yelp), news websites, blogs, and forums.

Data Preprocessing: Text data undergoes preprocessing steps like tokenization (breaking text into
words or phrases), lowercasing, removing stop words (common words like "and", "the"), and possibly
stemming or lemmatization (reducing words to their base form).

Feature Extraction: Features relevant to sentiment analysis are extracted. These could include word
frequencies, n-grams (sequences of words), part-of-speech tags, or even embeddings (numerical
representations of words or documents).

Sentiment Classification: Machine learning models or lexicon-based approaches are applied to


classify the sentiment of each piece of text:

• Machine Learning Models: Techniques like Support Vector Machines (SVM), Naive Bayes,
or more advanced neural networks (such as LSTM or Transformer models) are trained on
labeled data to predict sentiment.
• Lexicon-based Approaches: Using predefined sentiment lexicons (dictionaries) where words
are assigned sentiment scores, and then aggregating these scores to determine overall sentiment.

Real-Time Processing: Techniques like stream processing (using frameworks like Apache Kafka
or Apache Flink) are employed to handle the continuous flow of incoming data in real-time.

Scalability and Performance: Given the large volume of data, systems need to be scalable. This
often involves distributed computing frameworks (like Apache Spark) and efficient data storage and
retrieval mechanisms (such as NoSQL databases).
Visualization and Reporting: Results are often visualized in dashboards or reported in real-time,
providing insights into trends, spikes in sentiment, or issues that require immediate attention.

Feedback Loop: The system may incorporate a feedback loop where predictions are continuously
refined based on new data and user feedback

Types

Real-time sentiment analysis in big data can be categorized into several types based on the techniques
and approaches used. Here are some common types:

1. Lexicon-based Sentiment Analysis:


o Lexicon-based approaches rely on predefined sentiment lexicons or dictionaries that
assign sentiment scores to words or phrases. These scores are aggregated to determine
the overall sentiment of a piece of text.
2. Machine Learning-based Sentiment Analysis:
o Supervised Learning: Utilizes labeled data to train models such as Support Vector
Machines (SVM), Naive Bayes, or deep learning models like Convolutional Neural
Networks (CNNs) or Recurrent Neural Networks (RNNs) such as LSTMs.
o Unsupervised Learning: Techniques like clustering (e.g., K-means) or topic modeling
(e.g., Latent Dirichlet Allocation, LDA) to identify patterns and sentiments in data
without labeled examples.
3. Hybrid Approaches:
o Combining Lexicon-based and Machine Learning: Integrates lexicon-based
methods with machine learning models to improve sentiment classification accuracy.
o Rule-based and Machine Learning: Incorporates rule-based systems with machine
learning techniques for more nuanced sentiment analysis.
4. Aspect-based Sentiment Analysis:
o Analyzes sentiment at a more granular level, focusing on specific aspects or features
within text (e.g., aspects of a product in reviews), providing insights into sentiment
towards different aspects of a topic.
5. Deep Learning-based Sentiment Analysis:
o Uses deep learning architectures such as Transformer models (e.g., BERT, GPT) to
capture complex contextual information and improve sentiment classification
accuracy, especially in tasks requiring understanding of natural language nuances.
6. Real-time Stream Processing:
o Utilizes stream processing frameworks like Apache Kafka or Apache Flink to process
and analyze data as it arrives, ensuring timely insights and responses to dynamic
changes in sentiment.
7. Multilingual Sentiment Analysis:
o Handles sentiment analysis across multiple languages, utilizing techniques like
machine translation coupled with sentiment analysis to understand sentiments
expressed in diverse linguistic contexts.

Advantages:

1. Immediate Feedback
2. Enhanced Customer Engagement
3. Crisis Management
4. Competitive Advantage
5. Improved Decision Making

Disvantages:
1. Accuracy Challenges
2. Scalability Issues
3. Privacy Concerns
4. Bias and Misinterpretation
5. Costs

Stock market prediction:

Stock market prediction in the context of big data analytics refers to the use of large and diverse datasets,
often derived from various sources such as historical stock prices, trading volumes, economic
indicators, news sentiment, and alternative data sources (like social media or satellite imagery), to
forecast future movements or trends in financial markets.

Key aspects of stock market prediction using big data analytics include:

1. Data Collection and Integration: Gathering and combining vast amounts of structured and
unstructured data from multiple sources to create a comprehensive dataset for analysis.
2. Data Preprocessing: Cleaning, filtering, and transforming raw data to ensure accuracy,
consistency, and relevance for predictive modeling.
3. Feature Extraction and Engineering: Identifying and creating informative features
(predictors) from the data that are likely to influence stock prices, such as technical indicators,
economic factors, sentiment scores, or derived metrics.
4. Model Development: Applying machine learning algorithms, statistical models, or advanced
analytics techniques to build predictive models. Common approaches include regression
analysis, time series forecasting, ensemble methods (like random forests or gradient boosting),
and deep learning models (such as recurrent neural networks or convolutional neural networks).
5. Model Training and Validation: Training predictive models using historical data, optimizing
model parameters, and validating model performance using metrics like accuracy, precision,
recall, and profit-loss ratios.
6. Real-Time Processing: Implementing real-time or near-real-time analytics to continuously
update predictions as new data becomes available, enabling timely decision-making in volatile
market conditions.
7. Risk Management and Evaluation: Integrating risk management strategies to assess and
mitigate uncertainties associated with stock market predictions, including evaluating model
robustness and sensitivity to market changes.
8. Deployment and Monitoring: Deploying predictive models into operational environments,
monitoring model performance over time, and adapting models as market conditions evolve or
new data sources emerge.
9. Ethical and Regulatory Considerations: Adhering to ethical standards in data usage, ensuring
data privacy and security, and complying with regulatory requirements governing financial
markets and data management.

Applications

stock market prediction using big data analytics finds various applications across different areas of
finance and investment. Here are some key applications:

1. Algorithmic Trading and High-Frequency Trading (HFT):


o Big data analytics is used to develop algorithmic trading strategies that automatically
execute trades based on predictive models. These strategies can analyze large volumes
of market data in real-time to identify patterns and exploit short-term opportunities.
2. Portfolio Management:
o Big data analytics helps portfolio managers optimize asset allocation and risk
management strategies. Predictive models can assist in selecting stocks or assets
expected to outperform based on historical data, market trends, and economic
indicators.
3. Risk Management:
o Predictive analytics aids in assessing and managing financial risks associated with
stock investments. Models can predict potential losses or market downturns, allowing
investors to adjust their portfolios or hedge against adverse market movements.
4. Market Sentiment Analysis:
o Big data analytics is used to analyze social media sentiment, news sentiment, and other
textual data to gauge investor sentiment towards specific stocks or the market as a
whole. This information can influence trading decisions and market strategies.
5. Quantitative Research:
o Researchers use big data analytics to conduct quantitative studies and backtesting of
trading strategies. Historical market data combined with predictive models allows
researchers to validate the effectiveness of trading strategies under different market
conditions.
6. Market Surveillance and Compliance:
o Financial institutions and regulatory bodies use big data analytics for market
surveillance and compliance monitoring. Predictive models can detect suspicious
trading activities, market manipulation, and compliance breaches in real-time.
7. Event-Driven Trading:
o Predictive analytics identifies trading opportunities based on anticipated market
reactions to specific events such as earnings announcements, economic reports, or
geopolitical developments. This helps traders capitalize on short-term market
movements.
8. Financial Forecasting and Valuation:
o Big data analytics assists in forecasting future stock prices, earnings estimates, and
company valuations. Predictive models incorporate financial metrics, market data, and
economic indicators to provide insights into future performance.
9. Personalized Investment Advice:
o Wealth management firms and robo-advisors leverage big data analytics to provide
personalized investment advice to individual investors. Predictive models recommend
investment strategies and portfolio allocations based on investor profiles, goals, and
risk tolerance.
10. Market Efficiency and Arbitrage Opportunities:
o Big data analytics helps identify inefficiencies in stock prices and arbitrage
opportunities across different markets or asset classes. Predictive models can detect
pricing anomalies or mispricings that traders can exploit for profit.

Advantages:

1. Data-Driven Insights
2. improved Decision-Making
3. Real-Time Analysis
4. Risk Management
5. Cost Efficiency
6. Personalized Strategies
Disadvantages:

1. Complexity
2. Data Quality Issues
3. Market Uncertainty
4. Regulatory and Ethical Concerns
5. ack of Human Judgment
UNIT -3

HDFS

HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications.
This open source framework works by rapidly transferring data between nodes. It's often used by
companies who need to handle and store big data. HDFS is a key component of many Hadoop systems,
as it provides a means for managing big data, as well as supporting big data analytics.

What is HDFS?

HDFS stands for Hadoop Distributed File System. HDFS operates as a distributed file system designed
to run on commodity hardware.

HDFS is fault-tolerant and designed to be deployed on low-cost, commodity hardware. HDFS provides
high throughput data access to application data and is suitable for applications that have large data sets
and enables streaming access to file system data in Apache Hadoop

The HDFS meaning and purpose is to achieve the following goals:

• Manage large datasets - Organizing and storing datasets can be a hard talk to handle. HDFS
is used to manage the applications that have to deal with huge datasets. To do this, HDFS should
have hundreds of nodes per cluster.
• Detecting faults - HDFS should have technology in place to scan and detect faults quickly
and effectively as it includes a large number of commodity hardware. Failure of components is
a common issue.
• Hardware efficiency - When large datasets are involved it can reduce the network traffic and
increase the processing speed.

There are five core elements of big data organized by HDFS services:

• Velocity - How fast data is generated, collated and analyzed.


• Volume - The amount of data generated.
• Variety - The type of data, this can be structured, unstructured, etc.
• Veracity - The quality and accuracy of the data.
• Value - How you can use this data to bring an insight into your business processes.

Nodes: Master-slave nodes typically forms the HDFS cluster.


1. NameNode(MasterNode):
• Manages all the slave nodes and assign work to them.
• It executes filesystem namespace operations like opening, closing, renaming
files and directories.
• It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. DataNode(SlaveNode):

Actual worker nodes, who do the actual work like reading, writing,
processing etc.
• They also perform creation, deletion, and replication upon instruction from
the master.
• They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
• Namenodes:
o Run on the master node.
o Store metadata (data about data) like file path, the number of blocks,
block Ids. etc.
o Require high amount of RAM.
o Store meta-data in RAM for fast retrieval i.e to reduce seek time.
Though a persistent copy of it is kept on disk.
• DataNodes:
o Run on slave nodes.
o Require high memory as data is actually stored here.

HDFS components

It's important to know that there are three main components of Hadoop. Hadoop HDFS, Hadoop
MapReduce, and Hadoop YARN. Let's take a look at what these components bring to Hadoop:

• Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.
• Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop. This software
framework is used to write applications to process vast amounts of data.
• Hadoop YARN - Hadoop YARN is a resource management component of Hadoop. It
processes and runs data for batch, stream, interactive, and graph processing - all of which are
stored in HDFS.
Introduction to Hadoop Streaming

Hadoop Streaming uses UNIX standard streams as the interface between Hadoop and your program so
you can write MapReduce program in any language which can write to standard output and read
standard input. Hadoop offers a lot of methods to help non-Java development.

The primary mechanisms are Hadoop Pipes which gives a native C++ interface to Hadoop and Hadoop
Streaming which permits any program that uses standard input and output to be used for map tasks and
reduce tasks.

Features of Hadoop Streaming

Some of the key features associated with Hadoop Streaming are as follows :

• Hadoop Streaming is a part of the Hadoop Distribution System.


• It facilitates ease of writing Map Reduce programs and codes.
• Hadoop Streaming supports almost all types of programming languages such as Python,
C++, Ruby, Perl etc.
• The entire Hadoop Streaming framework runs on Java. However, the codes might be
written in different languages as mentioned in the above point.
• The Hadoop Streaming process uses Unix Streams that act as an interface between Hadoop
and Map Reduce programs.
• Hadoop Streaming uses various Streaming Command Options and the two mandatory
ones are – -input directoryname or filename and -output directoryname

Hadoop Streaming architecture

As it can be clearly seen in the diagram above that there are almost 8 key parts in a Hadoop Streaming
Architecture. They are :

• Input Reader/Format
• Key Value
• Mapper Stream
• Key-Value Pairs
• Reduce Stream
• Output Format
• Map External
• Reduce External

The involvement of these components will be discussed in detail when we explain the working of the
Hadoop streaming.

Hadoop Streaming Work?

• Input is read from standard input and the output is emitted to standard output by Mapper
and the Reducer. The utility creates a Map/Reduce job, submits the job to an appropriate
cluster, and monitors the progress of the job until completion.
• Every mapper task will launch the script as a separate process when the mapper is
initialized after a script is specified for mappers. Mapper task inputs are converted into
lines and fed to the standard input and Line oriented outputs are collected from the
standard output of the procedure Mapper and every line is changed into a key, value pair
which is collected as the outcome of the mapper.
• Each reducer task will launch the script as a separate process and then the reducer is
initialized after a script is specified for reducers. As the reducer task runs, reducer task
input key/value pairs are converted into lines and fed to the standard input (STDIN) of the
process.
• Each line of the line-oriented outputs is converted into a key/value pair after it is collected
from the standard output (STDOUT) of the process, which is then collected as the output
of the reducer
• Mapper Phase Code
• !/usr/bin/python
• import sys
myline = myline.strip()
words = myline.split()
for myword in words:
print '%s\t%s' % (myword, 1)

• Make sure this file has execution permission (chmod +x /home/ expert/hadoop-
1.2.1/mapper.py).
• Reducer Phase Code
• #!/usr/bin/python
• from operator import itemgetter
• import sys
• current_word = ""
• current_count = 0
• word = ""
myline = myline.strip()
count = myline.split('\t', 1)
try:
• count = int(count)
except ValueError:
• if current_word == word:
• current_count += count
• else:
• if current_word:
• current_count = count
• current_word = word

• if current_word == word:
• print '%s\t%s' % (current_word, current_count)

Advantages:

1. Flexibility in Programming Languages

2. Reuse of Existing Code

3. Integration with Ecosystem Tools

4. Easier Prototyping and Rapid Development

5. Scalability

Disadvantages:

1. Performance Overhead

2. Resource Management

3. Debugging and Error Handling

4. Limited Support for Complex Data Types

5. Security Concerns

The Design of HDFS :

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running
on clusters of commodity hardware.

Very large files: “Very large” in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Streaming data access : HDFS is built around the idea that the most efficient data processing pattern
is a write-once, readmany-times pattern. A dataset is typically generated or copied from source, then
various analyses are performed on that dataset over time.

Commodity hardware : Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s
designed to run on clusters of commodity hardware (commonly available hardware available from
multiple vendors3) for which the chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face
of such failure.

java interface for hadoop hdfs filesystems – examples and concept

Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation. The Java
abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in Hadoop,
and there are several concrete implementations.Hadoop is written in Java, so most Hadoop filesystem
interactions are mediated through the Java API. The filesystem shell, for example, is a Java application
that uses the Java FileSystem class to provide filesystem operations.By exposing its filesystem interface
as a Java API, Hadoop makes it awkward for non-Java applications to access HDFS. The HTTP REST
API exposed by the WebHDFS protocol makes it easier for other languages to interact with HDFS.
Note that the HTTP interface is slower than the native Java client, so should be avoided for very large
data transfers if possible.

There are two ways of accessing HDFS over HTTP: directly, where the HDFS daemons serve HTTP
requests to clients and via a proxy, which accesses HDFS on the client’s behalf using the usual
DistributedFileSystem API.

The Java Interface

In this section, we dig into the Hadoop FileSystem class: the API for interacting with one of Hadoop’s
filesystems.

Reading Data Using the FileSystem API

A file in a Hadoop filesystem is represented by a Hadoop Path object. FileSystem is a general filesystem
API, so the first step is to retrieve an instance for the filesystem we want to use—HDFS, in this case.
There are several static factory methods for getting a FileSystem instance

public static FileSystem get(Configuration conf) throws IOException


public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException
public static LocalFileSystem getLocal(Configuration conf) throws IOException
FSDataInputStream:
The open() method on FileSystem actually returns an FSDataInputStream rather than a standard java.io
class. This class is a specialization of java.io.DataInputStream with support for random access, so you
can read from any part of the stream:
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream implements Seekable,
PositionedReadable {

The Seekable interface permits seeking to a position in the file and provides a query method for the
current offset from the start of the file (getPos()) . Calling seek() with a position that is greater than the
length of the file will result in an IOException.
FSDataOutputStream

The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream,


has a method for querying the current position in the file:

package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
}
}

However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is because
HDFS allows only sequential writes to an open file or appends to an already written file. In other words,
there is no support for writing to anywhere other than the end of the file, so there is no value in being
able to seek while writing.

FileSystem provides a method to create a directory also

public boolean mkdirs(Path f) throws IOException

This method creates all of the necessary parent directories if they don’t already exists and returns true
if its success full.

File patterns It is a common requirement to process sets of files in a single operation. For example, a
MapReduce job for log processing might analyze a month’s worth of files contained in a number of
directories.Hadoop provides two FileSystem methods for processing globs:
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter)
throws IOException

The globStatus() methods return an array of FileStatus objects whose paths match the supplied pattern,

sorted by path.

What is MapReduce?

MapReduce is a Java-based, distributed execution framework within the Apache Hadoop Ecosystem. It
takes away the complexity of distributed programming by exposing two processing steps that
developers implement: 1) Map and 2) Reduce. In the Mapping step, data is split between parallel
processing tasks. Transformation logic can be applied to each chunk of data. Once completed, the
Reduce phase takes over to handle aggregating data from the Map set.. In general, MapReduce
uses Hadoop Distributed File System (HDFS) for both input and output.
How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

• The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
• The Reduce task takes the output from the Map as an input and combines those data tuples
(key-value pairs) into a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their significance.
• Input Phase − Here we have a Record Reader that translates each record in an input file and
sends the parsed data to the mapper in the form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-value pairs and processes
each one of them to generate zero or more key-value pairs.
• Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate
keys.
• Combiner − A combiner is a type of local Reducer that groups similar data from the map phase
into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the
grouped key-value pairs onto the local machine, where the Reducer is running. The individual
key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in the Reducer task.
• Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them. Here, the data can be aggregated, filtered, and combined in a
number of ways, and it requires a wide range of processing. Once the execution is over, it gives
zero or more key-value pairs to the final step.
• Output Phase − In the output phase, we have an output formatter that translates the final key-
value pairs from the Reducer function and writes them onto a file using a record writer.

MapReduce used for?

Legacy applications and Hadoop native tools like Sqoop and Pig leverage MapReduce today. There is
very limited MapReduce application development nor any significant contributions being made to it as
an open source technology.
Advantages of MapReduce

1. Scalability
2. Flexibility
3. Security and authentication
4. Faster processing of data
5. Very simple programming model
6. Availability and resilient nature

Simple tips on how to improve MapReduce performance

1. Enabling uber mode


2. Use native library
3. Increase the block size
4. Monitor time taken by map tasks
5. Identify if data compression is splittable or not
6. Set number of reduced tasks
7. Analyze the partition of data
8. Shuffle phase performance movements
9. Optimize MapReduce code

alternatives to MapReduce

1. Apache Spark
2. Apache Storm
3. Ceph
4. Hydra
5. Google BigQuery

Map reduce feature


1. Scalability
2. Fault Tolerance
3. Simplicity
4. Data Locality
5. Flexibility
6. Ecosystem
7. Parallelization
What Is a Hadoop Cluster?

Apache Hadoop is an open source, Java-based, software framework and parallel data processing engine.
It enables big data analytics processing tasks to be broken down into smaller tasks that can be performed
in parallel by using an algorithm (like the MapReduce algorithm), and distributing them across a
Hadoop cluster. A Hadoop cluster is a collection of computers, known as nodes, that are networked
together to perform these kinds of parallel computations on big data sets. Unlike other computer
clusters, Hadoop clusters are designed specifically to store and analyze mass amounts of structured and
unstructured data in a distributed computing environment.

Hadoop Cluster Architecture


Hadoop clusters are composed of a network of master and worker nodes that orchestrate and execute
the various jobs across the Hadoop distributed file system. The master nodes typically utilize higher
quality hardware and include a NameNode, Secondary NameNode, and JobTracker, with each running
on a separate machine.

The final part of the system are the Client Nodes, which are responsible for loading the data and fetching
the results.

• Master nodes are responsible for storing data in HDFS and overseeing key operations, such as
running parallel computations on the data using MapReduce.
• The worker nodes comprise most of the virtual machines in a Hadoop cluster, and perform the
job of storing the data and running computations. Each worker node runs the DataNode and
TaskTracker services, which are used to receive the instructions from the master nodes.
• Client nodes are in charge of loading the data into the cluster. Client nodes first submit
MapReduce jobs describing how data needs to be processed and then fetch the results once the
processing is finished.

Cluster size in Hadoop?

A Hadoop cluster size is a set of metrics that defines storage and compute capabilities to run Hadoop
workloads, namely :

• Number of nodes : number of Master nodes, number of Edge Nodes, number of Worker Nodes.
• Configuration of each type node: number of cores per node, RAM and Disk Volume.
Types of Hadoop clusters

1. Single Node Hadoop Cluster


2. Multiple Node Hadoop Cluster

1. Single Node Hadoop Cluster: In Single Node Hadoop Cluster as the name suggests the
cluster is of an only single node which means all our Hadoop Daemons.
2. 2. Multiple Node Hadoop Cluster: In multiple node Hadoop clusters as the name suggests
it contains multiple nodes. In this kind of cluster set up all of our Hadoop Daemons, will store
in different-different nodes in the same cluster setup.

advantages of a Hadoop

Cluster?

• Hadoop clusters can boost the processing speed of many big data analytics jobs, given their
ability to break down large computational tasks into smaller tasks that can be run in a parallel,
distributed fashion.
• Hadoop clusters are easily scalable and can quickly add nodes to increase throughput, and
maintain processing speed, when faced with increasing data blocks.
• The use of low cost, high availability commodity hardware makes Hadoop clusters relatively
easy and inexpensive to set up and maintain.
• Hadoop clusters replicate a data set across the distributed file system, making them resilient to
data loss and cluster failure.
• Hadoop clusters make it possible to integrate and leverage data from multiple different source
systems and data formats.
• It is possible to deploy Hadoop using a single-node installation, for evaluation purposes.

Challenges :

1. Issue with small files


2. High processing overhead
3. Only batch processing is supported
4. Iterative Processing

What is Hadoop Security?

Hadoop, an open-source platform, has transformed how businesses handle huge amounts of data.
Because of its ability to store and process massive volumes of data across distributed computer clusters,
it has become a popular choice for enterprises looking to use the potential of big data analytics.
Introduction

As data breaches and cyber threats have increased, guaranteeing the security of Hadoop clusters has
become a major problem. Hadoop security refers to the methods and practices implemented to safeguard
sensitive data and prevent unauthorized access or misuse inside a Hadoop system.

Need for Hadoop Security

The need for Hadoop security derives from the increased vulnerability of data to hostile attacks. Hadoop
is used by businesses to store and handle vast volumes of data, including sensitive information like
customer records, financial data, and intellectual property. This valuable data becomes a tempting target
for fraudsters if sufficient security measures are not implemented. Furthermore, data privacy laws, such
as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA),
require businesses to protect their consumers' information. Failure to follow these regulations can result
in serious financial penalties and reputational harm.

3 A's of security

Organizations must prioritize the three core security principles known as the 3 A's: Authentication,
Authorization, and Auditing to manage security concerns in a Hadoop environment effectively.

1. Authentication:

The authentication process ensures that only authorized users can access the Hadoop cluster. It entails
authenticating users' identities using various mechanisms such as usernames and passwords, digital
certificates, or biometric authentication.

2. Authorization:
Authorization governs the actions an authenticated user can take within the Hadoop cluster. It entails
creating access restrictions and permissions depending on the roles and responsibilities of the users.

3. Auditing:
Auditing is essential for monitoring and tracking user activity in the Hadoop cluster. Organizations can
investigate suspicious or unauthorized activity by keeping detailed audit logs.

How Hadoop Ensures Security?

Hadoop is a powerful and effective method for managing and analyzing data. Data security, on the other
hand, is critical in any big data ecosystem. Hadoop recognizes this critical element and offers several
measures to assure data security throughout its distributed infrastructure.

Authentication and authorization are two of the key ways in which Hadoop ensures security. Hadoop
has strong authentication procedures to verify user identities and prevent unauthorized data access. It
supports various authentication protocols, including Kerberos, LDAP, and SSL, to ensure safe access
to Hadoop clusters. Furthermore, Hadoop uses role-based access control (RBAC) to design and enforce
access permissions, allowing administrators to give or restrict capabilities based on user roles and
responsibilities.

Hadoop also includes auditing and logging features to track and monitor user activity. It records
important events like file access, user authentication, and administrative tasks, allowing
administrators to detect suspicious or unauthorized behavior. These records can benefit from forensic
analysis, compliance reporting, and troubleshooting.
Different Tools for Hadoop Security

Various effective methods are available to strengthen Hadoop security and maintain data
confidentiality, integrity, and availability.

Apache Ranger

Apache Ranger is a Hadoop security framework that allows administrators to design fine-grained access
control policies. Ranger's centralized policy management enables organizations to restrict user access,
monitor activity, and uniformly implement security policies across the Hadoop ecosystem.

Apache Knox

Apache Knox serves as a security gateway for Hadoop systems. It acts as a single entry point for users
and authenticates them before allowing access to the cluster's services. Knox enhances Hadoop's overall
security posture by providing perimeter security, encryption, and integration with external
authentication systems.

Apache Sentry

Apache Sentry is a Hadoop-based role-based access control (RBAC) solution. It lets administrators
create and enforce granular access privileges, ensuring that users only have access to the data required
for their specific activities. Sentry's permission procedures improve data security and reduce the danger
of unauthorized access.

Cloudera Navigator

Cloudera Navigator provides full Hadoop data governance and security features. It includes data
discovery, metadata management, and lineage tracing, allowing organizations to monitor and audit data
access, enforce compliance requirements, and quickly detect any questionable activity.

UNIT – 4

Link analysis is an analysis technique that focuses on relationships and connections in a dataset. Link
analysis gives you the ability to calculate centrality measures—namely degree, betweenness, closeness,
and eigenvector—and see the connections on a link chart or link map.

About link analysis


Link analysis uses a network of interconnected links and nodes to identify and analyze relationships
that are not easily seen in raw data. Common types of networks include the following:

• Social networks that show who talks to whom


• Semantic networks that illustrate topics that are related to each other
• Conflict networks indicating alliances of connections between players
• Airline networks indicating which airports have connecting flights

Examples
A crime analyst is investigating a criminal network. Data from cell phone records can be used to
determine the relationship and hierarchy between members of the network.

A credit card company is developing a new system to detect credit card theft. The system uses the known
patterns of transactions for each client, such as the city, stores, and types of transactions, to identify
anomalies and alert the client of a potential theft.

link analysis works


The following table provides an overview of the terminology in link analysis:

Term Description Examples

A set of An online social network that uses a network of profiles and relationships to connect users.
interconnected
Network
nodes and Airline networks that use a network of airports and flights to transport travelers from their ori
links. destination.
A point or
vertex that
represents an
object, such as
The profiles in a social network. Associated properties may include the user's name, hom
a person,
employer.
Node place, crime
type, or tweet.
The airports in an airline network. Associated properties may include the airport name.
The node may
also include
associated
properties.
The
relationships
or connections The relationship between profiles in the network, such as friend, follower, or connection.
between properties may include the length of the relationship.
Link nodes. The
link may also The flights between airports in an airline network. Associated properties may include the numb
include between airports.
associated
properties.

Centrality

Centrality is a measure of importance for nodes in a network.

Degree centrality
Degree centrality is based on the number of direct connections a node has. Degree centrality should be
used when you want to determine which nodes have the most direct influence.

Degree centrality of node x is calculated using the following equation:

degCentrality(x)=deg(x)/(NodesTotal-1)
Betweenness centrality
Betweenness centrality is based on the extent a node is part of the shortest path between other nodes.
Betweenness centrality should be used when you want to determine which nodes are used to connect
other nodes to each other.

Betweenness centrality of node x is calculated using the following equation:

btwCentrality(x)=Σa,bϵNodes(pathsa,b(x)/pathsa,b)
where:

Closeness centrality
Closeness centrality is based on the average of the shortest network path distance between nodes.
Closeness centrality should be used when you want to determine which nodes are most closely
associated to the other nodes in the network

loseCentrality(x)=(nodes(x,y)/(NodesTotal-1))*(nodes(x,y)/dist(x,y)Total)

Page Rank:

PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results.
PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of
measuring the importance of website pages. According to Google:
PageRank works by counting the number and quality of links to a page to determine a rough estimate
of how important the website is. The underlying assumption is that more important websites are likely
to receive more links from other websites.

Algorithm
The PageRank algorithm outputs a probability distribution used to represent the likelihood that a
person randomly clicking on links will arrive at any particular page. PageRank can be calculated for
collections of documents of any size.

PageRank effectively works in big data environments by leveraging distributed computing frameworks
to handle the scalability, fault tolerance, and computational complexity associated with large-scale web
graph processing. This enables applications such as search engines to compute and update web page
rankings efficiently across massive datasets.

Graph Representation: Websites and web pages are represented as nodes in a graph, and hyperlinks
between them are represented as edges. This graph can grow to billions of nodes and edges, making it
a massive dataset.

Scalability: PageRank computations can be distributed across multiple machines in a cluster using
parallel processing frameworks like MapReduce, Apache Spark, or other distributed computing
paradigms. This allows handling large-scale graphs efficiently.

Distributed Computation:
• Mapping Phase: Each node in the graph computes its contributions to other nodes it links to
(outgoing links). This can be parallelized across the cluster, where each node (page) computes
its contribution independently.
• Shuffling and Sorting: Intermediate results (contribution values) are shuffled and sorted based
on the destination nodes they contribute to. This prepares the data for the next phase.
• Reducing Phase: Each node aggregates the contributions it receives from incoming links
(nodes). The PageRank score is then updated using a formula that includes damping factors and
the accumulated contributions. This process iterates until convergence.

Fault Tolerance: Distributed computing frameworks like MapReduce provide fault tolerance
mechanisms. If a node or task fails during computation, it can be automatically restarted or reassigned
to another node in the cluster, ensuring robustness in large-scale computations.

Iterative Nature: PageRank computation is iterative, meaning it refines the PageRank values
through multiple iterations until they stabilize (convergence). Each iteration involves updating
PageRank scores based on the contributions from neighboring nodes, ensuring the ranking reflects the
overall structure of the web graph.

Performance Optimization: Techniques such as graph partitioning, efficient data shuffling, and
minimizing communication overhead between nodes are employed to optimize the performance of
PageRank computation in distributed environments.

Integration with Big Data Ecosystem: PageRank algorithms can be integrated with other
components of the big data ecosystem, such as distributed file systems

Components:

1. Graph Representation
2. Data Storage
3. MapReduce or Similar Framework
4. Iteration and Convergence
5. Integration with Ecosystem

Advantages:

1. Objective and unbiased


2. Quality-focused
3. Resilience to manipulation
4. Query-independent

Disadvantages

1. Vulnerability to manipulation
2. Emphasis on old pages
3. Lack of user context
4. Limited in dealing withspam and low-quality content

Efficient computation of PageRank in the context of big data involves several strategies to handle the
large volumes of data and complex computations efficiently. Here are some key approaches:

1. Graph Partitioning: Split the web graph into smaller partitions that can be processed
independently. This allows parallel computation of PageRank on each partition, reducing the
overall computation time. Techniques like the METIS algorithm can be used for effective
partitioning.
2. Iterative Computation: Use iterative algorithms such as the Power Method or the more
advanced algorithms like Gauss-Seidel or Jacobi methods. These algorithms are designed to
converge to the PageRank vector efficiently over multiple iterations.
3. Distributed Computing: Utilize distributed computing frameworks such as Apache Hadoop
or Apache Spark. These frameworks enable parallel processing across a cluster of computers,
distributing the computation of PageRank across multiple nodes to handle big data scale.
4. Data Compression: Apply techniques like sparse matrix representations and compression
algorithms to reduce the memory footprint and computational overhead, especially when
dealing with large, sparse matrices representing the web graph.
5. Optimized Data Structures: Use efficient data structures such as hash tables or inverted
indices to store and access the graph structure and PageRank values quickly during
computation.
6. Incremental Updates: Implement incremental algorithms that update PageRank values only
for nodes that have changed, rather than recalculating from scratch. This is particularly useful
in dynamic graphs where updates are frequent.
7. Preprocessing: Preprocess the graph data to remove redundancy, identify dangling nodes
(nodes with no outgoing links), and handle other structural optimizations that can simplify the
PageRank computation.
8. Algorithmic Improvements: Explore advanced algorithms or modifications to the original
PageRank algorithm that can improve convergence speed or reduce computational complexity,
while still maintaining accurate results.
9. Scalable Infrastructure: Ensure the computing infrastructure (hardware and software) is
scalable and can handle the volume and velocity of data required for computing PageRank in
real-time or near real-time.
10. Caching and Memory Management: Utilize caching mechanisms to store intermediate results
and optimize memory management to handle large data sets efficiently.

What is Link Spam?


Link spam is the technique of using backlinks or anchor text that is irrelevant to the content and usually
has the goal of increasing site traffic.

Spam links are also commonly referred to as "toxic backlinks," because every link that is posted must
be relevant to the content that is written. This indirectly forces users to visit the site in question, even
though the link or site being promoted does not match the content at all.

Types of Link Spam


The following are the types of link spam or toxic backlinks that you must know:

1. Hidden Link

Hidden links are the first type of spam link. Hidden links are a technique where links embedded in
content cannot be seen by users. These links are typically hidden behind images and given a background
color until they are hidden in the site's code and are completely invisible to users.
2. Nofollow Link

Some nofollow links on a site can be detected as spam links. Nofollow links are a tactic that is claimed
to be a technique that can increase the profile of these backlinks and trick Google's spam detectors.

3. Spam Posting
Post spam is an act in which someone consistently shares a link in public forums, comments, or other
inappropriate places. This technique is widely used by link spammers because it is considered the easiest
way to do link spamming.

4.Link Farms:Link farms are a type of link spamming that involves site owner cooperation. Site owners
who engage in link farming will continuously link to each other for the sole purpose of building
backlinks.

5.Directory Spam:Directories can be a double-edged sword when looking to improve your SERP rank.
When dealing with local SEO, registering your business across different authoritative directories can
cause serious improvements in your search rank.

Recommendation System?
A recommendation system is an artificial intelligence or AI algorithm, usually associated with machine
learning, that uses Big Data to suggest or recommend additional products to consumers. These can be
based on various criteria, including past purchases, search history, demographic information, and other
factors. Recommender systems are highly useful as they help users discover products and services they
might otherwise have not found on their own.
Recommender systems are trained to understand the preferences, previous decisions, and characteristics
of people and products using data gathered about their interactions. These include impressions, clicks,
likes, and purchases.

Types of Recommendation Systems


While there are a vast number of recommender algorithms and techniques, most fall into these broad
categories: collaborative filtering, content filtering and context filtering.

1. Collaborative Filtering

• User-Based Collaborative Filtering:


o Description: Recommends items to a user based on items liked by users with similar
preferences.
o Approach: Build a user-item matrix and calculate similarity between users based on
their interactions (e.g., ratings, purchases).
o Advantages: Doesn't require item metadata; captures user preferences implicitly.
o Challenges: Cold start problem for new users; scalability with increasing users and
items.
• Item-Based Collaborative Filtering:
o Description: Recommends items similar to those a user has liked or interacted with in
the past.
o Approach: Build an item-item matrix and calculate similarity between items based on
user interactions.
o Advantages: Effective for recommending niche items; handles item updates well.
o Challenges: Cold start problem for new items; computation-intensive for large
datasets.

2. Content-Based Filtering

• Description: Recommends items similar to those a user has liked, based on item attributes or
content.
• Approach: Extract features from item descriptions (e.g., text, metadata) and recommend items
with similar features.
• Advantages: Handles new items well; can recommend items based on user preferences.
• Challenges: Limited diversity in recommendations; requires quality metadata.

3. Hybrid Recommender Systems

• Description: Combine collaborative filtering and content-based filtering to provide more


accurate and diverse recommendations.
• Approach: Use a weighted approach to combine recommendations from both methods or use
one to enhance the weaknesses of the other.
• Advantages: Improves recommendation quality; mitigates cold start problems.
• Challenges: Complex to implement and tune; requires careful integration of different
algorithms.

4. Matrix Factorization Techniques

• Description: Decompose the user-item interaction matrix into lower-dimensional matrices to


capture latent factors.
• Approach: Techniques like Singular Value Decomposition (SVD), Alternating Least Squares
(ALS), or Non-negative Matrix Factorization (NMF).
• Advantages: Handles sparsity in user-item interactions; effective for large-scale datasets.
• Challenges: Scalability with increasing data size; tuning hyperparameters.

5. Contextual Recommender Systems

• Description: Incorporate contextual information such as time, location, or device to provide


more relevant recommendations.
• Approach: Use contextual features to personalize recommendations based on situational
factors.
• Advantages: Improves relevance of recommendations; adapts to changing user preferences.
• Challenges: Requires robust data collection and processing of contextual data; complexity in
modeling interactions.

Considerations for Big Data:

• Scalability: Algorithms should be scalable to handle large volumes of data efficiently.


• Real-Time Recommendations: Ability to update recommendations in real-time as new data
arrives.
• Data Privacy: Ensure compliance with data privacy regulations when handling large datasets.
• Evaluation Metrics: Use appropriate metrics (e.g., precision, recall, RMSE) to evaluate
recommendation quality in a big data context.
Benefits of Recommendation Systems
Recommender systems are a critical component driving personalized user experiences, deeper
engagement with customers, and powerful decision support tools in retail, entertainment, healthcare,
finance, and other industries. On some of the largest commercial platforms, recommendations account
for as much as 30% of the revenue. A 1% improvement in the quality of recommendations can translate
into billions of dollars in revenue.
Companies implement recommender systems for a variety of reasons, including:

• Improving retention. By continuously catering to the preferences of users and customers,


businesses are more likely to retain them as loyal subscribers or shoppers. When a customer
senses that they’re truly understood by a brand and not just having information randomly
thrown at them, they’re far more likely to remain loyal and continue shopping at your site.
• Increasing sales. Various research studies show increases in upselling revenue from 10-50%
resulting from accurate ‘you might also like’ product recommendations. Sales can be increased
with recommendation system strategies as simple as adding matching product
recommendations to a purchase confirmation; collecting information from abandoned
electronic shopping carts; sharing information on ‘what customers are buying now’; and sharing
other buyers’ purchases and comments.
• Helping to form customer habits and trends. Consistently serving up accurate and relevant
content can trigger cues that build strong habits and influence usage patterns in customers.
• Speeding up the pace of work. Analysts and researchers can save as much as 80% of their time
when served tailored suggestions for resources and other materials necessary for further
research.
• Boosting cart value. Companies with tens of thousands of items for sale would be challenged
to hard code product suggestions for such an inventory. By using various means of filtering,
these ecommerce titans can find just the right time to suggest new products customers are likely
to buy, either on their site or through email or other means.

Neural Network Models for Recommendation


1. feedforward neural networks.
2. Convolutional Neural Networks
3. Recurrent neural networks
What is Data Visualization ?

Data•• visualization is the graphical representation of information and data. By using visual elements like
charts, graphs, and maps, data visualization tools provide an accessible way to see and understand
trends, outliers, and patterns in data.

What is Data Visualization?


Data visualization translates complex data sets into visual formats that are easier for the human brain
to comprehend. This can include a variety of visual tools such as:
• Charts: Bar charts, line charts, pie charts, etc.
• Graphs: Scatter plots, histograms, etc.
• Maps: Geographic maps, heat maps, etc.
• Dashboards: Interactive platforms that combine multiple visualizations.
The primary goal of data visualization is to make data more accessible and easier to interpret,
allowing users to identify patterns, trends, and outliers quickly.
Types of Data for Visualization
• Numerical Data
• Categorical Data
Importantces:
1.Data Visualization Discovers the Trends in Data
2. Data Visualization Provides a Perspective on the Data
3. Data Visualization Puts the Data into the Correct Context
4. Data Visualization Saves Time
5. Data Visualization Tells a Data Story

Tools for Visualization of Data


The following are the 10 best Data Visualization Tools
1. Tableau
2. Looker
3. Zoho Analytics
4. Sisense
5. IBM Cognos Analytics
6. Qlik Sense
7. Domo
8. Microsoft Power BI
9. Klipfolio
10. SAP Analytics Cloud

Advantages:

1. Enhanced Comparison
2. Improved Methodology
3. Efficient Data Sharing
4. Sales Analysis
5. Identifying Event Relations

DisAdvantages:
a. Can be time-consuming
b. Can be misleading
c. Can be difficult to interpret
d. May not be suitable for all types of data

Big Data Interaction Techniques:

• Scalability: Techniques should scale with increasing data volumes and complexity.
• Interactivity: Tools should provide responsive and intuitive interfaces for exploring and
analyzing data.
• Integration: Techniques should integrate seamlessly with existing data pipelines and analytics
workflows.
• Security: Ensure data privacy and compliance with regulations when interacting with sensitive
data.

UNIT – 5

Pandas Introduction

•• is a powerful and open-source Python library. The Pandas library is used for data manipulation
Pandas
and analysis. Pandas consist of data structures and functions to perform efficient operations on data.

What is Pandas Libray in Python?


Pandas is a powerful and versatile library that simplifies the tasks of data manipulation in Python.
Pandas is well-suited for working with tabular data, such as spreadsheets or SQL tables.
The Pandas library is an essential tool for data analysts, scientists, and engineers working with
structured data in Python.

What is Python Pandas used for?


• Data set cleaning, merging, and joining.
• Easy handling of missing data (represented as NaN) in floating point as well as non-
floating point data.
• Columns can be inserted and deleted from DataFrame and higher-dimensional objects.
• Powerful group by functionality for performing split-apply-combine operations on data
sets.
• Data Visualization.
Pandas Series
A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, Python objects, etc.). The axis labels are collectively called indexes.
The Pandas Series is nothing but a column in an Excel sheet. Labels need not be unique but must be
of a hashable type.

import pandas as pd
import numpy as np

# Creating empty series


ser = pd.Series()
print("Pandas Series: ", ser)

# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])

ser = pd.Series(data)
print("Pandas Series:\n", ser)

Pandas DataFrame
Pandas DataFrame is a two-dimensional data structure with labeled axes (rows and columns).
import pandas as pd

# Calling DataFrame constructor


df = pd.DataFrame()
print(df)

# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list


df = pd.DataFrame(lst)
print(df)

You might also like