Big Data Unit 2:
Mining Data Streams
Data Stream: A data stream is an existing, continuous, ordered (implicitly by entrance
time or explicitly by timestamp) chain of items. It is unfeasible to control the order in
which units arrive, nor it is feasible to locally capture stream in its entirety. It is enormous
volumes of data, items arrive at a high rate.
Types of Data Streams :
Data stream: A data stream is a(possibly unchained) sequence of tuples. Each tuple
comprised of a set of attributes, similar to a row in a database table.
Transactional data stream: It is a log interconnection between entities
• Credit card – purchases by consumers from producer
• Telecommunications – phone calls by callers to the dialed parties
• Web – accesses by clients of information at servers
Measurement data streams
• Sensor Networks – a physical natural phenomenon, road traffic
• IP Network – traffic at router interfaces
• Earth climate – temperature, humidity level at weather stations
Examples of Stream Sources-
• Sensor Data: In navigation systems, sensor data is used. Imagine a temperature
sensor floating about in the ocean, sending back to the base station a reading of
the surface temperature each hour. The data generated by this sensor is a stream
of real numbers. We have 3.5 terabytes arriving every day and we for sure need to
think about what we can be kept continuing and what can only be archived.
• Image Data: Satellites frequently send down-to-earth streams containing many
terabytes of images per day. Surveillance cameras generate images with lower
resolution than satellites, but there can be numerous of them, each producing a
stream of images at a break of 1 second each.
• Internet and Web Traffic: A bobbing node in the center of the internet receives
streams of IP packets from many inputs and paths them to its outputs. Websites
receive streams of heterogeneous types. For example, Google receives a hundred
million search queries per day.
Characteristics of Data Streams :
• Large volumes of continuous data, possibly infinite.
• Steady changing and requires a fast, real-time response.
• Data stream captures nicely our data processing needs of today.
• Random access is expensive and a single scan algorithm
• Store only the summary of the data seen so far.
• Maximum stream data are at a pretty low level or multidimensional in creation,
needs multilevel and multidimensional treatment.
Applications of Data Streams :
• Fraud perception
• Real-time goods dealing
• Consumer enterprise
• Observing and describing on inside IT systems
Advantages of Data Streams :
• This data is helpful in upgrading sales
• Help in recognizing the fallacy
• Helps in minimizing costs
• It provides details to react swiftly to risk
Disadvantages of Data Streams :
• Lack of security of data in the cloud
• Hold cloud donor subordination
• Off-premises warehouse of details introduces the probable for disconnection
Stream Data Model and Architecture: A streaming data architecture is a dedicated network
of software components capable of processing large amount of stream data from many
sources. Unlike conventional data architecture solutions, which focus on batch reading and
writing, a streaming data architecture take data as it is generated in its raw form, stores it,
and may incorporate different components for real-time data processing and manipulation.
An effective streaming architecture must designed for the different characteristics of data
streams which tend to generate large amounts of structured and semi-structured data that
requires filtering and pre-processing to be useful.
Due to its complexity, stream processing cannot be solved with one ETL(Extract,
transform, and load) tool or database. That’s why organizations need to adopt solutions
consisting of multiple building blocks that can be combined with data pipelines within the
organization’s data architecture.
Although stream processing was initially considered a niche technology, it is hard to
find a modern business that does not have an eCommerce site, an online advertising strategy,
an app, or products enabled by IoT.
Each of these digital assets generates real-time event data streams, thus fueling the
need to implement a streaming data architecture capable of handling powerful, complex, and
real-time analytics.
Stream Computing: The word stream in stream computing is used to mean pulling in
streams of data, processing the data and streaming it back out as a single flow. Stream
computing uses software algorithmsthat analyzes the data in real time as it streams in to
increase speed and accuracy when dealing with data handling and analysis.
• Stream computing is a computing paradigm that reads data from collections of
software or hardware sensors in stream form and computes continuous data
streams.
• Stream computing uses software programs that compute continuous data
streams.
• Stream computing uses software algorithm that analyzes the data in real time.
• Stream computing is one effective way to support Big Data by providing extremely
low-latency velocities with massively parallel processing architectures.
• It is becoming the fastest and most efficient way to obtain useful knowledge from
Big Data.
Examples: In June 2007, IBM announced its stream computing system, called System
S. This system runs on 800 microprocessors and the System S software enables
software applications to split up tasks and then reassemble the data into an answer.
ATI Technologies also announced a stream computing technology that describes its
technology that enables the graphics processors (GPUs) to work in conjunction with
high-performance, low-latency CPUs to solve complex computational problems. ATI’s
stream computing technology is derived from a class of applications that run on the
GPU instead of a CPU.
Sampling Data in a Stream: Stream sampling is the process of collecting a representative
sample of the elements of a data stream. The sample is usually much smaller than the entire
stream, but can be designed to retain many important characteristics of the stream, and can
be used to estimate many important aggregates on the stream. Unlike sampling from a
stored data set, stream sampling must be performed online, when the data arrives. Any
element that is not stored within the sample is lost forever, and cannot be retrieved. This
article discusses various methods of sampling from a data stream and applications of these
methods.
Filtering Streams: stream filtering is one of the most useful and practical approaches to
efficient stream evaluation, whether it is done implicitly by the system to guarantee the
stability of the stream processing under overload conditions, or explicitly by the evaluating
procedure. In this section we will review some of the filtering techniques commonly used in
data stream processing.
Filtering Techniques in Data Mining consist of three disciplines: Machine Learning
techniques, Statistical Models, and Deep Learning algorithms. Depending on various
methods, Data Mining professionals try to understand how to process and make conclusions
from the huge amount of data.
• Tracking Patterns: Tracking patterns is one of the most basic Filtering Techniques
in Data Mining. It helps recognize aberrations in data or an ebb and flow of a
variable. Pattern tracking will help determine if a product is ordered more for a
demographic. A brand can use this to better stock the original product for this
demographic or create similar products. For example, you can identify Sales data
trends and capitalize on those insights.
• Classification: Classification Filtering Techniques in Data Mining are used to
categorize or classify related data after identifying the main characteristics of data
types. You can classify data by various criteria such as type of data sources mined,
database involved, the kind of knowledge discovered, and more.
• Clustering: Clustering Filtering Techniques in Data Mining identify similar data and
divide information into groups of connected objects (clusters) based on their
characteristics. It models data by its clusters and is seen as a historical point of
view in Data Modeling. Clustering helps in scientific data exploration, text mining,
spatial database applications, information retrieval, CRM, medical diagnostics, and
much more. You can recognize the differences and similarities in the data with this
method. Clustering is similar to classification but involves grouping chunks of data
based on their similarities.
• Visualization: Data Visualizations are another element of Data Mining; these
Filtering Techniques in Data Mining provide information about data based on
sensory perceptions. Today’s Data Visualizations are dynamic and helpful in
Streaming Data in real-time, characterized by various colours to reveal different
trends and patterns. Dashboards are powerful tools to uncover data mining
insights. Organizations can base dashboards on multiple metrics and use
visualizations to highlight patterns in data instead of numerical models.
• Association: Association is a Filtering Technique in Data Mining, related to tracking
patterns and statistics. It signifies that certain Data Events are associated with
other data-driven events. It’s like the co-occurrence notion in Machine Learning,
where the presence of another indicates the likelihood of another data-driven
event. The notion of association also indicated a Relationship between two data
events.
• Regression: Although, the Regression Filtering techniques in Data Mining are used
as a form of planning and modelling by identifying the likelihood of a certain
variable when other variables are known. Its primary focus is to uncover the
relationship between variables in a given dataset. For example, you could use it to
project prices based on consumer demand, availability, and competition.
• Prediction: The Prediction Filtering Techniques in Data Mining are about finding
patterns in historical and current data to extend them into future predictions,
providing insights into what might happen next. For example, reviewing
consumers’ past purchases and credit histories to predict whether they’ll be a
credit risk in the future.
• Neural Networks: Primarily used for deep learning algorithms, the neural network
filtering techniques in the Data Mining process mimic the human brain’s
interconnectivity. They have various layers of nodes where each node is made up
of weights, inputs, a bias, and an output. Filtering Techniques in Data Mining can
be a powerful tool in Data Mining but should be used with caution as these
models are incredibly complex.
• Decision Tree: Decision tree filtering techniques in Data Mining is a Predictive
model that uses Regression or Classification methods to classify potential
outcomes. It uses a tree-like structure/model to represent the possible outcomes.
These Filtering Techniques in Data Mining enable companies to understand how
their data inputs affect the output.
• K-Nearest Neighbor (KNN): Nonparametric Filtering Techniques in Data Mining
classify data points based on their association and proximity to other available
data. This algorithm for filtering techniques in Data Mining assumes that you can
find similar data points near each other. It calculates the distance between data
points and assigns a category based on the most frequent type or average.
Counting Distinct Elements in a Stream
Naive Approach for finding the count of distinct numbers
• For every index, i from 0 to N – K, traverse the array from i to i + k using another
loop. This is the window
• Traverse the window, from i to that index and check if the element is present or
not
• If the element is not present in the prefix of the array, i.e no duplicate element is
present from i to index-1, then increase the count, else ignore it
• Print the count
Estimating Moments
• Estimating moments is a generalization of the problem of counting distinct
elements in a stream. The problem, called computing "moments," involves the
distribution of frequencies of different elements in the stream.
• Suppose a stream consists of elements chosen from a universal set. Assume the
universal set is ordered so we can speak of the ith element for any i.
• Let mi be the number of occurrences of the ith element for any i. Then the kth-
order moment of the stream is the sum over all i of (mi)k
Example :-
• The 0th moment is the sum of 1 of each mi that is greater than 0 i.e., 0th moment
is a count of the number of distinct element in the stream.
• The 1st moment is the sum of the mi ’s, which must be the length of the stream.
Thus, first moments are especially easy to compute i.e., just count the length of
the stream seen so far.
• The second moment is the sum of the squares of the mi’s. It is sometimes called
the surprise number, since it measures how uneven the distribution of elements in
the stream is.
• To see the distinction, suppose we have a stream of length 100, in which eleven
different elements appear. The most even distribution of these eleven elements
would have one appearing 10 times and the other ten appearing 9 times each.
• In this case, the surprise number is 102 + 10 × 92 = 910. At the other extreme, one
of the eleven elements could appear 90 times and the other ten appear 1 time
each. Then, the surprise number would be 902 + 10 × 12 = 8110.
Decaying Window:
• This algorithm allows you to identify the most popular elements (trending, in other
words) in an incoming data stream.
• The decaying window algorithm not only tracks the most recurring elements in an
incoming data stream, but also discounts any random spikes or spam requests that
might have boosted an element’s frequency.
Algo:In a decaying window, you assign a score or weight to every element of the
incoming data stream. Further, you need to calculate the aggregate sum for each
distinct element by adding all the weights assigned to that element. The element with
the highest total score is listed as trending or the most popular.
o Assign each element with a weight/score.
o Calculate aggregate sum for each distinct element by adding all the weights
assigned to that element.
Advantages of Decaying Window Algorithm:
• Sudden spikes or spam data is taken care.
• New element is given more weight by this mechanism, to achieve right
trending output.
Real time Analytics: Real-time analytics permits businesses to get awareness and take
action on data immediately or soon after the data enters their system. Real-time app
analytics response queries within seconds. They grasp a large amount of data with high
velocity and low reaction time. For example, real-time big data analytics uses data in
financial databases to notify trading decisions. Analytics can be on-demand or
uninterrupted. On-demand notifies results when the user requests it. Continuous
renovation users as events happen and can be programmed to answer automatically to
certain events. For example, real-time web analytics might refurbish an administrator if the
page load presentation goes out of the present boundary.
Advantages:
• Create our interactive analytics tools.
• Transparent dashboards allow users to share information.
• Monitor behaviour in a way that is customized.
• Perform immediate adjustments if necessary.
• Make use of machine learning.
Real time Analytics Platform (RTAP)
• A real-time analytics platform enables organizations to make the most out of real-
time data by helping them to extract the valuable information and trends from it.
• Such platforms help in measuring data from the business point of view in real
time, further making the best use of data.
• An ideal real-time analytics platform would help in analyzing the data, correlating
it and predicting the outcomes on a real-time basis.
• The real-time analytics platform helps organizations in tracking things in real time,
thus helping them in the decision-making process.
• The platforms connect the data sources for better analytics and visualization.
• Real time analytics is the analysis of data as soon as that data becomes available.
In other words, users get insights or can draw conclusions immediately the data
enters their system.
Applications:
• Real time credit scoring, helping financial institutions to decide immediately
whether to extend credit.
• Customer relationship management (CRM), maximizing satisfaction and
business results during each interaction with the customer.
• Fraud detection at points of sale.
• Targeting individual customers in retail outlets with promotions and
incentives, while the customers are in the store and next to the
merchandise.
Case Studies:
• Big data in Netflix: Netflix implements data analytics models to discover customer
behavior and buying patterns. Then, using this information it recommends movies
and TV shows to their customers. That is, it analyzes the customer’s choice and
preferences and suggests shows and movies accordingly. According to Netflix,
around 75% of viewer activity is based on personalized recommendations. Netflix
generally collects data, which is enough to create a detailed profile of its
subscribers or customers. This profile helps them to know their customers better
and in the growth of the business.
• Big data at Google: Google uses Big data to optimize and refine its core search and
ad-serving algorithms. And Google continually develops new products and services
that have Big data algorithms. Google generally uses Big data from its Web index
to initially match the queries with potentially useful results. It uses machine-
learning algorithms to assess the reliability of data and then ranks the sites
accordingly. Google optimized its search engine to collect the data from us as we
browse the Web and show suggestions according to our preferences and interests.
• Big data at LinkedIn: LinkedIn is mainly for professional networking. It generally
uses Big data to develop product offerings such as people you may know, who
have viewed your profile, jobs you may be interested in, and more. LinkedIn uses
complex algorithms, analyzes the profiles, and suggests opportunities according to
qualification and interests. As the network grows moment by moment, LinkedIn’s
rich trove of information also grows more detailed and comprehensive.
• Big Data at Uber:Uber is the first choice for people around the world when they
think of moving people and making deliveries. It uses the personal data of the user
to closely monitor which features of the service are mostly used, to analyze usage
patterns and to determine where the services should be more focused. Uber
focuses on the supply and demand of the services due to which the prices of the
services provided changes. Therefore one of Uber’s biggest uses of data is surge
pricing. For instance, if you are running late for an appointment and you book a
cab in a crowded place then you must be ready to pay twice the amount.For
example, On New Year’s Eve, the price for driving for one mile can go from 200 to
1000. In the short term, surge pricing affects the rate of demand, while long term
use could be the key to retaining or losing customers. Machine learning algorithms
are considered to determine where the demand is strong.
Real Time Sentiment Analysis: Real-time Sentiment Analysis is a machine learning
technique that automatically recognizes and extracts the sentiment in a text whenever it
occurs. It is most commonly used to analyse brand and product mentions in live social
comments and posts. An important thing to note is that real-time sentiment analysis can be
done only from social media platforms that share live feeds like Twitter does
The real-time sentiment analysis process uses several ML tasks such as natural
language processing, text analysis, semantic clustering, etc to identify opinions expressed
about brand experiences in live feeds and extract business intelligence from them.
Need of real time sentimental analysis
• Live social feeds from video platforms like Instagram or Facebook
• Real-time sentiment analysis of text feeds from platforms such as Twitter is
helpful in threat detection in cyberbullying.
• Live monitoring of Influencer live streams.
• Live video streams of interviews, news broadcasts, seminars, panel
discussions, speaker events, and lectures.
• Live audio streams such as in virtual meetings on Zoom or Skype, or at
product support call centers for customer feedback analysis.
• Live monitoring of product review platforms for brand mentions.
Way to do it
• Step 1 - Data collection
• Step 2 - Data processing
• Step 3 - Data analysis
• Step 4 - Data visualization
Stock Market Prediction: The Stock market process is full of uncertainty and is affected by
many factors. Hence the Stock market prediction is one of the important exertions in
finance and business. There are two types of analysis possible for prediction, technical and
fundamental. In this paper both technical and fundamental analysis are considered.
Technical analysis is done using historical data of stock prices by applying machine learning
and fundamental analysis is done using social media data by applying sentiment analysis.
Social media data has high impact today than ever, it can aide in predicting the trend of the
stock market. The method involves collecting news and social media data and extracting
sentiments expressed by individual. Then the correlation between the sentiments and the
stock values is analyzed. The learned model can then be used to make future predictions
about stock values. It can be shown that this method is able to predict the sentiment and
the stock performance and its recent news and social data are also closely correlated.