Introduction to Big Data:
Big Data is a collection of data that is huge in volume, growing exponentially with time. Data
in Peta bytes (1015 bytes) is called Big Data. It is stated that almost 90% of today’s data has
been generated in the past 3 years.
Big Data is bringing about changes in our lives because it allows diverse and heterogeneous
data to be fully integrated and analysed to help us make decisions.
Big Data is the term for collection of data sets so large and complex that it becomes difficult
to process using on-hand database system tools or traditional data processing applications.
Examples:
90% of the world’s data has been created in last two years.
Walmart handles more than 1 million customer transactions every hour.
Facebook stores, accesses and analyses 30+ Peta bytes of user generated data.
230+ millions of tweets are created every day.
Why Big Data is so important?
Big data is very important for organizations or companies varying from medium-size to large-
size because it enables them to gather, store, manage and manipulate extremely large
amounts of data, extremely high velocity of data and extremely wide variety of data.
At the right speed.
At the right time.
To get the required business value.
Difference b/w Traditional data and Big data:
Traditional data Big Data
Here the data is “structured” data. Here the data is “Unstructured or
Semi-structured” data
The size of the data is very small. The size is more than the traditional
data size.
Here the data is centralized. Here the data is distributed.
It is easy to work or manipulate. It is difficult to handle the data.
Normal system configuration is High system configuration is
sufficient to process. required to process the data.
Traditional database tools are Special kind of tools are required.
enough.
Normal functions are enough to Requires special kind of functions to
manipulate the data. manipulate the data.
SQL server, Oracle designed for Hadoop, MapReduce
structured data.
5 V’s of Big Data:
1. Volume:
Refers to the amount of data generated and stored.
Big Data deals with huge datasets ranging from terabytes to petabytes and beyond.
Example: Social media platforms like Facebook generate hundreds of terabytes of data
daily from posts, images and videos.
2. Velocity:
Represents the speed at which data is generated, collected and processed.
Many applications require real-time or near real-time data processing for quick
decision-making.
Example: Stock market trading systems process millions of transactions per second.
3. Variety:
Denotes the different types of data formats (structured, semi-structured,
unstructured).
Traditional databases handle structured data, while big data includes text, images,
videos, logs, sensor data etc.
Example: Healthcare data comes from patient records, MRI scans, wearable devices
and doctor’s notes.
4. Veracity:
Refers to the quality, accuracy and reliability of data.
Since data comes from multiple sources, it may be incomplete, inconsistent or
biased, requiring cleansing and validation.
Example: Fake news on social media can lead to misinformation, affecting public
perception.
5. Value:
Represents the usefulness and insights derived from data.
Collecting data is meaningless unless it provides business benefits, improves
efficiency or drives innovation.
Example: E-commerce platforms like Amazon use big data to recommend products
based on user behaviour, increasing sales.
Benefits of Big Data:
1. Better decision making:
Rather than anonymously making decisions, companies are considering big data analytics
before concluding to any decision. Big Data Analytics is that it has boosted the decision-
making process to a great extent.
2. Big Data in greater innovations:
Big Data Analytics is used by various firms to create new products and services for their
customers. Companies through big data, analyse different customer’s opinions about their
products and how their product is perceived.
3. Big Data in Educational Sector:
Big Data benefits educational sector in managing the data related to students. Analysis of
the capabilities of students based on the data can help teachers in nurturing their future in a
better way.
4. Big Data in product price optimization:
Through big data, companies analyse the prices that have yielded the maximum profits to
them under various historic market conditions. Through big data solutions, they set their
product’s price according to the customer’s willingness to pay under different circumstances.
5. Big Data in Recommendation Engine:
Online searching has been made easy with the help of recommendation engines by using Big
Data Analytics. Companies analyse every customer’s data and then recommend them
accordingly. These recommendations are majorly based on the activities the customer did
when he last visited the platform and his real-time activities.
6. Big Data in Healthcare Industry:
Big data enhances overall operational efficiency of healthcare companies. Big Data Analytics
would allow them to find a better cure for a disease by recognizing unknown connections
and hidden patterns.
7. Fraud Detection:
Customer information can be analysed to predict general trends and spot fraudulent
behaviour.
8. Agriculture:
Big Data provides granular data on rainfall patterns, water cycles and enables farmers to
make smart decisions such as what crops to plant for better profitability and when to
harvest.
Types of Big Data Analytics:
Big Data Analytics is the use of advanced analytical techniques against very large, diverse
datasets that include structured, semi-structured and unstructured data from diff. sources
and diff. size.
1. Descriptive Analysis:
As the name suggests, description is there. Explains what is happening based on incoming
data.
e.g. Details filled in the form are descriptive.
2. Predictive Analysis:
As the name suggests, prediction is there. Forecasts what might happen in the future based
on data trends and patterns.
3. Prescriptive Analysis:
Determines the best course of action based on data insights. It goes beyond prediction by
recommending actions to achieve desired outcomes.
e.g. Google’s self-driving cars (analyses sensor data, traffic patterns and road conditions to
make real-time driving decisions. If an obstacle is detected, the system prescribes actions
like slowing down, changing lanes, or stopping to ensure safety).
4. Diagnostic Analysis:
Diagnose and detailed information is there in this analysis.
Big Data Architecture:
It is designed to handle the ingestion processing and analysis of data that is too large or
complex for traditional database systems.
Ingestion: Used in data capture. Collects diff. types of data from diff. sources or platforms.
Analyse data whether that is structured or unstructured and where data comes from.
Internal data: Built-in data systems.
External data: Data in pendrive/external sources.
Streaming data: Data without storage but live data is there.
Data Storage: It is used to store data whereas real-time message ingestion is used to store
real data.
Batch Processing: Data stored is shared with batch processing and divided into different
batches. It passes the data to the analytical data store for analysis before forwarding it for
further processing or insights.
e.g. When a 50 MB video recording from a camera is uploaded as a WhatsApp status, it is
automatically compressed to 5-6 MB due to processing in the analytical data store. This
happens because an algorithm or compression technique is applied which reduces the file
size while maintaining acceptable quality.
Machine Learning: It processes both batch and streaming data. It analyses data in batches at
scheduled intervals and also processes streaming data for instant insights.
During streaming, if internet speed drops or data runs out, the system automatically lowers
video quality to ensure smooth playback. Afterward, during photo upload, if a photo fails but
shows as "processing," data analytics and reporting tools help track details like device,
location and upload time.
Orchestration: Automates workflows (eliminates the need for manual intervention), ensures
that the tasks run in the correct sequence, coordination and management.
Big Data Components:
1. Data Capture: It refers to the process of collecting data from a variety of sources. This
includes everything from social media to sensor reading.
2. Data Storage: It is a process of storing the data in a way that makes it accessible for the
future analysis.
3. Data Processing: This is where algorithms are used to analyse the data and extract
insights.
4. Data Visualization: It is a process of representing the data in a way that is easy for
humans to understand.
e.g. Flow chart problem, use-case diagram, graph or chart is there in data visualization.
Challenges of Big Data:
1. Quick Data Growth: The amount of data being stored in data centers and databases of
companies is increasing rapidly. As these datasets grow exponentially with time, it gets
extremely difficult to handle.
2. Storage: Such large amount of data is difficult to store and manage by organizations
without any appropriate tool and technology.
3. Syncing across data sources: When organization imports data from different data
sources, data from one source might not be upto date as compared to data from another
source.
4. Security: Securing these huge datasets is one of the daunting challenges of Big Data.
Some big data stores can be attractive targets for hackers or advanced persistent threats.
5. Unreliable Data: Big data cannot be completely accurate and may contain some
redundant or incomplete data.
6. Miscellaneous Challenges: More challenges exist such as generating insights in timely
manner or recruiting and retaining big data professionals.
Data Stream Management System (DSMS):
It is a specialized system designed to process and manage continuous data streams in real-
time. Unlike traditional database management systems (DBMS) that store and process static
data, a DSMS continuously ingests, analyses, and queries dynamic data streams.
Key features:
Handles continuous data streams in real-time with low latency.
Queries run continuously on incoming data instead of one-time execution.
Ensures reliable processing even in cases of failures.
Can process large-scale data streams efficiently.
Works with various data sources like IoT devices, social media feeds, and transaction
logs.
Components:
1. Data Stream: A continuous flow of data coming from sources like sensors, social media,
or transactions. It never stops and keeps updating in real-time.
2. Stream processor: The brain of the DSMS. It processes incoming data, applies filters,
aggregates information, and runs computations in real-time.
3. Standing queries: Queries that run continuously on streaming data, updating results as
new data arrives. Example: A query that always shows the average temperature from
sensors.
4. Adhoc queries: One-time queries that analyse the current data stream. Example: A user
asks, "What was the peak website traffic in the last hour?"
5. Archival storage: A place where old data is stored permanently for historical analysis and
backup. Example: A database keeping records of all financial transactions.
6. Limited working storage: A small temporary memory space used to process real-time
data, as storing everything is impossible. Example: Only keeping the last 10 minutes of
sensor readings to detect trends.
Drivers of Big Data:
Big data is driven by several key factors that make it grow and become more important.
More Data Sources: Every day, people and machines create huge amounts of data
through social media, online shopping, smart devices, and sensors. The more sources we
have, the bigger the data gets.
Faster Internet & Technology: With better internet speeds and advanced technologies
like cloud computing, data can be collected, stored, and processed quickly.
Cheaper Storage: Storing large amounts of data used to be expensive, but now it's much
cheaper, allowing companies to keep and analyse more information.
Artificial Intelligence (AI) & Machine Learning: AI systems learn from big data,
improving their accuracy and making predictions, which in turn drives the need for even
more data.
The Internet of Things (IoT): Smart devices like fitness trackers, home assistants, and
self-driving cars are constantly generating data, adding to the big data explosion.
Data Stream Models:
A data stream model is a way to handle and process continuous, fast-flowing data in real
time. Unlike traditional databases, where data is stored and then analysed, data stream
models analyse data as it arrives.
Types of Data Stream Models:
1. Time-Based Model:
Data is processed based on time intervals (e.g., every 10 seconds).
Example: Stock market price updates every second.
2. Count-Based Model:
Processes data after receiving a fixed number of items.
Example: Analysing customer reviews after every 100 entries.
3. Sliding Window Model:
Keeps a limited amount of recent data for analysis.
Example: Monitoring website visitors in the last 10 minutes.
4. Tumbling Window Model:
Divides data into fixed chunks and processes each batch separately.
Example: Analysing sales every hour without overlap.
5. Sketch-Based Model:
Uses approximations to handle large data streams efficiently.
Example: Estimating the number of unique visitors to a website.
Streaming Methods:
Streaming methods are techniques used to process and analyse continuous data streams in
real-time. Instead of storing data first and then analysing it, these methods handle data as it
arrives.
Types of Streaming Methods:
1. Batch Processing:
o Data is collected over a period and then processed together.
o Example: A company analyses daily sales reports at midnight.
2. Real-Time (Event-Driven) Processing:
o Data is processed as soon as it arrives.
o Example: Fraud detection in banking transactions instantly flags suspicious activity.
3. Micro-Batch Processing:
o A mix of batch and real-time processing where small chunks of data are processed
frequently.
o Example: Social media analytics updating every few minutes.
4. Window-Based Processing:
o Processes data within a specific time or count-based window.
o Example: Monitoring website traffic in the last 10 minutes.
Data Synopsis:
Data synopsis is a technique used to create a small, summarized version of large data sets. It
helps in quickly analysing and processing data without storing or handling the full dataset.
This is especially useful in real-time data streams, where data is too large to store entirely.
Why is Data Synopsis Important?
Saves storage space by keeping only key information.
Speeds up data analysis without needing full data.
Helps in real-time decision-making for large-scale systems.
Types of Data Synopsis:
1. Sampling:
o Takes a small portion of data to represent the whole dataset.
o Example: Checking 100 customer reviews instead of 1 million.
2. Sketching:
o Uses mathematical techniques to estimate data properties.
o Example: Estimating the number of unique visitors on a website without storing all
IP addresses.
3. Histogram:
o Divides data into ranges and counts how many values fall into each range.
o Example: Tracking the number of customers in different age groups.
4. Wavelet Transform:
o Compresses data while keeping important patterns.
o Example: Identifying trends in stock market prices over time.
5. Sliding Windows:
o Keeps only the most recent data for analysis.
o Example: Monitoring temperature readings from sensors in the last 10 minutes.
Summarization Techniques:
When dealing with large amounts of data, it’s not always possible to store or analyse
everything. These techniques help by reducing data while keeping the most important
information.
1. Sampling means selecting a small part of the data that represents the whole dataset.
Instead of analysing every piece of data, we work with a smaller, manageable sample.
🔹 Why Use Sampling?
Saves time and storage.
Speeds up data processing.
Works well when full data analysis isn’t necessary.
🔹 Example:
Imagine a company receives 1 million customer reviews. Instead of analysing all, they
randomly pick 10,000 reviews to understand customer sentiment.
2. Filtering removes irrelevant or unnecessary data, keeping only what is important.
🔹 Why Use Filtering?
Reduces noise and irrelevant information.
Helps focus on useful data.
Improves accuracy of analysis.
🔹 Example:
A weather monitoring system collects temperature, humidity, and wind speed data. If a
researcher is only interested in temperature, they filter out the other data.