0% found this document useful (0 votes)

42 views41 pages

DSBDA Unit 6 Notes

The document provides a comprehensive overview of data visualization, including its importance, techniques, and challenges, particularly in the context of big data. It outlines various types of visualizations such as bar charts, line charts, and heatmaps, as well as the tools used for creating them. Additionally, it discusses the complexities of visualizing big data, emphasizing the need for effective techniques to manage scalability, data variety, and real-time processing.

Uploaded by

ektasalgar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views41 pages

DSBDA Unit 6 Notes

Uploaded by

ektasalgar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

DSBDA

Unit 6: Data Visualization and Hadoop

1. Introduction to Data Visualization

Introduction to Data Visualization – Explained in Detail

✅ What is Data Visualization?

Data Visualization is the graphical representation of information and data using visual elements
like charts, graphs, maps, and dashboards. It helps communicate complex data in a clear,
concise, and visual format that is easy to understand and analyze.

✅ Why is Data Visualization Important?

1. Simplifies Complex Data: Large datasets are easier to comprehend when presented
visually.

2. Enhances Data Interpretation: Patterns, trends, and outliers can be spotted quickly.

3. Facilitates Better Decision-Making: Helps in drawing insights and making informed
decisions.

4. Saves Time: Quick overview of data without reading through raw numbers or tables.

5. Engages Viewers: Visuals are more engaging than textual or numeric information.

✅ Types of Data Visualization Techniques

Type Description Example Tools

Bar Chart Compares categories using Excel, Power BI

rectangular bars.

Line Chart Shows trends over time. Tableau, Python

(Matplotlib)

Pie Chart Shows proportions within a whole. Google Charts

Histogram Shows data distribution. R, Python

Scatter Shows relationship between two Excel, Python

Plot variables.

Heatmap Represents data density or intensity Seaborn

with color.

Tree Map Shows part-to-whole relationships Tableau

hierarchically.

Geographi Visualizes location-based data. QGIS, Mapbox

cal Map

✅ Key Components of Good Data Visualization

1. Title: Clear and descriptive.

2. Axis Labels: Defined and accurate.

3. Legends: To differentiate categories or data series.

4. Colors and Fonts: Should enhance readability and not confuse.

5. Simplicity: Avoid clutter and keep the design clean.

✅ Tools for Data Visualization

● Microsoft Excel/Google Sheets: Basic visualizations.

● Tableau: Interactive and advanced dashboards.

● Power BI: Business analytics and visualization.

● Python Libraries: Matplotlib, Seaborn, Plotly.

● R Libraries: ggplot2, lattice.

✅ Applications of Data Visualization
● Business Intelligence: Monitor KPIs, sales trends.

● Healthcare: Track patient statistics and disease trends.

● Finance: Visualize stock data, budget analysis.

● Education: Display academic results or performance metrics.

● Social Media: Analyze engagement or user behavior data.

✅ Conclusion
Data Visualization bridges the gap between raw data and human understanding. It transforms
data into visuals that help in storytelling, pattern recognition, and strategic planning. In an era of
big data, mastering visualization is a key skill for analysts, scientists, and decision-makers.

2. Challenges to Big data visualization

Challenges to Big Data Visualization

Big data brings massive volumes, variety, and velocity of data—making visualization more
complex. Below are the key challenges in visualizing big data:

✅ 1. Scalability
● Problem: Traditional visualization tools can't handle billions of data points efficiently.

● Impact: Sluggish performance or system crashes when dealing with large datasets.

✅ 2. Data Variety
● Problem: Big data includes structured, unstructured (text, images), and semi-structured
data.

● Impact: Difficult to design a unified visualization approach for such diverse formats.

✅ 3. Real-Time Data Processing

● Problem: Streaming data (e.g., from sensors or social media) needs instant visual
updates.

● Impact: Requires high-speed processing and responsive UI for real-time dashboards.

✅ 4. Noise and Outliers

● Problem: Big data often contains irrelevant or erroneous data.

● Impact: Can distort visualizations and lead to incorrect interpretations.

✅ 5. Overplotting and Clutter

● Problem: Too many data points lead to visual crowding (e.g., in scatter plots).

● Impact: Reduces readability and hides important insights.

✅ 6. Interpretability
● Problem: Complex visualizations (e.g., 3D, multi-dimensional plots) may confuse
non-experts.

● Impact: Limits the accessibility of insights to business users and decision-makers.

✅ 7. Storage and Performance

● Problem: Storing and querying huge datasets for visualization is resource-intensive.

● Impact: High computation costs and infrastructure demands.

✅ 8. Data Privacy and Security

● Problem: Sensitive data (e.g., health, finance) must be protected during visualization.

● Impact: Requires anonymization and secure access controls.

✅ 9. Tool Limitations
● Problem: Many tools are not designed for distributed or cloud-based data.

● Impact: Need for advanced platforms (e.g., Apache Superset, D3.js with back-end
support).

✅ 10. User-Centric Design

● Problem: Different users have different levels of technical expertise.

● Impact: Requires adaptive dashboards and customizable views for better UX.

✅ Conclusion
Visualizing big data demands powerful tools, optimized data handling, and intuitive designs
to manage its scale and complexity. Overcoming these challenges is essential for unlocking the
true value hidden in massive datasets.
3. Types of data visualization

Types of Data Visualization – Explained in Detail

Data visualization comes in many forms, each suited for different types of data and analysis
goals. Below is a detailed explanation of the main types of data visualizations, along with
their purposes, examples, and best use cases.

✅ 1. Bar Chart
Description: Displays data using rectangular bars with lengths proportional to the values they
represent.

● Used For: Comparing different categories or groups.

● Orientation: Can be vertical or horizontal.

● Example: Sales revenue by product category.

Best For:

● Categorical data

● Side-by-side comparisons

✅ 2. Column Chart
Description: Similar to bar charts but with vertical bars.

● Used For: Showing changes over time or comparison across groups.

● Example: Monthly website traffic.

✅ 3. Line Chart
Description: Connects data points with lines, often used to show trends over time.
● Used For: Time series data or trends.

● Example: Temperature change over the year.

Best For:

● Continuous data

● Visualizing trends and patterns

✅ 4. Pie Chart
Description: Circular chart divided into slices to illustrate numerical proportion.

● Used For: Showing percentages or parts of a whole.

● Example: Market share of companies.

Caution: Not ideal for comparing many categories or small differences.

✅ 5. Histogram
Description: Similar to bar chart but used for distribution of numerical data.

● Used For: Showing the frequency of data within certain intervals (bins).

● Example: Distribution of student grades.

Best For:

● Understanding data distribution

● Identifying skewness or outliers

✅ 6. Scatter Plot
Description: Shows the relationship between two numerical variables using dots.

● Used For: Finding correlations or patterns between variables.

● Example: Height vs. weight of individuals.

Best For:

● Spotting trends

● Identifying clusters and outliers

✅ 7. Area Chart
Description: Like a line chart, but the area under the line is filled in with color.

● Used For: Showing volume over time.

● Example: Cumulative sales over months.

Best For:

● Comparing quantities over time

✅ 8. Heatmap
Description: Uses color intensity to show the magnitude of values in a matrix.

● Used For: Comparing values across two variables using color.

● Example: Student attendance vs. performance.

Best For:

● Spotting patterns or anomalies in large datasets

✅ 9. Tree Map
Description: Hierarchical chart using nested rectangles.

● Used For: Showing proportions within a hierarchy.

● Example: Company’s departmental expenses.

✅ 10. Bubble Chart

Description: Like a scatter plot, but with bubble size indicating a third variable.

● Used For: Multi-dimensional data.

● Example: Country GDP vs. life expectancy (size = population).

✅ 11. Box Plot (Box-and-Whisker Plot)

Description: Summarizes the distribution of a dataset using median, quartiles, and outliers.

● Used For: Detecting variability and outliers.

● Example: Exam score distribution among students.

✅ 12. Geographical Map (Geo Map)

Description: Represents data spatially on a map.

● Used For: Location-based data visualization.

● Example: Population density by region.

✅ 13. Radar Chart (Spider Chart)
Description: Plots multivariate data on a circular graph.

● Used For: Comparing multiple variables per item.

● Example: Employee skills analysis.

✅ 14. Gantt Chart

Description: A type of bar chart showing project tasks over time.

● Used For: Project planning and tracking.

● Example: Software development timeline.

✅ Summary Table
Chart Type Best For

Bar/Column Chart Category comparisons

Line Chart Trends over time

Pie Chart Proportions in a whole

Histogram Frequency distribution

Scatter Plot Correlations between two variables

Heatmap Density/pattern in two variables

Tree Map Hierarchical part-to-whole relationships

Geo Map Spatial/location-based data

Bubble Chart 3-variable comparison

Box Plot Distribution, outliers

Radar Chart Multi-attribute comparison

Gantt Chart Project scheduling

4. Data Visualization Techniques

Data Visualization Techniques – Explained in Detail

Data visualization techniques are methods used to transform raw data into meaningful and
interpretable visual formats. These techniques help in analyzing, understanding, and
communicating information effectively.

Below is a detailed breakdown of the major data visualization techniques:

✅ 1. Chart-Based Techniques
These are the most commonly used techniques for structured and quantitative data.

a) Bar Charts

● Represent categorical data using rectangular bars.

● Used for comparing groups or categories.

● Example: Sales by region, population by age group.

b) Line Charts

● Display data points connected by lines.

● Ideal for time-series or trend analysis.

● Example: Stock prices over a year.

c) Pie Charts

● Show percentage or proportional data as slices of a circle.

● Best used when comparing parts of a whole.

● Limitation: Not effective with more than 4–5 categories.

d) Histogram

● Displays frequency distribution of continuous variables.

● Groups data into intervals or “bins.”

● Example: Distribution of test scores.

e) Area Charts

● Similar to line charts but with filled areas.

● Used for visualizing cumulative trends.

● Example: Website visits over time.

✅ 2. Distribution-Based Techniques
Used to understand the spread and shape of the data.

a) Box Plots (Box-and-Whisker Plots)

● Visualize data distribution using median, quartiles, and outliers.

● Effective in comparing distributions across categories.

b) Violin Plots

● Combine box plot and density plot.

● Show the full distribution of the data.

✅ 3. Correlation and Relationship Techniques

These show the relationship between variables.

a) Scatter Plots

● Display the relationship between two continuous variables.

● Useful for detecting patterns, trends, and outliers.

● Example: Height vs. weight.

b) Bubble Charts
● An extension of scatter plots with a third dimension (size of bubble).

● Example: GDP vs. life expectancy (bubble size = population).

c) Heatmaps

● Use color intensity to show values in a matrix.

● Excellent for visualizing correlation matrices or attendance records.

✅ 4. Hierarchical and Part-to-Whole Techniques

These techniques show relationships within datasets with sub-categories or hierarchies.

a) Tree Maps

● Visualize hierarchical data using nested rectangles.

● Size and color represent different variables.

b) Sunburst Charts

● Circular hierarchical charts that are visually similar to tree maps.

c) Donut Charts

● A variation of pie charts with a central hole.

● Easier to read than traditional pie charts in some cases.

✅ 5. Geospatial Techniques
Used when data includes geographic or location-based components.
a) Choropleth Maps

● Use different shades or colors to show variable values across geographic areas.

● Example: Unemployment rate by state.

b) Dot Density Maps

● Each dot represents a fixed quantity.

● Good for visualizing population or incident distribution.

c) Symbol Maps

● Use scaled symbols (e.g., circles) on a map to represent data.

✅ 6. Multivariate and Dimensional Techniques

Used for data with multiple variables.

a) Radar Charts (Spider Charts)

● Display multivariate data in a circular layout.

● Useful for performance comparison.

b) Parallel Coordinates Plot

● Each variable has its own axis; lines represent observations.

● Good for analyzing complex relationships.

c) 3D Surface Charts

● Show three variables at once in a 3D space.

● Can be difficult to interpret without interaction.

✅ 7. Interactive Techniques
With modern tools like Tableau, Power BI, D3.js, and Plotly, visualizations can be interactive.

● Hover effects: Show tooltips with more data.

● Zoom and pan: Navigate large datasets easily.

● Drill-downs: Click to explore subcategories.

● Filters: Let users explore data dynamically.

✅ Conclusion
Choosing the right visualization technique depends on:

● The type of data (categorical, numerical, temporal, geospatial, etc.)

● The purpose (comparison, distribution, trend analysis, etc.)

● The audience (technical or non-technical)

Effective use of visualization techniques enhances data comprehension and helps in

storytelling with data.
5. Visualizing Big Data

Visualizing Big Data – A Detailed Explanation

Visualizing Big Data refers to the process of representing large, complex datasets in a visual
format (charts, graphs, maps, etc.) to uncover patterns, trends, and insights that are not easily
observed in raw data form.

✅ What is Big Data?

Big Data is typically characterized by the 4 Vs:

1. Volume – Extremely large datasets (e.g., terabytes or petabytes).

2. Velocity – High speed at which data is generated and processed.

3. Variety – Different forms of data (structured, semi-structured, unstructured).

4. Veracity – Uncertainty or inconsistency in data quality.

✅ Why Visualize Big Data?

● To make sense of complex datasets.

● To identify hidden patterns, trends, and anomalies.

● To enhance decision-making by delivering insights in real time.

● To communicate findings clearly to both technical and non-technical users.

✅ Challenges in Visualizing Big Data

(Recap of earlier points)
Challenge Description

Scalability Difficulty handling billions of records

Real-time visualization Need for instant rendering and live updates

Data variety Structured vs. unstructured formats

Overplotting Cluttered visuals due to too many data points

Interpretation Complex charts might be hard for general users to

understand

Performance Large datasets demand high storage and computational

power

✅ Techniques and Tools for Big Data Visualization

1. Sampling and Aggregation

● Instead of visualizing all data, take representative samples or aggregate data.

● Example: Showing average daily sales instead of each transaction.

2. Data Filtering and Zooming

● Use filters to focus on subsets of data.

● Enable zoom-in/out functionality on charts/maps.

3. Real-time Dashboards

● Used for streaming data like IoT sensor data or live social media feeds.

● Tools: Apache Kafka (backend) + Grafana / Kibana / Power BI

4. Multi-level Visualization (Drill-Down)

● Show summarized data initially.

● Allow users to click and explore deeper levels.

5. Clustering and Pattern Recognition

● Group similar data points (e.g., K-Means) and visualize them as clusters.

● Useful in visualizing customer segments, fraud detection, etc.

✅ Popular Tools for Big Data Visualization

Tool Description

Tableau Drag-and-drop interface, handles big data with

connectors

Power BI Microsoft tool with real-time data analysis

Apache Open-source, scalable tool for large data visualization

Superset

D3.js JavaScript library for creating custom, interactive visuals

Kibana Works with Elasticsearch to visualize log and event data

Grafana Real-time visualization of time-series data

Python Libraries Plotly, Bokeh, Altair – great for customized dashboards

✅ Best Practices for Big Data Visualization

1. Simplify your visuals – Avoid clutter, use summaries.

2. Use appropriate chart types – Line charts for time-series, heatmaps for dense data.

3. Enable interactivity – Filters, drill-downs, zooming.

4. Use color carefully – Use color to encode meaning, not to distract.

5. Ensure scalability – Choose tools and architectures that scale with data size.
✅ Use Cases of Big Data Visualization
● Social Media Analysis: Tracking hashtags, sentiment, engagement in real time.

● Healthcare: Monitoring patient vitals, population health trends.

● E-commerce: Visualizing product performance, customer behavior.

● Smart Cities: Traffic patterns, energy usage, public transport tracking.

● Finance: Real-time fraud detection, portfolio performance.

✅ Conclusion
Visualizing big data transforms massive, complex information into actionable insights. With the
right combination of techniques, tools, and best practices, organizations can gain a competitive
edge, make informed decisions, and tell compelling data stories.

6. Tools used in Data Visualization

Tools Used in Data Visualization – Explained in Detail

Data visualization tools help in transforming raw data into graphical or pictorial formats that
make it easier to understand and interpret. These tools range from simple chart creators to
advanced platforms capable of handling real-time big data.

Here’s a detailed overview of the most popular data visualization tools, categorized by
use-case and features:

✅ 1. Tableau
● Type: Commercial

● Key Features:

○ Drag-and-drop interface

○ Supports real-time data

○ Connects to multiple data sources (SQL, Excel, Cloud)

○ Dashboard creation and storytelling features

● Use Cases: Business intelligence, executive dashboards, real-time analytics

● Pros: User-friendly, powerful visualization

● Cons: Expensive license for full version

✅ 2. Microsoft Power BI
● Type: Freemium (free & paid plans)

● Key Features:

○ Integration with Microsoft products (Excel, Azure)

○ AI-driven data insights

○ Real-time dashboard capabilities

○ Custom visual plugins

● Use Cases: Business analytics, financial reports

● Pros: Affordable, excellent for Windows users

● Cons: Limited customization compared to Tableau

✅ 3. Google Data Studio (now Looker Studio)
● Type: Free (Cloud-based)

● Key Features:

○ Real-time collaboration

○ Connects to Google services (Analytics, Sheets, BigQuery)

○ Custom charts and filters

● Use Cases: Website analytics, marketing reports

● Pros: Easy for beginners, free

● Cons: Limited third-party data source integration

✅ 4. D3.js (Data-Driven Documents)

● Type: Open-source JavaScript library

● Key Features:

○ Highly customizable and interactive visualizations

○ Full control over design using HTML, CSS, and SVG

● Use Cases: Custom web-based dashboards, academic research

● Pros: Infinite design flexibility

● Cons: Requires programming skills

✅ 5. Qlik Sense
● Type: Commercial

● Key Features:

○ Associative data model (data linking across sets)

○ AI and machine learning features

○ Self-service dashboards

● Use Cases: Business intelligence, retail analytics

● Pros: Fast and dynamic filtering

● Cons: Steep learning curve

✅ 6. Apache Superset
● Type: Open-source

● Key Features:

○ Lightweight and scalable

○ Works well with big data tools like Hive, Presto

○ SQL editor and rich visual options

● Use Cases: Data engineering, big data analysis

● Pros: Free and powerful for developers

● Cons: Requires setup and backend knowledge

✅ 7. Plotly
● Type: Open-source & commercial (with Dash)

● Key Features:

○ Interactive visualizations in Python, R, JavaScript

○ Can be embedded in web applications

● Use Cases: Scientific data visualization, machine learning dashboards

● Pros: High-quality visuals, good integration with Jupyter notebooks

● Cons: Limited offline capabilities without a license

✅ 8. Excel
● Type: Commercial

● Key Features:

○ Basic charts (bar, pie, line, etc.)

○ Pivot tables and conditional formatting

● Use Cases: Simple reports, small datasets

● Pros: Familiar interface

● Cons: Not ideal for large or real-time data

✅ 9. Kibana
● Type: Open-source

● Key Features:
○ Built for visualizing data stored in Elasticsearch

○ Real-time log and event tracking

● Use Cases: IT infrastructure monitoring, log analysis

● Pros: Excellent for time-series data

● Cons: Tightly coupled with Elasticsearch

✅ 10. Grafana
● Type: Open-source & commercial

● Key Features:

○ Best for time-series and real-time data

○ Supports multiple data sources (Prometheus, InfluxDB, MySQL)

● Use Cases: DevOps dashboards, IoT, system performance

● Pros: Live monitoring and alerting features

● Cons: Limited support for static reporting

✅ Summary Table
Tool Type Best For Programming
Required

Tableau Commercial Business dashboards No

Power BI Freemium Enterprise reporting No

Google Data Free Google Analytics, reports No

Studio

D3.js Open-sourc Custom web visualizations Yes (JavaScript)

Qlik Sense Commercial Dynamic, self-service BI No

Apache Superset Open-sourc Big data, SQL-based Yes (optional)

e dashboards

Plotly Both Python/R-based analytics Yes

Excel Commercial Basic charting No

Kibana Open-sourc Log data visualization No

Grafana Both Real-time monitoring No

✅ Conclusion
The right tool depends on:

● Your technical expertise

● The type and volume of data

● Your budget

● Your visualization goals (exploration, monitoring, reporting)

7. Hadoop ecosystem

✅ Hadoop Ecosystem – Explained in Detail

The Hadoop Ecosystem is a collection of open-source tools and frameworks that work together
to store, process, and analyze Big Data efficiently using distributed computing. The
ecosystem revolves around the Apache Hadoop framework, which allows for scalable,
fault-tolerant, and parallel data processing across clusters of computers.

🧩 Core Components of Hadoop Ecosystem

1. HDFS (Hadoop Distributed File System)

● A distributed file system that stores data across multiple machines.

● Stores data in blocks (default: 128 MB) and replicates each block (default: 3 copies)
for fault tolerance.

● Master node: NameNode

● Worker nodes: DataNodes

✅ Use: Storage of large datasets across multiple nodes.

2. MapReduce

● A programming model for processing large data sets in parallel.

● Splits tasks into two phases:

○ Map phase: Processes input data and generates key-value pairs.

○ Reduce phase: Aggregates and summarizes the mapped data.

✅ Use: Batch processing of big data.

3. YARN (Yet Another Resource Negotiator)

● Manages and schedules resources in a Hadoop cluster.

● Components:

○ ResourceManager: Global resource scheduler.

○ NodeManager: Manages execution on each node.

✅ Use: Efficient resource management and job scheduling.

🧰 Hadoop Ecosystem Tools

The Hadoop core is extended by several tools that provide additional functionality for data
ingestion, processing, storage, querying, and management.

🔹 Data Ingestion Tools

1. Apache Sqoop

● Transfers data between Hadoop and relational databases (MySQL, Oracle).

● Supports import/export operations.

2. Apache Flume
● Collects and transfers large volumes of log data from various sources (like web servers)
to HDFS.

🔹 Data Storage and Access

1. HBase

● A NoSQL database that runs on top of HDFS.

● Stores structured and semi-structured data in a column-oriented format.

● Provides real-time read/write access.

🔹 Data Processing and Querying

1. Apache Hive

● A data warehouse system for Hadoop.

● Uses HiveQL, a SQL-like query language.

● Translates queries into MapReduce jobs.

2. Apache Pig

● Uses Pig Latin, a scripting language.

● Simplifies writing complex MapReduce tasks.

3. Apache Spark

● An in-memory data processing engine.

● Much faster than MapReduce for iterative tasks.

● Supports batch, streaming, machine learning, and graph processing.

🔹 Data Orchestration and Workflow
1. Apache Oozie

● A workflow scheduler system to manage Hadoop jobs.

● Supports job dependencies and triggers.

🔹 Data Search and Indexing

1. Apache Solr / Elasticsearch

● Open-source search platforms used for indexing large volumes of text data in Hadoop.

● Supports full-text search and analytics.

🔹 Cluster Management and Coordination

1. Apache Zookeeper

● A centralized service for maintaining configuration and coordination.

● Used for synchronization between Hadoop services.

🔹 Data Serialization
1. Apache Avro

● A data serialization system.

● Helps in storing and exchanging data efficiently between programs.

🧠 Architecture Diagram of Hadoop Ecosystem
(If you’d like, I can provide a visual diagram of this entire ecosystem.)

✅ Conclusion
The Hadoop Ecosystem provides a powerful platform for storing, managing, and analyzing
big data efficiently. It enables organizations to:

● Store huge volumes of data (structured or unstructured).

● Perform batch or real-time analytics.

● Scale horizontally with low-cost hardware.

8. Map Reduce

✅ 1. MapReduce (5 Marks)
🔷 Definition:
MapReduce is a programming model and processing engine used in the Hadoop ecosystem
to process large volumes of data in a distributed and parallel manner across a cluster.

🔷 Working Mechanism:
● Map Phase:

○ Input data is split into key-value pairs.

○ A user-defined Mapper function processes each pair.

● Shuffle and Sort Phase:

○ System sorts and groups intermediate data by key.

● Reduce Phase:
○ A user-defined Reducer function combines values with the same key to produce
the final output.

🔷 Example:
If processing log data to count IP visits:

● Map: Emit (IP, 1) for each request.

● Reduce: Sum all values by IP → (IP, total visits)

🔷 Advantages:
● Handles large-scale batch processing

● Fault-tolerant and scalable

● Runs on commodity hardware

🔷 Limitations:
● Low-level API; requires Java programming.

● Not efficient for iterative tasks (like ML).

9. Apache Pig (5 Marks)

🔷 Definition:
Apache Pig is a high-level data flow language and execution framework built on top of
Hadoop. It uses a scripting language called Pig Latin to simplify data analysis tasks.

🔷 Features:
● Abstracts complex MapReduce operations into simpler scripts.

● Supports data loading, transformation, and storage.

● Works with both structured and semi-structured data (e.g., JSON, XML).
🔷 Example Pig Latin Script:
pig

CopyEdit

logs = LOAD 'access.log' AS (ip:chararray, url:chararray);

grouped = GROUP logs BY ip;

counted = FOREACH grouped GENERATE group, COUNT(logs);

DUMP counted;

🔷 Advantages:
● Less coding than raw MapReduce

● Flexible for ETL tasks

● Supports UDFs (User-Defined Functions) in Java or Python

🔷 Limitations:
● Not real-time (batch only)

● Lacks native schema enforcement like SQL

10. Apache Hive (5 Marks)

🔷 Definition:
Apache Hive is a data warehousing tool that provides an SQL-like interface (HiveQL) to query
and analyze large datasets stored in Hadoop (HDFS).

🔷 Features:
● Converts HiveQL queries into MapReduce jobs.
● Supports tables, partitions, and buckets for data organization.

● Compatible with tools like Tableau, Power BI, and Spark.

🔷 Example Hive Query:

sql

CopyEdit

SELECT product, SUM(sales)

FROM transactions

GROUP BY product;

🔷 Advantages:
● Familiar SQL interface for analysts

● Handles huge datasets

● Supports batch queries and ETL

🔷 Limitations:
● Not suitable for real-time processing

● Slower than in-memory engines like Spark

11. Data Visualization using Python: Line plot, Scatter plot, Histogram, Density plot, Box- plot.

Here's a detailed 5-mark explanation of each key data visualization type in Python using
libraries like Matplotlib and Seaborn:

✅ 1. Line Plot (5 Marks)

🔷 Definition:
A line plot displays information as a series of data points called markers connected by straight
line segments. It is commonly used to visualize trends over time.

🔷 Use Cases:
● Time-series data

● Stock market prices

● Sensor readings over time

🔷 Python Example (Matplotlib):

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]

y = [10, 12, 14, 13, 15]

plt.plot(x, y, marker='o')

plt.title("Line Plot Example")

plt.xlabel("Time")

plt.ylabel("Value")

plt.show()
🔷 Advantages:
● Excellent for showing trends and changes

● Simple and clear to interpret

🔷 Limitations:
● Not suitable for categorical or unordered data

✅ 2. Scatter Plot (5 Marks)

🔷 Definition:
A scatter plot displays individual data points as dots in a two-dimensional space, often used
to show the relationship between two continuous variables.

🔷 Use Cases:
● Correlation between variables

● Outlier detection

● Regression analysis

🔷 Python Example:
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]

y = [5, 7, 4, 6, 5]

plt.scatter(x, y, color='red')

plt.title("Scatter Plot Example")

plt.xlabel("X-axis")

plt.ylabel("Y-axis")

plt.show()

🔷 Advantages:
● Reveals patterns, clusters, or correlations

● Identifies outliers easily

🔷 Limitations:
● Overplotting in large datasets

✅ 3. Histogram (5 Marks)
🔷 Definition:
A histogram is used to represent the frequency distribution of a numerical variable by
dividing the data into bins (intervals) and plotting the number of values in each bin.

🔷 Use Cases:
● Distribution of exam scores

● Frequency of ages

● Data skewness detection

🔷 Python Example:
import matplotlib.pyplot as plt

data = [1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 5]
plt.hist(data, bins=5, color='skyblue', edgecolor='black')

plt.title("Histogram Example")

plt.xlabel("Value")

plt.ylabel("Frequency")

plt.show()

🔷 Advantages:
● Shows distribution and range

● Easy to interpret

🔷 Limitations:
● Loses exact individual values

✅ 4. Density Plot (5 Marks)

🔷 Definition:
A density plot (KDE plot) is a smoothed version of a histogram. It shows the probability
density function of a continuous variable using kernel density estimation.

🔷 Use Cases:
● Identifying distribution shape

● Comparing multiple distributions

● Probability estimation

🔷 Python Example (Seaborn):

import seaborn as sns

import matplotlib.pyplot as plt

data = [1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5]

sns.kdeplot(data, shade=True)

plt.title("Density Plot Example")

plt.xlabel("Value")

plt.ylabel("Density")

plt.show()

🔷 Advantages:
● Smooth and continuous view of data distribution

● Better for comparing multiple datasets

🔷 Limitations:
● Sensitive to bandwidth choice

✅ 5. Box Plot (5 Marks)

🔷 Definition:
A box plot displays the five-number summary of a dataset: minimum, Q1, median, Q3, and
maximum. It helps identify outliers and data spread.

🔷 Use Cases:
● Comparing distributions between groups

● Detecting outliers

● Understanding skewness

🔷 Python Example (Seaborn):

import seaborn as sns

import matplotlib.pyplot as plt

data = [5, 7, 8, 9, 10, 12, 13, 14, 15, 22, 25]

sns.boxplot(data=data)

plt.title("Box Plot Example")

plt.show()

🔷 Advantages:
● Visualizes spread and central tendency

● Highlights outliers clearly

🔷 Limitations:
● Less informative for small datasets

Dsbda Ut6
No ratings yet
Dsbda Ut6
11 pages
All Unit DV Notes
No ratings yet
All Unit DV Notes
31 pages
Notes DV 2025
No ratings yet
Notes DV 2025
10 pages
Data Prep and Analysis - Unit 4
No ratings yet
Data Prep and Analysis - Unit 4
32 pages
Notes
No ratings yet
Notes
10 pages
5th Unit Fds
No ratings yet
5th Unit Fds
5 pages
Unit-1 Data Visualization Notes
No ratings yet
Unit-1 Data Visualization Notes
15 pages
Task 10a
No ratings yet
Task 10a
7 pages
Data Visualization Analysis Ans
No ratings yet
Data Visualization Analysis Ans
12 pages
Data Visualization
No ratings yet
Data Visualization
16 pages
Unit 6
No ratings yet
Unit 6
12 pages
Data Visualization Notes
No ratings yet
Data Visualization Notes
4 pages
Eti MP
No ratings yet
Eti MP
15 pages
Chapter 7 Data Analytics and Visualisation
No ratings yet
Chapter 7 Data Analytics and Visualisation
7 pages
DA Unit3
No ratings yet
DA Unit3
40 pages
Data Visualization 2
No ratings yet
Data Visualization 2
3 pages
UNIT 5 Data Analytics
No ratings yet
UNIT 5 Data Analytics
20 pages
Lecture Notes 1 - Introduction To Data Analysis and Visualization-1718780831207
No ratings yet
Lecture Notes 1 - Introduction To Data Analysis and Visualization-1718780831207
11 pages
EIT Project
No ratings yet
EIT Project
16 pages
Data Visualization Essentials
No ratings yet
Data Visualization Essentials
33 pages
UNIT-3 Data Visualization
No ratings yet
UNIT-3 Data Visualization
10 pages
Unit IV Final
No ratings yet
Unit IV Final
54 pages
CCW331 Unit 1 BA Part 2
No ratings yet
CCW331 Unit 1 BA Part 2
5 pages
Data Visualization Notes
No ratings yet
Data Visualization Notes
22 pages
Introduction to Data Visualization
100% (12)
Introduction to Data Visualization
28 pages
Data Visualization
No ratings yet
Data Visualization
3 pages
DA Fat 3
No ratings yet
DA Fat 3
24 pages
Unit - 1 DV
100% (1)
Unit - 1 DV
10 pages
Lecture 1
No ratings yet
Lecture 1
10 pages
Data Visualization Tableau
No ratings yet
Data Visualization Tableau
33 pages
Lecture 5 (BI)
No ratings yet
Lecture 5 (BI)
18 pages
Module 7
No ratings yet
Module 7
4 pages
Data Visualisation Techniques and
No ratings yet
Data Visualisation Techniques and
19 pages
DV Unit-1
No ratings yet
DV Unit-1
8 pages
105-106 Data Visualization Techniques Tools and Best Practices
No ratings yet
105-106 Data Visualization Techniques Tools and Best Practices
25 pages
Data Visualization Seminar Report4.docx 11
No ratings yet
Data Visualization Seminar Report4.docx 11
40 pages
Unit 5
No ratings yet
Unit 5
6 pages
Data Visualization Techniques Guide
No ratings yet
Data Visualization Techniques Guide
15 pages
Data Analytics
No ratings yet
Data Analytics
14 pages
Chapter 2c Data Visualization Notes & Q&A
No ratings yet
Chapter 2c Data Visualization Notes & Q&A
4 pages
Data Visualization
No ratings yet
Data Visualization
25 pages
Unit 5 BDT
No ratings yet
Unit 5 BDT
132 pages
Data Visualization
No ratings yet
Data Visualization
8 pages
Data Visualization Basics & Techniques
No ratings yet
Data Visualization Basics & Techniques
23 pages
Big Data Visualization
No ratings yet
Big Data Visualization
7 pages
Data Visualization Basics To Advanced
No ratings yet
Data Visualization Basics To Advanced
11 pages
17 Essential Data Visualization Techniques
No ratings yet
17 Essential Data Visualization Techniques
12 pages
Report Data
No ratings yet
Report Data
22 pages
DSV M4 Ext
No ratings yet
DSV M4 Ext
4 pages
Data Visualization Essentials Guide
No ratings yet
Data Visualization Essentials Guide
83 pages
Ds 1603 - Data Visualization Unit I Introduction
No ratings yet
Ds 1603 - Data Visualization Unit I Introduction
17 pages
Unit-5 BDA - Data Visualization
No ratings yet
Unit-5 BDA - Data Visualization
19 pages
Unit III Business Analytics
No ratings yet
Unit III Business Analytics
8 pages
Eds Unit 3
No ratings yet
Eds Unit 3
22 pages
Data Visualization Module4
No ratings yet
Data Visualization Module4
35 pages
Vol11Iss1 P4
No ratings yet
Vol11Iss1 P4
7 pages
Big Data Analytics L-7
No ratings yet
Big Data Analytics L-7
3 pages
Data Visualization-1
No ratings yet
Data Visualization-1
29 pages
Dvba Digital Notes
No ratings yet
Dvba Digital Notes
70 pages
TDS 30060 Hardtop Eco Euk GB
No ratings yet
TDS 30060 Hardtop Eco Euk GB
6 pages
Harmonic Reduction in VSI: SVPWM vs SPWM
No ratings yet
Harmonic Reduction in VSI: SVPWM vs SPWM
5 pages
Expt 7 Classification Tests For Hydrocarbons
87% (30)
Expt 7 Classification Tests For Hydrocarbons
7 pages
Jhunjhunu History
No ratings yet
Jhunjhunu History
5 pages
Treat and Sow Seeds
0% (1)
Treat and Sow Seeds
12 pages
Electrical Engg Exam Paper
No ratings yet
Electrical Engg Exam Paper
25 pages
Aakriti Mahajan
No ratings yet
Aakriti Mahajan
45 pages
Ahmed Radwan 1-1
No ratings yet
Ahmed Radwan 1-1
3 pages
Business Pressure
No ratings yet
Business Pressure
7 pages
How Big Is The Problem?: Incontinence in Numbers
No ratings yet
How Big Is The Problem?: Incontinence in Numbers
14 pages
Williams Poems
100% (1)
Williams Poems
3 pages
Cost and Management Accounting-I 1st Edition - Ebook PDF PDF Download
100% (1)
Cost and Management Accounting-I 1st Edition - Ebook PDF PDF Download
84 pages
Nuclear Fuel Rod Thermal Analysis
No ratings yet
Nuclear Fuel Rod Thermal Analysis
12 pages
The 4 Disciplines of Execution Revised and Updated
No ratings yet
The 4 Disciplines of Execution Revised and Updated
8 pages
Full Download The Subject of Coexistence Otherness in International Relations Borderlines Series 1st Edition Louiza Odysseos PDF
100% (13)
Full Download The Subject of Coexistence Otherness in International Relations Borderlines Series 1st Edition Louiza Odysseos PDF
84 pages
Hydraulic Systems Design Guidelines
100% (3)
Hydraulic Systems Design Guidelines
29 pages
CS 488
No ratings yet
CS 488
288 pages
MEG2
No ratings yet
MEG2
64 pages
Everest Simulation Report PDF
No ratings yet
Everest Simulation Report PDF
17 pages
PGBC Summary
100% (2)
PGBC Summary
36 pages
Tds Kansai Conducoat 209 WB
No ratings yet
Tds Kansai Conducoat 209 WB
2 pages
TR 28
100% (1)
TR 28
4 pages
Yukitoshi Higashino Mfta
100% (2)
Yukitoshi Higashino Mfta
29 pages
Money Adv Comp Essay
No ratings yet
Money Adv Comp Essay
5 pages
Od 226429569883076000
No ratings yet
Od 226429569883076000
2 pages
C Programming Quiz for Beginners
No ratings yet
C Programming Quiz for Beginners
18 pages
DLL English 10 Q1 - Module 1 - Lesson 3 - Myth, Implicit and Explicit Signals, Let It Go, Orpheus, Life of Pi
No ratings yet
DLL English 10 Q1 - Module 1 - Lesson 3 - Myth, Implicit and Explicit Signals, Let It Go, Orpheus, Life of Pi
8 pages
Pages From (ASHRAE) - 2009 - ASHRAE - Handbook - Fundamentals - (SI)
No ratings yet
Pages From (ASHRAE) - 2009 - ASHRAE - Handbook - Fundamentals - (SI)
2 pages
Logistics Information System
No ratings yet
Logistics Information System
6 pages
LP Day 2
No ratings yet
LP Day 2
5 pages

DSBDA Unit 6 Notes

Uploaded by

DSBDA Unit 6 Notes

Uploaded by

DSBDA

Unit 6: Data Visualization and Hadoop

1.​ Introduction to Data Visualization

Introduction to Data Visualization – Explained in Detail

✅ What is Data Visualization?

✅ Why is Data Visualization Important?

✅ Types of Data Visualization Techniques

Bar Chart Compares categories using Excel, Power BI

Line Chart Shows trends over time. Tableau, Python

Pie Chart Shows proportions within a whole. Google Charts

Scatter Shows relationship between two Excel, Python

Heatmap Represents data density or intensity Seaborn

Tree Map Shows part-to-whole relationships Tableau

Geographi Visualizes location-based data. QGIS, Mapbox

✅ Key Components of Good Data Visualization

2.​ Axis Labels: Defined and accurate.​

3.​ Legends: To differentiate categories or data series.​

5.​ Simplicity: Avoid clutter and keep the design clean.​

✅ Tools for Data Visualization

●​ Tableau: Interactive and advanced dashboards.​

●​ Power BI: Business analytics and visualization.​

●​ Python Libraries: Matplotlib, Seaborn, Plotly.​

●​ R Libraries: ggplot2, lattice.​

●​ Healthcare: Track patient statistics and disease trends.​

●​ Finance: Visualize stock data, budget analysis.​

●​ Education: Display academic results or performance metrics.​

●​ Social Media: Analyze engagement or user behavior data.​

2. Challenges to Big data visualization

Challenges to Big Data Visualization

✅ 3. Real-Time Data Processing

●​ Impact: Requires high-speed processing and responsive UI for real-time dashboards.​

✅ 4. Noise and Outliers

●​ Impact: Can distort visualizations and lead to incorrect interpretations.​

✅ 5. Overplotting and Clutter

●​ Impact: Reduces readability and hides important insights.​

●​ Impact: Limits the accessibility of insights to business users and decision-makers.​

✅ 7. Storage and Performance

●​ Impact: High computation costs and infrastructure demands.​

✅ 8. Data Privacy and Security

●​ Impact: Requires anonymization and secure access controls.​

✅ 10. User-Centric Design

Types of Data Visualization – Explained in Detail

●​ Used For: Comparing different categories or groups.​

●​ Orientation: Can be vertical or horizontal.​

●​ Example: Sales revenue by product category.​

●​ Used For: Showing changes over time or comparison across groups.​

●​ Example: Monthly website traffic.​

●​ Example: Temperature change over the year.​

●​ Visualizing trends and patterns​

●​ Used For: Showing percentages or parts of a whole.​

●​ Example: Market share of companies.​

Caution: Not ideal for comparing many categories or small differences.

●​ Example: Distribution of student grades.​

●​ Understanding data distribution​

●​ Identifying skewness or outliers​

●​ Used For: Finding correlations or patterns between variables.​

●​ Example: Height vs. weight of individuals.​

●​ Identifying clusters and outliers​

●​ Used For: Showing volume over time.​

●​ Example: Cumulative sales over months.​

●​ Comparing quantities over time​

●​ Used For: Comparing values across two variables using color.​

●​ Example: Student attendance vs. performance.​

●​ Spotting patterns or anomalies in large datasets​

●​ Used For: Showing proportions within a hierarchy.​

●​ Example: Company’s departmental expenses.​

✅ 10. Bubble Chart

●​ Used For: Multi-dimensional data.​

●​ Example: Country GDP vs. life expectancy (size = population).​

✅ 11. Box Plot (Box-and-Whisker Plot)

●​ Used For: Detecting variability and outliers.​

●​ Example: Exam score distribution among students.​

✅ 12. Geographical Map (Geo Map)

●​ Used For: Location-based data visualization.​

●​ Example: Population density by region.​

●​ Used For: Comparing multiple variables per item.​

1. Introduction to Data Visualization

2. Axis Labels: Defined and accurate.

3. Legends: To differentiate categories or data series.

5. Simplicity: Avoid clutter and keep the design clean.

● Tableau: Interactive and advanced dashboards.

● Power BI: Business analytics and visualization.

● Python Libraries: Matplotlib, Seaborn, Plotly.

● R Libraries: ggplot2, lattice.

● Healthcare: Track patient statistics and disease trends.

● Finance: Visualize stock data, budget analysis.

● Education: Display academic results or performance metrics.

● Social Media: Analyze engagement or user behavior data.

● Impact: Requires high-speed processing and responsive UI for real-time dashboards.

● Impact: Can distort visualizations and lead to incorrect interpretations.

● Impact: Reduces readability and hides important insights.

● Impact: Limits the accessibility of insights to business users and decision-makers.

● Impact: High computation costs and infrastructure demands.

● Impact: Requires anonymization and secure access controls.

● Used For: Comparing different categories or groups.

● Orientation: Can be vertical or horizontal.

● Example: Sales revenue by product category.

● Used For: Showing changes over time or comparison across groups.

● Example: Monthly website traffic.

● Example: Temperature change over the year.

● Visualizing trends and patterns

● Used For: Showing percentages or parts of a whole.

● Example: Market share of companies.

● Example: Distribution of student grades.

● Understanding data distribution

● Identifying skewness or outliers

● Used For: Finding correlations or patterns between variables.

● Example: Height vs. weight of individuals.

● Identifying clusters and outliers

● Used For: Showing volume over time.

● Example: Cumulative sales over months.

● Comparing quantities over time

● Used For: Comparing values across two variables using color.

● Example: Student attendance vs. performance.

● Spotting patterns or anomalies in large datasets

● Used For: Showing proportions within a hierarchy.

● Example: Company’s departmental expenses.

● Used For: Multi-dimensional data.

● Example: Country GDP vs. life expectancy (size = population).

● Used For: Detecting variability and outliers.

● Example: Exam score distribution among students.

● Used For: Location-based data visualization.

● Example: Population density by region.

● Used For: Comparing multiple variables per item.

● Example: Employee skills analysis.

● Used For: Project planning and tracking.

● Example: Software development timeline.

● Represent categorical data using rectangular bars.

● Used for comparing groups or categories.

● Example: Sales by region, population by age group.

● Display data points connected by lines.

● Ideal for time-series or trend analysis.

● Example: Stock prices over a year.

● Show percentage or proportional data as slices of a circle.

● Best used when comparing parts of a whole.

● Limitation: Not effective with more than 4–5 categories.

● Displays frequency distribution of continuous variables.

● Groups data into intervals or “bins.”

● Example: Distribution of test scores.

● Similar to line charts but with filled areas.

● Example: Website visits over time.

● Visualize data distribution using median, quartiles, and outliers.

● Effective in comparing distributions across categories.

● Combine box plot and density plot.

● Show the full distribution of the data.

● Display the relationship between two continuous variables.

● Useful for detecting patterns, trends, and outliers.

● Example: Height vs. weight.

● Example: GDP vs. life expectancy (bubble size = population).

● Use color intensity to show values in a matrix.

● Excellent for visualizing correlation matrices or attendance records.

● Visualize hierarchical data using nested rectangles.

● Size and color represent different variables.

● Circular hierarchical charts that are visually similar to tree maps.

● A variation of pie charts with a central hole.

● Easier to read than traditional pie charts in some cases.