SAGAR INSTITUTE OF RESEARCH AND TECHNOLOGY
Bhopal, Madhya Pradesh (462021)
DEPARTMENT OF AIML
Lab Manual
of
Data and Visual Analytics AL603 (B)
INDEX
Plan Date of
S.No Name of Experiment Remark
Date Completion
Write a program to implement Descriptive
1. Statistics with Measures of Central Tendency and
Dispersion in Python
Write a program to implement Statistical
2. Inference and Sampling Distribution in Python.
Write a program to implement Statistical
3. Hypothesis Testing and Analysis in Python.
Write a program to implement Regression
4. Modeling and Bayesian Inference in Python.
Write a program to implement Data Wrangling
5. and Cleaning in Python.
Write a program to implement Data Visualization
in Data Analysis in Python.
6.
Write a program to implement Data Ecosystem
Overview, File Formats, and Sources of Data in
7. Python.
Write a program to implement Data Pipelines,
ETL, and Big Data Processing with Spark in
8.
Python.
Write a program to implement Basic Data
Visualizations using Matplotlib, Seaborn, and
9. Pandas in Python.
Write a program to implement Interactive
Visualizations using Plotly in Python.
10.
1) Write a program to implement Descriptive Statistics with Measures of
Central Tendency and Dispersion in Python.
import numpy as np
import pandas as pd
from scipy import stats
# Sample data representing test scores of students (ratio scale data)
data = [56, 77, 89, 92, 65, 74, 83, 90, 78, 85, 60, 69]
# Convert the data into a pandas Series for better management and indexing
data_series = pd.Series(data)
# Descriptive statistics
mean = np.mean(data_series)
median = np.median(data_series)
mode = stats.mode(data_series)[0][0]
variance = np.var(data_series)
std_deviation = np.std(data_series)
range_data = np.max(data_series) - np.min(data_series)
# Output the results
print("Descriptive Statistics for Test Scores:")
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_deviation}")
print(f"Range: {range_data}")
Sample Output:
Descriptive Statistics for Test Scores:
Mean: 76.33333333333333
Median: 77.0
Mode: 56
Variance: 105.43434343434343
Standard Deviation: 10.270219259636385
Range: 36
2) Write a program to implement Statistical Inference and Sampling
Distribution in Python.
In this program, we will demonstrate sampling distributions, resampling, and statistical
inference using bootstrapping to estimate the confidence interval for the mean of a
population based on a sample.
import numpy as np
import matplotlib.pyplot as plt
# Sample data (ratio-level data)
data = [56, 77, 89, 92, 65, 74, 83, 90, 78, 85, 60, 69]
# Function for bootstrapping to estimate confidence interval
def bootstrap(data, n_iterations, sample_size):
means = []
for _ in range(n_iterations):
sample = np.random.choice(data, size=sample_size, replace=True)
means.append(np.mean(sample))
return np.array(means)
# Parameters
n_iterations = 1000 # Number of bootstrap iterations
sample_size = len(data) # Size of each sample
# Perform bootstrapping
bootstrap_means = bootstrap(data, n_iterations, sample_size)
# Calculate the 95% confidence interval
conf_interval = np.percentile(bootstrap_means, [2.5, 97.5])
# Output the results
print(f"Bootstrap 95% Confidence Interval for the Mean: {conf_interval}")
print(f"Bootstrap Mean Estimate: {np.mean(bootstrap_means)}")
# Plotting the distribution of the bootstrap samples
plt.hist(bootstrap_means, bins=30, edgecolor='black')
plt.axvline(conf_interval[0], color='red', linestyle='dashed', linewidth=2, label=f'2.5% CI:
{conf_interval[0]}')
plt.axvline(conf_interval[1], color='red', linestyle='dashed', linewidth=2, label=f'97.5% CI:
{conf_interval[1]}')
plt.axvline(np.mean(bootstrap_means), color='blue', linestyle='solid', linewidth=2,
label=f'Mean: {np.mean(bootstrap_means)}')
plt.title('Bootstrap Distribution of Sample Means')
plt.xlabel('Sample Means')
plt.ylabel('Frequency')
plt.legend()
plt.show()
3) Write a program to implement Statistical Hypothesis Testing and
Analysis in Python.
This program performs hypothesis testing, including the Chi-Square test, t-Test, and ANOVA
(Analysis of Variance), along with Correlation Analysis.
import numpy as np
import scipy.stats as stats
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Data for testing
group1 = np.array([23, 45, 67, 89, 56, 45, 78, 23, 67, 89]) # Example data for group 1
group2 = np.array([45, 67, 56, 34, 23, 90, 56, 78, 54, 68]) # Example data for group 2
observed = np.array([10, 20, 30, 40, 50]) # Observed frequency for Chi-square test
expected = np.array([12, 18, 28, 38, 52]) # Expected frequency for Chi-square test
# Chi-Square Test
chi2_stat, p_val_chi2 = stats.chisquare(observed, expected)
print(f"Chi-Square Test: chi2_stat = {chi2_stat}, p-value = {p_val_chi2}")
# Independent t-Test
t_stat, p_val_ttest = stats.ttest_ind(group1, group2)
print(f"\nt-Test: t_stat = {t_stat}, p-value = {p_val_ttest}")
# ANOVA (Analysis of Variance) - Testing if there are significant differences between
multiple groups
group3 = np.array([100, 110, 120, 130, 140])
group4 = np.array([150, 160, 170, 180, 190])
f_stat, p_val_anova = stats.f_oneway(group1, group2, group3, group4)
print(f"\nANOVA: F-stat = {f_stat}, p-value = {p_val_anova}")
# Correlation Analysis - Pearson's correlation
correlation, p_val_corr = stats.pearsonr(group1, group2)
print(f"\nCorrelation Analysis: Pearson correlation = {correlation}, p-value = {p_val_corr}")
# Plotting the correlation using a scatter plot
plt.scatter(group1, group2)
plt.title("Correlation between Group 1 and Group 2")
plt.xlabel("Group 1")
plt.ylabel("Group 2")
plt.show()
Sample Output:
Chi-Square Test: chi2_stat = 3.64, p-value = 0.45915828422551696
t-Test: t_stat = -0.18453789687872552, p-value = 0.8542929353724321
ANOVA: F-stat = 0.2324324324324324, p-value = 0.8723641987883339
Correlation Analysis: Pearson correlation = 0.3278688493185258, p-value =
0.3101255781256621
4)Write a program to implement Regression Modeling and Bayesian
Inference in Python.
This program demonstrates Regression Modeling, including Linear Regression and Bayesian
Inference using a Bayesian Network approach to predict outcomes based on data.
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
import pymc3 as pm
import matplotlib.pyplot as plt
# Generate synthetic data for linear regression (Example: predicting house price based on
size)
np.random.seed(0)
size = np.random.normal(1500, 500, 100) # House sizes in square feet
price = size * 300 + np.random.normal(50000, 10000, 100) # Price in dollars (size * 300 +
some noise)
# Linear Regression (Ordinary Least Squares)
X = sm.add_constant(size) # Adding constant for intercept
model = sm.OLS(price, X).fit() # Fit the model
print(model.summary())
# Visualization of regression line
plt.scatter(size, price, color='blue', label='Data Points')
plt.plot(size, model.predict(X), color='red', label='Regression Line')
plt.title('Linear Regression: House Price vs. Size')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price (in dollars)')
plt.legend()
plt.show()
# Bayesian Inference using PyMC3 (Bayesian Linear Regression)
with pm.Model() as model_bayesian:
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sigma=10)
beta = pm.Normal('beta', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=10)
# Linear model
mu = alpha + beta * size
# Likelihood
Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=price)
# Sampling from the posterior distribution
trace = pm.sample(2000, return_inferencedata=False)
# Plot posterior distributions
pm.plot_posterior(trace, var_names=['alpha', 'beta', 'sigma'])
plt.show()
Sample Output:
OLS Regression Summary (Printed Output):
5)Write a program to implement Data Wrangling and Cleaning in Python.
This program demonstrates how to gather, assess, and clean data before performing
visualizations. We will work with a sample dataset (e.g., from a CSV file) to perform basic
data wrangling tasks.
import pandas as pd
import numpy as np
# Load a sample dataset (Assume it's a CSV file)
# For this example, we're using a small manually-created dataset for simplicity.
data = {
'Age': [23, 45, 36, np.nan, 50, 29, 35, np.nan, 60, 44],
'Income': [50000, 70000, 80000, 60000, 120000, np.nan, 55000, 80000, 100000, 85000],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago', 'Los Angeles', np.nan,
'Chicago', 'New York', 'Los Angeles']
}
df = pd.DataFrame(data)
# Displaying the initial dataset
print("Initial Dataset:")
print(df)
# Assessing the data - Checking for missing values
print("\nMissing Values:")
print(df.isnull().sum())
# Cleaning the data - Handling missing values
# For 'Age' and 'Income', we can fill missing values with the mean of the respective column
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Income'].fillna(df['Income'].mean(), inplace=True)
# For 'City', we can fill missing values with the mode (most common value)
df['City'].fillna(df['City'].mode()[0], inplace=True)
# Displaying the cleaned dataset
print("\nCleaned Dataset:")
print(df)
Sample Output:
Initial Dataset:
Age Income City
0 23.0 50000.0 New York
1 45.0 70000.0 Los Angeles
2 36.0 80000.0 Chicago
3 NaN 60000.0 New York
4 50.0 120000.0 Chicago
5 29.0 NaN Los Angeles
6 35.0 55000.0 Chicago
7 NaN 80000.0 New York
8 60.0 100000.0 Los Angeles
9 44.0 85000.0 Los Angeles
Missing Values:
Age 2
Income 1
City 1
dtype: int64
Cleaned Dataset:
Age Income City
0 23.0 50000.0 New York
1 45.0 70000.0 Los Angeles
2 36.0 80000.0 Chicago
3 39.5 60000.0 New York
4 50.0 120000.0 Chicago
5 29.0 74900.0 Los Angeles
6 35.0 55000.0 Chicago
7 39.5 80000.0 New York
8 60.0 100000.0 Los Angeles
9 44.0 85000.0 Los Angeles
6)Write a program to implement Data Visualization in Data Analysis in
Python.
This program focuses on various types of visualizations: univariate, bivariate, and
multivariate exploration. We will use the cleaned data from Program 1 and apply different
visualization techniques to explore the relationships in the dataset.
import matplotlib.pyplot as plt
import seaborn as sns
# Data from Program 1
# Assuming df is the cleaned DataFrame from Program 1
# Univariate Exploration: Distribution of 'Age'
plt.figure(figsize=(8, 6))
sns.histplot(df['Age'], kde=True, color='blue', bins=8)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
# Bivariate Exploration: Scatter plot of 'Income' vs 'Age'
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Age', y='Income', data=df, hue='City', palette='viridis', s=100)
plt.title('Income vs Age by City')
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()
# Multivariate Exploration: Pairplot (pairwise relationships)
sns.pairplot(df, hue='City', diag_kind='kde')
plt.title('Pairplot of Age, Income, and City')
plt.show()
# Explanatory Visualization: Boxplot of 'Income' across 'City'
plt.figure(figsize=(8, 6))
sns.boxplot(x='City', y='Income', data=df)
plt.title('Income Distribution by City')
plt.xlabel('City')
plt.ylabel('Income')
plt.show()
7) Write a program to implement Data Ecosystem Overview, File Formats,
and Sources of Data in Python.
This program demonstrates the handling of different data types and file formats (CSV, JSON,
and NoSQL with MongoDB) to understand various sources of data and data repositories.
import pandas as pd
import json
import pymongo
from bson.json_util import dumps
# 1. Overview of Data Types (Structured, Semi-structured, Unstructured)
# Structured Data: Tabular format (e.g., CSV, SQL Databases)
structured_data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(structured_data)
print("Structured Data (DataFrame):")
print(df)
# Semi-structured Data: JSON format
semi_structured_data = [
{"Name": "Alice", "Age": 24, "City": "New York"},
{"Name": "Bob", "Age": 30, "City": "Los Angeles"},
{"Name": "Charlie", "Age": 35, "City": "Chicago"}
]
json_data = json.dumps(semi_structured_data, indent=4)
print("\nSemi-structured Data (JSON):")
print(json_data)
# Unstructured Data: Text data (example)
unstructured_data = "Alice is 24 years old. Bob is 30 years old. Charlie is 35 years old."
print("\nUnstructured Data (Text):")
print(unstructured_data)
# 2. Sources of Data: Using a CSV file (you can substitute this with any CSV file)
csv_data = pd.read_csv('sample_data.csv') # Assumed 'sample_data.csv' exists in your
directory
print("\nCSV Data Loaded:")
print(csv_data.head())
# 3. MongoDB NoSQL Database Example
client = pymongo.MongoClient("mongodb://localhost:27017/") # MongoDB connection
(locally hosted)
db = client['test_database']
collection = db['people']
# Inserting the structured data (inserting a document into MongoDB)
collection.insert_many(structured_data)
print("\nMongoDB Data Insertion Complete.")
# Retrieve and print documents
documents = collection.find()
print("\nData Retrieved from MongoDB:")
print(dumps(documents, indent=4))
Sample Output:
Structured Data (DataFrame):
Name Age City
0 Alice 24 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
Semi-structured Data (JSON):
[
{
"Name": "Alice",
"Age": 24,
"City": "New York"
},
{
"Name": "Bob",
"Age": 30,
"City": "Los Angeles"
},
{
"Name": "Charlie",
"Age": 35,
"City": "Chicago"
}
]
Unstructured Data (Text):
Alice is 24 years old. Bob is 30 years old. Charlie is 35 years old.
CSV Data Loaded:
Name Age City
0 Alice 24 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
MongoDB Data Insertion Complete.
Data Retrieved from MongoDB:
[
{
"Name": "Alice",
"Age": 24,
"City": "New York"
},
{
"Name": "Bob",
"Age": 30,
"City": "Los Angeles"
},
{
"Name": "Charlie",
"Age": 35,
"City": "Chicago"
}
]
8) Write a program to implement Data Pipelines, ETL, and Big Data
Processing with Spark in Python.
This program demonstrates ETL (Extract, Transform, Load) and basic big data processing
using Apache Spark (via PySpark). It also touches on concepts like Hadoop Distributed File
System (HDFS) and processing large datasets in Spark.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# 1. Setting up a Spark session
spark = SparkSession.builder \
.appName("ETL and Big Data Processing") \
.getOrCreate()
# 2. Extracting data: Loading a large dataset (assuming a CSV file with large data)
# Example: Loading the data from HDFS or local file system (adjust the path accordingly)
# data = spark.read.csv("hdfs://path_to_large_dataset.csv", header=True, inferSchema=True)
# For demonstration purposes, we use a smaller sample dataframe (similar to large dataset)
data = spark.createDataFrame([
("Alice", 24, "New York"),
("Bob", 30, "Los Angeles"),
("Charlie", 35, "Chicago"),
("David", 40, "Miami"),
("Eve", 22, "San Francisco")
], ["Name", "Age", "City"])
# 3. Transforming data: Filtering out people who are younger than 30 years old
filtered_data = data.filter(col("Age") > 30)
# 4. Loading data: Show the result after transformation (ETL process)
print("Transformed Data:")
filtered_data.show()
# 5. Big Data Processing - Using Spark for Large Dataset Processing (e.g., group by city and
count)
city_count = data.groupBy("City").count()
print("\nCity Count Aggregation (Big Data Processing Example):")
city_count.show()
# Stop the Spark session
spark.stop()
Sample Output:
9) Write a program to implement Basic Data Visualizations using
Matplotlib, Seaborn, and Pandas in Python.
This program demonstrates how to use Matplotlib for basic plots, Seaborn for advanced
statistical visualizations, and Pandas for data handling.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# 1. Data Preparation using Pandas
# Creating a sample DataFrame with random data
data = {
'Age': np.random.randint(20, 60, 100), # Random ages between 20 and 60
'Income': np.random.randint(30000, 120000, 100), # Random income values
'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Miami'], 100)
}
df = pd.DataFrame(data)
# 2. Basic Plotting with Matplotlib
plt.figure(figsize=(8, 6))
plt.hist(df['Age'], bins=10, color='skyblue', edgecolor='black')
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
# 3. Seaborn: Boxplot to compare Income by City
plt.figure(figsize=(8, 6))
sns.boxplot(x='City', y='Income', data=df, palette='Set2')
plt.title('Income Distribution by City')
plt.xlabel('City')
plt.ylabel('Income')
plt.show()
# 4. Seaborn: Scatterplot to explore relationship between Age and Income
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Age', y='Income', data=df, hue='City', palette='Set1', s=100)
plt.title('Scatterplot of Age vs Income')
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()
Sample Output:
1. Histogram of Age:
o A histogram showing the distribution of Age with bins, allowing you to
observe the frequency of different age ranges.
2. Boxplot of Income by City:
o A boxplot for Income across different cities, showing the central tendency,
spread, and potential outliers.
3. Scatterplot of Age vs Income:
o A scatter plot showing how Income correlates with Age, with different colors
representing different cities.
\
10) Write a program to implement Interactive Visualizations using Plotly
in Python.
This program demonstrates how to create interactive visualizations using Plotly. We will use
Plotly Express to visualize a dataset, making it more dynamic and user-friendly.
import plotly.express as px
import pandas as pd
import numpy as np
# 1. Data Preparation using Pandas
# Creating a sample DataFrame with random data
data = {
'Age': np.random.randint(20, 60, 100), # Random ages between 20 and 60
'Income': np.random.randint(30000, 120000, 100), # Random income values
'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Miami'], 100)
}
df = pd.DataFrame(data)
# 2. Plotly: Scatter plot of Age vs Income with color by City
fig = px.scatter(df, x='Age', y='Income', color='City', title="Age vs Income by City",
labels={'Age': 'Age (years)', 'Income': 'Income ($)'})
fig.show()
# 3. Plotly: Box plot of Income by City
fig = px.box(df, x='City', y='Income', title="Income Distribution by City", labels={'Income':
'Income ($)', 'City': 'City'})
fig.show()
# 4. Plotly: Histogram of Age
fig = px.histogram(df, x='Age', title="Distribution of Age", labels={'Age': 'Age (years)'})
fig.show()
o.
Sample Output:
1. Interactive Scatter Plot (Age vs Income by City):
o A dynamic scatter plot where you can hover over points to see Age, Income,
and City. You can zoom in or out to explore specific data points.
2. Interactive Box Plot (Income by City):
o An interactive box plot showing the distribution of Income across cities.
Hovering over the box shows detailed statistics like median, quartiles, and
outliers.
3. Interactive Histogram (Age Distribution):
o An interactive histogram for Age showing the frequency of different age
groups. You can zoom in or click on specific bins for more detailed analysis.
Key Concepts Covered in These Programs:
1. Matplotlib:
o Used for creating basic plots like histograms and line plots.
o Offers customization options for appearance (e.g., titles, labels, colors).
2. Seaborn:
o Built on top of Matplotlib, Seaborn provides higher-level abstractions and
improved visualizations.
o Great for statistical plots like boxplots and scatterplots with hue based on
categories.
3. Plotly:
o Plotly enables the creation of interactive visualizations. It's excellent for
scatter plots, box plots, histograms, and more.
o Offers interactivity features like zooming, panning, and hover text.
4. Pandas:
o Pandas is used to manipulate and prepare the data for visualization.
o It simplifies handling and transforming data for plotting with libraries like
Matplotlib, Seaborn, and Plotly.