KEMBAR78
DVA Lab Manual | PDF | Apache Spark | Bootstrapping (Statistics)
0% found this document useful (0 votes)
31 views20 pages

DVA Lab Manual

The document is a lab manual for the Data and Visual Analytics course at SAGAR Institute of Research and Technology, detailing various programming experiments in Python. It includes tasks on descriptive statistics, statistical inference, hypothesis testing, regression modeling, data wrangling, data visualization, and an overview of data ecosystems. Each section provides code examples and expected outputs for practical learning.

Uploaded by

pawanyadav7015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views20 pages

DVA Lab Manual

The document is a lab manual for the Data and Visual Analytics course at SAGAR Institute of Research and Technology, detailing various programming experiments in Python. It includes tasks on descriptive statistics, statistical inference, hypothesis testing, regression modeling, data wrangling, data visualization, and an overview of data ecosystems. Each section provides code examples and expected outputs for practical learning.

Uploaded by

pawanyadav7015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

SAGAR INSTITUTE OF RESEARCH AND TECHNOLOGY

Bhopal, Madhya Pradesh (462021)

DEPARTMENT OF AIML

Lab Manual

of

Data and Visual Analytics AL603 (B)


INDEX

Plan Date of
S.No Name of Experiment Remark
Date Completion

Write a program to implement Descriptive


1. Statistics with Measures of Central Tendency and
Dispersion in Python
Write a program to implement Statistical
2. Inference and Sampling Distribution in Python.

Write a program to implement Statistical


3. Hypothesis Testing and Analysis in Python.

Write a program to implement Regression


4. Modeling and Bayesian Inference in Python.

Write a program to implement Data Wrangling


5. and Cleaning in Python.

Write a program to implement Data Visualization


in Data Analysis in Python.
6.

Write a program to implement Data Ecosystem


Overview, File Formats, and Sources of Data in
7. Python.

Write a program to implement Data Pipelines,


ETL, and Big Data Processing with Spark in
8.
Python.

Write a program to implement Basic Data


Visualizations using Matplotlib, Seaborn, and
9. Pandas in Python.

Write a program to implement Interactive


Visualizations using Plotly in Python.
10.
1) Write a program to implement Descriptive Statistics with Measures of
Central Tendency and Dispersion in Python.

import numpy as np
import pandas as pd
from scipy import stats

# Sample data representing test scores of students (ratio scale data)


data = [56, 77, 89, 92, 65, 74, 83, 90, 78, 85, 60, 69]

# Convert the data into a pandas Series for better management and indexing
data_series = pd.Series(data)

# Descriptive statistics
mean = np.mean(data_series)
median = np.median(data_series)
mode = stats.mode(data_series)[0][0]
variance = np.var(data_series)
std_deviation = np.std(data_series)
range_data = np.max(data_series) - np.min(data_series)

# Output the results


print("Descriptive Statistics for Test Scores:")
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_deviation}")
print(f"Range: {range_data}")

Sample Output:
Descriptive Statistics for Test Scores:
Mean: 76.33333333333333
Median: 77.0
Mode: 56
Variance: 105.43434343434343
Standard Deviation: 10.270219259636385
Range: 36
2) Write a program to implement Statistical Inference and Sampling
Distribution in Python.

In this program, we will demonstrate sampling distributions, resampling, and statistical


inference using bootstrapping to estimate the confidence interval for the mean of a
population based on a sample.

import numpy as np
import matplotlib.pyplot as plt

# Sample data (ratio-level data)


data = [56, 77, 89, 92, 65, 74, 83, 90, 78, 85, 60, 69]

# Function for bootstrapping to estimate confidence interval


def bootstrap(data, n_iterations, sample_size):
means = []
for _ in range(n_iterations):
sample = np.random.choice(data, size=sample_size, replace=True)
means.append(np.mean(sample))
return np.array(means)

# Parameters
n_iterations = 1000 # Number of bootstrap iterations
sample_size = len(data) # Size of each sample

# Perform bootstrapping
bootstrap_means = bootstrap(data, n_iterations, sample_size)

# Calculate the 95% confidence interval


conf_interval = np.percentile(bootstrap_means, [2.5, 97.5])

# Output the results


print(f"Bootstrap 95% Confidence Interval for the Mean: {conf_interval}")
print(f"Bootstrap Mean Estimate: {np.mean(bootstrap_means)}")

# Plotting the distribution of the bootstrap samples


plt.hist(bootstrap_means, bins=30, edgecolor='black')
plt.axvline(conf_interval[0], color='red', linestyle='dashed', linewidth=2, label=f'2.5% CI:
{conf_interval[0]}')
plt.axvline(conf_interval[1], color='red', linestyle='dashed', linewidth=2, label=f'97.5% CI:
{conf_interval[1]}')
plt.axvline(np.mean(bootstrap_means), color='blue', linestyle='solid', linewidth=2,
label=f'Mean: {np.mean(bootstrap_means)}')
plt.title('Bootstrap Distribution of Sample Means')
plt.xlabel('Sample Means')
plt.ylabel('Frequency')
plt.legend()
plt.show()

3) Write a program to implement Statistical Hypothesis Testing and


Analysis in Python.

This program performs hypothesis testing, including the Chi-Square test, t-Test, and ANOVA
(Analysis of Variance), along with Correlation Analysis.

import numpy as np
import scipy.stats as stats
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Data for testing


group1 = np.array([23, 45, 67, 89, 56, 45, 78, 23, 67, 89]) # Example data for group 1
group2 = np.array([45, 67, 56, 34, 23, 90, 56, 78, 54, 68]) # Example data for group 2
observed = np.array([10, 20, 30, 40, 50]) # Observed frequency for Chi-square test
expected = np.array([12, 18, 28, 38, 52]) # Expected frequency for Chi-square test

# Chi-Square Test
chi2_stat, p_val_chi2 = stats.chisquare(observed, expected)
print(f"Chi-Square Test: chi2_stat = {chi2_stat}, p-value = {p_val_chi2}")

# Independent t-Test
t_stat, p_val_ttest = stats.ttest_ind(group1, group2)
print(f"\nt-Test: t_stat = {t_stat}, p-value = {p_val_ttest}")

# ANOVA (Analysis of Variance) - Testing if there are significant differences between


multiple groups
group3 = np.array([100, 110, 120, 130, 140])
group4 = np.array([150, 160, 170, 180, 190])
f_stat, p_val_anova = stats.f_oneway(group1, group2, group3, group4)
print(f"\nANOVA: F-stat = {f_stat}, p-value = {p_val_anova}")

# Correlation Analysis - Pearson's correlation


correlation, p_val_corr = stats.pearsonr(group1, group2)
print(f"\nCorrelation Analysis: Pearson correlation = {correlation}, p-value = {p_val_corr}")

# Plotting the correlation using a scatter plot


plt.scatter(group1, group2)
plt.title("Correlation between Group 1 and Group 2")
plt.xlabel("Group 1")
plt.ylabel("Group 2")
plt.show()
Sample Output:

Chi-Square Test: chi2_stat = 3.64, p-value = 0.45915828422551696

t-Test: t_stat = -0.18453789687872552, p-value = 0.8542929353724321

ANOVA: F-stat = 0.2324324324324324, p-value = 0.8723641987883339

Correlation Analysis: Pearson correlation = 0.3278688493185258, p-value =


0.3101255781256621
4)Write a program to implement Regression Modeling and Bayesian
Inference in Python.

This program demonstrates Regression Modeling, including Linear Regression and Bayesian
Inference using a Bayesian Network approach to predict outcomes based on data.

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
import pymc3 as pm
import matplotlib.pyplot as plt

# Generate synthetic data for linear regression (Example: predicting house price based on
size)
np.random.seed(0)
size = np.random.normal(1500, 500, 100) # House sizes in square feet
price = size * 300 + np.random.normal(50000, 10000, 100) # Price in dollars (size * 300 +
some noise)

# Linear Regression (Ordinary Least Squares)


X = sm.add_constant(size) # Adding constant for intercept
model = sm.OLS(price, X).fit() # Fit the model
print(model.summary())

# Visualization of regression line


plt.scatter(size, price, color='blue', label='Data Points')
plt.plot(size, model.predict(X), color='red', label='Regression Line')
plt.title('Linear Regression: House Price vs. Size')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price (in dollars)')
plt.legend()
plt.show()

# Bayesian Inference using PyMC3 (Bayesian Linear Regression)


with pm.Model() as model_bayesian:
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sigma=10)
beta = pm.Normal('beta', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=10)

# Linear model
mu = alpha + beta * size

# Likelihood
Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=price)
# Sampling from the posterior distribution
trace = pm.sample(2000, return_inferencedata=False)

# Plot posterior distributions


pm.plot_posterior(trace, var_names=['alpha', 'beta', 'sigma'])
plt.show()

Sample Output:

OLS Regression Summary (Printed Output):


5)Write a program to implement Data Wrangling and Cleaning in Python.

This program demonstrates how to gather, assess, and clean data before performing
visualizations. We will work with a sample dataset (e.g., from a CSV file) to perform basic
data wrangling tasks.

import pandas as pd
import numpy as np

# Load a sample dataset (Assume it's a CSV file)


# For this example, we're using a small manually-created dataset for simplicity.
data = {
'Age': [23, 45, 36, np.nan, 50, 29, 35, np.nan, 60, 44],
'Income': [50000, 70000, 80000, 60000, 120000, np.nan, 55000, 80000, 100000, 85000],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago', 'Los Angeles', np.nan,
'Chicago', 'New York', 'Los Angeles']
}

df = pd.DataFrame(data)

# Displaying the initial dataset


print("Initial Dataset:")
print(df)

# Assessing the data - Checking for missing values


print("\nMissing Values:")
print(df.isnull().sum())

# Cleaning the data - Handling missing values


# For 'Age' and 'Income', we can fill missing values with the mean of the respective column
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Income'].fillna(df['Income'].mean(), inplace=True)

# For 'City', we can fill missing values with the mode (most common value)
df['City'].fillna(df['City'].mode()[0], inplace=True)

# Displaying the cleaned dataset


print("\nCleaned Dataset:")
print(df)

Sample Output:

Initial Dataset:
Age Income City
0 23.0 50000.0 New York
1 45.0 70000.0 Los Angeles
2 36.0 80000.0 Chicago
3 NaN 60000.0 New York
4 50.0 120000.0 Chicago
5 29.0 NaN Los Angeles
6 35.0 55000.0 Chicago
7 NaN 80000.0 New York
8 60.0 100000.0 Los Angeles
9 44.0 85000.0 Los Angeles

Missing Values:
Age 2
Income 1
City 1
dtype: int64

Cleaned Dataset:
Age Income City
0 23.0 50000.0 New York
1 45.0 70000.0 Los Angeles
2 36.0 80000.0 Chicago
3 39.5 60000.0 New York
4 50.0 120000.0 Chicago
5 29.0 74900.0 Los Angeles
6 35.0 55000.0 Chicago
7 39.5 80000.0 New York
8 60.0 100000.0 Los Angeles
9 44.0 85000.0 Los Angeles
6)Write a program to implement Data Visualization in Data Analysis in
Python.

This program focuses on various types of visualizations: univariate, bivariate, and


multivariate exploration. We will use the cleaned data from Program 1 and apply different
visualization techniques to explore the relationships in the dataset.

import matplotlib.pyplot as plt


import seaborn as sns

# Data from Program 1


# Assuming df is the cleaned DataFrame from Program 1

# Univariate Exploration: Distribution of 'Age'


plt.figure(figsize=(8, 6))
sns.histplot(df['Age'], kde=True, color='blue', bins=8)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Bivariate Exploration: Scatter plot of 'Income' vs 'Age'


plt.figure(figsize=(8, 6))
sns.scatterplot(x='Age', y='Income', data=df, hue='City', palette='viridis', s=100)
plt.title('Income vs Age by City')
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()

# Multivariate Exploration: Pairplot (pairwise relationships)


sns.pairplot(df, hue='City', diag_kind='kde')
plt.title('Pairplot of Age, Income, and City')
plt.show()

# Explanatory Visualization: Boxplot of 'Income' across 'City'


plt.figure(figsize=(8, 6))
sns.boxplot(x='City', y='Income', data=df)
plt.title('Income Distribution by City')
plt.xlabel('City')
plt.ylabel('Income')
plt.show()
7) Write a program to implement Data Ecosystem Overview, File Formats,
and Sources of Data in Python.

This program demonstrates the handling of different data types and file formats (CSV, JSON,
and NoSQL with MongoDB) to understand various sources of data and data repositories.

import pandas as pd
import json
import pymongo
from bson.json_util import dumps

# 1. Overview of Data Types (Structured, Semi-structured, Unstructured)


# Structured Data: Tabular format (e.g., CSV, SQL Databases)
structured_data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(structured_data)
print("Structured Data (DataFrame):")
print(df)

# Semi-structured Data: JSON format


semi_structured_data = [
{"Name": "Alice", "Age": 24, "City": "New York"},
{"Name": "Bob", "Age": 30, "City": "Los Angeles"},
{"Name": "Charlie", "Age": 35, "City": "Chicago"}
]
json_data = json.dumps(semi_structured_data, indent=4)
print("\nSemi-structured Data (JSON):")
print(json_data)

# Unstructured Data: Text data (example)


unstructured_data = "Alice is 24 years old. Bob is 30 years old. Charlie is 35 years old."
print("\nUnstructured Data (Text):")
print(unstructured_data)

# 2. Sources of Data: Using a CSV file (you can substitute this with any CSV file)
csv_data = pd.read_csv('sample_data.csv') # Assumed 'sample_data.csv' exists in your
directory
print("\nCSV Data Loaded:")
print(csv_data.head())
# 3. MongoDB NoSQL Database Example
client = pymongo.MongoClient("mongodb://localhost:27017/") # MongoDB connection
(locally hosted)
db = client['test_database']
collection = db['people']

# Inserting the structured data (inserting a document into MongoDB)


collection.insert_many(structured_data)
print("\nMongoDB Data Insertion Complete.")

# Retrieve and print documents


documents = collection.find()
print("\nData Retrieved from MongoDB:")
print(dumps(documents, indent=4))

Sample Output:
Structured Data (DataFrame):
Name Age City
0 Alice 24 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago

Semi-structured Data (JSON):


[
{
"Name": "Alice",
"Age": 24,
"City": "New York"
},
{
"Name": "Bob",
"Age": 30,
"City": "Los Angeles"
},
{
"Name": "Charlie",
"Age": 35,
"City": "Chicago"
}
]

Unstructured Data (Text):


Alice is 24 years old. Bob is 30 years old. Charlie is 35 years old.
CSV Data Loaded:
Name Age City
0 Alice 24 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago

MongoDB Data Insertion Complete.

Data Retrieved from MongoDB:


[
{
"Name": "Alice",
"Age": 24,
"City": "New York"
},
{
"Name": "Bob",
"Age": 30,
"City": "Los Angeles"
},
{
"Name": "Charlie",
"Age": 35,
"City": "Chicago"
}
]
8) Write a program to implement Data Pipelines, ETL, and Big Data
Processing with Spark in Python.

This program demonstrates ETL (Extract, Transform, Load) and basic big data processing
using Apache Spark (via PySpark). It also touches on concepts like Hadoop Distributed File
System (HDFS) and processing large datasets in Spark.

from pyspark.sql import SparkSession


from pyspark.sql.functions import col

# 1. Setting up a Spark session


spark = SparkSession.builder \
.appName("ETL and Big Data Processing") \
.getOrCreate()

# 2. Extracting data: Loading a large dataset (assuming a CSV file with large data)
# Example: Loading the data from HDFS or local file system (adjust the path accordingly)
# data = spark.read.csv("hdfs://path_to_large_dataset.csv", header=True, inferSchema=True)

# For demonstration purposes, we use a smaller sample dataframe (similar to large dataset)
data = spark.createDataFrame([
("Alice", 24, "New York"),
("Bob", 30, "Los Angeles"),
("Charlie", 35, "Chicago"),
("David", 40, "Miami"),
("Eve", 22, "San Francisco")
], ["Name", "Age", "City"])

# 3. Transforming data: Filtering out people who are younger than 30 years old
filtered_data = data.filter(col("Age") > 30)

# 4. Loading data: Show the result after transformation (ETL process)


print("Transformed Data:")
filtered_data.show()

# 5. Big Data Processing - Using Spark for Large Dataset Processing (e.g., group by city and
count)
city_count = data.groupBy("City").count()

print("\nCity Count Aggregation (Big Data Processing Example):")


city_count.show()

# Stop the Spark session


spark.stop()

Sample Output:
9) Write a program to implement Basic Data Visualizations using
Matplotlib, Seaborn, and Pandas in Python.

This program demonstrates how to use Matplotlib for basic plots, Seaborn for advanced
statistical visualizations, and Pandas for data handling.

import matplotlib.pyplot as plt


import seaborn as sns
import pandas as pd
import numpy as np

# 1. Data Preparation using Pandas


# Creating a sample DataFrame with random data
data = {
'Age': np.random.randint(20, 60, 100), # Random ages between 20 and 60
'Income': np.random.randint(30000, 120000, 100), # Random income values
'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Miami'], 100)
}

df = pd.DataFrame(data)

# 2. Basic Plotting with Matplotlib


plt.figure(figsize=(8, 6))
plt.hist(df['Age'], bins=10, color='skyblue', edgecolor='black')
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# 3. Seaborn: Boxplot to compare Income by City


plt.figure(figsize=(8, 6))
sns.boxplot(x='City', y='Income', data=df, palette='Set2')
plt.title('Income Distribution by City')
plt.xlabel('City')
plt.ylabel('Income')
plt.show()

# 4. Seaborn: Scatterplot to explore relationship between Age and Income


plt.figure(figsize=(8, 6))
sns.scatterplot(x='Age', y='Income', data=df, hue='City', palette='Set1', s=100)
plt.title('Scatterplot of Age vs Income')
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()

Sample Output:
1. Histogram of Age:
o A histogram showing the distribution of Age with bins, allowing you to
observe the frequency of different age ranges.
2. Boxplot of Income by City:
o A boxplot for Income across different cities, showing the central tendency,
spread, and potential outliers.
3. Scatterplot of Age vs Income:
o A scatter plot showing how Income correlates with Age, with different colors
representing different cities.

\
10) Write a program to implement Interactive Visualizations using Plotly
in Python.

This program demonstrates how to create interactive visualizations using Plotly. We will use
Plotly Express to visualize a dataset, making it more dynamic and user-friendly.

import plotly.express as px
import pandas as pd
import numpy as np

# 1. Data Preparation using Pandas


# Creating a sample DataFrame with random data
data = {
'Age': np.random.randint(20, 60, 100), # Random ages between 20 and 60
'Income': np.random.randint(30000, 120000, 100), # Random income values
'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Miami'], 100)
}

df = pd.DataFrame(data)

# 2. Plotly: Scatter plot of Age vs Income with color by City


fig = px.scatter(df, x='Age', y='Income', color='City', title="Age vs Income by City",
labels={'Age': 'Age (years)', 'Income': 'Income ($)'})
fig.show()

# 3. Plotly: Box plot of Income by City


fig = px.box(df, x='City', y='Income', title="Income Distribution by City", labels={'Income':
'Income ($)', 'City': 'City'})
fig.show()

# 4. Plotly: Histogram of Age


fig = px.histogram(df, x='Age', title="Distribution of Age", labels={'Age': 'Age (years)'})
fig.show()

o.
Sample Output:
1. Interactive Scatter Plot (Age vs Income by City):
o A dynamic scatter plot where you can hover over points to see Age, Income,
and City. You can zoom in or out to explore specific data points.
2. Interactive Box Plot (Income by City):
o An interactive box plot showing the distribution of Income across cities.
Hovering over the box shows detailed statistics like median, quartiles, and
outliers.
3. Interactive Histogram (Age Distribution):
o An interactive histogram for Age showing the frequency of different age
groups. You can zoom in or click on specific bins for more detailed analysis.
Key Concepts Covered in These Programs:
1. Matplotlib:
o Used for creating basic plots like histograms and line plots.
o Offers customization options for appearance (e.g., titles, labels, colors).
2. Seaborn:
o Built on top of Matplotlib, Seaborn provides higher-level abstractions and
improved visualizations.
o Great for statistical plots like boxplots and scatterplots with hue based on
categories.
3. Plotly:
o Plotly enables the creation of interactive visualizations. It's excellent for
scatter plots, box plots, histograms, and more.
o Offers interactivity features like zooming, panning, and hover text.
4. Pandas:
o Pandas is used to manipulate and prepare the data for visualization.
o It simplifies handling and transforming data for plotting with libraries like
Matplotlib, Seaborn, and Plotly.

You might also like