Data science
Vikas College Of Arts, Science and Commerce Page 1
INDEX
Sr
Title Date Sign
No
1 Introduction to Excel
2 Data Frames and Basic Data Pre-processing
3 Feature Scaling and Dummification
4 Hypothesis Testing
5 ANOVA (Analysis of Variance)
6 Regression and Its Types
7 Logistic Regression and Decision Tree
8 K-Means Clustering
9 Principal Component Analysis (PCA)
10 Data Visualization and Storytelling
Vikas College Of Arts, Science and Commerce Page 2
PRACTICAL 1
Introduction to Excel
A. Perform conditional formatting on a dataset using various criteria.
Steps
Step 1: Go to conditional formatting > Greater Than
Step 2: Enter the greater than filter value for example 2000.
Vikas College Of Arts, Science and Commerce Page 3
Step 3: Go to Data Bars > Solid Fill in conditional formatting.
B. Create a pivot table to analyse and summarize data.
Steps
Step 1: select the entire table and go to Insert tab PivotChart > Pivotchart Step 2:
Select “New worksheet” in the create pivot chart window.
Vikas College Of Arts, Science and Commerce Page 4
Step 3: Select and drag attributes in the below boxes.
C. Use VLOOKUP function to retrieve information from a different worksheet or table. Steps:
Step 1: click on an empty cell and type the following command.
=VLOOKUP(B3, B3:D3,1, TRUE)
Vikas College Of Arts, Science and Commerce Page 5
D. Perform what-if analysis using Goal Seek to determine input values for desired output.
Steps-
Step 1: In the Data tab go to the what if analysis>Goal seek.
Step 2: Fill the information in the window accordingly and click ok.
Vikas College Of Arts, Science and Commerce Page 6
Vikas College Of Arts, Science and Commerce Page 7
PRACTICAL 2
Data Frames and Basic Data Pre-processing
A. Read data from CSV and JSON files into a data frame.
B. Perform basic data pre-processing tasks such as handling missing values and outliers. Code:
import pandas as pd
# Reading CSV file into DataFrame
df = pd.read_csv("samp.csv")
print("Our dataset:")
print(df)
# Reading JSON file into DataFrame
data = pd.read_json("sample.json")
print(data)
# Displaying the first 10 rows of the DataFrame
df.head(10)
# Filling missing values with 0
print("Dataset after filling NA values with 0:")
df2 = df.fillna(value=0)
print(df2)
# Dropping rows with any missing values
print("Dataset after dropping NA values:")
df.dropna(inplace=True)
print(df)
Vikas College Of Arts, Science and Commerce Page 8
C. Manipulate and transform data using functions like filtering, sorting, and grouping Code:
import pandas as pd
# Reading CSV file into DataFrame
df = pd.read_csv("samp.csv")
# Filtering data based on a condition (e.g., age greater than 25)
filtered_df = df[df["age"] > 25]
# Sorting data based on a column (e.g., sorting by age in descending order)
sorted_df = df.sort_values(by="age", ascending=False)
# Grouping data based on a column and applying an aggregation function (e.g., finding the average age per
city)
grouped_df = df.groupby("city").agg({"age": "mean"})
# Displaying the filtered DataFrame
print("Filtered DataFrame:")
print(filtered_df)
# Displaying the sorted DataFrame
print("\nSorted DataFrame:")
print(sorted_df)
# Displaying the grouped DataFrame
print("\nGrouped DataFrame:")
print(grouped_df)
Vikas College Of Arts, Science and Commerce Page 9
PRACTICAL 3
Feature Scaling and Dummification
A. Apply feature-scaling techniques like standardization and normalization to numerical
features.
Code:
# Standardization and normalization import pandas as pd
import numpy as np
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
print("printing few data")
df = pd.read_csv("D:\TYCS\Data Science\SampleFile.csv")
print(df.head())
print("Max values")
max_vals = np.max(np.abs(df))
print(max_vals)
print((df - max_vals) / max_vals)
print("Normalization")
scaler = Normalizer()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print(scaled_df.head())
print("Standardization")
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
print(scaled_df.head())
Vikas College Of Arts, Science and Commerce Page 10
Vikas College Of Arts, Science and Commerce Page 11
B. Perform feature Dummification to convert categorical variables into numerical
representations.
Code:
import pandas as pd
data = pd.read_csv("data32.csv")
categorical_features = data.select_dtypes(include="object")
dummies = pd.get_dummies(categorical_features)
data = pd.concat([data, dummies], axis=1)
data.drop(categorical_features, axis=1, inplace=True)
data.to_csv("Output.csv")
Vikas College Of Arts, Science and Commerce Page 12
Practical 4 Hypothesis
Testing
Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-square test) # t-test
import numpy as np
import scipy.stats as stats
np.random.seed(42)
scoreA = np.random.normal(loc=70,scale=10,size=30)
scoreB = np.random.normal(loc=75,scale=10,size=30)
t_stat,pvalue = stats.ttest_ind(scoreA,scoreB)
print(f"T-Statistics: {t_stat}\nP-Value: {pvalue}")
alpha = 0.05
if pvalue < alpha:
print("Reject the null hypothesis. There is a significant difference in exam scores.")
else:
print("Fail to reject the null hypothesis. There is no significant difference in exam scores.")
Output:
Chi-test
import numpy as np
import scipy.stats as stats
observed_data = np.array([[25, 15], [20, 40]])
chi2, pvalue, dof, expected = stats.chi2_contingency(observed_data)
print(f'Chi-Square Statistic: {chi2}\nPvalue: {pvalue}\nDegrees of Freedom: {dof}\nExpected
frequency:\n{expected}')
alpha = 0.05
if pvalue < alpha:
print("Reject the null hypothesis. There is a significant association between gender and job satisfaction.")
else:
print("Fail to reject the null hypothesis. Gender and job satisfaction are independent.")
Output:
Vikas College Of Arts, Science and Commerce Page 13
Practical 5
ANOVA (Analysis of Variance)
Perform one-way ANOVA to compare means across multiple groups.
from scipy.stats import f_oneway
# Define sample data for each group
group1 = [15, 20, 25, 30, 35]
group2 = [10, 18, 22, 28, 32]
group3 = [12, 16, 20, 24, 28]
f_statistic, p_value = f_oneway(group1, group2, group3)
print("One-way ANOVA results:")
print("F-statistic:", f_statistic)
print("P-value:", p_value)
alpha = 0.05
if p_value < alpha:
print(
"Reject null hypothesis: There are significant differences between the means of the groups."
else:
print(
"Fail to reject null hypothesis: There are no significant differences between the means of the groups."
Output:-
Vikas College Of Arts, Science and Commerce Page 14
Practical 6
Regression and its Types.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Dependent variable (predictor)
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
# Independent variable (predictor)
y = np.array([[7], [9], [11], [13], [15], [17], [19], [21], [23], [25]])
# Dependent variable (response)
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Simple Linear Regression
model = LinearRegression()
model.fit(X_train, y_train) # Fitting the model
# Coefficients
print("Intercept:", model.intercept_[0])
print("Coefficient:", model.coef_[0][0])
# Predictions
y_pred = model.predict(X_test)
# Model Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
Vikas College Of Arts, Science and Commerce Page 15
# Plotting the regression line
plt.scatter(X_test, y_test, color="blue")
plt.plot(X_test, y_pred, color="red")
plt.title("Simple Linear Regression")
plt.xlabel("Independent Variable (X)")
plt.ylabel("Dependent Variable (y)")
plt.show()
Output:
Vikas College Of Arts, Science and Commerce Page 16
Practical 7
Logistic Regression and Decision Tree
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=5, cluster_std=0.60, random_state=0)
# Determine the optimal number of clusters using the silhouette score
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
score = silhouette_score(X, kmeans.labels_)
silhouette_scores.append(score)
# Plot the silhouette scores
plt.plot(range(2, 11), silhouette_scores, marker="o")
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score for Optimal Number of Clusters")
plt.show()
# Choose the optimal number of clusters based on the silhouette score
optimal_k = silhouette_scores.index(max(silhouette_scores)) + 2
# Apply K-Means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=0).fit(X)
# Visualize the clustering results
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap="viridis", s=50, alpha=0.7)
plt.scatter(
kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
Vikas College Of Arts, Science and Commerce Page 17
s=200,
c="red",
marker="X",
label="Centroids",
)
plt.title("K-Means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()
# Analyze the cluster characteristics
silhouette_avg = silhouette_score(X, kmeans.labels_)
print(f"Silhouette Score: {silhouette_avg}")
Output:
Vikas College Of Arts, Science and Commerce Page 18
Vikas College Of Arts, Science and Commerce Page 19
Practical 8
K-Means clustering
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Load data
data = pd.read_csv("wholesale.csv")
# Display the first few rows of the dataset
data.head()
# Define categorical and continuous features
categorical_features = ["Channel", "Region"]
continuous_features = [
"Fresh",
"Milk",
"Grocery",
"Frozen",
"Detergents_Paper",
"Delicassen",
]
# Descriptive statistics for continuous features
data[continuous_features].describe()
# Convert categorical features into dummy variables
for col in categorical_features:
dummies = pd.get_dummies(data[col], prefix=col)
data = pd.concat([data, dummies], axis=1)
data.drop(col, axis=1, inplace=True)
Vikas College Of Arts, Science and Commerce Page 20
# Display the first few rows of the updated dataset
data.head()
# Normalize the data
mms = MinMaxScaler()
data_transformed = mms.fit_transform(data)
# Calculate the sum of squared distances for different values of k
sum_of_squared_distances = []
K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k)
km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)
# Plot the elbow method graph
plt.plot(K, sum_of_squared_distances, "bx-")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Sum of Squared Distances")
plt.title("Elbow Method for Optimal k")
plt.show()
Output:
Vikas College Of Arts, Science and Commerce Page 21
Practical 9
Principal Component Analysis (PCA)
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
# Perform PCA
pca = PCA(n_components=2) # Specify the number of components (dimensions)
X_r = pca.fit_transform(X)
# Create a DataFrame for visualization
df = pd.DataFrame(data=X_r, columns=['PC1', 'PC2'])
df['target'] = y
# Plot the data
plt.figure(figsize=(8, 6))
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(df.loc[df['target'] == i, 'PC1'], df.loc[df['target'] == i, 'PC2'], color=color, alpha=.8, lw=lw,
label=target_name)
plt.title('PCA of IRIS dataset')
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
Output:
Vikas College Of Arts, Science and Commerce Page 22
Vikas College Of Arts, Science and Commerce Page 23
Practical 10
Data Visualization and Storytelling
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
# Assume 'data.csv' contains your dataset
df = pd.read_csv("data.csv")
# Perform data analysis
# Example: Calculate summary statistics
summary_stats = df.describe()
# Create meaningful visualizations
# Example: Plot a histogram of a numerical variable
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x="numerical_variable", bins=20, kde=True)
plt.title("Histogram of Numerical Variable")
plt.xlabel("Numerical Variable")
plt.ylabel("Frequency")
plt.show()
# Example: Plot a bar chart of a categorical variable
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x="categorical_variable", palette="viridis")
plt.title("Bar Chart of Categorical Variable")
plt.xlabel("Categories")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()
# Present findings and insights in a clear and concise manner
# Example: Use Markdown to format text for presentation
print("# Data Analysis and Visualization Report\n")
print("## Summary Statistics:\n")
print(summary_stats)
print("\n## Insights:\n")
print(
"- The histogram shows that the distribution of the numerical variable is approximately normal."
)
print(
"- The bar chart indicates that category A is the most frequent in the categorical variable."
)
print(
"- The scatterplot suggests a positive correlation between numerical variables 1 and 2, with different
categories showing distinct patterns.\n"
Vikas College Of Arts, Science and Commerce Page 24
)
Output:
Vikas College Of Arts, Science and Commerce Page 25
Vikas College Of Arts, Science and Commerce Page 26