Let's address each of your questions with explanations and Python code where
applicable.
1. How to calculate important numbers based on data sets, how to use various Python
modules and how to make functions that are able to predict the outcome based on
what we have learned (Small Dataset)
For a small dataset, you can calculate various descriptive statistics and
potentially build simple predictive models. Here's how:
Calculating Important Numbers & Using Python Modules:
Pandas: For data manipulation and analysis.
NumPy: For numerical computations.
Statistics: For basic statistical functions.
import pandas as pd
import numpy as np
import statistics
# Sample small dataset (replace with your actual data)
data = {'Feature1': [1, 2, 3, 4, 5],
'Feature2': [2, 4, 1, 5, 3],
'Target': [0, 1, 0, 1, 0]}
df = pd.DataFrame(data)
# Descriptive Statistics
print("Descriptive Statistics:")
print(df.describe())
# Specific calculations
print("\nMean of Feature1:", df['Feature1'].mean())
print("Median of Feature2:", df['Feature2'].median())
print("Standard Deviation of Target:", df['Target'].std())
# Using NumPy
print("\nNumPy Array of Feature1:", np.array(df['Feature1']))
# Using statistics module
print("\nVariance of Feature2 (statistics):", statistics.variance(df['Feature2']))
output
Experiment no 1 on whatsup
Making Predictive Functions (Simple Example - based on a rule):
def simple_predictor(feature1_value, feature2_value):
"""A very simple predictor based on arbitrary rules."""
if feature1_value > 3 and feature2_value < 4:
return 1
else:
return 0
# Example prediction
new_feature1 = 4
new_feature2 = 2
prediction = simple_predictor(new_feature1, new_feature2)
print(f"\nPrediction for Feature1={new_feature1}, Feature2={new_feature2}:
{prediction}")
output
program 2 ----How can we get Big Data Sets, Learn: Data Distribution, Normal data
distribution, Random Data Distribution, Scatter Plot.
Getting Big Data Sets:
Public Datasets:
Kaggle Datasets: A vast collection of datasets for various machine learning tasks.
UCI Machine Learning Repository: Classic datasets for machine learning research.
Google Dataset Search: A search engine for publicly available datasets.
Government Open Data Portals: Many governments provide open data (e.g., data.gov in
the US, data.gov.in in India).
Academic Research Datasets: Researchers often make their data public.
APIs: Many companies and services provide APIs to access their data (e.g., Twitter
API, financial APIs).
Web Scraping: If data is publicly available on websites, you can use libraries like
Beautiful Soup and Scrapy (be mindful of website terms of service).
Data Generation: For specific purposes, you can generate synthetic big data using
libraries like NumPy or specialized tools.
Cloud Storage: Platforms like AWS S3, Google Cloud Storage, and Azure Blob Storage
often host large datasets.
Learning about Data Distribution:
Data Distribution: Describes how the values of a variable are spread out across its
range. Understanding the distribution is crucial for choosing appropriate
statistical methods and machine learning models.
Normal Data Distribution (Gaussian Distribution):
A bell-shaped, symmetrical distribution where most of the data points cluster
around the mean.
Characterized by its mean (μ) and standard deviation (σ).
Many natural phenomena tend to follow a normal distribution (e.g., height, weight).
In Python, you can visualize it using matplotlib.pyplot.hist() and generate normal
data using numpy.random.normal().
import numpy as np
import matplotlib.pyplot as plt
Program 3-----3. Build an Artificial Neural Network by implementing the
Backpropagation algorithm and test the same using appropriate data sets.
Implementing backpropagation from scratch is a significant task. Here's a
simplified conceptual outline and a basic Python implementation for a two-layer
network:
Program 4------4. The probability that it is Friday and that a student is absent is
3%. Since there are 5 school days in a week, the probability that it is Friday is
20%. What is the probability that a student is absent given that today is Friday?
Apply Bayes' rule in Python to get the result.
Let:
P(A∩F) be the probability that a student is absent AND it is Friday = 3% = 0.03
P(F) be the probability that it is Friday = 20% = 0.20
P(A∣F) be the probability that a student is absent GIVEN that it is Friday (what we
want to find)
Bayes' Rule states:
P(A∣F)=
P(F)
P(F∣A)P(A)
However, we are directly given P(A∩F), which is equal to P(F∣A)P(A). So, we can use
a simplified form:
P(A∣F)=
P(F)
P(A∩F)
Now, let's implement this in Python:
prob_absent_and_friday = 0.03
prob_friday = 0.20
prob_absent_given_friday = prob_absent_and_friday / prob_friday
print(f"The probability that a student is absent given that today is Friday is:
{prob_absent_given_friday:.2f}")
output screen short
The probability that a student is absent given that today is Friday is: 0.15
So, the probability that a student is absent given that today is Friday is 15%.
Program5........5. Write a program to implement k-Nearest Neighbour algorithm to
classify the iris data set.
Program9.........9.Write a program to demonstrate the working of the decision tree
based ID3 algorithm. Use an appropriate data set for building the decision tree and
apply this knowledge to classify a new sample.
import pandas as pd
import numpy as np
from collections import Counter
import math
class DecisionTreeID3:
def __init__(self, min_samples_split=2, max_depth=None):
self.min_samples_split = min_samples_split
self.max_depth = max_depth
self.root = None
def _entropy(self, s):
"""Calculates the entropy of a dataset."""
class_counts = Counter(s)
entropy = 0
for count in class_counts.values():
probability = count / len(s)
entropy -= probability * math.log2(probability)
return entropy
def _information_gain(self, dataset, feature, target):
"""Calculates the information gain of a feature."""
total_entropy = self._entropy(dataset[target])
weighted_entropy = 0
for value in dataset[feature].unique():
subset = dataset[dataset[feature] == value][target]
probability = len(subset) / len(dataset)
weighted_entropy += probability * self._entropy(subset)
return total_entropy - weighted_entropy
def _split_dataset(self, dataset, feature, value):
"""Splits the dataset based on a feature and its value."""
left_subset = dataset[dataset[feature] == value]
return left_subset
def _choose_best_feature(self, dataset, features, target):
"""Chooses the best feature to split on based on information gain."""
best_gain = -1
best_feature = None
for feature in features:
gain = self._information_gain(dataset, feature, target)
if gain > best_gain:
best_gain = gain
best_feature = feature
return best_feature
def _build_tree(self, dataset, features, target, depth=0):
"""Recursively builds the decision tree."""
if len(np.unique(dataset[target])) == 1:
return dataset[target].iloc[0] # Pure node, return the class
if len(dataset) < self.min_samples_split or not features or (self.max_depth
is not None and depth >= self.max_depth):
return Counter(dataset[target]).most_common(1)[0][0] # Return majority
class
best_feature = self._choose_best_feature(dataset, features, target)
if best_feature is None: # No information gain
return Counter(dataset[target]).most_common(1)[0][0]
tree = {best_feature: {}}
remaining_features = [f for f in features if f != best_feature]
for value in dataset[best_feature].unique():
subset = self._split_dataset(dataset, best_feature, value)
tree[best_feature][value] = self._build_tree(subset,
remaining_features, target, depth + 1)
return tree
def fit(self, dataset, target_column):
"""Fits the decision tree model to the training data."""
self.target_column = target_column
self.features = [col for col in dataset.columns if col != target_column]
self.root = self._build_tree(dataset, self.features, self.target_column)
def predict(self, sample):
"""Predicts the class label for a new sample."""
def traverse_tree(tree, sample):
feature = list(tree.keys())[0]
if feature not in sample:
# Handle missing feature in the sample (can return majority or
raise error)
return None # Or implement a more sophisticated handling
value = sample[feature]
if value not in tree[feature]:
# Handle unseen value (can return majority of the branch or None)
return None # Or implement a more sophisticated handling
subtree = tree[feature][value]
if isinstance(subtree, dict):
return traverse_tree(subtree, sample)
else:
return subtree
return traverse_tree(self.root, sample)
# --- Example Usage with a Simple Play Tennis Dataset ---
# Create a sample dataset
data = {
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy',
'Overcast', 'Sunny', 'Sunny', 'Rainy', 'Sunny', 'Overcast', 'Overcast', 'Rainy'],
'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild',
'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal',
'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
'Windy': [False, True, False, False, False, True, True, False, False, False,
True, True, False, True],
'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)
# Initialize and train the ID3 decision tree
id3_tree = DecisionTreeID3(min_samples_split=2, max_depth=None)
id3_tree.fit(df, 'PlayTennis')
# Print the learned decision tree (can be nested and hard to read for complex
trees)
print("Learned Decision Tree:")
import pprint
pprint.pprint(id3_tree.root)
# Classify a new sample
new_sample = {'Outlook': 'Sunny', 'Temperature': 'Cool', 'Humidity': 'High',
'Windy': False}
prediction = id3_tree.predict(new_sample)
print(f"\nPrediction for {new_sample}: {prediction}")
new_sample_2 = {'Outlook': 'Rainy', 'Temperature': 'Mild', 'Humidity': 'Normal',
'Windy': True}
prediction_2 = id3_tree.predict(new_sample_2)
print(f"Prediction for {new_sample_2}: {prediction_2}")