KEMBAR78
Sec Assignment | PDF | Machine Learning | Quartile
0% found this document useful (0 votes)
11 views15 pages

Sec Assignment

The document discusses the differences and similarities between Python and Excel for data analysis, highlighting Python's advantages for large datasets and automation. It covers the installation of Anaconda and Jupyter Notebook, the importance of data types, conditional statements, and the role of Pandas in data manipulation. Additionally, it explains machine learning concepts, data transformation techniques, and provides insights on handling duplicates and missing values in datasets.

Uploaded by

diyanshaadvani98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views15 pages

Sec Assignment

The document discusses the differences and similarities between Python and Excel for data analysis, highlighting Python's advantages for large datasets and automation. It covers the installation of Anaconda and Jupyter Notebook, the importance of data types, conditional statements, and the role of Pandas in data manipulation. Additionally, it explains machine learning concepts, data transformation techniques, and provides insights on handling duplicates and missing values in datasets.

Uploaded by

diyanshaadvani98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

SECTION A-

Q1. Explain the differences and similarities between Python and Excel. Provide
real-life scenarios where Python is preferred over Excel and justify why.
ANS. Python and Excel are widely used for data analysis, but they serve different purposes:
FEATURES PYTHON EXCEL
AUTOMATION Supports scripting(Pandas, Requires VBA or power
Numpy) query
SCALABILITY Handles large dataset Slows down with large data
efficiently
VISUALIZATION Uses matplotlib, seaborn Built-in charts
DATA PROCESSING More flexible with complex Limited for large scale
logic. operations

Excel is preferred for budget planning, small-scale reports, quick data visualization.
Python is preferred for large scale data analysis, predictive modeling, automation.

Q2. Describe the process of installing Anaconda and launching Jupyter


Notebook. Explain how to create, save, and run Python code using Jupyter
Notebook.
 Go to the Anaconda website and download the installer for your operating system
(Windows, macOS, or Linux).
 Choose the Python 3.x version.
 Run the downloaded installer and follow the installation instructions.
 During installation, select the option to Add Anaconda to your system PATH for
easier command-line access.
 Once installed, open Anaconda Navigator.
 Open Anaconda Navigator, and you'll see an option to launch Jupyter Notebook.

Create a New Notebook:

 In the Jupyter Notebook interface, click New (top-right corner), then select Python 3
from the dropdown list.
 This opens a new notebook where you can start writing your Python code.

Run Python Code:

 To run the code in a cell, press Shift + Enter or click the Run button in the toolbar.
 The output will appear below the code cell.
Q3. Illustrate the differences between various Python data types (int, float, str,
list, tuple, dict, etc.) with examples. Discuss the importance of understanding
data types while handling large datasets.
ANS. Python has several fundamental data types:
int: 10
float: 3.14
str: "hello"
list: [1, 2, 3]
tuple: (1, 2, 3)
dict: {"name": "Alice", "age": 25}

Importance of Data Types in Large Datasets:


 Helps optimize memory usage.
 Ensures correct operations (e.g., string vs. numeric calculations).

Q.4Define and explain conditional statements in Python with multiple use-case


examples. Include examples of nested if-else and practical scenarios where
nested conditions are necessary.
ANS. if statement: Executes code only if a condition is true.
Elif statement: Checks additional conditions if previous ones are false.
Else statement: Executes code when all previous conditions are true.

Nested if-else statement: this occur when you include an if-else statement inside
another.

Practical Use-Case:
SECTION B-

Q6. Discuss the role of Pandas in data manipulation. Explain the key differences
between DataFrame and Series with appropriate code examples.
ANS. Role of pandas in data manipulation:
 Data Cleaning
 Data Transformation
 Exploratory Data Analysis(EDA)
 Integration
DataFrame VS Series:
Series DataFrame
One dimensional Two Dimensional
Used for single column data or simple Used for handling datasets with multiple
labeled arrays. rows and columns.
Support element wise operations. Allows complex operations like joining ,
grouping and merging data across
multiple column.
Q7. Write a Python script that reads data from an Excel file, performs data
cleaning (removes duplicates and fills missing values), and outputs a
summary of the data.
ANS.

Q8. Demonstrate how to create various types of visualizations (line plots,


bar charts, scatter plots, and histograms) using Matplotlib. Customize the
visualizations with labels, legends, and colors.
ANS.
SECTION C-
Q.11 Explain the concept of Machine Learning (ML) and differentiate between
supervised and unsupervised learning. Provide real-world examples of each type.

Ans. Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to
learn from data and make predictions or decisions without being explicitly programmed.

Supervised vs. Unsupervised Learning

Feature Supervised Learning Unsupervised Learning


Definition Learns from labeled data (input-output Learns from unlabeled data by
pairs) finding patterns
Goal Predict outcomes Discover hidden structures or
(classification/regression) clusters
Data Labeled Unlabeled
Type
Examples Email spam detection, predicting house Customer segmentation, anomaly
prices detection

Real-World Examples

 Supervised Learning: Identifying spam emails (Spam vs. Not Spam)


 Unsupervised Learning: Grouping customers based on shopping behavior for
targeted marketing
Q12. Describe the key steps involved in building a Machine Learning model using
Scikit-learn. Explain how to split data into training and testing sets and evaluate
the model.

Q13. Demonstrate the implementation of a Linear Regression model using Scikit-


learn to predict house prices based on a given dataset. Include evaluation of
model performance using appropriate metrics.

Q14. .Discuss the differences between Linear and Logistic Regression. Highlight
the mathematical equations involved and explain when to use each method.
Ans.
Feature Linear Regression Logistic Regression
Type Regression (Predicts Classification (Predicts probability of classes)
continuous values)
Equation y=β0+β1x+ϵ P(Y=1)=1/(1+e^(−(β0+β1x)))
Output Continuous values (e.g., Probability (0 to 1)
house prices)
Use Case When predicting numeric When predicting categorical outcomes (e.g.,
values spam detection)

When to Use:

 Linear Regression → Predicting salaries, stock prices.


 Logistic Regression → Classifying emails as spam or not spam

SECTION D-
Q17. Discuss different data transformation techniques such as normalization,
standardization, and encoding. Write Python code that performs each of these
techniques on a sample dataset.

Ans.  Normalization (Min-Max Scaling)

 Scales data between 0 and 1.


 Formula: X′=( X−Xmin)/( Xmax−Xmin )

 Standardization (Z-Score Scaling)

 Centers data around mean = 0 and std = 1.


 Formula: X′=(X−μ)/σ

 Encoding (Categorical Data Transformation)

 One-Hot Encoding → Converts categories into binary columns.


 Label Encoding → Assigns numeric values to categories.
Q18. Explain overfitting and underfitting in machine learning models. Provide
practical examples where overfitting occurs and discuss strategies to mitigate it
using regularization.
Ans.

Concept Overfitting Underfitting


Definition Model learns noise and performs well on Model is too simple and fails to
training data but poorly on test data. capture patterns in data.
Cause Too complex model, too many features, Model is too simple, not enough
small dataset. features, high bias.
Effect High variance, poor generalization. High bias, poor accuracy.

Example of Overfitting

 A deep neural network trained on a small dataset memorizes data but fails on new
examples.
 A decision tree that grows too deep and perfectly classifies training data but
performs poorly on unseen data.

Mitigating Overfitting Using Regularization

Regularization helps control model complexity:

1. Lasso Regression (L1): Shrinks some coefficients to zero, selecting important


features.
2. Ridge Regression (L2): Shrinks all coefficients to small values, reducing model
complexity.
3. Dropout (for Neural Networks): Randomly drops neurons to prevent memorization.
4. Cross-Validation: Splitting data into multiple sets to evaluate performance.
5. More Data: Increases generalization and reduces noise impact.
Q19. Implement Ridge and Lasso Regression on a dataset and compare the
impact of L1 and L2 regularization techniques. Explain which model performs
better and why.

Comparison & Explanation

 Ridge Regression (L2 Regularization): Shrinks coefficients but does not eliminate
them completely. It performs better when all features contribute to predictions.
 Lasso Regression (L1 Regularization): Shrinks some coefficients to zero, effectively
performing feature selection. It performs better when some features are irrelevant.

Which is better?

 If feature selection is needed → Lasso is better.


 If all features contribute → Ridge is better.

SECTION E-
Q21. Write a Python script to identify and remove duplicate rows from a large
dataset using Pandas. Provide a detailed explanation of how Pandas handles
duplicates.
Ans.

 df.duplicated() → Returns a Boolean Series indicating duplicate rows.

 df.drop_duplicates() → Removes duplicate rows, keeping the first occurrence by


default.

 Options:

 keep='first' (default) → Keeps the first occurrence.


 keep='last' → Keeps the last occurrence.
 keep=False → Removes all duplicates.

Q22. Explain the importance of handling missing values in a dataset. Write


Python code that demonstrates different methods to handle missing values
(drop, fill, or interpolate).
 dropna() → Removes rows with missing values.
 fillna(df.mean()) → Replaces missing values with the column's mean.
 interpolate() → Estimates missing values based on neighboring data.

Q23. Illustrate the concept of indexing and slicing in Pandas DataFrame. Provide
examples of selecting specific rows, columns, and sub-sections of the
DataFrame.
SECTION F-

Q26.Demonstrate how to create and customize interactive plots using Plotly.


Explain how it differs from Matplotlib and Seaborn for data visualization.

Q27. Write a Python script to perform time series analysis and visualization on a
dataset. Explain how to identify trends, seasonality, and anomalies.
Ans. Time series analysis involves analyzing data points ordered by time to uncover
underlying patterns such as trends, seasonality, and anomalies. The process typically
involves:

1. Identifying Trends: Long-term movements in the data.


2. Identifying Seasonality: Regular, repeating patterns.
3. Detecting Anomalies: Outliers or unusual events in the time series.

Steps for Time Series Analysis:

 Visualize the data.


 Decompose the time series to identify trend, seasonality, and residual components.
 Detect anomalies using statistical methods.
Q28. Demonstrate how to use a Box Plot to identify outliers in a dataset. Provide
Python code to create a Box Plot and explain how to interpret it.
A Box Plot is a powerful visualization for identifying outliers and understanding the
distribution of the data. It displays the minimum, first quartile (Q1), median (Q2), third
quartile (Q3), and maximum of the data. Outliers are typically points that fall outside of the
"whiskers" of the box plot, which are calculated as 1.5 times the interquartile range (IQR)
from the first and third quartiles.

How to Interpret a Box Plot

 Box: Represents the interquartile range (IQR) from Q1 to Q3, containing the middle
50% of the data.
 Whiskers: Extend to 1.5 times the IQR from Q1 and Q3. Data points outside the
whiskers are potential outliers.
 Median: The line inside the box represents the median (Q2) of the data.
How to Interpret the Box Plot:

 The box shows where the middle 50% of the data lies.
 The line inside the box is the median (Q2) value.
 The whiskers extend to the most extreme values within 1.5 times the IQR.
 Outliers: Data points beyond the whiskers (above 120 and below 80 in this example).

You might also like