KEMBAR78
2nd - Semester - Data Science | PDF
0% found this document useful (0 votes)
475 views16 pages

2nd - Semester - Data Science

vtu syllabus

Uploaded by

drmanu.ar-csds
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
475 views16 pages

2nd - Semester - Data Science

vtu syllabus

Uploaded by

drmanu.ar-csds
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Dayananda Sagar Academy of Technology & Management

(Autonomous Institute under VTU)

Semester : 3rd Semester


Course Title : Data Science for Engineers
Course Code : 23CSDS33
Course Type
: Integrated
(Theory/Practical/Project/Integrated)
Category : IPCC
Stream : CSE-DS CIE : 50

Teaching hours/ week (L:T:P:S) : 3-0-2-0 SEE : 50

Total Hours : 40 hours SEE Duration : 3 hours


Theory + 20
hours Practical
Credits : 4

Course Learning Objectives: Students will be able to:


Sl. No Course Objectives
 To provide a foundation in data Science terminologies, fundamentals and process
and tools available for data science and data analytics and to Define big data and
its key characteristics (volume, variety, velocity) and understand the challenges
1 associated with processing big data using traditional methods, and Gain
proficiency in Apache Spark for distributed data processing

 To describe the data for the data science process and to get familiarize data
science process and steps and Study usage of various data sources, and to develop
ETL pipelines for data preparation using Spark on Databricks, and Apply
2
statistical concepts to summarize and analyze data, understand hypothesis testing
and perform statistical inference

 To describe the relationship between data and to Demonstrate the data


visualization tools, and Learn data extraction from various data sources, and
Create informative data visualizations using Python libraries, also identify
3
relationships and patterns within datasets through EDA techniques

To analyze the data science applicability in real time applications., and working with various Data
4 analytics Charts. And Grasp the fundamental principles of supervised and unsupervised machine
learning algorithms
To utilize the Python libraries for Data Wrangling, Understand the various calculations and best
5
practices. To present and interpret data using visualization libraries in Python used for data science
Teaching-Learning Process
Pedagogical Initiatives:
Some sample strategies to accelerate the attainment of various course outcomes are listed below:
 Adopt different teaching methods to attain the course outcomes.
 Include videos to demonstrate various concepts in Data Science.
 Encourage collaborative (Group) Learning to encourage team building.
 Ask at least three HOTS (Higher-order Thinking Skills) module-wise questions to promote critical
thinking.
 Adopt Problem-Based Learning (PBL), which fosters students’ analytical skills, and develops
thinking skills such as evaluating, generalizing, and analyzing information rather than simply recalling
it.
 Show different ways to solve a problem and encourage the students to come up with creative and
optimal solutions.
 Discuss various case studies to map with real-world scenarios and improve the understanding.
 Devise innovative pedagogy to improve Teaching-Learning Process (TLP).
 Lecturer method (L) need not to be only a traditional lecture method, but alternative effective teaching
methods could be adopted to attain the outcomes.
 Use of Video/Animation to explain functioning of various concepts.
 Encourage collaborative (Group Learning) Learning in the class.
 Introduce Topics in manifold representations.
 Show the different ways to solve the same problem with different circuits/logic and encourage the
students to come up with their own creative ways to solve them.
 Discuss how every concept can be applied to the real world - and when that's possible, it
 helps improve the students' understanding.

Scheme of Teaching and Examinations for BE Programme -2024-25


Outcome Based Education and Choice Based Credit System (CBCS)
(Effective from the Academic Year 2024-25)
COURSE CURRICULUM
Module Topics Hours
No.
INTRODUCTION
1 8
Hours

What is Data Science? Big Data and Data Science hype – and getting past the hype, Why
now? – Datafication, Current landscape of perspectives, Skill sets. Needed Statistical
Inference: Populations and samples, Statistical modelling, probability distributions, fitting a
model.

Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the
model– presenting findings and building applications - Data Mining - Data Warehousing – Basic
Statistical descriptions of Data.
PREPARING AND GATHERING DATA AND KNOWLEDGE:
Philosophies of data science - Data science in a big data world - Benefits and uses of data
science and big data - facts of data: Structured data, Unstructured data, Natural Language,
Machine generated data, Audio, Image and video streaming data - The Big data Eco system:
Distributed file system, Distributed Programming framework, Data Integration frame work,
Machine learning Framework, NoSQL Databases, Scheduling tools, Benchmarking Tools,
System Deployment, Service programming and Security. Big Data Fundamentals: Definition
and characteristics of big data (volume, variety, velocity), Impact of big data on different
industries, Challenges of processing big data with traditional methods. Apache Spark: A
Distributed Processing Engine: Introduction to Spark and its distributed nature, Components
of the Spark ecosystem (Spark Core, Spark SQL, Spark Streaming), Benefits of using Spark for
data processing. Building ETL Pipelines with Spark on Databricks: Introduction to the ETL
process (Extract, Transform, Load), Setting up a Databricks workspace (free tier available),
Connecting to data sources and data ingestion techniques in Spark, Data transformation and
manipulation using Spark Data Frames/Datasets

Group activity: Summarize testable predictions for real-time data, Data can be used from the
Pedagogy Kaggle data set and other open source Github repository and data repository.

DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data
8
with Averages - Describing Variability - Normal Distributions and Standard (z) Scores.
2 THE DATA SCIENCE PROCESS-Overview of the data science process- defining research goals Hours
and creating project charter, retrieving data, cleansing, integrating and transforming data,
exploratory data analysis, Build the models, presenting findings and building application on top of
them.

Deep Dive into ETL with Spark: Data Ingestion and Cleaning: Techniques for
handling various data formats (text files, CSV, JSON), Addressing common data
quality issues (missing values, inconsistencies). Data Transformation with Spark
Functions : Working with Spark Data Frames/Datasets and applying transformation
functions -Filtering, aggregating, and manipulating data for analysis, Joining datasets
for comprehensive analysis. Data Quality Checks and Missing Value Handling (2
Hours): Implementing data quality checks to identify errors and inconsistencies,
Techniques for handling missing values (imputation, deletion), Ensuring data integrity
for reliable analysis, Introduction to Apache Spark SQL: Declarative data querying
with Spark SQL using SQL-like syntax, Integrating Spark SQL with Spark
DataFrames/Datasets, Performing complex queries on large datasets efficiently

Decision Trees
What Is a Decision Tree?, Entropy, The Entropy of a Partition, Creating a Decision Tree, Putting
It All Together, Random Forests, Neural Networks, Perceptrons, Feed-Forward Neural
Networks, Backpropagation, Example: Fizz Buzz, Deep Learning, The Tensor, The Layer
Abstraction, The Linear Layer, Neural Networks as a Sequence of Layers, Loss and
Optimization, Example: XOR Revisited, Other Activation Functions, Example: Fizz Buzz
Revisited, Softmaxes and Cross-Entropy, Dropout, Example: MNIST, Saving and Loading
Models, Clustering, The Idea, The Model, Example: Meetups, Choosing k, Example: Clustering
Colors, Bottom-Up Hierarchical Clustering.

Blended Learning-Data Collection from Kaggle and other repository and performing case
Pedagogy
studies
Feature Generation and Feature Selection
Extracting Meaning from Data: Motivating application: user (customer) retention. Feature 8
Generation (brainstorming, role of domain expertise, and place for imagination), Feature
3 Hours
Selection algorithms. Filters; Wrappers; Decision Trees; Random Forests. Recommendation
Systems: Building a User-Facing Data Product, Algorithmic ingredients of a Recommendation
Engine, Dimensionality Reduction, Singular Value Decomposition, Principal Component
Analysis, Exercise: build your own recommendation system.

Overview of Data Science Workflow – CRISP-DM , Descriptive Statistics:


Introduction to statistics- Summarizing data using central tendency (mean, median,
mode) and dispersion (variance, standard deviation) measures, Exploring data
distribution and skewness, Probability Concepts and Random Variables :
Understanding basic probability concepts and random variables, Different probability
distributions (normal, binomial, Poisson) and their applications in data analysis,
Hypothesis Testing : Formulating null and alternative hypotheses, Calculating p-
values and interpreting statistical significance, Making data-driven inferences based on
hypothesis testing. Correlation and Regression Analysis : Measuring the strength and
direction of relationships between variables, Understanding linear regression analysis
and its applications in data modeling

PYTHON LIBRARIES FOR DATA WRANGLING


Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, Boolean
logic – fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and
selection – operating on data – missing data – Hierarchical indexing – combining datasets –
aggregation and grouping – pivot tables
Application for machine learning in data science- Tools used in machine learning- Modeling
Process – Training model – Validating model – Predicting new observations –Types of machine
learning Algorithm: Supervised learning algorithms, Unsupervised learning algorithms.
Poster Presentation: allows students to represent the concepts visually in order to understand
Pedagogy
the topics easily.
Data Visualization and Data Exploration
Introduction: Data Visualization, Importance of Data Visualization, Data Wrangling, Tools and 8
Libraries for Visualization Comparison Plots: Line Chart, Bar Chart and Radar Chart; Relation
4 Plots: Scatter Plot, Bubble Plot Correlogram and Heatmap; Composition Plots: Pie Chart, Hours
Stacked Bar Chart, Stacked Area Chart, Venn Diagram; Distribution Plots: Histogram, Density
Plot, Box Plot, Violin Plot; Geo Plots: Dot Map, Choropleth Map, Connection Map; What Makes a
Good Visualization?
Visualization: Introduction to data visualization – Data visualization options Filters – Map
Reduce, Dashboard development tools.

Model Development: Simple and Multiple Regression – Model Evaluation using Visualization –
Residual Plot – Distribution Plot – Polynomial Regression and Pipelines – Measures for In-
sample Evaluation – Prediction and Decision Making.

Exploratory Data Analysis and the Data Science Process


Basic tools (plots, graphs and summary statistics) of EDA, Philosophy of EDA, The Data
Science Process, Case Study: Real Direct (online realestate firm). Three Basic Machine
Learning Algorithms: Linear Regression, k-Nearest Neighbours (k- NN), k-means Introduction
to Exploratory Data Analysis (EDA)): Importance of EDA in data science projects,
Different stages of the EDA process, Techniques for gaining insights from data.
Univariate and Multivariate Data Analysis: Analyzing data distribution (histograms,
boxplots) for individual variables. Visualizing relationships between pairs of variables
(scatter plots). Exploring relationships among multiple variables using techniques like
correlation matrices

Demonstration: exhibits the implementation process


Pedagogy
A Deep Dive into Matplotlib and Case studies 8
Introduction, Overview of Plots in Matplotlib, Pyplot Basics: Creating Figures, Closing Figures,
5 Hours
Format Strings, Plotting, Plotting Using pandas DataFrames, Displaying Figures, Saving
Figures; Basic Text and Legend Functions: Labels, Titles, Text, Annotations, Legends; Basic
Plots:Bar Chart, Pie Chart, Stacked Bar Chart, Stacked Area Chart, Histogram, Box Plot,
Scatter Plot, Bubble Plot; Layouts: Subplots, Tight Layout, Radar Charts, GridSpec; Images:
Basic Image Operations, Writing Mathematical Expressions
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots –
Histograms – legends – colors – subplots – text and annotation – customization – three-
dimensional plotting - Geographic Data with Basemap - Visualization with Seaborn.

Data Visualization with Matplotlib and Seaborn : Creating various data


visualizations (bar charts, line charts, heatmaps) using Matplotlib. Leveraging Seaborn,
a high-level library built on top of Matplotlib, for advanced and aesthetically pleasing
visualizations. Customizing visualizations to effectively communicate insights. Feature
Engineering for Machine Learning: Identifying and selecting relevant features for
machine learning models. Techniques for creating new features from existing data
(feature scaling, encoding categorical variables). Understanding the impact of feature
engineering on model performance. Machine Learning Algorithms for data science:
Supervised Learning Algorithms: Introduction to supervised learning and its goal of
predicting a target variable. Common supervised learning algorithms: Linear
Regression: Predicting continuous target variables using a linear relationship, K-
Nearest Neighbors (KNN): Predicting a target variable based on the k nearest
neighbors, Decision Trees: Making classification decisions based on a tree-like
structure, Understanding the concept of model evaluation metrics (accuracy, precision,
recall). Implementing basic supervised learning algorithms using Python libraries
(Scikit-learn). Unsupervised Learning Algorithms :Introduction to unsupervised
learning and its goal of uncovering hidden patterns, Common unsupervised learning
algorithms: K-Means Clustering: Grouping data points into clusters based on their
similarity, Principal Component Analysis (PCA): Reducing data dimensionality while
preserving key information, Understanding the applications of unsupervised learning in
data exploration and dimensionality reduction, Implementing basic unsupervised
learning algorithms using Python libraries (Scikit-learn)

CASE STUDIES Distributing data storage and processing with frameworks - Case study: e.g.,
Assessing risk when lending money.
The Data Ecosystem
the different types of data structures, file formats, sources of data, and the languages data
professionals use in their day-to-day tasks. various types of data repositories such as
Databases, Data Warehouses, Data Marts, Data Lakes, and Data Pipelines. the Extract,
Transform, and Load (ETL) Process, which is used to extract, transform, and load data into data
repositories. basic understanding of Big Data and Big Data processing tools such as Hadoop,
Hadoop Distributed File System (HDFS), Hive, and Spark.

Pedagogy Case studies: maps different domains in real time applications

List of Programs:

Sl. No. Experiments/Programs COs


1 A. Big Data Processing with Apache Spark: Objective: Understand CO3,4,5
the fundamentals of big data processing using Apache Spark.

Tasks: i. Set up a Databricks workspace:Create a free Databricks account.


Set up a new workspace and cluster.

ii. Data Ingestion: Load a large dataset (e.g., a CSV file containing
transaction data) into Databricks.
• Basic Data Exploration: Use Spark DataFrames to explore
the dataset. And Perform basic operations like filtering,
grouping, and aggregating data.
• ETL Pipeline: Build an ETL pipeline to clean and transform
the data. Save the transformed data back to a storage system
(e.g., DBFS).

iii. Installation of Python/R/Go language, Visual Studio code editors can


be demonstrated along with Kaggle data set usage.
iv. Write programs in Python/R and Execute them in either Visual Studio
Code or PyCharm Community Edition or any other suitable environment.
v. A study was conducted to understand the effect of number of hours the
students spent studying on their performance in the final exams. Write a
code to plot line chart with number of hours spent studying on x-axis and
score in final exam on y-axis. Use a red ‘*’ as the point character, label the
axes and give the plot a title.
Number 10 9 2 15 10 16 11 16
of hrs
spent
studying
(x)
Score in 95 80 10 50 45 98 38 93
the final
exam (0
– 100)
(y)

d. For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a


histogram to check the frequency distribution of the variable ‘mpg’ (Miles per gallon)

2 a. Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle


(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains
information about books. Write a program to demonstrate the following.
 Import the data into a DataFrame
 Find and drop the columns which are irrelevant for the book information.
 Change the Index of the DataFrame
 Tidy up fields in the data such as date of publication with the help of simple
regular expression.
 Combine str methods with NumPy to clean columns.

b. Statistical Analysis with Python:Objective: Apply statistical concepts


to summarize and analyze data.

Tasks:

1. Descriptive Statistics:
o Calculate measures of central tendency (mean, median, mode)
and dispersion (variance, standard deviation) for a dataset.
2. Probability Distributions:
o Analyze a dataset to identify its underlying probability
distribution (e.g., normal, binomial).
o Visualize the distribution using histograms and probability
plots.
3. Hypothesis Testing:
o Formulate null and alternative hypotheses for a given problem.
o Perform hypothesis testing (e.g., t-test, chi-square test) and
interpret the results.
3 a. Train a regularized logistic regression classifier on the iris dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris
dataset) using sklearn. Train the model with the following hyperparameter C = 1e4
and report the best classification accuracy.
b. Train an SVM classifier on the iris dataset using sklearn. Try different kernels and
the associated hyperparameters. Train model with the following set of
hyperparameters RBF- kernel, gamma=0.5, one-vs-rest classifier,
no-feature-normalization. Also try C=0.01,1,10C=0.01,1,10. For the
above set of hyperparameters, find the best classification accuracy along with total
number of support vectors on the test data

4 A. Consider the following dataset. Write a program to demonstrate the working of the
decision tree based ID3 algorithm.
Price Maintenance Capacity Airbag Profitable
Low Low 2 No Yes
Low Med 4 Yes Yes
Low Low 4 No Yes
Low Med 4 No No
Low High 4 No No
Med Med 4 No No
Med Med 4 Yes Yes
Med High 2 Yes No
Med High 5 No Yes
High Med 4 Yes Yes
high Med 2 Yes Yes
High High 2 Yes No
high High 5 yes Yes

B. Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the
dataset corresponds to the co-ordinates of each data point. The third column
corresponds to the actual
cluster label. Compute the rand index for the following methods:
o K – means Clustering
o Single – link Hierarchical Clustering
o Complete link hierarchical clustering.
o Also visualize the dataset and which algorithm will be able to recover
the true clusters.

5 A. Import any CSV file to Pandas Data Frame and perform the following:
(a) Visualize the first and last 10 records
(b) Get the shape, index and column details
(c) Select/Delete the records (rows)/columns based on conditions.
(d) Perform ranking and sorting operations.
(e) Do required statistical operations on the given columns.
(f) Find the count and uniqueness of the given categorical values.
(g) Rename single/multiple columns
2. import any CSV file to Pandas Data Frame and perform the following:
a)Handle missing data by detecting and dropping/ filling missing values.
(b) Transform data using apply () and map() method.
(c) Detect and filter outliers.
(d)Perform Vectorized String operations on Pandas Series.
(e) Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and Scatter
Plots.

B. Exploratory Data Analysis (EDA) with Python: Objective:


Conduct exploratory data analysis using Python libraries.

Tasks:

a. Univariate Analysis:
a. Analyze the distribution of individual variables using
histograms and boxplots.
b. Multivariate Analysis:
a. Explore relationships between pairs of variables using scatter
plots and correlation matrices.
c. Data Visualization:
a. Create various visualizations (bar charts, line charts, heatmaps)
using Matplotlib and Seaborn.
b. Customize the visualizations to effectively communicate
insights.
d. Feature Engineering:
a. Perform feature scaling and encoding of categorical variables.
b. Create new features from existing data to enhance model
performance.

6 A. Reading data from text files, Excel and the web and exploring various
commands for doing descriptive analytics on the Iris data set.
B. Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
1. Univariate analysis: Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis.
2. Bivariate analysis: Linear and logistic regression modeling
3. Multiple Regression analysis
4. Also compare the results of the above analysis for the two data sets.
C. Apply and explore various plotting functions on UCI data sets.
1. Normal curves
2. Density and contour plots
3. Correlation and scatter plots
4. Histograms
5. Three-dimensional plotting
D. Visualizing Geographic Data with Basemap
7 A . Supervised Learning with Scikit-learn: Objective: Implement and
evaluate supervised learning algorithms.

Tasks:
1. Data Preparation:
o Split a dataset into training and testing sets.
2. Linear Regression:
o Implement a linear regression model to predict a continuous
target variable.
o Evaluate the model's performance using metrics like mean
squared error (MSE).
3. Decision Trees:
o Build a decision tree classifier to predict a categorical target
variable.
o Assess the model's accuracy, precision, and recall.
4. K-Nearest Neighbors (KNN):
o Implement a KNN model for classification.
o Tune the hyperparameters (e.g., the number of neighbors) to
optimize performance.

b. Demonstrate Decision tree classification model and evaluate the performance of


classifier on Iris dataset.

c. Unsupervised Learning with Scikit-learn: Objective: Implement and


explore unsupervised learning algorithms.

Tasks:

i. K-Means Clustering:
a. Apply K-means clustering to a dataset to group similar data
points.
b. Visualize the clusters and interpret the results.
ii. Principal Component Analysis (PCA):
a. Perform PCA on a high-dimensional dataset to reduce its
dimensionality.
b. Analyze the principal components and their contribution to
variance.
iii. Data Exploration:
a. Use unsupervised learning techniques to uncover hidden
patterns and insights within the data.

Suggested Learning Resources For Lab:


● Virtual Labs (CSE): http://cse01-iiith.vlabs.ac.in/

1. Using Python : https://www.python.org


2. R Programming : https://www.r-project.org/
3. Python for Natural Language Processing : https://www.nltk.org/book/
4. Data set: https://bit.ly/2Lm75Ly
5. Data set: https://archive.ics.uci.edu/ml/datasets.html
6. Data set : www.kaggle.com/ruiromanini/mtcars
7. Pycharm : https://www.jetbrains.com/pycharm/
8. https://nptel.ac.in/courses/106/106/106106179/
9. https://nptel.ac.in/courses/106/106/106106212/
10. http://nlp-iiith.vlabs.ac.in/List%20of%20experiments.html
11.Spark Documentation: https://docs.databricks.com/en/index.html
12.Scikit-learn Documentation:https://scikit-learn.org/0.19/documentation.html
13. https://www.databricks.com/
14. https://www.simplilearn.com/tutorials/data-science-tutorial/what-is-data-science
15.https://www.youtube.com/watch?v=N6BghzuFLIg
16. https://www.coursera.org/lecture/what-is-datascience/fundamentals-of-data-science-tPgFU
17. https://www.youtube.com/watch?v=ua-CiDNNj30
18. https://nptel.ac.in/courses/106/105/106105077/
19.https://www.oreilly.com/library/view/doing-data-science/9781449363871/toc01.html
20.http://book.visualisingdata.com/
21.https://matplotlib.org/
22..https://docs.python.org/3/tutorial/
23.https://www.tableau.com/
24. https://skillsforall.com/course/introduction-data-science?courseLang=en-
US&utm_campaign=writ&utm_content=intro-to-data-science-get-started-
button&utm_source=cisco.com&utm_medium=referral
25. https://www.simplilearn.com/data-science-free-course-for-beginners-skillup
26. https://www.coursera.org/learn/what-is-datascience
27. https://www.coursera.org/learn/datasciencemathskills
28.https://www.coursera.org/specializations/introduction-data-science
29.https://www.coursera.org/learn/foundations-of-data-science
30.https://www.coursera.org/learn/data-science-ethics
31.https://www.coursera.org/learn/foundations-of-data-science
32.http://www.data8.org/
33.https://www.microsoft.com/en-us/research/publication/foundations-of-data-science-2/
34.https://www.codecademy.com/learn/paths/data-science-foundations
35.https://github.com/glouppe/dats0001-foundations-of-data-science
36.https://www.cambridge.org/core/books/abs/data-science-in-context/data-science/
D767E4FD5E42834A1D92799541199663

Open ended Programs


Requirements Data Sets
For Open IRIS Data Set
ended It is required that the student be conversant with R Programming Language or Python
Programs Programming language and use them in implementing Data Science and Algorithms.

Iris is a particularly famous toy dataset (i.e. a dataset with a small number of rows and columns,
mostly used for initial small-scale tests and proofs of concept). This specific dataset contains
information about the Iris, a genus that includes 260-300 species of plants. The Iris dataset
contains measurements for 150 Iris flowers, each belonging to one of three species: Virginica,
Versicolor and Setose. (50 flowers for each of the three species). Each of the 150 flowers
contained in the Iris dataset is represented by 5 values:
□ Sepal length, in cm
□ Sepal width, in cm
□ petal length, in cm
□ petal width, in cm
Iris species, one of: iris-setose, iris-versicolor, iris-virginica. Each row of the dataset represents a
distinct flower (as such, the dataset will have 150 rows). Each row then contains 5 values (4
measurements and a species label). The dataset is described in more detail on the UCI Machine
Learning Repository website. The dataset can either be downloaded directly from there (iris.data
file), or from a terminal, using the wget tool. The following command downloads the dataset from
the original URL and stores it in a file named iris.csv.
$ wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O iris.csv
MNIST Data Set
The MNIST dataset is another particularly famous dataset as CSV file. It contains several
thousands of hand- written digits (0 to 9). Each hand-written digit is contained in a 28 × 28 8-bit
grayscale image. This means that each digit has 784 (282) pixels, and each pixel has a value that
ranges from 0 (black) to 255 (white). The dataset can be downloaded from the following
URL:https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv.
Each row of the MNIST datasets represents a digit. For the sake of simplicity, this dataset contains
only a small fraction (10,000 digits out of 70,000) of the real MNIST dataset, which is known as the
MNIST test set.
For each digit, 785 values are available.
1 Load the Iris dataset as a list of lists (each of the 150 lists should have 5 elements). CO3,4,5
Compute and print the mean and the standard deviation for each of the 4 measurement
columns (i.e. sepal length and width, petal length and width). Compute and print the
mean and the standard deviation for each of the 4 measurement columns, separately
for each of the three Iris species (Versicolor, Virginica and Setose). Which
measurement would you consider “best”, if you were to guess the Iris species based
only on those four values?
2 Load the MNIST dataset. Create a function that, given a position 1 ≤ k ≤ 10, 000, CO3,4,5
prints the kthdigit of the dataset (i.e. thekthrow of the csv file) as a grid of 28 × 28
characters. More specifically, you should map each range of pixel values to the
following characters:
[0, 64) → " "
[64, 128) → "."
[128, 192) → "*"
[192, 256) → "#"
Compute the Euclidean distance between each pair of the 784-dimensional vectors of
the digits at the following positions: 26th, 30th, 32nd, 35th. Based on the distances
computed in the previous step and knowing that the digits listed are 7, 0, 1, 1, can you
assign the correct label to each of the digits?
3 Split the Iris dataset into two the datasets - IrisTest_TrainData.csv, CO3,4,5
IrisTest_TestData.csv. Read them as two separate data frames named Train_Data
and Test_Data respectively.
Answer the following questions:
 How many missing values are there in Train_Data?
 What is the proportion of Setosa types in the Test_Data?
 What is the accuracy score of the K-Nearest Neighbor model (model_1) with 2/3
neighbors using Train_Data and Test_Data?
 Identify the list of indices of misclassified samples from the „model_1‟.
Build a logistic regression model (model_2) keeping the modelling steps constant. Find
the accuracy of the model_2
4 Demonstrate any of the Clustering model and evaluate the performance on Iris dataset. CO3,4,5
Text Books

Sl. No. Title of the Book/Name of the author/Name of the publisher/Edition and Year

1 Introducing Data Science, Davy Cielen, Arno D. B. Meysman and Mohamed Ali,Manning Publications,
2016
2 Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
(Units II and III)
Reference Books

1 Joel Grus, “Data Science from Scratch”, 2ndEdition, O’Reilly Publications/Shroff Publishers and Distributors
Pvt. Ltd., 2019. ISBN-13: 978-9352138326

2 Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt Publishing, ISBN
9781800568112

3
4 Data Science for Business by Foster Provost and Tom Fawcett
(https://www.amazon.com/Data-Science-Business-Data-Analytic-Thinking-ebook/dp/
B00E6EQ3X4)
5
Python for Data Analysis by Wes McKinney (https://www.oreilly.com/library/view/python-
for-data/9781491957653/)
6 Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurélien Géron
7 Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016. (Units IV and V)
8 Cathy O’Neil and Rachel Schutt, “ Doing Data Science, Straight Talk From The
Frontline”, O’Reilly, 2014.

9 Jiawei Han, Micheline Kamber and Jian Pei, “ Data Mining: Concepts and
Techniques”, Third Edition. ISBN 0123814790, 2011.

10 Mohammed J. Zaki and Wagner Miera Jr, “Data Mining and Analysis:
Fundamental Concepts and Algorithms”, Cambridge University Press, 2014.

11 Jojo Moolayil, “Smarter Decisions : The Intersection of IoT and Data Science”, PACKT,
2016

12 Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman,
Cambridge University Press, 2nd edition, 2014
13 Think Like a Data Scientist, Brian Godsey, Manning Publications, 2017.

14 A handbook for data driven design by Andy krik


15 Foundations of Data Science∗ Avrim Blum, John Hopcroft, and Ravindran Kannan
Thursday 4th January, 2018
https://www.cs.cornell.edu/jeh/book.pdf

Course Outcomes: At the end of the course, the student will be able to:
RBT Level
CO Course Outcomes RBT
Indicator
Level
CO Describe the data science terminologies an To Understand the basics of data science R, U Level 1
1
CO Apply the Data Science process on real time scenario and Explain how data is R, U Level 2
2 collected, managed and stored for data science.
CO Analyze data visualization tools, Build, and prepare data for use with a variety of Ap Level 3
3 statistical methods and models
CO Apply Data storage and processing with frameworks and Ap, An Level 4
4 Analyze Data using various Visualization techniques.
Apply visualization Libraries in Python to interpret and explore data, Use the Ap, An Level 4
Python Libraries for Data Wrangling and Choose contemporary models, such as
CO machine learning, AI, techniques to solve practical problems
5

Program Outcome of this Course


Sl. No. Description POs

1 Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and PO1
computer science and business systems to the solution of complex engineering and societal problems.
2 Problem analysis: Identify, formulate, review research literature, and analyze complex engineering and PO2
business problems reaching substantiated conclusions using first principles of mathematics, natural
sciences, and engineering sciences.

3 Design/development of solutions: Design solutions for complex engineering problems and design system PO3
components or processes that meet the specified needs with appropriate consideration for the public health
and safety, and the cultural, societal, and environmental considerations.

4 Conduct investigations of complex problems: Use research-based knowledge and research methods PO4
including design of experiments, analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.

5 Modern tool usage: Create, select, and apply appropriate techniques, resources, and PO5
modern engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations
6 The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal, PO6
health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering and business practices.
7 Environment and sustainability: Understand the impact of the professional engineering solutions in business PO7
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.

8 Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the PO8
engineering and business practices.

9 Individual and team work: Function effectively as an individual, and as a member or leader in diverse PO9
teams, and in multidisciplinary settings.
10 Communication: Communicate effectively on complex engineering activities with the engineering community PO10
and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions.

11 Project management and finance: Demonstrate knowledge and understanding of the engineering, business PO11
and management principles and apply these to one‟s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.

12 Life-long learning: Recognize the need for, and have the preparation and ability to engage in independent PO12
and life-long learning in the broadest context of technological change.

Mapping of Course Outcomes to Program Outcomes:

CO/PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
x x x x
CO1
x
CO2
x x x
CO3
x
CO4

CO5

CIE- Continuous Internal Evaluation (50 Marks)


Theory Practical
Bloom’s Continuous Assessment Tests Continuous Comprehensive Assessment
Category (IAT) (CCA) Practical Test
IAT-1 IAT-2 CCA-1 CCA-2
50 Marks 50 Marks 50 Marks 50 Marks
Remember

Understand

Apply

Analyse

Evaluate

Create

CIE Course Assessment Plan


Marks Distribution
CO’s Test-1 Test-2 Total Weightage
Module-1 Module-2 Module 2 to Module-2.5 Module-4 Module-5
2.5 to 3 Marks
CO1
CO2
CO3
CO4
CO5
Total

SEE- Semester End Examination (50 Marks)


Bloom’s Category SEE Marks
(90% Theory+10% Practical Questions)
Remember
Understand
Apply
Analyse
Evaluate
Create

SEE Course Plan


Marks Distribution
CO’s Total Weightage
Module-1 Module-2 Module 2 to Module-2.5 Module-4 Module-5 Marks
2.5 to 3
CO1
CO2
CO3
CO4
CO5
Total

You might also like