Course Code Course name L T P C
Foundation of Data Science 3 0 1
Total Units to be Covered: 5 Total Contact Hours:
Prerequisite:- Python Programming Syllabus version: 1.0
Course Objectives
1. To explore the different concepts of Statistics.
2. To acquire a basic understanding of the Machine learning Models.
3. To comprehend software requirements for implementing statistical and ML
models.
Course Outcomes
CO1. Understand the fundamentals of Data Science.
CO2. Acquire the concepts and tools of data integration and data processing.
CO3. Explore software for data integration and data preprocessing.
CO4. To learn how to apply statistical & ML methods for predictive modelling.
CO5. To develop skills for effective data visualization.
CO-PO Mapping
Program
Outcomes
Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
Outcomes
CO 1 2 3 2 2 1 - - - 2 - 3 - 1 2 -
CO 2 2 3 2 2 1 - - - 2 - 3 - 1 2 -
CO 3 2 3 2 3 1 - - - 2 - 3 - 1 2 -
CO 4 2 3 2 3 1 - - - 2 - 3 - 1 2 -
- - - - - -
Average
2 3 2 2.5 1 2 3 1 2
1 – Weakly Mapped (Low) 2 – Moderately Mapped (Medium)
3 – Strongly Mapped (High) “_” means there is no correlation
Syllabus
Unit I: Introduction to Data Science
7 Lecture Hours
Evolution of Data Science, Data Science Roles, Stages in a Data Science Project,
Applications of Data Science in various fields, Data Security Issues, Mathematical
Foundations for Data Science, Exploratory Data Analysis, Data Munging or Data
Wrangling, Theory of causation, The Difference Between Business Analytics (BI),
Data Analytics and Data Science
Unit II: Data Collection and Data Pre-Processing
7 Lecture Hours
Data Collection Strategies, Data Pre-Processing Overview, Data Cleaning, Data
Integration and Transformation, Data Reduction, Data Discretization, Binary
Encoding, One-Hot Encoding, Standardization, Normalization; Data Bases; SQL
Tables; Functions, Pandas. Data Types and Formats (Structured, Unstructured, Semi-
Structured), Data Collection Methods (APIs, Web Scraping, Databases)
.
Unit III: Exploratory Data Analytics & Descriptive Statistics
11 Lecture Hours
Introduction to exploratory data analytics & Descriptive Statistics (Mean, Standard
Deviation), Skewness and Kurtosis (Box Plots, Pivot Table, Heat Map, Correlation
Statistics), Basic Probability Concepts, Conditional Probability and Bayes' Theorem,
Probability Distributions (Binomial, Poisson, Normal). Inferential Statistics- (Sampling
Methods, Central Limit Theorem, Confidence Intervals), Hypothesis Testing (Null and
Alternative Hypotheses, Type I and Type II Errors, t-tests, Chi-Square Tests, ANOVA),
Regression Analysis (Simple Linear Regression, Multiple Linear Regression,
Assumptions of Regression Analysis, Model Evaluation Metrics (R², Adjusted R²,
RMSE))
Unit IV: Model Development (Classification & Clustering Methods)
13 Lecture Hours
Simple and Multiple Regression, Supervised vs. Unsupervised Learning, Key
Algorithms (Linear Regression, Decision Trees, K-Means), Classification Algorithms
(K-Nearest Neighbors, Support Vector Machines, etc), Clustering Techniques (K-
Means, Hierarchical Clustering, DBSCAN, etc), Dimensionality Reduction (Principal
Component Analysis), Anomaly Detection, Feature Selection and Extraction, Handling
Categorical and Numerical Data, Model Selection and Hyperparameter Tuning Model
Evaluation (Confusion Matrix, ROC Curve, AUC, Cross-Validation, Metrics) – Model
Evaluation using Visualization – Residual Plot – Distribution Plot – Polynomial
Regression and Pipelines – Measures for In-sample Evaluation – Prediction and
Decision Making,
Unit V: Big Data and Cloud Computing
7 Lecture Hours
Introduction to Big Data Technologies (Hadoop, Spark), Definition and Characteristics
of Big Data (Volume, Variety, Velocity, Veracity), Big Data vs. Traditional Data,
Overview of Big Data Technologies and Ecosystem, Big Data Storage and Processing
Frameworks, Distributed Systems and Parallel Computing, Overview of Hadoop
Ecosystem (HDFS, YARN, MapReduce), Introduction to Apache Spark Use Cases
and Applications of Big Data, Data Storage and Management (NoSQL), Relational vs.
NoSQL Databases, Types of NoSQL Databases: Key-Value, Document, Column-
Family, Graph, CAP Theorem and BASE Properties, NoSQL Use Cases and
Advantages
Cloud Platforms for Data Science (AWS, Google Cloud, Azure), Definition and History
of Cloud Computing, Benefits and Challenges of Cloud Computing, Key Concepts:
Scalability, Elasticity, Agility, Cloud Service Models (IaaS, PaaS, SaaS), Overview of
Amazon Web Services (AWS), Overview of Microsoft Azure, Overview of Google
Cloud Platform (GCP), Comparison of Cloud Providers
Total lecture Hours 45
Textbooks
1. Peter Bruce, Andrew Bruce, Peter Gedeck, Practical Statistics for Data Scientists,
2e: 50+ Essential Concepts Using R and Python June 2020, O′Reilly
2. Balamurugan Balusamy, Nandhini Abirami R et.el, " Big Data: Concepts,
Technology, and Architecture, June 2021, Wiley
3. Derrick Rountree, Ileana Castrillo (“The Basics of Cloud Computing:
Understanding the Fundamentals of Cloud Computing in Theory and Practice”
November 2013, Syngress
Reference Books
1. Aurélien Géron , " Hands-On Machine Learning with Scikit-Learn, Keras, and
TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Third
Edition, 2022, O′Reilly
2. Funmi Obembe, Ofer Engel, " A Hands-on Introduction to Big Data Analytics ",
February 2024 | SAGE Publications Ltd.
Modes of Evaluation: Quiz/Assignment/ presentation/ extempore/ Written
Examination
Examination Scheme
Components IA MID SEM End Sem Total
Weightage (%) 50 20 30 100
Detailed breakup of Internal Assessment
Internal Assessment Weightage in calculation of Internal
Component Assessment (100 marks)
Quiz 1 15%
Quiz 2 15%
Class Test 1 15%
Class Test 2 15%
Assignment 1/Project 20%
Assignment 2/Project 20%