KEMBAR78
Introduction to Fundamentals of Data Science | PPTX
Fundamentals of Data Science
Introduction
Course Outcomes
• Understand the key steps and pipeline of Data Science and its application in solving real-world problems.
• Recognize the importance of measuring similarity and dissimilarity between features in data for various analysis
tasks.
• Appreciate the significance of pre-processing techniques in preparing data for analysis in real-time scenarios.
• Identify the characteristics and practical applications of different regression models used in real-world scenarios.
• Evaluate classification models using appropriate metrics, including the confusion matrix, to assess model
performance and make informed decisions.
• Understand the principles of ensemble modeling and clustering, and apply appropriate ensemble techniques to
improve the accuracy and reliability of machine learning models.
Unit I
Introduction: Relation among AI, ML and Data Science, Importance of Data Science; Data Science Process;
Data Exploration: Objectives of Data Exploration, Forms of Data (Structured, Semi Structured, Unstructured),
Datasets (data objects and types of attributes/fields), Characteristics of Datasets and corresponding Statistical
Measures; Data Visualization: Univariate Visualization, Multivariate Visualization.
Categorization of Data Science Algorithms. Overview of different kind of dataset (i.e. text, image) and the different
format (ie. CSV, json).
Unit II
Data Similarity/Dissimilarity: Understanding data similarity and dissimilarity, Measures for comparing different
types of data (nominal, ordinal, binary, numerical).
Data Preprocessing: Data Preprocessing Pipeline, Preprocessing techniques for cleaning and integrating data, Data
reduction techniques for handling large datasets.
Cosine Similarity, Distance based similarity(Euclidean distance, Jaccard Similarity).
Unit III
Regression: Introduction to linear regression for forecasting numerical quantities, Logistic regression for
classification problems, Regularization techniques for improving model performance;
Classification: Classification Principles, Classification Model Evaluation Metrics(Confusion Matrix), Classification
using Decision Trees, Distance based Classifier (k-NN), Bayesian classifier.
Regression vs classification.
Unit IV
Ensemble Learning: Conditions for Ensemble Modeling, Overview of ensemble techniques(Voting, Bagging,
Boosting and Random Forest);
Clustering: Clustering Principles, Clustering for description/preprocessing/classification, Types of Clustering,
Clustering Evaluation Parameters, Clustering Algorithms (k-Means) and Evaluation metrics for assessing the quality
of clustering results;
Applications/Purpose of Clustering.
Practical Components
1. Perform data exploration techniques on any dataset to understand its characteristics, identify attribute types, and calculate
relevant statistical measures for numerical attributes.
2. Choose a dataset with multiple attributes, select relevant variables, and employ appropriate visualization techniques to
explore their distributions and summary statistics (you can use python library matplotlib/seaborn for visualization )
3. Take a dataset with missing values or inconsistencies and demonstrate the steps involved in cleaning and integrating the data.
Apply techniques such as data imputation, outlier detection, and data standardization to preprocess the dataset.
4. Select a large dataset and apply data reduction techniques such as feature selection and dimensionality reduction (e.g., PCA,
t-SNE) to handle its size while preserving important information and patterns in the data.
5. Select a dataset with numerical quantities and perform linear regression to forecast a specific target variable. Evaluate the
performance of the regression model using appropriate evaluation metrics such as MSE or RMSE. Apply any regularization
techniques such as L1 or L2 regularization to improve the model's performance. Compare the results with and without
regularization and discuss the impact on model accuracy.
6. Choose a dataset suitable for classification and apply the KNN algorithm to build a classification model. Utilize
appropriate evaluation metrics and construct a confusion matrix to assess the model's performance.
7. Choose a dataset suitable for classification or regression and explore any ensemble learning techniques such as voting,
bagging, or boosting. Discuss the conditions under which ensemble modeling is beneficial compared to individual models.
8. Select a dataset and apply the k-means clustering algorithm to perform clustering for classification purposes. Use
evaluation metrics such as silhouette coefficient, cohesion, and separation to assess the quality of the clustering results.
Experiment with different values of k and analyze the impact on the clustering outcome. Discuss the strengths and
limitations of the k-means algorithm.
Introduction to Fundamentals of Data Science

Introduction to Fundamentals of Data Science

  • 1.
  • 2.
  • 3.
    Course Outcomes • Understandthe key steps and pipeline of Data Science and its application in solving real-world problems. • Recognize the importance of measuring similarity and dissimilarity between features in data for various analysis tasks. • Appreciate the significance of pre-processing techniques in preparing data for analysis in real-time scenarios. • Identify the characteristics and practical applications of different regression models used in real-world scenarios. • Evaluate classification models using appropriate metrics, including the confusion matrix, to assess model performance and make informed decisions. • Understand the principles of ensemble modeling and clustering, and apply appropriate ensemble techniques to improve the accuracy and reliability of machine learning models.
  • 4.
    Unit I Introduction: Relationamong AI, ML and Data Science, Importance of Data Science; Data Science Process; Data Exploration: Objectives of Data Exploration, Forms of Data (Structured, Semi Structured, Unstructured), Datasets (data objects and types of attributes/fields), Characteristics of Datasets and corresponding Statistical Measures; Data Visualization: Univariate Visualization, Multivariate Visualization. Categorization of Data Science Algorithms. Overview of different kind of dataset (i.e. text, image) and the different format (ie. CSV, json). Unit II Data Similarity/Dissimilarity: Understanding data similarity and dissimilarity, Measures for comparing different types of data (nominal, ordinal, binary, numerical). Data Preprocessing: Data Preprocessing Pipeline, Preprocessing techniques for cleaning and integrating data, Data reduction techniques for handling large datasets. Cosine Similarity, Distance based similarity(Euclidean distance, Jaccard Similarity).
  • 5.
    Unit III Regression: Introductionto linear regression for forecasting numerical quantities, Logistic regression for classification problems, Regularization techniques for improving model performance; Classification: Classification Principles, Classification Model Evaluation Metrics(Confusion Matrix), Classification using Decision Trees, Distance based Classifier (k-NN), Bayesian classifier. Regression vs classification. Unit IV Ensemble Learning: Conditions for Ensemble Modeling, Overview of ensemble techniques(Voting, Bagging, Boosting and Random Forest); Clustering: Clustering Principles, Clustering for description/preprocessing/classification, Types of Clustering, Clustering Evaluation Parameters, Clustering Algorithms (k-Means) and Evaluation metrics for assessing the quality of clustering results; Applications/Purpose of Clustering.
  • 6.
    Practical Components 1. Performdata exploration techniques on any dataset to understand its characteristics, identify attribute types, and calculate relevant statistical measures for numerical attributes. 2. Choose a dataset with multiple attributes, select relevant variables, and employ appropriate visualization techniques to explore their distributions and summary statistics (you can use python library matplotlib/seaborn for visualization ) 3. Take a dataset with missing values or inconsistencies and demonstrate the steps involved in cleaning and integrating the data. Apply techniques such as data imputation, outlier detection, and data standardization to preprocess the dataset. 4. Select a large dataset and apply data reduction techniques such as feature selection and dimensionality reduction (e.g., PCA, t-SNE) to handle its size while preserving important information and patterns in the data. 5. Select a dataset with numerical quantities and perform linear regression to forecast a specific target variable. Evaluate the performance of the regression model using appropriate evaluation metrics such as MSE or RMSE. Apply any regularization techniques such as L1 or L2 regularization to improve the model's performance. Compare the results with and without regularization and discuss the impact on model accuracy.
  • 7.
    6. Choose adataset suitable for classification and apply the KNN algorithm to build a classification model. Utilize appropriate evaluation metrics and construct a confusion matrix to assess the model's performance. 7. Choose a dataset suitable for classification or regression and explore any ensemble learning techniques such as voting, bagging, or boosting. Discuss the conditions under which ensemble modeling is beneficial compared to individual models. 8. Select a dataset and apply the k-means clustering algorithm to perform clustering for classification purposes. Use evaluation metrics such as silhouette coefficient, cohesion, and separation to assess the quality of the clustering results. Experiment with different values of k and analyze the impact on the clustering outcome. Discuss the strengths and limitations of the k-means algorithm.