Data Science Fundamentals
University Lecture Notes - CS 4820
Professor: Dr. Sarah Johnson
Semester: Fall 2025
Course: Introduction to Data Science
Lecture 1: Introduction to Data Science
Learning Objectives
By the end of this lecture, students will be able to:
Define data science and its core components
Understand the data science lifecycle
Identify different types of data and their characteristics
Recognize applications across various industries
What is Data Science?
Data Science is an interdisciplinary field that combines:
Statistics & Mathematics: Foundation for analysis
Computer Science: Programming and algorithms
Domain Expertise: Understanding business context
Communication: Presenting insights effectively
The Data Science Process
1. Problem Definition
Understanding business objectives
Translating business questions into analytical problems
Defining success metrics
2. Data Collection
Identifying relevant data sources
Web scraping and API integration
Database queries and data extraction
Survey design and data gathering
3. Data Exploration & Cleaning
Exploratory Data Analysis (EDA)
Handling missing values
Outlier detection and treatment
Data quality assessment
4. Feature Engineering
Creating new variables from existing data
Dimensionality reduction techniques
Variable transformation and scaling
Feature selection methods
5. Modeling
Algorithm selection
Model training and validation
Hyperparameter tuning
Performance evaluation
6. Deployment & Monitoring
Model deployment strategies
Monitoring model performance
Model maintenance and updates
A/B testing frameworks
Lecture 2: Python for Data Science
Essential Libraries
NumPy: Numerical Computing
import numpy as np
# Creating arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])
# Mathematical operations
mean_value = np.mean(arr)
std_dev = np.std(arr)
Pandas: Data Manipulation
import pandas as pd
# Reading data
df = pd.read_csv('data.csv')
# Basic operations
df.head()
df.info()
df.describe()
# Data cleaning
df.dropna()
df.fillna(method='ffill')
Matplotlib & Seaborn: Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Basic plotting
plt.figure(figsize=(10, 6))
plt.plot(x, y)
plt.title('Sample Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
# Statistical plots
sns.boxplot(data=df, x='category', y='value')
sns.heatmap(correlation_matrix, annot=True)
Data Types and Structures
Structured Data:
Tabular format (rows and columns)
Relational databases
CSV files, Excel spreadsheets
Example: Customer transaction records
Semi-Structured Data:
JSON, XML formats
Log files with consistent patterns
Web data with tags
Example: Social media posts with metadata
Unstructured Data:
Text documents, images, videos
Audio files, emails
Social media content
Example: Customer reviews, medical images
Lecture 3: Statistical Foundations
Descriptive Statistics
Measures of Central Tendency:
Mean: Average value
Formula: μ = Σx / n
Sensitive to outliers
Median: Middle value when sorted
More robust to outliers
Better for skewed distributions
Mode: Most frequently occurring value
Useful for categorical data
Can have multiple modes
Measures of Dispersion:
Variance: Average squared deviation from mean
Formula: σ² = Σ(x - μ)² / n
Standard Deviation: Square root of variance
Same units as original data
Interpretable measure of spread
Distribution Shapes:
Normal Distribution: Bell-shaped, symmetric
Skewed Distribution: Asymmetric tail
Bimodal Distribution: Two peaks
Uniform Distribution: Equal probability across range
Inferential Statistics
Hypothesis Testing:
1. Null Hypothesis (H₀): No effect or difference
2. Alternative Hypothesis (H₁): There is an effect
3. Significance Level (α): Typically 0.05
4. P-value: Probability of observing results under H₀
5. Decision Rule: Reject H₀ if p-value < α
Common Statistical Tests:
T-test: Compare means between groups
Chi-square test: Test independence in categorical data
ANOVA: Compare means across multiple groups
Correlation analysis: Measure linear relationship
Lecture 4: Machine Learning Basics
Types of Machine Learning
Supervised Learning:
Uses labeled training data
Goal: Predict target variable for new data
Examples: Classification, Regression
Unsupervised Learning:
No target variable provided
Goal: Discover hidden patterns
Examples: Clustering, Dimensionality Reduction
Reinforcement Learning:
Agent learns through interaction
Goal: Maximize cumulative reward
Examples: Game playing, Robotics
Model Evaluation
Classification Metrics:
Accuracy: (TP + TN) / (TP + TN + FP + FN)
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1-Score: 2 * (Precision * Recall) / (Precision + Recall)
Regression Metrics:
Mean Absolute Error (MAE): Σ|yᵢ - ŷᵢ| / n
Mean Squared Error (MSE): Σ(yᵢ - ŷᵢ)² / n
R-squared: Proportion of variance explained
Cross-Validation:
K-fold cross-validation
Leave-one-out cross-validation
Stratified cross-validation
Time series cross-validation
Lecture 5: Data Visualization
Principles of Effective Visualization
Choose Appropriate Chart Types:
Bar charts: Comparing categories
Line charts: Showing trends over time
Scatter plots: Exploring relationships
Histograms: Showing distributions
Heatmaps: Displaying correlation matrices
Design Guidelines:
1. Clarity: Clear titles, labels, and legends
2. Simplicity: Avoid chart junk and unnecessary elements
3. Consistency: Use consistent colors and styles
4. Accessibility: Consider color-blind friendly palettes
Advanced Visualization Techniques
Interactive Visualizations:
Plotly for interactive web-based charts
Bokeh for large dataset visualization
D3.js for custom interactive graphics
Dashboard Creation:
Tableau for business intelligence
Power BI for Microsoft ecosystem
Streamlit for Python-based dashboards
Dash for web applications
Assignments and Projects
Assignment 1: Exploratory Data Analysis
Due: Week 3
Objective: Perform comprehensive EDA on provided dataset
Deliverables:
Data quality report
Statistical summary
Visualization portfolio
Insights and recommendations
Assignment 2: Predictive Modeling
Due: Week 6
Objective: Build and evaluate machine learning models
Requirements:
Data preprocessing pipeline
Model comparison and selection
Performance evaluation
Model interpretation
Final Project: End-to-End Data Science Project
Due: Week 12
Scope: Complete data science project from problem definition to deployment
Components:
Problem statement and objectives
Data collection and preprocessing
Exploratory data analysis
Model development and evaluation
Business recommendations
Presentation to class
Study Resources
Recommended Textbooks
1. "Python for Data Analysis" by Wes McKinney
2. "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman
3. "Pattern Recognition and Machine Learning" by Christopher Bishop
Online Resources
Kaggle Learn courses and competitions
Coursera Data Science Specialization
edX MITx Introduction to Computational Thinking and Data Science
GitHub repositories with sample projects
Practice Datasets
Iris flower classification
Boston housing prices
Titanic passenger survival
Netflix movie recommendations
COVID-19 tracking data
Office Hours and Contact Information
Office Hours: Tuesdays and Thursdays, 2:00-4:00 PM
Location: Computer Science Building, Room 314
Email: sarah.johnson@university.edu
Course Website: www.university.edu/cs4820