KEMBAR78
Week 1 Data Science | PDF | Machine Learning | Analytics
0% found this document useful (0 votes)
23 views17 pages

Week 1 Data Science

The document outlines a comprehensive six-week curriculum on Data Science, covering fundamentals, data management, preprocessing, statistical methods, visualization, and advanced tools. It emphasizes the importance of data science across various sectors such as business, healthcare, and technology, highlighting its role in decision-making, automation, and competitive advantage. Additionally, it provides real-world examples and terminologies related to data science, illustrating its applications and significance in today's data-driven world.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views17 pages

Week 1 Data Science

The document outlines a comprehensive six-week curriculum on Data Science, covering fundamentals, data management, preprocessing, statistical methods, visualization, and advanced tools. It emphasizes the importance of data science across various sectors such as business, healthcare, and technology, highlighting its role in decision-making, automation, and competitive advantage. Additionally, it provides real-world examples and terminologies related to data science, illustrating its applications and significance in today's data-driven world.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Week 1 – Fundamentals of Data Science

Day 1: Introduction to Data Science – What, Why, and Importance


Day 2: Data Science Terminologies (AI, ML, BI, Big Data, etc.)
Day 3: Applications of Data Science (e.g., Healthcare, Marketing, Gaming)
Day 4: Data Types – Structured, Unstructured, Semi-structured, Metadata
Day 5: Data Science Tools – Overview (Python, R, SAS, Jupyter, etc.)

Week 2 – Data Collection and Management


Day 6: Data Management Plan – Collection, Use, Reuse
Day 7: Primary vs. Secondary Data – Methods and Examples
Day 8: APIs and Data Sources in Real World
Day 9: Storage Management & Mass Storage Devices
Day 10: Storage Resource Management (SRM) Concepts

Week 3 – Data Preprocessing and Wrangling


Day 11: Data Cleaning Techniques – Handling Missing Values
Day 12: Data Transformation & Wrangling
Day 13: Encoding Methods (Label, One-hot, Target, Frequency)
Day 14: Introduction to Python for Data Handling
Day 15: Python Lists, Strings, Sets, and Basic Operations

Week 4 – Statistical Methods and Data Analysis


Day 16: Data Analysis vs Data Analytics
Day 17: Types of Data – Nominal, Ordinal, Interval, Ratio
Day 18: Numerical vs Categorical Data – Discrete and Continuous
Day 19: Big Data Analytics – Descriptive, Diagnostic, Predictive, Prescriptive
Day 20: Python Loops and Modules for Data Manipulation

Week 5 – Data Visualization and Communication


Day 21: Introduction to Data Visualization – Importance & Tools
Day 22: Visualization Techniques – Charts, Plots, Maps, Diagrams
Day 23: Retinal & Preattentive Attributes in Visualization
Day 24: Matplotlib & Seaborn in Python
Day 25: Advanced Tools – Tableau, D3.js, Power BI, etc.

Week 6 – Advanced Tools and Case Studies


Day 26: Machine Learning Basics – Supervised vs Unsupervised
Day 27: Algorithms – Classification, Regression, Clustering
Day 28: Python Libraries – NumPy, Pandas, Scikit-learn
Day 29: Big Data Visualization Challenges & Solutions
Day 30:Case Study
Introduction to Data Science: What, why, and Importance

What is Data Science?


Data Science is a multidisciplinary field that uses various techniques, algorithms, and processes
to extract insights and knowledge from structured and unstructured data. It combines expertise
from statistics, computer science, machine learning, and domain knowledge to analyze large
volumes of data, identify patterns, make predictions, and guide decision-making.

1. Structured Data
Definition:
Data that is organized in a predefined format, usually in tables (rows and columns). It is easy to
search, filter, and analyze using tools like Excel, SQL, etc.
Example:
A table of student information:
Student ID Name Age Grade
101 Asha 20 A
102 Rohan 21 B
This is structured data because it is clearly organized.

2. Unstructured Data
Definition:
Data that has no specific format or structure. It's harder to process and analyze directly using
traditional tools.
Example:
A WhatsApp message:
"Hey Usha! Congrats on your research in lung cancer detection 🎉. Let's meet this weekend!"
This is unstructured — it's just text, and not in a tabular format.

Summary:
Type Format Example
Structured Tables (rows/columns) Student database
Unstructured Free text, images, videos WhatsApp messages, images

Key components of Data Science include:


 Data Collection: Gathering raw data from various sources.
 Data Cleaning: Processing and preparing data to remove inconsistencies or errors.
 Data Exploration: Analyzing data to understand its structure and characteristics.
 Modeling and Analysis: Applying statistical or machine learning models to extract insights.
 Data Visualization: Presenting data in a clear and understandable format.

Why Data Science?


The explosion of data in recent years, coupled with the availability of powerful computational
tools, has made data science a vital field for driving business growth, innovation, and research.

Here's why it's crucial:


 Decision-Making: Data science provides data-driven insights that help businesses and
organizations make informed decisions. Instead of relying on intuition or guesswork, data
science allows decision-makers to base their choices on factual evidence.
 Automation: Machine learning models, a key part of data science, can automate
repetitive tasks, improve efficiencies, and reduce human errors.
 Competitive Advantage: Companies that leverage data science to analyze trends and
customer behavior gain a competitive edge over those that do not.
 Predictive Capabilities: Data science enables the development of predictive models,
helping businesses forecast future trends and make proactive decisions.
 Personalization: From personalized recommendations on Netflix to targeted ads on
social media, data science plays a huge role in tailoring experiences and products to
individual users.

Importance of Data Science

1. In Business:
o Customer Insights: Businesses use data science to understand customer
preferences, behaviors, and trends, leading to better-targeted marketing and
improved customer service.
o Operational Efficiency: Data science helps optimize supply chains, streamline
operations, and reduce costs.
o Risk Management: Companies use data science for fraud detection, financial
analysis, and to assess various risks.
2. In Healthcare:
o Medical Research: Data science is used in drug discovery, genomics, and clinical
trials to uncover new insights in medicine.
o Healthcare Predictions: It helps in predicting disease outbreaks, patient
conditions, and improving treatment outcomes.
3. In Government:
o Public Policy: Governments use data science for decision-making in areas like
education, healthcare, and public safety.
o Social Good: Data science is used in managing resources, tackling poverty,
disaster response, and addressing climate change.

4. In Technology:
o AI and Machine Learning: Data science is the backbone of artificial
intelligence, enabling the creation of smarter machines and applications.
o Big Data: With the growth of big data, the role of data science becomes crucial in
analyzing massive datasets to uncover insights.

Conclusion

Data Science is more than just a technical field—it's a critical driver of innovation and efficiency
across multiple industries. By enabling smarter decision-making, improving predictions, and
fostering automation, data science has become indispensable in the modern world. Whether
you're in business, healthcare, technology, or any other field, the ability to harness and interpret
data is a key factor for success.

Real-Time Example: Predictive Analytics in E-commerce (Amazon)

Scenario:
Amazon uses data science to recommend products to its users based on their browsing history,
purchase history, cart items, and search queries.
What (How Data Science Works Here):
1. Data Collection: Amazon collects vast amounts of data from its users—what they search
for, what they buy, how much time they spend on a product page, and even what they add
to their wish lists.
2. Data Cleaning: This raw data often contains missing or irrelevant entries. Data scientists
clean the data to remove inconsistencies.
3. Exploratory Data Analysis (EDA): Patterns like frequent purchases, trending items, or
seasonal behaviours are identified.
4. Modelling and Prediction: Amazon uses machine learning algorithms (like
collaborative filtering) to predict what a customer might want to buy next.
5. Data Visualization: Dashboards are used internally to help teams visualize what
products are trending, which marketing campaigns are working, etc.

Why (Purpose of Using Data Science):


 To increase sales through personalized recommendations.
 To enhance customer experience by showing them relevant products instead of random
ones.
 To reduce cart abandonment by reminding users of items they viewed or left in their
cart.
 To gain insights into customer behaviour and market trends.

Importance (Impact Created):


 Boosts Revenue: A significant portion of Amazon's revenue comes from its
recommendation engine.
 Customer Retention: Personalized experiences encourage users to return and buy more.
 Operational Efficiency: Amazon knows what to stock more, where to stock it, and how
to price it based on data-driven insights.
 Competitive Advantage: Their data science-driven strategy is one reason Amazon
dominates the e-commerce market.

Conclusion:
This example of Amazon demonstrates how data science turns raw data into valuable
insights, leading to smarter business decisions, increased efficiency, and better user satisfaction.
It clearly showcases the what, why, and importance of data science in real life.

Data Science Terminologies (AI, ML, BI, Big Data, etc.)

1. Artificial Intelligence (AI)

Definition: The simulation of human intelligence in machines that can perform tasks like
reasoning,learning, and decision-making.

Example:
 Google Maps using AI to suggest faster routes by analyzing traffic in real time.
 ChatGPT answering your questions like a human.
2. Machine Learning (ML)

Definition: A subset of AI where machines learn from data without being explicitly
programmed.
Example:
 Netflix recommending movies based on your watch history.
 A model predicting if a credit card transaction is fraudulent.

3. Deep Learning
Definition: A type of ML that uses neural networks with many layers to model complex patterns.
Example:
 Facial recognition in smartphones.
 Self-driving cars detecting pedestrians and signs.

4. Big Data
Definition: Extremely large datasets that are too complex for traditional tools to process. It is
characterized by the 5Vs – Volume, Velocity, Variety, Veracity, and Value.
Example:
 Facebook handling billions of posts, messages, and images every day.
 Amazon analyzing massive customer transaction data for personalized deals.

5. Business Intelligence (BI)


Definition: Tools and techniques to transform raw data into meaningful insights to support
business decisions.
Example:
 A sales dashboard showing product performance across regions.
 Power BI or Tableau reports for monthly revenue analysis.

6. Data Mining
Definition: The process of discovering patterns and relationships in large datasets.
Example:
 Analyzing supermarket purchases to see that people often buy bread and butter together.

7. Data Analytics
Definition: The science of analyzing raw data to make conclusions. It includes descriptive,
diagnostic, predictive, and prescriptive analytics.
Example:
 Analyzing app usage to understand which features users love.
 Predicting next month’s sales using past data.

8. Data Engineering
Definition: Preparing and building data pipelines so that data is clean, usable, and accessible for
analysis.
Example:
 Designing the backend to collect and process data from a fitness app to store in a data
warehouse.
9. Data Visualization
Definition: Representing data through charts, graphs, and visuals to make insights easy to
understand.
Example:
 A line graph showing COVID-19 cases over time.
 Pie chart showing user distribution by country.

10. Neural Networks


Definition: Algorithms modelled after the human brain, used in deep learning to recognize
patterns.
Example:
 Voice assistants like Siri understanding your speech.
 Image tagging on Instagram or Google Photos.

11. Natural Language Processing (NLP)


Definition: A branch of AI that enables machines to understand and interpret human language.
Example:
 Chatbots answering customer questions.
 Gmail auto-completing your sentences.

12. Computer Vision


Definition: A field of AI that helps machines understand visual data (images, videos).
Example:
 Google Lens identifying objects via your phone camera.
 Security cameras detecting unusual activity.

13. Internet of Things (IoT)


Definition: Interconnected physical devices that collect and exchange data via the internet.
Example:
 Smart watches tracking your heart rate and syncing with your phone.
 Smart refrigerators alerting you when groceries are low.

14. Cloud Computing


Definition: Delivering data storage, processing, and analytics over the internet.
Example:
 Storing your Google Drive files in the cloud.
 Companies using AWS or Azure for data analysis.
Data Science terminologies along with real-time examples

1. Data
Definition: Raw facts and figures without context.
Example: A list of temperatures recorded every hour in Chennai for a week.

2. Dataset
Definition: A collection of related data, usually organized in a table.
Example: An Excel sheet containing columns like Date, City, Temperature, Humidity, etc.

3. Data Cleaning
Definition: The process of correcting or removing inaccurate records from a dataset.
Example: Removing rows with missing values or fixing typos in city names like "Chenai" to
"Chennai".

4. Feature
Definition: An individual measurable property or characteristic of a data point.
Example: In a house price dataset, features could be the number of bedrooms, size in square
feet, and location.

5. Label
Definition: The target variable the model is trying to predict.
Example: In a house price prediction model, the price of the house is the label.

6. Model
Definition: A mathematical representation trained on data to make predictions or decisions.
Example: A machine learning model trained to detect spam emails based on content and sender
information.

7. Training Data
Definition: The data used to train a machine learning model.
Example: 80% of an email dataset used to teach the model which emails are spam.

8. Test Data
Definition: The data used to evaluate the model's performance.
Example: The remaining 20% of the email dataset is used to test if the model correctly classifies
emails as spam or not.

9. Overfitting
Definition: When a model learns the training data too well, including noise and errors.
Example: A stock prediction model that works perfectly on past data but fails on future trends.
10. Underfitting
Definition: When a model is too simple to learn the underlying patterns in the data.
Example: A linear model trying to predict a non-linear relationship between hours studied and
exam scores.

11. Supervised Learning


Definition: A type of ML where the model is trained on labeled data.
Example: Predicting house prices (label) based on features like area, location, and age.

12. Unsupervised Learning


Definition: ML model is trained on data without labels to find hidden patterns.
Example: Grouping customers into segments based on buying behavior without predefined
categories.

13. Clustering
Definition: Grouping similar data points together.
Example: Grouping YouTube viewers into clusters based on watch history.

14. Classification
Definition: Predicting categories or labels.
Example: Predicting whether a tumor is benign or malignant.

15. Regression
Definition: Predicting continuous values.
Example: Predicting the salary of an employee based on years of experience.

16. Accuracy
Definition: The percentage of correct predictions made by the model.
Example: If the model correctly classifies 90 out of 100 emails, the accuracy is 90%.

17. Confusion Matrix


Definition: A table used to evaluate the performance of a classification model.
Example: Helps understand how many spam emails were correctly or incorrectly classified.

18. Precision
Definition: Out of all predicted positive cases, how many are actually positive.
Example: If 100 emails are marked spam and only 80 are truly spam, precision is 80%.
19. Recall
Definition: Out of all actual positive cases, how many were correctly predicted.
Example: If there are 100 spam emails and 90 are detected, recall is 90%.

20. F1 Score
Definition: The harmonic mean852085 of precision and recall.
Example: A balanced measure when precision and recall are equally important.

Applications of Data Science

 Education
Data science is used to analyze student performance, learning patterns, and drop-out
prediction.
Example: EdTech platforms like BYJU’S use learning analytics to personalize lessons for
each student.

 Manufacturing
Predictive maintenance and process optimization are powered by data science.
Example: GE(General Electric.) uses data from sensors on machines to predict failures
before they happen.

 Sports
Analyzing player performance, game strategies, and injury risks.
Example: In the NBA, teams use player tracking data to improve performance and make
recruitment decisions.

 Energy Sector
Forecasting demand, optimizing energy usage, and detecting faults.
Example: Smart grids use data science to predict peak demand times and manage load
distribution efficiently.

 Retail
Inventory optimization, customer behavior analysis, and sales forecasting.
Example: Walmart uses predictive analytics to manage stock levels and plan promotions.

 Weather Forecasting
Data science models are used to predict weather patterns, storms, and climate changes.
Example: The Indian Meteorological Department (IMD) uses data models to issue
cyclone warnings.

 Banking
Customer credit scoring, default prediction, and loan approval automation.
Example: HDFC Bank uses data science to evaluate customer eligibility for pre-approved
loans.
 Cybersecurity
Intrusion detection, anomaly detection, and phishing attack prevention.
Example: Google uses machine learning to detect and block phishing emails in Gmail.

 Telecommunications
Churn prediction, network optimization, and user behavior analysis.
Example: Jio uses customer usage data to offer targeted data packs and reduce user
churn.

 Aviation
Application: Flight delay prediction, route optimization, fuel efficiency
Real-time Example: Delta Airlines uses predictive analytics to reduce flight delays by
analyzing weather, air traffic, and historical data.

 Insurance
Application: Risk assessment, fraud detection, claims prediction
Real-time Example: Progressive Insurance uses telematics data (from vehicle sensors)
to offer personalized car insurance premiums based on driving behavior.

 Real Estate
Application: Price prediction, market trend analysis, property recommendation
Real-time Example: Zillow uses data science to estimate property values and suggest
homes to buyers based on preferences and location trends.

 Environmental Science
Application: Air quality monitoring, deforestation tracking, wildlife conservation
Real-time Example: NASA uses satellite data and machine learning to detect illegal
deforestation and monitor climate change indicators globally.

 Automotive Industry
Application: Autonomous driving, vehicle safety, predictive maintenance
Real-time Example: Tesla collects data from its fleet to train self-driving models and
release over-the-air updates to improve driving performance.

 Government and Public Policy


Application: Crime prediction, resource allocation, policy analysis
Real-time Example: The Chicago Police Department uses predictive analytics to
identify high-crime areas and optimize patrolling strategies.

 Human Resources (HR)


Application: Resume screening, employee attrition prediction, performance analysis
Real-time Example: LinkedIn uses machine learning to match job seekers with relevant
job postings and recommend candidates to recruiters.

 Tourism and Hospitality


Application: Demand forecasting, customer personalization, review sentiment analysis
Real-time Example: Airbnb uses data science to suggest accommodations based on user
behavior, price trends, and location preferences.

 Space Exploration
Application: Mission planning, spacecraft health monitoring, anomaly detection
Real-time Example: NASA's Mars Rover missions use machine learning to analyze
Martian terrain and autonomously select safe navigation paths.
Data Types – Structured, Unstructured, Semi-structured, Metadata

In data science, understanding different data types is crucial for selecting appropriate
storage, processing, and analysis techniques. Here's a breakdown of the four major data
types

1. Structured Data

Definition:
Structured data refers to data that is organized in a fixed format, typically rows and
columns (like in relational databases). It's easily searchable and manageable with SQL or
spreadsheet tools.

Characteristics:
 Clearly defined fields
 Stored in tables (rows & columns)
 Easy to input, query, and analyze

Examples:
 Excel spreadsheet with columns: Customer_ID, Name, Age, Purchase_Amount

 SQL database like MySQL or PostgreSQL


| Customer_ID | Name | Age | Purchase_Amount |
|-------------|-------- |-----|-----------------|
| 1001 | Alice | 28 | $250 |
| 1002 | Bob | 35 | $320 |

2. Unstructured Data

Definition:
Unstructured data doesn't follow a predefined format or model. It's typically text-heavy,
image, audio, or video-based, and harder to analyze directly.

Characteristics:
 No fixed structure
 Needs preprocessing (e.g., NLP for text)
 Requires advanced tools to extract meaning

Examples:
 Text files: Emails, social media posts
 Media files: Photos, videos, audio recordings
 PDF documents
"Hey Usha! I loved your latest blog on machine learning. Super helpful!"
3.Semi-structured Data

Definition:
Semi-structured data doesn't reside in a traditional table format, but still contains tags
or markers to separate elements, making it easier to process than unstructured data.

Characteristics:
 Has structure, but not as rigid as relational databases
 Often found in formats like JSON, XML, YAML

Examples:
 JSON data:
{
"id": 101,
"name": "Usha",
"skills": ["Python", "Machine Learning", "Marketo"]
}
 XML documents

4. Metadata

Definition:
Metadata is "data about data." It describes or provides information about other data,
helping users understand, find, or manage the actual data.

Characteristics:
 Descriptive
 Enhances data usability
 Doesn't include the content, only information about the content

Examples:
 For an image:
o File type: JPEG
o Size: 1.2MB
o Resolution: 1920x1080
o Date created: May 30, 2025
 For a document:
o Title: "Lung Cancer Detection Thesis"
o Author: Usha
o Word count: 15,000
Summary Table

Type Structure Examples


Structured Fixed (tables) SQL database, Excel sheet
No fixed Videos, Images, Text files, Social
Unstructured
format media posts
Semi- Loose
JSON, XML, HTML
structured structure
Descriptive Image resolution, File type,
Metadata
info Document author

Quiz: Data Types in Data Science


1. What type of data is organized in rows and columns, making it easy to query using
SQL?
a) Unstructured Data
b) Structured Data
c) Semi-structured Data
d) Metadata
2. Which data type includes videos, images, and text files that do not have a fixed
format?
a) Structured Data
b) Semi-structured Data
c) Unstructured Data
d) Metadata
3. JSON and XML are examples of which type of data?
a) Structured Data
b) Unstructured Data
c) Semi-structured Data
d) Metadata
4. What is Metadata?
a) Data with no format
b) Data about data
c) Data stored in tables
d) Data in JSON format
5. Which of the following is an example of metadata for an image file?
a) The image resolution and file size
b) The pixels of the image itself
c) A paragraph of text in the image
d) The colors used in the image
6. True or False: Semi-structured data is more flexible than structured data but still
contains some organizational markers.

Here the answers:


1. b) Structured Data
2. c) Unstructured Data
3. c) Semi-structured Data
4. b) Data about data
5. a) The image resolution and file size
6. True

Data Science Tools – Overview (Python, R, SAS, Jupyter, etc.)

🐍 Python
 General-purpose programming language widely used in Data Science.
 Libraries:
o Pandas – data manipulation (df.describe())
o NumPy – numerical computations (np.array())
o Matplotlib / Seaborn – data visualization (sns.heatmap())
o Scikit-learn – machine learning (model.fit())
 Example: Predicting house prices using linear regression.

📊R
 Statistical computing language popular in academia and research.
 Libraries:
o ggplot2 – powerful data visualization
o dplyr – data wrangling (filter(), mutate())
o caret – machine learning modeling
 Example: Analyzing survey data and visualizing trends.

📈 SAS (Statistical Analysis System)


 Commercial software suite for advanced analytics.
 Features:
o Strong in statistical analysis, data mining, and business intelligence
o Uses PROC syntax for statistical procedures (PROC REG)
 Example: Risk analysis in banking.

📓 Jupyter Notebook
 Web-based interactive environment for writing and running code.
 Supports Python, Retc.
 Great for:
o Data exploration and visualization
o Documentation with Markdown
 Example: Step-by-step EDA (Exploratory Data Analysis) with plots and comments.

Other Tools
 Excel – Data manipulation, pivot tables, simple charts
 Tableau / Power BI – Drag-and-drop data visualization tools
 Apache Spark – Big Data processing, often used with PySpark
 Google Colab – Cloud-based Jupyter notebook with free GPU
 KNIME / RapidMiner – GUI-based Data Science platforms

Data Science Tools in Healthcare – Comparison Table

Tool Role in Healthcare Key Features Example Use Cases


Pandas, Scikit-learn,
Data analysis, ML Lung cancer detection,
Python TensorFlow, NLP
modeling EHR data modeling
libraries
Biostatistics, clinical ggplot2, survival Drug efficacy studies,
R
trial analysis analysis packages patient survival models
Regulatory-compliant PROC procedures, Clinical trials, adverse
SAS
analytics FDA-approved usage event monitoring
Jupyter Interactive research Step-by-step analysis, Exploratory health data
Notebook notebooks visualizations analysis (EDA)
Quick stats, data Patient logs, initial
Excel Pivot tables, formulas
tracking clinical data review
Hospital performance
Health dashboarding, Drag-and-drop, easy
Tableau dashboards, COVID
visual storytelling data linkage
trends
Real-time hospital Integration with EHRs Resource usage, patient
Power BI
analytics and hospital databases flow tracking
Handling massive Genomics data
Apache Spark patient or genomic Distributed processing processing, population
data health
Collaborative research Cloud notebooks with Deep learning for
Google Colab
modeling GPU medical imaging
No-code pipeline for Visual workflows, Disease prediction
KNIME
data prep & ML integrations pipelines
Automated modeling Auto ML, no-code Predicting readmission
RapidMiner
& analysis modeling risk
Tool Type Language Support Key Features Common Use Cases
Programming Versatile, rich libraries (Pandas, Machine learning, automation,
Python Python
Language Scikit-learn) web scraping
Programming Statistical analysis, data
R R Strong in statistics and visualization
Language visualization
Commercial Risk modeling, clinical data
SAS SAS proprietary GUI + code, powerful analytics
Software analysis
Jupyter
IDE / Notebook Python, R, Julia Interactive coding, visual output Prototyping, reporting, teaching
Notebook
Easy data manipulation, formulas,
Excel Spreadsheet Tool N/A Quick analysis, business reports
pivot tables
Interactive dashboards, strong
Tableau BI Tool Drag-and-drop GUI Business intelligence, storytelling
visuals
Power BI BI Tool DAX, Power Query Microsoft ecosystem integration Corporate dashboards, KPIs
Big Data Handles large datasets across Big Data analytics, real-time
Apache Spark Scala, Python (PySpark)
Framework clusters processing
Google Colab Cloud Notebook Python Free GPU/TPU, no setup Deep learning, shared projects
GUI Workflow
KNIME No-code, Python/R nodes Visual pipelines for ML & ETL Data prep, model deployment
Tool
GUI Workflow
RapidMiner Drag-and-drop + code Automated machine learning Predictive analytics, model testing
Tool

Case Study: Lung Cancer Prediction using Logistic Regression

Goal:
Predict if a patient has lung cancer (Yes or No) based on features like age, smoking habit,
and shortness of breath.

# Step 1: Import libraries


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Step 2: Create a sample dataset


data = {
'Age': [45, 54, 65, 34, 72, 60, 40, 38, 50, 67],
'Smoking': [1, 1, 1, 0, 1, 1, 0, 0, 1, 1], # 1 = Yes, 0 = No
'Shortness_of_Breath': [1, 1, 1, 0, 1, 1, 0, 0, 1, 1],
'Has_Lung_Cancer': [1, 1, 1, 0, 1, 1, 0, 0, 1, 1] # 1 = Yes, 0 = No
}

df = pd.DataFrame(data)

# Step 3: Define features and target


X = df[['Age', 'Smoking', 'Shortness_of_Breath']]
y = df['Has_Lung_Cancer']

# Step 4: Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Train logistic regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# Step 6: Predict on test data


y_pred = model.predict(X_test)
# Step 7: Evaluate model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Output:
Accuracy: 1.0

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 1


1 1.00 1.00 1.00 2

accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3

the input data has information for 10 people, but the output (accuracy and classification report) only
shows 3 test cases.
 test_size=0.3 means 30% of the data is used for testing.
 30% of 10 people = 3 people for testing, 7 for training.
 They were used in the training phase to train the logistic regression model, so they're not
part of the y_test or y_pred, which are used for evaluation.

Total Patients Used For Training Used For Testing


10 7 3

 Precision: Out of all patients predicted as "having cancer (or not)", how many were
correctly predicted.
 Recall: Out of all actual cancer (or no cancer) patients, how many did the model correctly
find.
 F1-score: A balance between precision and recall (good for uneven data).
 Support: The actual number of patients in each class in the test data.

You might also like