0% found this document useful (0 votes)

15 views26 pages

Introduction To Dataset

Uploaded by

muska07n

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views26 pages

Introduction To Dataset

Uploaded by

muska07n

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Introduction to Datasets

Dr. Faisal Anwer

Department of Computer Science
Aligarh Muslim University, Aligarh-202002
1
Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display.
Recap of Previous Lecture
• Introduction to Google Colab

• Features of Google Colab

• How to read Datasets in Google Colab ?

– Importing Kaggle Dataset into Colab
– Downloading and uploading Dataset to Colab
– Read directly through Google Drive

• How to read Github Project in Google Colab?

Contents
• Popular Dataset Repositories

• Discussion on following Datasets:

– Crop Recommendation Dataset
– Pima Indians Diabetes Dataset
– California Housing Prices Dataset
– MNIST Dataset
Working with Dataset
• A few places, where you can get data

• Popular open data repositories:

—UC Irvine Machine Learning Repository
(https://archive.ics.uci.edu/)
—Kaggle datasets (https://www.kaggle.com/datasets)
— Specific Datasets

• Other pages listing many popular open data repositories:

—Wikipedia’s list of Machine Learning datasets
(https://en.wikipedia.org/wiki/List_of_datasets_for_machine-
learning_research)
Crop Recommendation Dataset

– A dataset which would allow the users to build a predictive

model to recommend the most suitable crops to grow in a
particular farm based on various parameters.

– This dataset was build by augmenting datasets of rainfall,

climate and fertilizer data available for India.

– Link to download dataset:

https://www.kaggle.com/datasets/atharvaingle/crop-
recommendation-dataset
Features of Crop Recommendation Dataset

• The datasets consist of seven independent variables and one

target (dependent) variable, labels.

• Total records: 2200

Features of Crop Recommendation Dataset

• Independent Features:
▪ N - ratio of Nitrogen content in soil
▪ P - ratio of Phosphorous content in soil
▪ K - ratio of Potassium content in soil
▪ temperature - temperature in degree Celsius
▪ humidity - relative humidity in %
▪ ph - ph value of the soil
▪ rainfall - rainfall
• Dependant Feature:
▪ Label- Suitable crops
Few Records of Crop Recommendation Dataset

N P K temperature humidity ph rainfall label

90 42 43 20.87974 82.00274 6.502985 202.9355 rice

85 58 41 21.77046 80.31964 7.038096 226.6555 rice

60 55 44 23.00446 82.32076 7.840207 263.9642 rice

43 79 79 19.40752 18.98031 7.806748 80.25065 chickpea

44 74 85 20.18649 19.6372 7.150681 78.2604 chickpea

83 45 21 18.83344 58.75082 5.716223 79.75329 maize

100 48 16 25.71896 67.22191 5.549902 74.51491 maize

Correlation Between Features
Pima Indians Diabetes Dataset
• Dataset originally from the National Institute of Diabetes
and Digestive and Kidney Diseases.

• The objective of the dataset is to diagnostically predict

whether or not a patient has diabetes.

• Number of download: 463K

• Link to download dataset:

https://www.kaggle.com/datasets/uciml/pima-indians-
diabetes-database
Features of Pima Indians Diabetes Dataset

• The datasets consist of several medical predictor

(independent) variables and one target (dependent) variable,
Outcome.

• All patients here are females at least 21 years old

• Total records: 768

Features of Pima Indians Diabetes Dataset

• Independent variables includes:

– Pregnancies: Number of times pregnant
– Glucose: Glucose concentration
– BloodPressure: Diastolic blood pressure
– SkinThickness: Triceps skin fold thickness
– Insulin: 2-Hour serum insulin
– BMI: Body mass index
– DiabetesPedigreeFunction: Diabetes pedigree function
– Age
• Dependant Variable
– Outcome: Class variable (0 or 1) 268 of 768 are 1, the
others are 0
Few Records of Pima Indians Diabetes Dataset

Diabetes
BloodPre SkinThick Pedigree
Pregnancies Glucose ssure ness Insulin BMI Function Age Outcome

6 148 72 35 0 33.6 0.627 50 1

1 85 66 29 0 26.6 0.351 31 0

8 183 64 0 0 23.3 0.672 32 1

1 89 66 23 94 28.1 0.167 21 0

0 137 40 35 168 43.1 2.288 33 1

5 116 74 0 0 25.6 0.201 30 0

Correlation Between Features
California Housing Prices Dataset
• The dataset contains median house prices for California
districts derived from the 1990 census.

• It serves as an excellent introduction to implementing

machine learning algorithms.

• The objective of the dataset is to predict median house value.

• Number of downloads: 136K

• Link to download dataset:

https://www.kaggle.com/datasets/camnugent/california-
housing-prices
Features of California Housing Prices Dataset

• The datasets consist of nine (independent) variables and one

target (dependent) variable, medianHouseValue (Median
House Value).

• Total records: 20640

Features of California Housing Prices Dataset
• Independent variables includes:
– longitude: A measure of how far west a house is; a higher
value is farther west
– latitude: A measure of how far north a house is; a higher
value is farther north
– housingMedianAge: Median age of a house within a block;
a lower number is a newer building
– totalRooms: Total number of rooms within a block
– totalBedrooms: Total number of bedrooms within a block
Features of California Housing Prices Dataset
• Independent variables includes:
– population: Total number of people residing within a block
– households: Total number of households, a group of people
residing within a home unit, for a block
– medianIncome: Median income for households within a
block of houses (measured in tens of thousands of US
Dollars)
– oceanProximity: Location of the house w.r.t ocean/sea
Dependant Variable
• Dependent variable includes:
– medianHouseValue: Median house value for households
within a block (measured in US Dollars)
Few Records of California Housing Prices Dataset

latitu housing_median_a total_room total_bedroo populatio household median_incom median_house_val ocean_proximi

longitude de ge s ms n s e ue ty

-122.23 37.88 41 880 129 322 126 8.3252 452600NEAR BAY

-122.22 37.86 21 7099 1106 2401 1138 8.3014 358500NEAR BAY

-122.24 37.85 52 1467 190 496 177 7.2574 352100NEAR BAY

-118.18 34.63 19 3562 606 1677 578 4.1573 228100INLAND

-118.17 34.61 7 2465 336 978 332 7.1381 292200INLAND

-118.16 34.6 2 11008 1549 4098 1367 6.4865 204400INLAND

-117.11 32.58 21 2894 685 2109 712 2.2755 125000NEAR OCEAN

-117.1 32.58 27 2616 591 1889 577 2.3824 127600NEAR OCEAN

Correlation Between Features
MNIST Dataset
• MNIST dataset, is a set of 70,000 small images of digits
handwritten by high school students and employees of the
US Census Bureau.
• Each image is labeled with the digit it represents.
• This set has been studied so much that it is often called the
“Hello World” of Machine Learning
• The objective of the dataset is to predict the image
between 0 to 9.
• Number of downloads: 127K
• Link to download dataset:
https://www.kaggle.com/datasets/oddrationale/mnist-in-csv
Features of MNIST Dataset

• Each image has 784 features. This is because each image is

28×28 pixels.
• Each feature simply represents one pixel’s intensity, from 0
(white) to 255 (black).
• We considered csv format of the dataset.
• The dataset contains two files one for training data set and one
for testing dataset.

• Total records: 70000 (Training: 60000 + Testing: 10000)

Features of MNIST Dataset
• Independent variables includes:
– 1x1
– 1x2
– 1x3
– 1x4
– 1x5
– 1x6
– …..
– …..
– 28x28
• Dependent variable includes:
– label : digits (from 0 to 9)
Records of MNIST Dataset

CATEGORY 0 1 2 3 4 5 6 7 8 9 TOTAL

#Training
5,923 6,742 5,958 6,131 5,842 5,421 5,918 6,265 5,851 5,949 60,000
Samples

#Testing
980 1,135 1,032 1,010 982 892 958 1,028 974 1,009 10,000
Samples
Sample Images of MNIST Dataset
Summary

• We have popular Dataset Repositories

• We Discussed on following Datasets:

– Crop Recommendation Dataset
– Pima Indians Diabetes Dataset
– California Housing Prices Dataset
– MNIST Dataset

List of Datasets For Regression and Classification
No ratings yet
List of Datasets For Regression and Classification
2 pages
Aiml Datasets Library
No ratings yet
Aiml Datasets Library
18 pages
Cp4252 Machine Learning Lab Manual
No ratings yet
Cp4252 Machine Learning Lab Manual
27 pages
Computational Data Science (1) - 240728 - 180556
No ratings yet
Computational Data Science (1) - 240728 - 180556
30 pages
CatBoost - An In-Depth Guide Python
No ratings yet
CatBoost - An In-Depth Guide Python
33 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Descriptive Statistics Project
No ratings yet
Descriptive Statistics Project
11 pages
Unit 2
No ratings yet
Unit 2
78 pages
ML 3
No ratings yet
ML 3
24 pages
Cp4252-Machine Learning Lab Manual 23-24
No ratings yet
Cp4252-Machine Learning Lab Manual 23-24
28 pages
Projects Instructions
No ratings yet
Projects Instructions
3 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Proyecto Final Model
No ratings yet
Proyecto Final Model
13 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
Rain Prediction Using Random Forest
No ratings yet
Rain Prediction Using Random Forest
30 pages
Capstone Projects Detail
No ratings yet
Capstone Projects Detail
6 pages
ML Lab Manual
No ratings yet
ML Lab Manual
60 pages
LightGBM Python Guide: Datasets & Training
No ratings yet
LightGBM Python Guide: Datasets & Training
26 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
Data Science Public Dataset
No ratings yet
Data Science Public Dataset
40 pages
Machine Learning Internship Projects
No ratings yet
Machine Learning Internship Projects
8 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Hands On ML Workshop End To End ML
No ratings yet
Hands On ML Workshop End To End ML
20 pages
BCSL606 Machine Learning Lab
No ratings yet
BCSL606 Machine Learning Lab
33 pages
ML Mini Project Idea
No ratings yet
ML Mini Project Idea
13 pages
ML Public Datasets 1693110238
No ratings yet
ML Public Datasets 1693110238
39 pages
ML Abstract
No ratings yet
ML Abstract
5 pages
Machine Learning Labnem
No ratings yet
Machine Learning Labnem
5 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
33 pages
ML Lab Manual
No ratings yet
ML Lab Manual
25 pages
MiniProject BI
No ratings yet
MiniProject BI
16 pages
1000 AI Public Datasets
No ratings yet
1000 AI Public Datasets
38 pages
ML Manual
No ratings yet
ML Manual
29 pages
Akshata Report
No ratings yet
Akshata Report
21 pages
Revolutionizing Agriculture With Image-Based Disease Detection Using Artificial Intelligence
No ratings yet
Revolutionizing Agriculture With Image-Based Disease Detection Using Artificial Intelligence
28 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Decisiontree 1
No ratings yet
Decisiontree 1
10 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Individual
No ratings yet
Individual
6 pages
Soft Computing J - Component
No ratings yet
Soft Computing J - Component
16 pages
Food Recipe Recommendation Based On Ingredients de
No ratings yet
Food Recipe Recommendation Based On Ingredients de
7 pages
006 Practical List of ML
No ratings yet
006 Practical List of ML
3 pages
Detection of Diseases in Rice Plants Using Machine Learning Techniques
No ratings yet
Detection of Diseases in Rice Plants Using Machine Learning Techniques
25 pages
Rental Listings EDA & Feature Selection
No ratings yet
Rental Listings EDA & Feature Selection
15 pages
Dawit House
No ratings yet
Dawit House
49 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
Emllab
No ratings yet
Emllab
6 pages
ML - Datascience Manual
No ratings yet
ML - Datascience Manual
64 pages
ML Lab Mannual1
No ratings yet
ML Lab Mannual1
37 pages
Machine Learning Life Cycle Report
No ratings yet
Machine Learning Life Cycle Report
2 pages
Datasets
No ratings yet
Datasets
15 pages
End To End Project
No ratings yet
End To End Project
21 pages
ML Lab Manual for CSE Students
No ratings yet
ML Lab Manual for CSE Students
32 pages
Manufacturing Project Proposel - Removed-4
No ratings yet
Manufacturing Project Proposel - Removed-4
5 pages
1 What Is Feature Engineering - Kaggle
No ratings yet
1 What Is Feature Engineering - Kaggle
6 pages
Classification - Issues Regarding Classification and Prediction
No ratings yet
Classification - Issues Regarding Classification and Prediction
42 pages
End To End Machine Learning Project-2
No ratings yet
End To End Machine Learning Project-2
10 pages
Machine Learning Lab Manaul BCSL606
No ratings yet
Machine Learning Lab Manaul BCSL606
27 pages
Fire Test Report: Luxury Vinyl Tile
No ratings yet
Fire Test Report: Luxury Vinyl Tile
5 pages
Unit 4 Midwifery at Community
No ratings yet
Unit 4 Midwifery at Community
35 pages
Proceeding International Conference 2019
No ratings yet
Proceeding International Conference 2019
154 pages
Lesson-3 1
No ratings yet
Lesson-3 1
32 pages
Mock Test SPP 12
No ratings yet
Mock Test SPP 12
3 pages
429 125 0339 TD 03.01.2023
No ratings yet
429 125 0339 TD 03.01.2023
3 pages
Plasma Arc Cutting: Fixed Equipment: Control Approach 2
No ratings yet
Plasma Arc Cutting: Fixed Equipment: Control Approach 2
4 pages
Overview of Community Health Nursing
No ratings yet
Overview of Community Health Nursing
25 pages
Unit 2 Rvsed Integrated English Ta 2019 - 2020
No ratings yet
Unit 2 Rvsed Integrated English Ta 2019 - 2020
11 pages
Healthy Mindset Tips During Crisis
No ratings yet
Healthy Mindset Tips During Crisis
6 pages
26 Sri Rahayu Oktaviani
No ratings yet
26 Sri Rahayu Oktaviani
123 pages
Dento Alveolar Measurements and Histomorphometric Parameters of Maxillary and Mandibular First Molars, Using Micro CT
No ratings yet
Dento Alveolar Measurements and Histomorphometric Parameters of Maxillary and Mandibular First Molars, Using Micro CT
12 pages
Herpes Zoster Ophthalmicus Case Study
No ratings yet
Herpes Zoster Ophthalmicus Case Study
37 pages
Science 5 DLP 1 Human Reproductive System
No ratings yet
Science 5 DLP 1 Human Reproductive System
12 pages
Quarter 2 English 6 Periodical Test
100% (2)
Quarter 2 English 6 Periodical Test
6 pages
Corail Surgical Technique 0612-82-501
No ratings yet
Corail Surgical Technique 0612-82-501
16 pages
APA DSM5TR FromPlanningtoPublication
No ratings yet
APA DSM5TR FromPlanningtoPublication
2 pages
Nutrition Course: Protein Essentials
No ratings yet
Nutrition Course: Protein Essentials
23 pages
San Bernardino County Child Welfare System Class Action Lawsuit
100% (2)
San Bernardino County Child Welfare System Class Action Lawsuit
68 pages
Experience vs. Expertise in Decision Making
No ratings yet
Experience vs. Expertise in Decision Making
36 pages
CalFast XT
No ratings yet
CalFast XT
27 pages
TLE9 HE BeautyCare NailCareServices Q1 M1
0% (1)
TLE9 HE BeautyCare NailCareServices Q1 M1
10 pages
Safety Health Wellbeing Assessment Schedule
No ratings yet
Safety Health Wellbeing Assessment Schedule
12 pages
BMHP CT Scan Pemeriksaan
No ratings yet
BMHP CT Scan Pemeriksaan
12 pages
DRUG USE-WPS Office
No ratings yet
DRUG USE-WPS Office
2 pages
ExamView - Chapter - 01 2
No ratings yet
ExamView - Chapter - 01 2
7 pages
Training Module For INCLEN
No ratings yet
Training Module For INCLEN
26 pages
Oms Role of Arts Evidence
100% (1)
Oms Role of Arts Evidence
146 pages
MODULE 6 Leadership and Management
No ratings yet
MODULE 6 Leadership and Management
4 pages
Cognitive Disability Frames of Reference
No ratings yet
Cognitive Disability Frames of Reference
26 pages

Introduction To Dataset

Uploaded by

Introduction To Dataset

Uploaded by

Introduction to Datasets

Dr. Faisal Anwer

• Features of Google Colab

• How to read Datasets in Google Colab ?

• How to read Github Project in Google Colab?

• Discussion on following Datasets:

• Popular open data repositories:

• Other pages listing many popular open data repositories:

– A dataset which would allow the users to build a predictive

– This dataset was build by augmenting datasets of rainfall,

– Link to download dataset:

• The datasets consist of seven independent variables and one

• Total records: 2200

N P K temperature humidity ph rainfall label

90 42 43 20.87974 82.00274 6.502985 202.9355 rice

85 58 41 21.77046 80.31964 7.038096 226.6555 rice

60 55 44 23.00446 82.32076 7.840207 263.9642 rice

43 79 79 19.40752 18.98031 7.806748 80.25065 chickpea

44 74 85 20.18649 19.6372 7.150681 78.2604 chickpea

83 45 21 18.83344 58.75082 5.716223 79.75329 maize

100 48 16 25.71896 67.22191 5.549902 74.51491 maize

• The objective of the dataset is to diagnostically predict

• Number of download: 463K

• Link to download dataset:

• The datasets consist of several medical predictor

• All patients here are females at least 21 years old

• Total records: 768

• Independent variables includes:

6 148 72 35 0 33.6 0.627 50 1

8 183 64 0 0 23.3 0.672 32 1

0 137 40 35 168 43.1 2.288 33 1

5 116 74 0 0 25.6 0.201 30 0

• It serves as an excellent introduction to implementing

• The objective of the dataset is to predict median house value.

• Number of downloads: 136K

• Link to download dataset:

• The datasets consist of nine (independent) variables and one

• Total records: 20640

latitu housing_median_a total_room total_bedroo populatio household median_incom median_house_val ocean_proximi

-122.23 37.88 41 880 129 322 126 8.3252 452600NEAR BAY

-122.22 37.86 21 7099 1106 2401 1138 8.3014 358500NEAR BAY

-122.24 37.85 52 1467 190 496 177 7.2574 352100NEAR BAY

-118.18 34.63 19 3562 606 1677 578 4.1573 228100INLAND

-118.17 34.61 7 2465 336 978 332 7.1381 292200INLAND

-118.16 34.6 2 11008 1549 4098 1367 6.4865 204400INLAND

-117.11 32.58 21 2894 685 2109 712 2.2755 125000NEAR OCEAN

-117.1 32.58 27 2616 591 1889 577 2.3824 127600NEAR OCEAN

• Each image has 784 features. This is because each image is

• Total records: 70000 (Training: 60000 + Testing: 10000)

• We have popular Dataset Repositories

• We Discussed on following Datasets:

You might also like