KEMBAR78
Introduction To Dataset | PDF | Diabetes | Dependent And Independent Variables
0% found this document useful (0 votes)
15 views26 pages

Introduction To Dataset

Uploaded by

muska07n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views26 pages

Introduction To Dataset

Uploaded by

muska07n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction to Datasets

Dr. Faisal Anwer


Department of Computer Science
Aligarh Muslim University, Aligarh-202002
1
Copyright © Dept of Computer Science, AMU, Aligarh. Permission required for reproduction or display.
Recap of Previous Lecture
• Introduction to Google Colab

• Features of Google Colab

• How to read Datasets in Google Colab ?


– Importing Kaggle Dataset into Colab
– Downloading and uploading Dataset to Colab
– Read directly through Google Drive

• How to read Github Project in Google Colab?


Contents
• Popular Dataset Repositories

• Discussion on following Datasets:


– Crop Recommendation Dataset
– Pima Indians Diabetes Dataset
– California Housing Prices Dataset
– MNIST Dataset
Working with Dataset
• A few places, where you can get data

• Popular open data repositories:


—UC Irvine Machine Learning Repository
(https://archive.ics.uci.edu/)
—Kaggle datasets (https://www.kaggle.com/datasets)
— Specific Datasets

• Other pages listing many popular open data repositories:


—Wikipedia’s list of Machine Learning datasets
(https://en.wikipedia.org/wiki/List_of_datasets_for_machine-
learning_research)
Crop Recommendation Dataset

– A dataset which would allow the users to build a predictive


model to recommend the most suitable crops to grow in a
particular farm based on various parameters.

– This dataset was build by augmenting datasets of rainfall,


climate and fertilizer data available for India.

– Link to download dataset:


https://www.kaggle.com/datasets/atharvaingle/crop-
recommendation-dataset
Features of Crop Recommendation Dataset

• The datasets consist of seven independent variables and one


target (dependent) variable, labels.

• Total records: 2200


Features of Crop Recommendation Dataset

• Independent Features:
▪ N - ratio of Nitrogen content in soil
▪ P - ratio of Phosphorous content in soil
▪ K - ratio of Potassium content in soil
▪ temperature - temperature in degree Celsius
▪ humidity - relative humidity in %
▪ ph - ph value of the soil
▪ rainfall - rainfall
• Dependant Feature:
▪ Label- Suitable crops
Few Records of Crop Recommendation Dataset

N P K temperature humidity ph rainfall label

90 42 43 20.87974 82.00274 6.502985 202.9355 rice

85 58 41 21.77046 80.31964 7.038096 226.6555 rice

60 55 44 23.00446 82.32076 7.840207 263.9642 rice

43 79 79 19.40752 18.98031 7.806748 80.25065 chickpea

44 74 85 20.18649 19.6372 7.150681 78.2604 chickpea

83 45 21 18.83344 58.75082 5.716223 79.75329 maize

100 48 16 25.71896 67.22191 5.549902 74.51491 maize


Correlation Between Features
Pima Indians Diabetes Dataset
• Dataset originally from the National Institute of Diabetes
and Digestive and Kidney Diseases.

• The objective of the dataset is to diagnostically predict


whether or not a patient has diabetes.

• Number of download: 463K

• Link to download dataset:


https://www.kaggle.com/datasets/uciml/pima-indians-
diabetes-database
Features of Pima Indians Diabetes Dataset

• The datasets consist of several medical predictor


(independent) variables and one target (dependent) variable,
Outcome.

• All patients here are females at least 21 years old

• Total records: 768


Features of Pima Indians Diabetes Dataset

• Independent variables includes:


– Pregnancies: Number of times pregnant
– Glucose: Glucose concentration
– BloodPressure: Diastolic blood pressure
– SkinThickness: Triceps skin fold thickness
– Insulin: 2-Hour serum insulin
– BMI: Body mass index
– DiabetesPedigreeFunction: Diabetes pedigree function
– Age
• Dependant Variable
– Outcome: Class variable (0 or 1) 268 of 768 are 1, the
others are 0
Few Records of Pima Indians Diabetes Dataset

Diabetes
BloodPre SkinThick Pedigree
Pregnancies Glucose ssure ness Insulin BMI Function Age Outcome

6 148 72 35 0 33.6 0.627 50 1

1 85 66 29 0 26.6 0.351 31 0

8 183 64 0 0 23.3 0.672 32 1

1 89 66 23 94 28.1 0.167 21 0

0 137 40 35 168 43.1 2.288 33 1

5 116 74 0 0 25.6 0.201 30 0


Correlation Between Features
California Housing Prices Dataset
• The dataset contains median house prices for California
districts derived from the 1990 census.

• It serves as an excellent introduction to implementing


machine learning algorithms.

• The objective of the dataset is to predict median house value.

• Number of downloads: 136K

• Link to download dataset:


https://www.kaggle.com/datasets/camnugent/california-
housing-prices
Features of California Housing Prices Dataset

• The datasets consist of nine (independent) variables and one


target (dependent) variable, medianHouseValue (Median
House Value).

• Total records: 20640


Features of California Housing Prices Dataset
• Independent variables includes:
– longitude: A measure of how far west a house is; a higher
value is farther west
– latitude: A measure of how far north a house is; a higher
value is farther north
– housingMedianAge: Median age of a house within a block;
a lower number is a newer building
– totalRooms: Total number of rooms within a block
– totalBedrooms: Total number of bedrooms within a block
Features of California Housing Prices Dataset
• Independent variables includes:
– population: Total number of people residing within a block
– households: Total number of households, a group of people
residing within a home unit, for a block
– medianIncome: Median income for households within a
block of houses (measured in tens of thousands of US
Dollars)
– oceanProximity: Location of the house w.r.t ocean/sea
Dependant Variable
• Dependent variable includes:
– medianHouseValue: Median house value for households
within a block (measured in US Dollars)
Few Records of California Housing Prices Dataset

latitu housing_median_a total_room total_bedroo populatio household median_incom median_house_val ocean_proximi


longitude de ge s ms n s e ue ty

-122.23 37.88 41 880 129 322 126 8.3252 452600NEAR BAY

-122.22 37.86 21 7099 1106 2401 1138 8.3014 358500NEAR BAY

-122.24 37.85 52 1467 190 496 177 7.2574 352100NEAR BAY

-118.18 34.63 19 3562 606 1677 578 4.1573 228100INLAND

-118.17 34.61 7 2465 336 978 332 7.1381 292200INLAND

-118.16 34.6 2 11008 1549 4098 1367 6.4865 204400INLAND

-117.11 32.58 21 2894 685 2109 712 2.2755 125000NEAR OCEAN

-117.1 32.58 27 2616 591 1889 577 2.3824 127600NEAR OCEAN


Correlation Between Features
MNIST Dataset
• MNIST dataset, is a set of 70,000 small images of digits
handwritten by high school students and employees of the
US Census Bureau.
• Each image is labeled with the digit it represents.
• This set has been studied so much that it is often called the
“Hello World” of Machine Learning
• The objective of the dataset is to predict the image
between 0 to 9.
• Number of downloads: 127K
• Link to download dataset:
https://www.kaggle.com/datasets/oddrationale/mnist-in-csv
Features of MNIST Dataset

• Each image has 784 features. This is because each image is


28×28 pixels.
• Each feature simply represents one pixel’s intensity, from 0
(white) to 255 (black).
• We considered csv format of the dataset.
• The dataset contains two files one for training data set and one
for testing dataset.

• Total records: 70000 (Training: 60000 + Testing: 10000)


Features of MNIST Dataset
• Independent variables includes:
– 1x1
– 1x2
– 1x3
– 1x4
– 1x5
– 1x6
– …..
– …..
– 28x28
• Dependent variable includes:
– label : digits (from 0 to 9)
Records of MNIST Dataset

CATEGORY 0 1 2 3 4 5 6 7 8 9 TOTAL

#Training
5,923 6,742 5,958 6,131 5,842 5,421 5,918 6,265 5,851 5,949 60,000
Samples

#Testing
980 1,135 1,032 1,010 982 892 958 1,028 974 1,009 10,000
Samples
Sample Images of MNIST Dataset
Summary

• We have popular Dataset Repositories

• We Discussed on following Datasets:


– Crop Recommendation Dataset
– Pima Indians Diabetes Dataset
– California Housing Prices Dataset
– MNIST Dataset

You might also like