Pre-Installed(Toy) Sklearn Datasets
1. Iris
This dataset includes measurements of the sepal length, sepal width,
petal length and petal width of 150 iris flowers, which belong to 3
different species: setosa, versicolor and virginica. The iris dataset has
150 rows and 5 columns, which are stored as a dataframe, including a
column for the species of each flower.
The variables include:
Sepal.Length - The sepal.length represents the length of the
sepal in centimetres.
Sepal.Width - The sepal.width represents the width of the sepal in
centimetres.
Petal.Length - The petal.length represents the length of the petal
in centimetres.
Species - The species variable represents the species of the iris
flower, with three possible values: setosa, versicolor and
virginica.
You can load the iris dataset directly from sklearn using
the load_iris function from the sklearn.datasets module.
# To install sklearn
pip install scikit-learn
# To import sklearn
from sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
# Print the dataset description
print(iris.describe())
Code for loading the Iris dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.load_iris.html on 27/3/2023.
2. Diabetes
This sklearn dataset contains information on 442 patients with
diabetes, including demographic and clinical measurements:
Age
Sex
Body mass index (BMI)
Average blood pressure
Six blood serum measurements (e.g. total cholesterol, low-
density lipoprotein (LDL) cholesterol, high-density lipoprotein
(HDL) cholesterol).
A quantitative measure of diabetes disease progression (HbA1c).
The Diabetes dataset can be loaded using
the load_diabetes() function from the sklearn.datasets module.
from sklearn.datasets import load_diabetes
# Load the diabetes dataset
diabetes = load_diabetes()
# Print some information about the dataset
print(diabetes.describe())
Code for loading the Diabetes dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes
-dataset on 28/3/2023.
3. Digits
This sklearn dataset is a collection of hand-written digits from 0 to 9,
stored as grayscale images. It contains a total of 1797 samples, with
each sample is a 2D array of shape (8,8). There are 64 variables (or
features) in the digits sklearn dataset, corresponding to the 64 pixels
in each digit image.
The Digits dataset can be loaded using
the load_digits() function from the sklearn.datasets module.
from sklearn.datasets import load_digits
# Load the digits dataset
digits = load_digits()
# Print the features and target data
print(digits.data)
print(digits.target)
Code for loading the Digits dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/datasets/toy_dataset.html#optical-
recognition-of-handwritten-digits-dataset on 29/3/2023.
4. Linnerud
The Linnerud dataset contains physical and physiological
measurements of 20 professional athletes.
The dataset includes the following variables:
Three physical exercise variables - chin-ups, sit-ups, and jumping
jacks.
Three physiological measurement variables - pulse, systolic blood
pressure, and diastolic blood pressure.
To load the Linnerud dataset in Python using sklearn:
from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
Code for loading the linnerud dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.load_linnerud.html#sklearn.datasets.load_linnerud on 27/3/2023.
5. Wine
This sklearn dataset contains the results of chemical analyses of wines
grown in a specific area of Italy, to classify the wines into their correct
varieties.
Some of the variables in the dataset:
Alcohol
Malic acid
Ash
Alkalinity of ash
Magnesium
Total phenols
Flavanoids
The Wine dataset can be loaded using the load_wine() function
from the sklearn.datasets module.
from sklearn.datasets import load_wine
# Load the Wine dataset
wine_data = load_wine()
# Access the features and targets of the dataset
X = wine_data.data # Features
y = wine_data.target # Targets
# Access the feature names and target names of the dataset
feature_names = wine_data.feature_names
target_names = wine_data.target_names
Code for loading the Wine Quality dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-
recognition-dataset on 28/3/2023.
6. Breast Cancer Wisconsin Dataset
This sklearn dataset consists of information about breast cancer
tumours and was initially created by Dr. William H. Wolberg. The
dataset was created to assist researchers and machine learning
practitioners in classifying tumours as either malignant(cancerous) or
benign (non-cancerous).
Some of the variables included in this dataset:
ID number
Diagnosis (M = malignant, B = benign).
Radius (the mean of distances from the centre to points on the
perimeter).
Texture (the standard deviation of gray-scale values).
Perimeter
Area
Smoothness (the local variation in radius lengths).
Compactness (the perimeter^2 / area - 1.0).
Concavity (the severity of concave portions of the contour).
Concave points (the number of concave portions of the contour).
Symmetry
Fractal dimension ("coastline approximation" - 1).
You can load the Breast Cancer Wisconsin dataset directly
from sklearn using the load_breast_cancer function from the
sklearn.datasets module.
from sklearn.datasets import load_breast_cancer
# Load the Breast Cancer Wisconsin dataset
cancer = load_breast_cancer()
# Print the dataset description
print(cancer.describe())
Code for loading the Breast Cancer Wisconsin dataset using sklearn.
Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.load_breast_cancer.html on 28/3/2023.
Breast Cancer Wisconsin dataset
Real World Sklearn Datasets
Real world sklearn datasets are based on real-world problems,
commonly used to practice and experiment with machine learning
algorithms and techniques using the sklearn library in Python.
7. Boston Housing
The Boston Housing dataset consists of information on housing in the
area of Boston, Massachusetts. It has about 506 rows and 14 columns
of data.
Some of the variables in the dataset include:
CRIM - Per capita crime rate by town.
ZN - The proportion of residential land zoned for lots over 25,000
sq.ft.
INDUS - The proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (= 1 if tract bounds river; 0
otherwise).
NOX - The nitric oxide concentration (parts per 10 million).
RM - The average number of rooms per dwelling.
AGE - The proportion of owner-occupied units built prior to 1940.
DIS - The weighted distances to five Boston employment centres.
RAD - The Index of accessibility to radial highways.
TAX - The full-value property-tax rate per $10,000.
PTRATIO - The pupil-teacher ratio by town.
B - 1000(Bk - 0.63)^2 where -Bk is the proportion of blacks by
town.
LSTAT - The percentage lower status of the population.
MEDV - The median value of owner-occupied homes in $1000's.
You can load the Boston Housing dataset directly from scikit-
learn using the load_boston function from the sklearn.datasets
module.
from sklearn.datasets import load_boston
# Load the Boston Housing dataset
boston = load_boston()
# Print the dataset description
print(boston.describe())
Code for loading the Boston Housing dataset using sklearn. Retrieved
from https://scikit-learn.org/0.15/modules/generated/sklearn.datasets.l
oad_boston.html on 29/3/2023.
8. Olivetti Faces
The Olivetti Faces dataset is a collection of grayscale images of human
faces taken between April 1992 and April 1994 at AT&T Laboratories. It
contains 400 images of 10 individuals, with each individual having 40
images shot at different angles and different lighting conditions.
You can load the Olivetti Faces dataset in sklearn by using
the fetch_olivetti_faces function from the datasets module.
from sklearn.datasets import fetch_olivetti_faces
# Load the dataset
faces = fetch_olivetti_faces()
# Get the data and target labels
X = faces.data
y = faces.target
Code for loading the Olivetti Faces dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.fetch_olivetti_faces.html on 29/3/2023.
9. California Housing
This sklearn dataset contains information on median house values, as
well as attributes for census tracts in California. It also includes 20,640
instances and 8 features.
Some of the variables in the dataset:
MedInc - The median income in block.
HouseAge - The median age of houses in block.
AveRooms - The average number of rooms per household.
AveBedrms - The average number of bedrooms per household.
Population - The block population.
AveOccup - The average household occupancy.
Latitude - The latitude of the block in decimal degrees.
Longitude - The longitude of the block in decimal degrees.
You can load the California Housing dataset using
the fetch_california_housing function from sklearn.
from sklearn.datasets import fetch_california_housing
# Load the dataset
california_housing = fetch_california_housing()
# Get the features and target variable
X = california_housing.data
y = california_housing.target
Code for loading the California Housing dataset using sklearn.
Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.fetch_california_housing.html on 29/3/2023.
10. MNIST
The MNIST dataset is popular and widely used in the fields of machine
learning and computer vision. It consists of 70,000 grayscale images of
handwritten digits 0–9, with 60,000 images for training and 10,000 for
testing. Each image is 28x28 pixels in size and has a corresponding
label denoting which digits it represents.
You can load the MNIST dataset from sklearn using the
following code:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
Note: The MNIST dataset is a subset of the Digits dataset.
Code for loading the MNIST dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.fetch_openml.html#sklearn.datasets.fetch_openml on 30/3/2023.
11. Fashion-MNIST
The Fashion MNIST dataset was created by Zalando Research as a
replacement for the original MNIST dataset. The Fashion MNIST dataset
consists of 70,000 grayscale images(training set of 60,000 and a test
set of 10,000) of clothing items.
The images are 28x28 pixels in size and represent 10 different classes
of clothing items, including T-shirts/tops, trousers, pullovers, dresses,
coats, sandals, shirts, sneakers, bags, and ankle boots. It is similar to
the original MNIST dataset, but with more challenging classification
tasks due to the greater complexity and variety of the clothing items.
You can load this sklearn dataset using the fetch_openml
function.
from sklearn.datasets import fetch_openml
fmnist = fetch_openml(name='Fashion-MNIST')
Code for loading the Fashion MNIST dataset using sklearn. Retrieved
from__https://scikit-learn.org/stable/modules/generated/sklearn.datase
ts.fetch_openml.html#sklearn.datasets.fetch_openml__ on 30/3/2023.
Generated Sklearn Datasets
Generated sklearn datasets are synthetic datasets, generated using
the sklearn library in Python. They are used for testing, benchmarking
and developing machine learning algorithms/models.
12. make_classification
This function generates a random n-class classification dataset with a
specified number of samples, features, and informative features.
Here's an example code to generate this sklearn dataset with
100 samples, 5 features, and 3 classes:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=5,
n_informative=3, n_classes=3, random_state=42)
This code generates a dataset with 100 samples and 5 features, with 3
classes and 3 informative features. The remaining features will be
redundant or noise.
Code for loading the make_classification dataset using sklearn.
Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.make_classification.html#sklearn.datasets.make_classification on
30/3/2023.
13. make_regression
This function generates a random regression dataset with a specified
number of samples, features, and noise.
Here's an example code to generate this sklearn dataset with
100 samples, 5 features, and noise level of 0.1:
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=5,
noise=0.1, random_state=42)
This code generates a dataset with 100 samples and 5 features, with a
noise level of 0.1. The target variable y will be a continuous variable.
Code for loading the make_regression dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.make_regression.html#sklearn.datasets.make_regression on
30/3/2023.
14. make_blobs
This function generates a random dataset with a specified number of
samples and clusters.
Here's an example code to generate this sklearn dataset with
100 samples and 3 clusters:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=3,
random_state=42)
This code generates a dataset with 100 samples and 2 features (x and
y coordinates), with 3 clusters centred at random locations, and with
no noise.
Code for loading the make_blobs dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.make_blobs.html#sklearn.datasets.make_blobs on 30/3/2023.
15. make_moons and make_circles
These functions generate datasets with non-linear boundaries that are
useful for testing non-linear classification algorithms.
Here's an example code for loading the make_moons dataset:
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
This code generates a dataset with 1000 samples and 2 features (x
and y coordinates) with a non-linear boundary between the two
classes, and with 0.2 standard deviations of Gaussian noise added to
the data.
Code for loading the make_moons dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.make_moons.html#sklearn.datasets.make_moons on 30/3/2023.
Here's an example code to generate and load the make_circles
dataset:
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=1000, noise=0.05,
random_state=42)
Code for loading the make_circles dataset using sklearn. Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.make_circles.html#sklearn.datasets.make_circles on 30/3/2023.
16. make_sparse_coded_signal
This function generates a sparse coded signal dataset that is useful for
testing compressive sensing algorithms.
Here's an example code for loading this sklearn dataset:
from sklearn.datasets import make_sparse_coded_signal
X, y, w = make_sparse_coded_signal(n_samples=100,
n_components=10, n_features=50, n_nonzero_coefs=3,
random_state=42)
This code generates a sparse coded signal dataset with 100 samples,
50 features, and 10 atoms.
Code for loading the make_sparse_coded_signal dataset using sklearn.
Retrieved
from https://scikit-learn.org/stable/modules/generated/sklearn.datasets
.make_sparse_coded_signal.html#sklearn-datasets-make-sparse-
coded-signal on 30/3/2023.
Common Use Cases for Sklearn Datasets
Pre-Installed(Toy) Sklearn Datasets
Iris - This sklearn dataset is commonly used for classification tasks and
is used as a benchmark dataset for testing classification algorithms.
Diabetes - This dataset contains medical information about patients
with diabetes and is used for classification and regression tasks in
healthcare analytics.
Digits - This sklearn dataset contains images of handwritten digits and
is commonly used for image classification and pattern recognition
tasks.
Linnerud - This dataset contains physical fitness and medical data of
20 athletes and is commonly used for multivariate regression analysis.
Wine - This sklearn dataset contains chemical analysis of wines and is
commonly used for classification and clustering tasks.
Breast Cancer Wisconsin - This dataset contains medical information
about breast cancer patients and is commonly used for classification
tasks in healthcare analytics.
Real World Sklearn Datasets
Boston Housing - This sklearn dataset contains information about
housing in Boston and is commonly used for regression tasks.
Olivetti Faces - This dataset contains grayscale images of faces and is
commonly used for image classification and facial recognition tasks.
California Housing - This sklearn dataset contains information about
housing in California and is commonly used for regression tasks.
MNIST - This dataset contains images of handwritten digits and is
commonly used for image classification and pattern recognition tasks.
Fashion-MNIST - This sklearn dataset contains images of clothing items
and is commonly used for image classification and pattern recognition
tasks.
Generated Sklearn Datasets
make_classification - This dataset is a randomly generated dataset for
binary and multiclass classification tasks.
make_regression - This dataset is a randomly generated dataset for
regression tasks.
make_blobs - This sklearn dataset is a randomly generated dataset for
clustering tasks.
make_moons and make_circles - These datasets are randomly
generated datasets for classification tasks and are commonly used for
testing nonlinear classifiers.
make_sparse_coded_signal - This dataset is a randomly generated
dataset for sparse coding tasks in signal processing.
Final Thoughts
Sklearn datasets provide a convenient way for developers and
researchers to test and evaluate machine learning models without
having to manually collect and preprocess data.
They are also available for anyone to download and use freely.
The lead image of this article was generated via HackerNoon's AI
Stable Diffusion model using the prompt 'iris dataset'.
More Dataset Listicles:
1. Excel Datasets
2. Keras Datasets
3. R Datasets
About Author
SHIVENDRA SINGH RAJPUT