Guide Eda Python 2
Guide Eda Python 2
Nov 2024
This document has been prepared within the framework of
the Aporta Initiative (datos.gob.es), developed by the
Ministry for Digital Transformation and Public Administration
through the Public Business Entity Red.es, in collaboration
with the Data General Directorate.
INTRODUCTION
This document constitutes an adaptation to the Python programming language of A Practical Guide to
Exploratory Data Analysis with R (Introduction) published by the Aporta Initiative in 2021. The dataset
and the basic structure of the referenced analysis are maintained with the intention of facilitating the
comparison between both programming languages, allowing users to identify syntactic and
implementation differences, and providing a valuable resource for those working in environments where
both languages are used. Additionally, this approach allows focusing on the particularities of each
language without the differences in data or the structure of the analysis adding unnecessary complexity
to the learning process. As an added value, this new version incorporates sections dedicated to emerging
trends in the field, such as automated exploratory analysis and data profiling tools, thus responding to the
current needs of the sector in terms of efficiency and scalability.
Exploratory Data Analysis (EDA) represents a critical step prior to any statistical analysis. Its relevance
derives both from the need to thoroughly understand the data before analyzing it and from verifying the
fulfillment of statistical requirements that will ensure the validity of subsequent analyses.
The statistical techniques that make up EDA allow for unraveling the intrinsic nature of the data,
characterizing its main attributes, and discovering the interrelationships between variables. This
systematic process lays the foundation for a deep understanding of the dataset and underpins the
robustness of subsequent analyses. The initial exploration reveals crucial aspects such as possible data
entry errors, patterns of missing values, significant correlations between variables, and informational
redundancies that could affect the quality of the analysis.”
Paradoxically, despite its fundamental role in ensuring consistent and accurate results, the exploratory
phase is often minimized in data reusing processes. This guide addresses this issue by presenting both
traditional methodologies and modern approaches to EDA, including automated tools that facilitate the
systematic exploration of large datasets.
The debate on the precise delineation of the processes that make up exploratory analysis remains current
in the scientific community. While some experts consider data cleaning as an independent preliminary
phase, the intricate interrelation between exploration and cleaning, along with their dependence on the
specific context of the data, suggests the convenience of an integrated approach. This introductory guide
details a series of tasks that constitute the minimum set to be addressed to ensure an acceptable starting
point for effective data reuse.
This guide is aimed both at those who are new to data analysis and those who seek to systematize the
processes of EDA. Using the Python1 programming language, key concepts are illustrated, and concrete
examples are provided to facilitate the understanding and application of the techniques learned.
1
To achieve the maximum understanding of the scope of this guide, it is recommended to have basic competencies in the Python
language (Aporta Initiative resources and exercises in Python), which is chosen to illustrate, through examples, the different stages
involved in EDA. If not, we still encourage you to continue reading this guide as it includes an interesting bibliography that, in addition to
helping you understand EDA, will allow you to learn and make the most of this powerful programming language.
A Practical Guide to Introduce Exploratory Data Analysis with Python 2
1. METHODOLOGY
This guide takes a practical approach based on experiential learning, allowing users to become familiar
with analysis techniques using real public data and open-source technological tools at no cost. All
developed materials, including data and source code, are made publicly available to facilitate both the
replication of the analysis and its adaptation to other study contexts.
As a tool for the practical case, the Python programming language and the Jupyter Notebook
development environment in Google Colab have been used. You can follow the guide by running the cells
of the notebook published in the offcial repository, ensuring the reproducibility of the results.
The selection of Python as the programming language is due to its prominent position in the field of data
analysis, where it stands out for combining an intuitive syntax with advanced analytical capabilities.
Although the guide includes the necessary code snippets to perform each task, the emphasis is placed on
the conceptual understanding of the processes and the explanation of the key functionalities that Python
offers for exploratory analysis. This approach prioritizes expository and didactic clarity over code
optimization, thus facilitating understanding for different levels of technical experience.
Maintaining consistency with the previous guide, the same dataset has been chosen, specifically, the
Castilla and León Air Quality Records, available in the datos.gob.es portal. As already explained in the
version implemented in R, the suitability of this dataset lies both in its social relevance and its technical
characteristics, making it an ideal case study to illustrate different exploratory analysis techniques.
1. Perform a descriptive analysis of the variables to obtain a representative idea of the dataset.
3. Detect and handle missing data to properly process numerical variables. Missing data are values
not recorded in some observations, and it is essential to manage them correctly to avoid biases
and issues in the analysis.
4. Identify and treat outliers to prevent them from distorting future statistical analyses.
5. Conduct a numerical and graphical examination of the relationships between the analyzed
variables to determine the degree of correlation between them, allowing the prediction of one
variable’s behavior based on the others.
A Practical Guide to Introduce Exploratory Data Analysis with Python 3
The following chart (Figure 1) schematically represents the set of stages of exploratory data
analysis described in the contents of this guide.
Let’s take a detailed look at each of the proposed stages for conducting exploratory data analysis. Each
chapter includes the section ‘Experiment,’ which, through the practical application of various functions in
Python, will help you understand the concepts explained.
Once the dataset on air quality records in Castilla and León has been obtained from the open data catalog
(it can be downloaded directly by accessing this link) we will proceed to perform an initial
characterization of the dataset to understand its structure and content. To do this, we will combine two
complementary approaches: on one hand, the application of descriptive statistics techniques that will
provide us with a quantitative view of the variables and their characteristics; on the other hand, the
generation of visualizations that will help us intuitively understand the distribution patterns present in the
data. This initial recognition phase lays the foundation for any deeper and more targeted analysis.
A Practical Guide to Introduce Exploratory Data Analysis with Python 4
2.1.1. EXPERIMENT
For our analysis, we will use the aforementioned dataset, which we will assign to the
object calidad_aire in our code. This dataset will serve as the basis for implementing and demonstrating
all the exploratory analysis techniques presented in the guide.
The initial exploration will be carried out using various Python functions specifically designed to provide a
global perspective of the structure and content of the data. These same functions will be used repeatedly
throughout the analysis to monitor how different transformations and processes affect the characteristics
of the dataset.
For the initial acquisition of the dataset, automation has been prioritized using the read_csv() function,
which retrieves the information directly from the open data catalog. While this method may slightly
increase the initial execution time, it ensures the reproducibility of the process and simplifies the
workflow. Alternatively, if greater control over the process is preferred or faster execution is needed,
there is the option to download the file from the provided link beforehand and load it from a local
directory.
# Cargar las librerías necesarias
import pandas as pd
import os
With df being the name of the object where the DataFrame is stored:
Graphical data visualization is a fundamental pillar in the exploratory analysis process. Visual
representations not only complement numerical analyses but also reveal crucial aspects of the data that
might be imperceptible through purely quantitative methods: behavior patterns, temporal trends,
relationships between variables, and potential anomalies emerge clearly through appropriate
visualizations. The available arsenal of visual tools is broad and versatile, including histograms for
distribution analysis, line charts for temporal evolution, bar charts for categorical comparisons, and pie
charts for compositional analyses, among others. Selecting the most suitable type of visualization for each
analysis is crucial and can be further explored by consulting this Data Visualization Guide.
Particularly, the histogram stands out for its ability to represent the distribution of numerical variables.
By grouping data into intervals or ‘bins,’ these charts reveal crucial patterns in the underlying structure of
the data. The shape of a histogram provides essential information about fundamental statistical
characteristics: it can show whether the distribution is symmetrical or has asymmetries (positive or
negative skewness), whether it is unimodal or multimodal, whether it approximately follows a normal
distribution or fits better to other probabilistic models, among others.
The detailed interpretation of a histogram allows for the identification of various critical aspects for
subsequent analysis:
A Practical Guide to Introduce Exploratory Data Analysis with Python 6
• The central tendency of the data, observable through the location of the peak or peaks of the
distribution.
• The dispersion or variability, reflected in the width and spread of the bars.
• The presence of long or short tails, which can indicate the frequency of extreme values.
• Possible discontinuities or gaps in the distribution, which could indicate issues in data collection
or important underlying phenomena.
• The existence of outliers, visible as isolated bars at the extremes of the distribution.
Understanding the shape of the distribution is crucial for the selection of subsequent statistical methods,
as many techniques assume normality or other specific distributional characteristics. For example, a
strongly skewed distribution might require data transformation or the use of non-parametric statistical
methods, while a bimodal distribution could suggest the presence of distinct subpopulations in the data
that warrant separate analysis.
Next, histograms are generated for all numerical variables present in the dataset, after importing the
data visualization library matplotlib. Through more advanced programming techniques, such as iterating
over the dataset columns and automating the generation of subplots, we can optimize the visualization
process and obtain all distributions on a single canvas. This programmatic approach is not only more
efficient than manually creating individual charts, but it also facilitates the detection of common patterns
or significant divergences among the different numerical variables in the dataset.
import matplotlib.pyplot as plt
import numpy as np
plt.tight_layout()
plt.show()
In this set of histograms, it is particularly interesting to analyze the distribution of NO2 (µg/m³). Its
histogram reveals a clearly asymmetric distribution with a pronounced positive skew, where most
measurements are concentrated at low values (below 50 µg/m³), but it has a long tail to the right
extending up to approximately 250 µg/m³. This distribution is typical of atmospheric pollutants in urban
environments, where periods of relatively low baseline concentrations are combined with occasional
episodes of high pollution, possibly associated with peak traffic hours or specific weather conditions. The
shape of this distribution suggests that logarithmic transformations might be necessary for subsequent
statistical analyses that assume normality, and it also indicates the importance of paying special attention
to those extreme values which, although less frequent, could represent critical pollution episodes from a
public health perspective.
A Practical Guide to Introduce Exploratory Data Analysis with Python 8
After the initial load, it is essential to verify the correct encoding of the data type for each variable. An
inappropriate data type can compromise subsequent analyses or generate erroneous results. It is
necessary to confirm that numerical variables are indeed stored as numbers (whether integers or
decimals), while qualitative or categorical variables should be encoded as character strings and contain a
finite and well-defined set of categories. This early verification allows for the identification and correction
of possible inconsistencies in data typology, such as dates stored as text or categories incorrectly encoded
as numerical values.
The usual types of variables that our data table can contain may be:
Correctly typing the variables in a dataset is not just a matter of format, but it is essential to ensure the
integrity and effectiveness of subsequent analyses. Proper classification of data types adheres to
fundamental principles of data analysis.
• Temporal variables (dates) require special treatment to capture the sequential nature and cyclical
properties of time.
• Categorical variables must preserve the hierarchical or nominal structure of their categories.
• Numerical data must maintain their mathematical and statistical properties.
2.2.1. EXPERIMENT
With the df.info() function, we can perform an initial inspection of the data types assigned to each
variable. In our dataset, we find various variables where the automatically assigned data type does not
correspond to the intrinsic nature of the information they contain, as is the case with the variables Fecha,
Provincia and Estación. These three variables have been generically encoded as type object, which
significantly limits their analytical utility:
• Variable Fecha requires a conversion to date type (datetime) to allow temporal operations such
as interval calculations, period aggregations, or seasonality analysis. The transformation
A Practical Guide to Introduce Exploratory Data Analysis with Python 9
using pd.to_datetime() not only changes the format but also enables the entire ecosystem of
temporal analysis in Python.
The procedure to follow is to readjust the types of these variables in order to subsequently perform the
necessary operations, analyses, and graphical representations.
# Ajustar el tipo de la variable Fecha
calidad_aire['Fecha'] = pd.to_datetime(calidad_aire['Fecha'], errors='coerce')
Proper management of missing data is crucial to ensure the quality and reliability of statistical analysis.
Missing data can distort analysis results, affect the accuracy of predictive models, and alter graphical
visualizations, leading to incorrect or misleading interpretations. For example, if not handled properly,
missing data can bias regression results or decrease a model’s predictive capability.
2.3.1.1. EXPERIMENT
In Python, we can use the pandas and numpy libraries to work with missing data. Below, we show some
useful functions to detect missing values:
#Sección 1
#Sección 2
• In the first section, isna() to generate a boolean DataFrame indicating the presence of missing
values, any() to check if there is at least one NaN in the DataFrame, and sum() to count the total
number of NaNs.
• Additionally, the percentage of missing values is calculated using mean(). Finally, the number and
percentage of missing values for each column are obtained.
A thorough analysis of the missing values in our dataset reveals a situation that requires special attention:
the DataFrame "calidad_aire" has 1,163,037 missing values, which represents a significant 20% of the
total observations. This percentage is the same as that obtained in the analysis of the guide in R, with
116,281 missing values. This fact confirms a non-random pattern in data loss that could be related to the
availability or capacity of the measurement systems.
The distribution of these missing values is not uniform among the variables, with a notable absence of
data in the parameters of CO (mg/m³) and PM25 (µg/m³), with 77% and 88% missing values respectively.
The magnitude and nature of these missing data pose a significant methodological challenge that must be
carefully addressed to ensure the validity and usefulness of subsequent analysis. Managing these missing
values will require a strategy that balances preserving data integrity with the need to maintain a
sufficiently complete dataset for effective analysis and reuse.
A Practical Guide to Introduce Exploratory Data Analysis with Python 12
The selection of the appropriate treatment technique depends on the type of data, the amount and
pattern of missing data, and the context of the analysis. Although mean imputation is a common
technique, it is not always the most suitable. It is essential to evaluate how each method can affect the
results and the quality of the final analysis.
Additionally, it is important to carefully document any decisions made in the treatment of missing data.
A rigorous EDA design includes traceability of these processes to evaluate their impact and make
adjustments if inconsistencies or weaknesses are detected in later stages of the analysis.
2.3.2.1. EXPERIMENT
As an example of applying the listed options, the first treatment to be performed on the missing data is
the elimination of the two variables that present a percentage higher than 50%, as such a high number
A Practical Guide to Introduce Exploratory Data Analysis with Python 13
of NaNs can produce errors or distort subsequent analyses by not using the rows that present NaNs (in
this case, more than 50% of the observations would not be used). Before this, a copy of the original dataset
is saved, which will be used in point 3.
# Guardar copia del dataset original
calidad_aire_original = calidad_aire.copy()
Continuing with the example, we will replace the missing values in the DataFrame for the remaining
variables with the mean of each column, so as not to lose significant information and to ensure that
subsequent analyses are not altered.
# Seleccionamos las variables numéricas
columnas_numericas = calidad_aire.select_dtypes(include=[np.number]).columns
# Calculamos la media para cada una de las variables numéricas sin tener en cuenta los NaN
cols_mean = calidad_aire[columnas_numericas].mean()
It is important to note that the treatment performed, although valid as an illustrative example, represents
a simplified approach to the problem of missing values. In a more thorough analysis of air quality data,
additional aspects should be considered such as:
• The temporal nature of the measurements: Air pollutants usually exhibit daily and seasonal
patterns, so simple mean imputation might not capture these cyclical variations. A more
sophisticated approach could consider imputation based on moving averages or values from
equivalent time periods.
• Spatial correlation: Since the data come from different measurement stations, it would be
relevant to consider the geographical proximity between stations for the imputation of missing
values, as nearby stations tend to record similar levels of pollution.
• The pattern of missingness: Before removing variables with a high percentage of missing values,
it would be advisable to analyze whether they follow any systematic pattern (e.g., specific sensor
failures or maintenance periods) that could provide relevant information about the data quality.
These considerations underscore the importance of adapting the techniques for handling missing values
to the specific context of the problem and the nature of the data analyzed.
A Practical Guide to Introduce Exploratory Data Analysis with Python 14
An atypical value or outlier represents an observation that exhibits a significant deviation from the
general pattern of behavior of the rest of the data. These extreme observations can arise for various
reasons: from errors in measurement or data recording to real but extraordinary phenomena that deserve
special attention. Their importance in exploratory analysis is crucial, as they can exert a disproportionate
influence on descriptive statistics, distort relationships between variables, and compromise the validity of
subsequent statistical models. For example, in air quality data, an outlier could represent both a
measurement error and a real episode of severe pollution, making its identification and contextualized
analysis especially relevant before making decisions about its treatment. Proper management of these
values requires a balance between preserving potentially valuable information and maintaining the
robustness of the statistical analysis.
The most common approach to handling outliers is to reduce their potential influence on analyses. Below
are some strategies that can be considered:
• Robust statistical methods: There are robust statistical techniques designed to minimize the
impact of outliers on results. These methods adjust the analysis to be less sensitive to outliers,
thus preserving the integrity of the results.
• Removal of outliers: Removing outliers can be appropriate in some cases, but it should be done
carefully. Before discarding an outlier, it is essential to verify whether the value is the result of a
measurement error or a problem in the dataset construction.
• Substitution of outliers: replacing outliers with the mean or median, for example. Although this
practice may seem like a simple solution, it can alter the distribution and variance of the data,
introducing bias into the analysis.
“If it is decided to remove or replace outliers, it is prudent to repeat the analyses with both the original
values and the modified data. This allows for observing the real impact of the outliers on the results. If
the difference is minimal, it may be reasonable to proceed with the removal or replacement. However, if
the impact is considerable, any decision should be adequately justified.
Regardless of the approach taken, as with the treatment of missing data, it is also crucial to document all
decisions made during the process of handling outliers. This ensures that other analysts can understand
the transformations performed on the dataset and allows for proper traceability throughout the
Exploratory Data Analysis.
A Practical Guide to Introduce Exploratory Data Analysis with Python 15
The following shows how outliers can be detected and removed, assuming it can be justified that the
values are measurement errors or issues arising from data ingestion. The goal is to prevent these values
from distorting future statistical analyses.
To demonstrate the process, we must distinguish between two types of treatment based on the type of
variables: continuous or discrete and categorical.
To demonstrate the process of outlier detection in a continuous variable, we will use the numerical
variable ‘O3 (µg/m³)’ as an example. The process is exactly the same for the rest of the numerical
variables in the table.
First, we generate a histogram to understand the frequency distribution of the variable under study:
plt.hist(calidad_aire['O3 (ug/m3)'], bins=100, range=(0, 150), color='blue', edgecolor='black'
)
plt.title('Distribución de O3 (ug/m3)')
plt.xlabel('O3 (ug/m3)')
plt.ylabel('Frecuencia')
plt.xlim(0,150)
plt.tight_layout()
plt.show()
The analysis of the O3 histogram reveals a positively skewed distribution characteristic of atmospheric
pollutant measurements, where most observations are concentrated in the range of 0 to 100 µg/m³.
Observations exceeding this threshold show a markedly lower frequency, suggesting the presence of
A Practical Guide to Introduce Exploratory Data Analysis with Python 16
potential outliers. However, in the context of air quality, these high values could represent actual episodes
of high ozone concentration, typically associated with specific meteorological conditions such as high solar
radiation and elevated temperature, especially during the summer months. To more appropriately detect
the presence of outliers, we will use the most suitable representation for this task: a box plot.
Box plots provide a visual representation that describes the dispersion and symmetry of the data by
observing the quartiles (division of the distribution into four parts delimited by the values 0.25, 0.50, and
0.75). These plots consist of three components:
1. Interquartile range (IQR): It represents 50% of the data, ranging from the 25th percentile of the
distribution (Q1) to the 75th percentile (Q3). Inside the box, we find a line indicating the 50th
percentile of the distribution (Q2), the median. The box provides an idea of the distribution’s
dispersion based on the separation between Q1 and Q3, as well as whether the distribution is
symmetric around the median or skewed to one side.
2. Whiskers: they extend from both sides of the ends of the box and represent the ranges of the
25% of values in the lower part (Q1 – 1.5 IQR) and the 25% of values in the upper part (Q3 + 1.5
IQR), excluding outliers.
3. Outliers: This representation identifies as outliers those observations that have values lower or
higher than the limits of the plot (lower limit: Q1 – 1.5 IQR and upper limit: Q3 + 1.5 IQR).
To obtain the necessary statistics for the plot representation, we will use the boxplot() function from
seaborn, another popular data visualization library:
import seaborn as sns
# Estadísticas necesarias para reproducir el gráfico de cajas y bigotes
Q1 = calidad_aire['O3 (ug/m3)'].quantile(0.25)
Q3 = calidad_aire['O3 (ug/m3)'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
A detailed analysis of the O3 variable reveals interesting patterns in its distribution. The descriptive
statistics show a median of 52.62 µg/m³, with an interquartile range (IQR) from 31.50 µg/m³ (Q1) to 75.50
µg/m³ (Q3). The box plot illustrates a significant presence of outliers above the upper limit, specifically
identifying 91,163 observations that exceed this threshold out of a total of 446,014 measurements
(approximately 20% of the data).
It is important to note that this analysis represents an illustrative example of basic outlier identification
using simple statistical methods. In a comprehensive air quality study, the treatment of these values
would require additional domain-specific considerations and a deeper analysis of the causes of these
deviations. However, for the educational purposes of this guide, this approach serves to demonstrate the
basic concepts of outlier analysis.
A Practical Guide to Introduce Exploratory Data Analysis with Python 18
Outlier detection in categorical variables requires a different approach than that used for numerical
variables, focusing on identifying unusual or inconsistent categories within the problem domain. For this
analysis, visualization using bar charts or frequency diagrams is particularly useful, as it allows for
identifying both the distribution of expected categories and the possible presence of anomalous
categories that could represent coding errors, data collection issues, or special cases that require
attention.
In our case, we will use the variable ‘Provincia’ as an illustrative example of this process. This choice is
particularly suitable because the valid categories are clearly defined (the provinces of Castilla y León),
which facilitates the identification of possible anomalies such as spelling errors, variations in writing
format, or the presence of provinces that do not belong to the autonomous community. This same
analysis procedure can and should be applied to any categorical variable in the dataset, adapting the
validation criteria according to the specific nature of each variable.
In Python, we can use the matplotlib or seaborn library to create bar charts and visualize the distribution
of categories.
# Número de categorías que presenta la variable Provincia
categoria_counts = calidad_aire['Provincia'].value_counts()
Based on the exploratory analysis, we can deduce that the category Madrid is an outlier within the
variable ‘Provincia’, as it does not belong to Castilla y León. In the next section, we will proceed to remove
this category from our data.
Once the outliers have been identified, we proceed to remove them. One way to remove outliers from a
numerical variable is to generate a new table.
# Se genera una nueva tabla que no contiene los valores identificados como atípicos
calidad_aire_NoOut = calidad_aire[(calidad_aire['O3 (ug/m3)'] >= lower_bound) &
(calidad_aire['O3 (ug/m3)'] <= upper_bound)]
plt.tight_layout()
plt.show()
Figure 14 - Box plot of the variable ‘O3 (µg/m³)’ before and after outlier removal
The comparison of the box plots before and after outlier removal reveals significant changes in the
distribution of the O3 (µg/m³) variable. The post-removal plot shows a more compact and symmetric
distribution, with a considerably narrower range oscillating approximately between 30 and 70 µg/m³. This
change in scale allows for a better appreciation of the central structure of the data, where both the
median and the quartiles are more clearly visualized.
A Practical Guide to Introduce Exploratory Data Analysis with Python 20
n Python, we can use the drop() method to remove rows containing the outlier category
and astype('category').cat.remove_unused_categories() to ensure the category is removed from the
categorical data type.
# Eliminamos las filas que pertenecen al factor “Madrid”
calidad_aire_SM = calidad_aire[calidad_aire['Provincia'] != 'Madrid'].copy()
Correlation (r) measures the linear relationship between two or more variables, reflecting both the
strength and direction of their relationship. In simple terms, correlation tells us if two variables change
together and how they do so:
• Positive Correlation: if one variable increases and the other also increases, they are said to be
positively correlated. A value of (r) close to +1 indicates a strong positive relationship.”
• Negative Correlation: if one variable increases while the other decreases, they are negatively
correlated. A value of (r) close to -1 indicates a strong negative relationship.
• No Correlation: a value of (r) close to 0 suggests that there is no clear linear relationship between
the variables.
It is important to note that Pearson correlation, being the most common, is not the only measure of
correlation available. There are alternatives such as:
• Spearman Correlation, more robust against outliers and useful for non-linear monotonic
relationships.
• Kendall Correlation, especially useful for small samples and when the data do not follow a normal
distribution.
A Practical Guide to Introduce Exploratory Data Analysis with Python 21
• Rank-based correlation measures, which can capture non-linear relationships between variables.
• Identification of redundancy: it can help identify redundant variables in a dataset. If two variables
are highly correlated, one of them could be eliminated without losing significant information,
simplifying data analysis and processing.
• Simplicity in analysis: in the context of EDA, understanding the correlations between variables
can guide the selection of variables for predictive models and other statistical analyses.
• Relationship with PCA: correlation analysis is related to techniques such as Principal Component
Analysis (PCA). PCA uses the correlation matrix to transform the original variables into a new set
of variables, called principal components, which capture the greatest variability in the data.
In the specific context of environmental data and air quality, correlation analysis becomes especially
relevant by allowing a holistic understanding of the interactions between pollutants and their
environment. This technique is fundamental for identifying relationships between different pollutants
that share common emission sources, while also facilitating the understanding of how meteorological
conditions influence their concentrations. Additionally, this analysis allows for the detection of temporal
and spatial patterns in the distribution of pollutants, providing a solid basis for interpreting atmospheric
dynamics and effectively managing air quality.
It is crucial to remember that correlation does not imply causation. Although two variables may be
correlated, it does not necessarily mean that one causes changes in the other. Correlation simply indicates
an association and does not establish a direct causal relationship.
Correlation analysis is a fundamental tool in multivariate data exploration, but it has limitations that
must be considered, such as coefficients significantly distorted by the presence of outliers or the potential
omission of non-linear relationships and complex patterns. To address these limitations, it is essential to
complement the analysis with additional tools such as scatterplot visualizations to detect non-linear
patterns, stratified analysis by subgroups to identify possible heterogeneities in relationships, or the
implementation of temporal or spatial cross-validation techniques to confirm the stability of identified
correlations. This comprehensive and multidimensional approach allows for a more robust and reliable
understanding of the relationships between variables in a professional and complete context.
2.5.1. EXPERIMENT
In this section, we will calculate the correlation matrix for the numerical variables and represent it
graphically.
num_variables = calidad_aire.select_dtypes(include=[np.number])
The visualization of correlations between numerical variables has been implemented using a heatmap, an
effective tool that allows representing the correlation matrix with an intuitive color code, where more
intense shades indicate stronger correlations.
The analysis of the correlation matrix reveals several interesting patterns in our air quality dataset:
The most significant correlations are observed between NO (µg/m³) y NO₂ (µg/m³), with a coefficient of
r=0.73, suggesting a strong positive association. This relationship was expected since both pollutants
belong to the nitrogen oxides (NOx) family and share common emission sources, mainly related to
combustion processes.
On the other hand, moderate positive correlations are identified between PM10 (µg/m³) and several
pollutants: with NO (r=0.4) and NO2 (r=0.44). These associations could indicate a common contribution
from emission sources, possibly related to urban traffic or industrial processes.
A Practical Guide to Introduce Exploratory Data Analysis with Python 23
The rest of the variables show lower correlation coefficients. This is reflected in the graph, where the
squares representing these correlations are closer to white in color. For example, we could infer that the
variables O₃ (µg/m³) y PM10 (µg/m³) are relatively independent, as their correlation coefficient is low.
These observed correlations provide a basis for understanding the interrelationships between pollutants,
although as mentioned earlier, they should be interpreted with caution and in the specific context of
urban air quality.
• Efficiency and speed: automated EDA allows processing large volumes of data in a short time,
generating detailed reports with statistical metrics, visualizations, and correlation analysis
automatically.
• Comprehensive overview: automated EDA tools provide a panoramic view of the data, including
statistical summaries, distributions, relationships between variables, and possible anomalies,
facilitating the identification of key patterns and trends.
• Early problem detection: automated EDA can help identify issues in the data, such as outliers,
missing data, or biases, at an early stage of the analysis, allowing for informed decisions about
data preprocessing and cleaning.
However, it is essential that this tool is used considering the following considerations:
• Interpretation of results: although automated EDA provides a quick overview, it is essential that
the data scientist interprets the results with judgment and knowledge of the problem’s context.
• Personalization: Automated AED tools offer customization options to tailor reports and
visualizations to the specific needs of the analysis.
• Limitations: Automated AED may not be suitable for all types of data or analysis. In some cases,
a more in-depth and personalized exploratory analysis may be necessary.
3.1. EXPERIMENT
Next, we will use YData Profiling, a library that allows generating interactive AED reports with
visualizations, descriptive statistics, correlation analysis, and outlier detection. The report generated with
YData Profiling includes:
• Executive summary with general information about the dataset, such as the number of rows and
columns, data types, unique values, missing values, etc.
A Practical Guide to Introduce Exploratory Data Analysis with Python 24
• Univariate analysis of each column, with visualizations and descriptive statistics such as mean,
median, standard deviation, minimum and maximum values, distribution, etc.
• Bivariate analysis, with correlation matrices, scatter plots, and analysis of the relationship
between variables.
• Detection of outliers and anomalies in the data
• Recommendations and suggestions for improving the dataset.
The following code block shows how to generate an automated AED report with YData Profiling on the
calidad_aire dataset. This report is generated in the working directory of the chosen execution
environment and can be downloaded from the github repository.
!pip install setuptools #instalación de paquetes y dependencias
!pip install --upgrade ydata-profiling
!pip install ipywidgets
In the generated interactive report, we will be able to observe some previously extracted conclusions,
such as the number of null values, the high correlation between certain variables, or the distribution of
each variable.
4. CONCLUSIONS
Exploratory Data Analysis (EDA) constitutes a fundamental pillar in modern data science, providing a
robust methodological framework for the initial understanding of complex datasets. As demonstrated
throughout this guide, this process goes beyond simple preliminary data inspection, forming a critical
phase that determines the quality and reliability of any subsequent analysis, whether it be rigorous
scientific research or the development of advanced interactive visualizations.
The systematic application of EDA techniques, from verifying data integrity to analyzing multivariate
correlations, allows for building a solid foundation for data-driven decision-making.
This guide has adopted a practical approach, using real air quality data that, although available in the
national catalog datos.gob.es, are downloaded directly from their original source: the open data portal of
Castilla y León. This dataset has allowed illustrating both the application of techniques and the specific
challenges and considerations that arise when working with environmental data. The choice of Python as
an analysis tool is due to its growing relevance in the data science ecosystem, providing an accessible yet
powerful foundation for implementing these methods.
It is important to emphasize that the techniques presented constitute a basic starting point in data
analysis. In more advanced applications, these methods can and should be complemented with more
sophisticated techniques, such as advanced multivariate analysis, automated anomaly detection, or the
use of machine learning techniques for exploring complex patterns.
A Practical Guide to Introduce Exploratory Data Analysis with Python 26
We hope this guide serves as a practical resource for those starting in data analysis, providing a solid
methodological foundation that can be applied and adapted to various datasets and specific contexts. See
you soon!
5. NEXT STEPS
If you want to delve deeper into the fascinating world of exploratory data analysis, we suggest the
following resources:
• Some freely available books that detail the process of exploratory data analysis and include test
datasets and examples with code (R or Python) to illustrate the process:
o Python for Data Analysis: a fundamental book that extensively covers data analysis in
Python, including EDA.
o Exploratory Data Analysis with R: a classic book focused on EDA using R.
o R for Data Science: an extensive resource on data science and exploratory analysis in R.
• In addition to books, the best way to learn data science is by practicing. Below, we provide links
to tutorials and online courses with a significant amount of practical programming:
o Comprehensive Data Exploration with Python: a tutorial on Kaggle that guides you
through a complete exploratory data analysis before training a machine learning model.
o Exploratory Data Analysis with Python and Pandas: a step-by-step tutorial on how to
perform EDA using pandas.
o Exploratory Data Analysis with Seaborn: complete guide for visualization and EDA with
seaborn.
• Lastly, here are some very useful additional resources that graphically compile the most relevant
information:
o Data Science Cheat Sheet: DataCamp cheat sheets that summarize key data science
concepts.
o Seaborn Cheat Sheet: Examples and cheat sheets for the Seaborn library, useful in EDA.
o Pandas Cheat Sheet: A practical summary of the most common operations with Pandas in
Python.