Matplotlib
The plotting of numerical data is the responsibility of this library. It's for
this reason that it's used in analysis of data. It's an open-source library that
plots high-definition figures such as pie charts, scatterplots, boxplots, and
graphs, among other things.
NumPy
NumPy is one of the most widely used open-source Python packages,
focusing on mathematical and scientific computation. It has built-in
mathematical functions for convenient computation and facilitates large
matrices and multidimensional data. It can be used for various things,
including linear algebra, as an N-dimensional container for all types of
data. The NumPy Array Python object defines an N-dimensional array with
rows and columns. A long with this, it can be used as a random number
generator.
In Python, NumPy is recommended over lists because it uses less memory,
is faster, and is more convenient.
Images, sound waves, and other binary raw streams can be represented as
a multidimensional array of real values using the NumPy interface for
visualization. Full-stack developers must be familiar with Numpy to use
this machine learning library.
Pandas
Pandas is an open source library licenced under the Berkeley Software
Distribution (BSD). In the domain of data science, this well-known library
is widely used. They're mostly used for analysis, manipulation, and
cleaning of data, among other things. Pandas allows us to perform simple
data modelling and analysis without having to swap to another language
like R.
SciPy
Scipy is a Python library. It is an open-source library, especially designed
for scientific computing, information processing, and high-level
computing. A large number of user-friendly methods and functions for
quick and convinient computation are included in the library. Scipy can be
used for mathematical computations alongside NumPy.
Cluster, fftpack, constants, integrate, io, linalg, interpolate, ndimage, odr,
optimise, signal, spatial, special, sparse, and stats are just a few of the
subpackages available in SciPy.
Scikit- learn
Scikit-learn is also an open-source machine learning library based on
Python. Both supervised and unsupervised learning processes can be used
in this library. Popular algorithms and the SciPy, NumPy, and Matplotlib
packages are all already pre-included in this library. The most well-known
Scikit-most-learn application is for Spotify music recommendations.
Seaborn
Visualization of statistical models is possible with this package. The library
is largely based on Matplotlib and enables the formation of statistical
graphics via:
Variable comparison via an API based on datasets
Create complex visualisations with ease, including multi-plot grids.
Univariate and bivariate visualisations are used to compare data subsets.
Patterns can be displayed in a variety of colour palettes.
Linear regression estimation and plotting are done automatically.
TensorFlow
TensorFlow is an open-source numerical calculation library with high
performance. Deep learning and ML algorithms make use of it as well. It
was developed by Google Brain group researchers inside the Google AI
organisation and is now widely used for complex mathematical
computations by mathematics, physics, and also machine learning
researchers.
Keras
Keras is a Python-based open-source neural network library that makes it
possible for us to examine deep neural networks deeply. As deep learning
becomes more common, Keras emerges as a viable option because,
according to its creators, it is an API (Application Programming Interface)
designed for humans, not machines. Compared to TensorFlow or Theano,
Keras has a greater adoption rate in the research community and industry.
Before installing Keras, the user should first download the TensorFlow
backend engine.
Statsmodels
Statsmodels is a Python library that helps with statistical model analysis
and estimation. The library is used to run statistical tests and other tasks,
resulting in high-quality results.
The user-friendly interface The Python programming language is widely
used in many real-world applications. It is expanding rapidly in the sectors
of error debugging since it is a high-level language that is dynamically
written. Python is becoming more widely used in widely famous
applications like YouTube and DropBox. Users can also perform multiple
tasks without needing to type their code, thanks to the accessibility of
Python libraries.
Data Visualization Techniques
Data visualization is a graphical representation of information and data. By
using visual elements like charts, graphs, and maps, data visualization tools provide
an accessible way to see and understand trends, outliers, and patterns in data. This
study on data visualization techniques will help you understand detailed techniques
and benefits.
In the world of Big Data, data visualization tools and technologies are essential to
analyse massive amounts of information and make data-driven decisions.
Advantages of data visualization
The uses of Data Visualization as follows:
Powerful way to explore data with presentable results.
Primary use is the pre-processing portion of the data mining process.
Supports the data cleaning process by finding incorrect and missing values.
For variable derivation and selection means to determine which variable to include
and discarded in the analysis.
Also play a role in combining categories as part of the data reduction process.
Disadvantages
While there are many advantages, some of the disadvantages may seem less obvious.
For example, when viewing a visualization with many different data points, it’s easy to
make an inaccurate assumption. Or sometimes the visualization is just designed
wrong so that it’s biased or confusing.
Some other disadvantages include:
Biased or inaccurate information.
Correlation doesn’t always mean causation.
Core messages can get lost in translation.
Data visualization for One-dimensional (1-D)
import numpy as np
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = [7.00, 3.50]
plt.rcParams["figure.autolayout"] = True
y_value = 1
x = np.arange(10)
y = np.zeros_like(x) + y_value
plt.plot(x, y, ls='dotted', c='red', lw=5)
plt.show()
Data visualization for 2-D
import numpy as np
import matplotlib.pyplot as plt
image = np.random.rand(30, 30)
plt.imshow(image, cmap=plt.cm.hot)
plt.colorbar()
plt.show()
Data visualization for 3-D
We can easily plot 3-D figures in matplotlib. Now, we discuss some important
and commonly used 3-D plots.
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
# setting a custom style to use
style.use('ggplot')
# create a new figure for plotting
fig = plt.figure()
# create a new subplot on our figure
# and set projection as 3d
ax1 = fig.add_subplot(111, projection='3d')
# defining x, y, z co-ordinates
x = np.random.randint(0, 10, size = 20)
y = np.random.randint(0, 10, size = 20)
z = np.random.randint(0, 10, size = 20)
# plotting the points on subplot
# setting labels for the axes
ax1.set_xlabel('x-axis')
ax1.set_ylabel('y-axis')
ax1.set_zlabel('z-axis')
# function to show the plot
plt.show()
General Types of Visualizations:
Chart: Information presented in a tabular, graphical form with data displayed along
two axes. Can be in the form of a graph, diagram, or map
Table: A set of figures displayed in rows and columns.
Graph: A diagram of points, lines, segments, curves, or areas that represents certain
variables in comparison to each other, usually along two axes at a right angle.
Geospatial: A visualization that shows data in map form using different shapes and
colors to show the relationship between pieces of data and specific locations.
Infographic: A combination of visuals and words that represent data. Usually uses
charts or diagrams.
Dashboards: A collection of visualizations and data displayed in one place to help
with analysing and presenting data.
Data Visualization Techniques
Box plots
Histograms
Heat maps
Charts
Tree maps
kernel density estimate
Box Plots
The image above is a box plot. A boxplot is a standardized way of displaying the
distribution of data based on a five-number summary (“minimum”, first quartile (Q1),
median, third quartile (Q3), and “maximum”). It can tell you about your outliers and
what their values are. It can also tell you if your data is symmetrical, how tightly your
data is grouped, and if and how your data is skewed.
A box plot is a graph that gives you a good indication of how the values in the data are
spread out. Although box plots may seem primitive in comparison to
a histogram or density plot, they have the advantage of taking up less space, which is
useful when comparing distributions between many groups or datasets. For some
distributions/datasets, you will find that you need more information than the measures
of central tendency (median, mean, and mode). You need to have information on the
variability or dispersion of the data.
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
# Creating dataset
np.random.seed(10)
data = np.random.normal(100, 20, 200)
fig = plt.figure(figsize =(10, 7))
# Creating plot
plt.boxplot(data)
# show plot
plt.show()
Five Number Summary of Box Plot
Minimum Q1 -1.5*IQR
First quartile (Q1/25th The middle number between the smallest number (not the
Percentile) “minimum”) and the median of the dataset
Median (Q2/50th Percentile) the middle value of the dataset
Third quartile (Q3/75th the middle value between the median and the highest value (not
Percentile)”: the “maximum”) of the dataset.
Maximum Q3 + 1.5*IQR
interquartile range (IQR) 25th to the 75th percentile.
Histograms
A histogram is a graphical display of data using bars of different heights. In a
histogram, each bar groups numbers into ranges. Taller bars show that more data falls
in that range. A histogram displays the shape and spread of continuous sample data.
It is a plot that lets you discover, and show, the underlying frequency distribution
(shape) of a set of continuous data. This allows the inspection of the data for its
underlying distribution (e.g., normal distribution), outliers, skewness, etc. It is an
accurate representation of the distribution of numerical data, it relates only one
variable. Includes bin or bucket- the range of values that divide the entire range of
values into a series of intervals and then count how many values fall into each interval.
Bins are consecutive, non- overlapping intervals of a variable. As the adjacent bins
leave no gaps, the rectangles of histogram touch each other to indicate that the original
value is continuous.
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(256)
x = 10*np.random.rand(200,1)
y = (0.2 + 0.8*x) * np.sin(2*np.pi*x) + np.random.randn(200,1)
plt.hist(y, bins=20, color='purple')
plt.show()
Histograms are based on area, not height of bars
In a histogram, the height of the bar does not necessarily indicate how many
occurrences of scores there were within each bin. It is the product of height multiplied
by the width of the bin that indicates the frequency of occurrences within that bin. One
of the reasons that the height of the bars is often incorrectly assessed as indicating
the frequency and not the area of the bar is because a lot of histograms often have
equally spaced bars (bins), and under these circumstances, the height of the bin does
reflect the frequency.
Heat Maps
A heat map is data analysis software that uses colour the way a bar graph uses
height and width: as a data visualization tool.
If you’re looking at a web page and you want to know which areas get the most
attention, a heat map shows you in a visual way that’s easy to assimilate and make
decisions from. It is a graphical representation of data where the individual values
contained in a matrix are represented as colours. Useful for two purposes: for
visualizing correlation tables and for visualizing missing values in the data. In both
cases, the information is conveyed in a two-dimensional table.
Note that heat maps are useful when examining a large number of values, but they
are not a replacement for more precise graphical displays, such as bar charts,
because colour differences cannot be perceived accurately.
# importing the modules
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
# generating 2-D 10x10 matrix of random numbers
# from 1 to 100
data = np.random.randint(low = 1,
high = 100,
size = (10, 10))
print("The data to be plotted:\n")
print(data)
# plotting the heatmap
hm = sn.heatmap(data = data)
# displaying the plotted heatmap
plt.show()
List of Charts to Visualize Data
Bar Graph: It has rectangular bars in which the lengths are proportional to the
values which are represented.
import numpy as np
import matplotlib.pyplot as plt
# creating the dataset
data = {'C':20, 'C++':15, 'Java':30,
'Python':35}
courses = list(data.keys())
values = list(data.values())
fig = plt.figure(figsize = (10, 5))
# creating the bar plot
plt.bar(courses, values, color ='maroon',
width = 0.4)
plt.xlabel("Courses offered")
plt.ylabel("No. of students enrolled")
plt.title("Students enrolled in different courses")
plt.show()
Area Chart: It combines the line chart and bar chart to show how the numeric
values of one or more groups change over the progress of a viable area.
import plotly.express as px
df = px.data.iris()
fig = px.area(df, x="sepal_width", y="sepal_length",
color="species",
hover_data=['petal_width'],)
fig.show()
Line Graph: The data points are connected through a straight line; therefore,
creating a representation of the changing trend.
x = np.linspace(0, 1, 201)
y = np.sin((2*np.pi*x)**2)
plt.plot(x, y, 'purple')
plt.show()
Pie Chart: It is a chart where various components of a data set are presented in
the form of a pie which represents their proportion in the entire data set.
import matplotlib.pyplot as plt
import numpy as np
y = np.array([35, 25, 25, 15])
plt.pie(y)
plt.show()
Scatter Charts
Another common visualization technique is a scatter plot that is a two-dimensional plot
representing the joint variation of two data items. Each marker (symbols such as dots,
squares and plus signs) represents an observation. The marker position indicates the
value for each observation. When you assign more than two measures, a scatter plot
matrix is produced that is a series scatter plot displaying every possible pairing of the
measures that are assigned to the visualization. Scatter plots are used for examining
the relationship, or correlations, between X and Y variables.
np.random.seed(256)
x = 10*np.random.rand(200,1)
y = (0.2 + 0.8*x) * np.sin(2*np.pi*x) + np.random.randn(200,1)
plt.scatter(x, y, color='purple')
plt.show()
Tree Map
A treemap is a visualization that displays hierarchically organized data as a set of
nested rectangles, parent elements being tiled with their child elements. The sizes and
colours of rectangles are proportional to the values of the data points they represent.
A leaf node rectangle has an area proportional to the specified dimension of the data.
Depending on the choice, the leaf node is coloured, sized or both according to chosen
attributes. They make efficient use of space, thus display thousands of items on the
screen simultaneously.
!pip install squarify -qqq
import squarify
import matplotlib.pyplot as plt
labels=['nepal', 'america', 'india']
sizes=[2, 3, 4]
colors=['red', 'blue', 'red']
squarify.plot(sizes=sizes,
label=labels,
color =colors,
alpha=.7,
bar_kwargs=dict(linewidth=1, edgecolor="#222222"))
plt.show()
Kernel density estimate (KDE) plot
A kernel density estimate (KDE) plot is a method for visualizing
the distribution of observations in a dataset, analogous to a
histogram. KDE represents the data using a continuous
probability density curve in one or more dimensions.
# importing the libraries
# importing the libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
def generate_data(seed=17):
# Fix the seed to reproduce the results
rand = np.random.RandomState(seed)
x = []
dat = rand.lognormal(0, 0.3, 1000)
x = np.concatenate((x, dat))
dat = rand.normal(3, 1, 1000)
x = np.concatenate((x, dat))
return x
x_train = generate_data()[:, np.newaxis]
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
plt.subplot(121)
plt.scatter(np.arange(len(x_train)), x_train, c='red')
plt.xlabel('Sample no.')
plt.ylabel('Value')
plt.title('Scatter plot')
plt.subplot(122)
plt.hist(x_train, bins=50)