Machine Learning With Python
Machine Learning With Python
Machine Learning
One of the fields of study that is gaining more popularity is within the
Computer science is machine learning. Many of
the services we use in our daily lives like Google, Gmail, Netflix, Spotify or
Amazon utilizes the tools provided by Machine Learning to achieve
an increasingly personalized service and thus achieve competitive advantages over its
rivals.
But what exactly is Machine Learning? Machine Learning is the design and
study of the computer tools that use past experience to make
future decisions; it is the study of programs that can learn from data. The
The fundamental objective of Machine Learning is to generalize, or induce a rule
unknown from examples where that rule is applied. The most typical example
where we can see the use of Machine Learning is in the filtering of spam emails or
spam. Through the observation of thousands of emails that have been marked
Previously as garbage, spam filters learn to classify new messages.
Machine Learning combines concepts and techniques from different areas of knowledge,
like mathematics, statistics, and computer science; for this reason, there is
many ways to learn the discipline.
Unsupervised learning
Reinforcement learning
In the first case, the idea is to divide our dataset into one or more subsets.
of training and other evaluation sets. That is to say, we are not going to pass it
we will not give all our data to the algorithm during training, but we will retain one
part of the training data to perform an evaluation of the effectiveness of
model. With this, what we seek is to prevent the same data that we use for
train be the same ones we use to evaluate. In this way we will be able to
analyze more precisely how the model behaves as we provide more of it
let's train and be able to detect the critical point at which the model stops
generalize and begins to overfit the training data.
Cross-validation is a more sophisticated procedure than the previous one. Instead of
just to get a simple estimate of the effectiveness of generalization; the idea is
conduct a statistical analysis to obtain other measures of estimated performance,
such as the mean and variance, and thus understand how performance is expected
varies across different datasets. This variation is fundamental for the
evaluation of confidence in performance estimation. Cross-validation
It also makes better use of a limited dataset; since unlike the
simple division of the data into one for training and another for evaluation; the validation
Cruzada calculates its estimates on the entire dataset by conducting
of multiple divisions and systematic exchanges between training data and data
of evaluation.
Collect the data. We can collect the data from many sources, we can
for example, extracting data from a website or obtaining data using an API or
from a database. We can also use other devices that collect the
data by us; or use data that is publicly available. The number of options
What we have to collect data is endless! This step seems obvious, but it is one of
those that bring the most complications and take the most time.
Preprocess the data. Once we have the data, we need to ensure that
it has the correct format to feed our learning algorithm. It is practically
inevitable to perform several preprocessing tasks before being able to use the
data. Likewise, this point is usually much simpler than the previous step.
Explore the data. Once we have the data and it is in the correct format,
we can perform a preliminary analysis to correct the cases of missing values or try
find at first glance some pattern in them that facilitates the construction of the
model. In this stage, statistical measures and the
graphs in 2 and 3 dimensions to get a visual idea of how our behave
data. At this point, we can detect outliers that we should discard; or
find the characteristics that have the most influence for making a prediction.
Use the model. In this last stage, we set our model to face the
real problem. Here we can also measure its performance, which may force us to
review all the previous steps.
As I always like to mention, one of the great advantages that Python offers over
other programming languages; it is how large and prolific the community is
developers around him; a community that has contributed with a great variety of
first-level libraries that extend the functionalities of the language. In the case of
Machine Learning, the main libraries that we can use are:
Scikit-Learn
Scikit-learn is the main library available for working with Machine Learning, it includes
the implementation of a large number of learning algorithms. We can use it
for classifications, feature extraction, regressions, groupings, reduction
of dimensions, model selection, or preprocessing. It has an API that is
consistent across all models and integrates very well with the rest of the packages
scientists that Python offers. This library also facilitates evaluation tasks,
diagnosis and cross validations since it provides us with several factory methods
to be able to carry out these tasks in a very simple way.
Statsmodels
Statsmodels is another great library that focuses on statistical models and is used
mainly for predictive and exploratory analysis. Just like Scikit-learn, it also
it integrates very well with the other scientific packages of Python. If we want
fit linear models, conduct statistical analysis, or maybe a bit of modeling
predictive, then Statsmodels is the ideal library. The statistical tests it offers
They are quite broad and cover validation tasks for most cases.
PyMC
NTLK
NLTK is the leading library for natural language processing or NLP for its acronym.
provides user-friendly interfaces to over 50 bodies and lexical resources,
like WordNet, along with a set of text processing libraries for the
classification, tokenization, labeling, analysis, and semantic reasoning.
Obviously, here I am only listing a few of the many libraries that exist in
Python to work with Machine Learning problems, I invite you to create your own
research on the topic.