Francisco Villarreal-Valderrama
Dec 15, 2021
·
3 min read
Introduction to data-science tools
Data science is an interdisciplinary approach to extracting
knowledge from noisy, structured and unstructured large volumes
of data. It encompasses preparing data for analysis and processing,
performing advanced data analysis, and presenting the results to
reveal patterns.
The process of data mining and analysis involves
applying mathematics, statistics, computer science,
information science, and domain knowledge to illustrate
stories that clearly convey the meaning of results to decision-makers
and stakeholders at every level of technical knowledge and
understanding. This shows the role of a data scientist, which is
someone who creates programming code, and combines it with
statistical knowledge to explain how the obtained results can
be used to solve business problems.
As a scientific field, data-science unifies scientific methods,
processes, algorithms and systems into a set of tools based on
statistics, data analysis and informatics. Data science is
closely related to data mining, machine learning and big data. The
most common tools involve:
Linear algorithms
Linear regression
It creates numerical predictions using the best linear fitting of a
data-set. The resulting model is easy to understand and shows the
biggest drivers of the results. Nonetheless, it can be too simple to
capture more complex relationships among the variables.
Logistic regression
This is an adaptation of linear regression to classification problems.
Similarly, it is easy to understand but not powerful enough to handle
complex relationships between the variables.
Principal Component Analysis
It is a data-compression tool based on the correlation among the
data variables. Its applications include anomaly detection and
prediction. It’s often combined with other tools to yield better
results.
Tree-based
Decision tree
This algorithm is comprised by a series of yes/no rules based on the
data features, forming a decision tree to match all the possible
outcomes of the process. It’s an easy-to-understand algorithm but
can become large when handling complex data-sets.
Random forest
It takes advantage of many decision trees with rules created from the
data itself. Individual decision trees are combined to form a
powerful predictor with better overall performance. It tends to give
high-quality results at the cost of not-easy-to-understand large
models.
Gradient boosting
It uses simpler decision trees that are increasingly focused on known
data. It is a high performance tool that gives very case-specific
results. That is, a small change in the feature set can create radical
changes in the model.
Neural networks
General neural network models
It consists in interconnected neurons that pass messages to each
other, with layers of neurons stacked on top of one another. These
models can handle extremely complex tasks but are very slow to
train and often have a complex architecture. Neural network models
outstand for image recognition and classification problems.
Nonetheless, their use as predictors is limited since its very hard to
understand the possible outcomes.
31