Unit 5
CPU vs GPU vs TPU
The difference between CPU, GPU and TPU is that the CPU handles all the logics,
calculations, and input/output of the computer, it is a general-purpose processor. In
comparison, GPU is an additional processor to enhance the graphical interface and run
high-end tasks. TPUs are powerful custom-built processors to run the project made on a
specific framework, i.e. TensorFlow.
CPU: Central Processing Unit. Manage all the functions of a computer.
GPU: Graphical Processing Unit. Enhance the graphical performance of the
computer.
TPU: Tensor Processing Unit. Custom build ASIC to accelerate TensorFlow
projects.
CPU vs GPU vs TPU
CPU GPU TPU
Several core Thousands of Cores Matrix based workload
Low latency High data throughput High latency
Massive parallel
Serial processing High data throughput
computing
Limited simultaneous
Limited multitasking Suited for large batch sizes
operations
Complex neural network
Large memory capacity Low memory
models
REFER DEEPSPHERE FOR FOLLOWING TOPIC
COMPUTER VISION HANDS ON LAB WORK - BUILD, TEST AND DEPLOY ML MODELS
(CONSUMER 1)
DATA ENGINEERING
Data engineering refers to the building of systems to enable the collection and usage of
data. This data is usually used to enable subsequent analysis and data science; which
often involves machine learning.
‘What Is Data Engineering?
Data engineering is the process of designing and building systems that let people collect
and analyze raw data from multiple sources and formats. These systems empower
people to find practical applications of the data, which businesses can use to thrive.
Data Engineering Process:
Data engineering is a skill that is in increasing demand. Data engineers are the people
who design the system that unifies data and can help you navigate it. Data engineers
perform many different tasks including:
Acquisition: Finding all the different data sets around the business
Cleansing: Finding and cleaning any errors in the data
Conversion: Giving all the data a common format
Disambiguation: Interpreting data that could be interpreted in multiple ways
Deduplication: Removing duplicate copies of data
Data Engineering Tools
Data engineers use many different tools to work with data. They use a specialized skill
set to create end-to-end data pipelines that move data from source systems to target
destinations.
Data engineers work with a variety of tools and technologies, including:
ETL Tools: ETL (extract, transform, load) tools move data between systems. They
access data, then apply rules to “transform” the data through steps that make it more
suitable for analysis.
SQL: Structured Query Language (SQL) is the standard language for querying relational
databases.
Python: Python is a general programming language. Data engineers may choose to use
Python for ETL tasks.
Cloud Data Storage: Including Amazon S3, Azure Data Lake Storage (ADLS), Google
Cloud Storage, etc.
Query Engines: Engines run queries against data to return answers. Data engineers
may work with engines like Dremio Sonar, Spark, Flink, and others.
DATA PIPELINE
A data pipeline is a method in which raw data is ingested from various data
sources and then ported to data store, like a data lake or data warehouse, for analysis.
Before data flows into a data repository, it usually undergoes some data processing
Data Pipeline Architecture
A data pipeline architecture provides a complete blueprint of the processes and
technologies used to replicate data from a source to a destination system, including data
extraction, transformation, and loading. A common data pipeline architecture
includes data integration tools, data governance and quality tools, and data visualization
tools. A data pipeline architecture aims to enable efficient and reliable movement of
data from source systems to target systems while ensuring that the data is accurate,
complete, and consistent.
Types of Data Pipelines
Batch: Batch processing of data is leveraged when businesses want to move high
volumes of data at regular intervals. Batch processing jobs will typically run on a
fixed schedule (for example, every 24 hours), or in some cases, once the volume
of data reaches a specific threshold.
Real-time: Real-time Pipelines are optimized to process the necessary data in
real-time, i.e., as soon as it is generated at the source. Real-time processing is
useful when processing data from a streaming source, such as data from financial
markets or telemetry from connected devices.
MODDEL SELECTION
Model selection is the process of selecting the best model from all the available
models for a particular business problem on the basis of different criterions such as
robustness and model complexity.
Types of Machine Learning Models
Machine learning models come in many versions, just like there are plenty of different
machine learning classifications. Of course, not everyone agrees on the exact number or
breakdown of machine learning models, but we’re presenting two of the most common
summaries.
For starters, some people split machine learning models into three types:
Supervised Learning
Data sets include their desired outputs or labels so that a function can calculate an
error for any given prediction. The supervision part comes into play when a prediction
is created, and an error is produced to change the function and learn the mapping.
Supervised learning’s goal is to create a function that effectively generalizes over data it
has never seen.
Unsupervised Learning
There are cases where a data set doesn’t have the desired output, so there’s no means of
supervising the function. Instead, the process tries to segment the data set into “classes”
so that each class has a segment of the data set with common features. Unsupervised
learning aims to build a mapping function that classifies data based on features found
within the data.
Reinforcement Learning
With reinforcement learning, the algorithm tries to learn actions for a given set of states
that lead to a goal state. Thus, errors aren’t flagged after each example but rather on
receiving a reinforcement signal, like reaching the goal state. This process closely
resembles human learning, where feedback isn’t provided for every action, only when
the situation calls for a reward.
Alternatively, we can break down machine learning models into five types. This
approach gives a more specific and in-depth look at machine learning characteristics.
Classification Models
Classification predicts the class or type of an object according to a finite number of
options. The classification output variable is always a category. For example, is this
email spam or not?
Regression Models
Regression is a problem set where output variables can assume continuous values. For
example, predicting the per barrel price of oil on the commodity market is a standard
regression task. Regression models get further split into:
o Decision Trees
o Random Forests
o Linear Regression
Clustering
This model involves gathering similar objects into groups. This process helps identify
similar objects automatically without human intervention. Effective supervised machine
learning models, including models that need to be trained with labeled or manually
curated data, need homogeneous data, and clustering provides a smarter way to do it.
Dimensionality Reduction
Sometimes, the number of possible variables in real-world data sets is too high, which
leads to problems. Not all those countless variables even contribute significantly to the
goal. Thus, we turn to dimensionality reduction, which preserves variances with a
smaller number of variables.
Deep Learning
This machine learning type involves neural networks. Neural networks are networks of
mathematical equations. The network takes input variables, runs them through the
equations, and produces output variables. The most significant deep learning models
are:
o Autoencoders
o Boltzmann Machine
o Convolution Neural Networks
o Multi-layer perceptron
o Recurrent Neural Networks
Techniques for Model Selection:
Random train/test split: This is a resampling method. In this method the model
is evaluated on the skill of generalization and predictive efficiency in an unseen
set of data.The data points here are sampled without replacement. This involves
splitting the data into train set and test set. The model that performs best on this
test set is selected as the best model.
Cross validation: It is a very popular resampling method for model selection. In
this method candidate models are trained and evaluated on multiple resampled
train and test sets that are exclusive of each other. The data points here are
sampled without replacement. The model performance across these different
iterations are averaged to estimate the model performance. For example, K-Fold
cross validation, Leave one out cross validation
Model Engineering
Model Training - The process of applying the machine learning algorithm on
training data to train an ML model. It also includes feature engineering and the
hyperparameter tuning for the model training activity.
Model Engineering for Machine Learning
Model engineering is the pre-processing step of machine learning, which is used to
transform raw data into features that can be used for creating a predictive model using
Machine learning or statistical Modelling. Feature engineering in machine learning aims
to improve the performance of models.
These processes are described as below:
o Data Preparation: The first step is data preparation. In this step, raw data
acquired from different resources are prepared to make it in a suitable format so
that it can be used in the ML model. The data preparation may contain cleaning
of data, delivery, data augmentation, fusion, ingestion, or loading.
o Exploratory Analysis: Exploratory analysis or Exploratory data analysis (EDA)
is an important step of features engineering, which is mainly used by data
scientists. This step involves analysis, investing data set, and summarization of
the main characteristics of data. Different data visualization techniques are used
to better understand the manipulation of data sources, to find the most
appropriate statistical technique for data analysis, and to select the best features
for the data.
o Benchmark: Benchmarking is a process of setting a standard baseline for
accuracy to compare all the variables from this baseline. The benchmarking
process is used to improve the predictability of the model and reduce the error
rate.
Model Engineering Techniques
1. Imputation
Feature engineering deals with inappropriate data, missing values, human interruption,
general errors, insufficient data sources, etc. Missing values within the dataset highly
affect the performance of the algorithm, and to deal with them "Imputation" technique
is used. Imputation is responsible for handling irregularities within the dataset.
2. Handling Outliers
Outliers are the deviated values or data points that are observed too away from other
data points in such a way that they badly affect the performance of the model. Outliers
can be handled with this feature engineering technique. This technique first identifies
the outliers and then remove them out.
Standard deviation can be used to identify the outliers. For example, each value within
a space has a definite to an average distance, but if a value is greater distant than a
certain value, it can be considered as an outlier. Z-score can also be used to detect
outliers.
3. Log transform
Logarithm transformation or log transform is one of the commonly used mathematical
techniques in machine learning. Log transform helps in handling the skewed data, and it
makes the distribution more approximate to normal after transformation. It also
reduces the effects of outliers on the data, as because of the normalization of magnitude
differences, a model becomes much robust.
Note: Log transformation is only applicable for the positive values; else, it will give
an error. To avoid this, we can add 1 to the data before transformation, which
ensures transformation to be positive.
4. Binning
In machine learning, overfitting is one of the main issues that degrade the performance
of the model and which occurs due to a greater number of parameters and noisy data.
However, one of the popular techniques of feature engineering, "binning", can be used
to normalize the noisy data. This process involves segmenting different features into
bins.
5. Feature Split
As the name suggests, feature split is the process of splitting features intimately into
two or more parts and performing to make new features. This technique helps the
algorithms to better understand and learn the patterns in the dataset.
The feature splitting process enables the new features to be clustered and binned,
which results in extracting useful information and improving the performance of the
data models.
MODEL OUTCOME
Models are also known as the resulting output of the training process and are
considered the mathematical representation of real-world processes. Data scientists
train a model over a set of data, giving it the required algorithm to reason over and
learn from the data
MODEL PIPIELINE
Computer Vision systems are composed of two main components: 1) a sensing device
and 2) an interpreting device.
Applications of computer vision vary, but a typical vision system uses a similar
sequence of distinct steps to process and analyze image data. These are referred to as a
vision pipeline. Many vision applications start off by acquiring images and data, then
processing that data, performing some analysis and recognition steps, and finally make
a prediction based on the extracted information.
Figure 2
Here’s how the image flows through the classification pipeline:
1. First, a computer receives visual input from an imaging device like a camera. This is
typically captured as an image or a sequence of images forming a video.
2. Each image is sent through some pre-processing steps whose purpose is to
standardize each image. Common preprocessing steps include resizing an image,
blurring, rotating, change its shape or transforming the image from one color to
another—like from color to grayscale. Only by standardizing each image, for
example: making them the same size, can you then compare them and further
analyze them in the same way.
3. Next, we extract features. Features are what help us define certain objects, and
they’re usually information about object shape or color. For example, some features
that distinguish the shape of a motorcycle’s wheel, headlights, mudguards, and so on.
The output of this process is a features vector, which is a list of unique shapes that
identify the object.
4. Finally, these features are fed into a classification model! This step looks at the
features vector from the previous step and predicts the class of the image. Pretend
that you’re the classifier model and let’s go through the classification process: You
look at the list of features in the vector feature one-by-one and try to divine what’s in
the image.
a. First, you see a feature of a wheel—could this be a car, motorcycle or a dog?
Clearly it isn’t a dog because dogs don’t have wheels (at least normal dogs, not
robots!). Then, this could be an image of a car or a motorcycle
b. Then you move on to the next feature “the headlights.” It’s a higher probability
that this is a motorcycle than a usual car
c. The next feature is “rear mudguard.” Again, there’s a higher probability it’s a
motorcycle
d. The object has only two wheels, hmm, this is closer to a motorcycle
e. And you keep going through all the features like the body shape, pedal, etc. until
you have created a better guess of the object in the image
MODEL OPTIMIZATION
Machine learning optimization is the process of adjusting hyperparameters in order to
minimize the cost function by using one of the optimization techniques. It is important
to minimize the cost function because it describes the discrepancy between the true
value of the estimated parameter and what the model has predicted.
optimization techniques
Exhaustive search
Exhaustive search, or brute-force search, is the process of looking for the most optimal
hyperparameters by checking whether each candidate is a good match. You perform the
same thing when you forget the code for your bike’s lock and try out all the possible
options. In machine learning, we do the same thing but the number of options is quite
large, usually.
The exhaustive search method is simple. For example, if you are working with a k-
means algorithm, you will manually search for the right number of clusters. However, if
there are hundreds and thousands of options that you have to consider, it becomes
unbearably heavy and slow. This makes brute-force search inefficient in the majority of
real-life cases.
Gradient descent
Gradient descent is the most common algorithm for model optimization for
minimizing the error. In order to perform gradient descent, you have to iterate over the
training dataset while re-adjusting the model.
Your goal is to minimize the cost function because it means you get the smallest
possible error and improve the accuracy of the model.
Note: In gradient descent, you proceed forward with steps of the same size. If you
choose a learning rate that is too large, the algorithm will be jumping around without
getting closer to the right answer. If it’s too small, the computation will start mimicking
exhaustive search take, which is, of course, inefficient.
So you have to choose the learning rate very carefully. If done right, gradient descent
becomes a computation-efficient and rather quick method to optimize models.
User Interface:
One significant challenge for data scientists, data analysts, and machine learning
engineers is to showcase and demo their models to non-technical personnel. That often
demands additional skills, including frontend development, backend development, and
sometimes even devops. Even if you are skilled in these areas, it takes a tremendous
amount of time to get the job done.
Gradio:
Gradio is an open-source Python library that allows you to build a user interface for
machine learning models and deploy it in a few lines of code. If you worked with Dash or
Streamlit in python before, it’s similar; however, it integrates directly with notebooks
and doesn’t require a separate python script.
Flask – (Creating first simple application): Flask is a web application framework
written in Python. Flask is based on the Werkzeug WSGI toolkit and Jinja2 template
engine.
Django - Django is a back-end server side web framework. Django is free, open source
and written in Python.Django makes it easier to build web pages using Python.