Student Guide - Module 2 Machine Learning
Student Guide - Module 2 Machine Learning
Module 2
Table of Contents
Learning Objectives...................................................................................................................................... 5
Machine Learning – The foundation of Artificial Intelligence...........................................................6
Machine Learning...................................................................................................................................................................... 7
The Need for Machine Learning.......................................................................................................................................... 8
Understanding Data and Datasets.........................................................................................................10
Data and Its Utility.................................................................................................................................................................. 10
Use of Data in Machine Learning...................................................................................................................................... 12
Different Types of Datasets................................................................................................................................................. 12
Sentiment Analysis................................................................................................................................................................. 14
Design Thinking, Problem Identification, & Working with Data....................................................21
Design Thinking....................................................................................................................................................................... 21
Problem Identification........................................................................................................................................................... 22
Development and Understanding of the BOT Framework...............................................................24
What is a BOT?......................................................................................................................................................................... 24
What can a BOT do?.............................................................................................................................................................. 25
The BOT Framework............................................................................................................................................................... 26
Data Labeling............................................................................................................................................................................ 27
Machine Learning and CHATBOTs.........................................................................................................29
Robots and Humans.............................................................................................................................................................. 29
BOTS – Redefining Workplaces......................................................................................................................................... 29
The Science behind the Generation of a BOT.............................................................................................................. 30
Building blocks - Program for a BOT............................................................................................................................... 30
Demonstration......................................................................................................................................................................... 32
Introduction to Supervised Machine Learning....................................................................................33
Major machine learning methods.................................................................................................................................... 33
Supervised learning................................................................................................................................................................ 33
Semi-Supervised learning.................................................................................................................................................... 33
Reinforcement learning........................................................................................................................................................ 34
Supervised Machine Learning............................................................................................................................................ 35
2
Classroom Activity.................................................................................................................................................................. 35
Introduction to Unsupervised Machine Learning...............................................................................36
Unsupervised learning.......................................................................................................................................................... 36
Clustering................................................................................................................................................................................... 37
Assessments Questions.............................................................................................................................38
Questions to consider................................................................................................................................40
Some Practical Assignments/Lab Work................................................................................................42
Practical Assignments................................................................................................................................46
Further Reading.......................................................................................................................................... 47
Reference Links...........................................................................................................................................48
Glossary......................................................................................................................................................... 50
3
Disclaimer:
The Imagine Cup Junior guides and lesson materials are created by Microsoft and our partners
and intended to be for guidance only to support with the Imagine Cup Junior Challenge. For
the latest on Microsoft AI please visit https://www.microsoft.com/en-us/ai
4
Learning Objectives
Through this module, students will get an overview of machine learning and understand how it
provides the foundation of AI. Students should be able to understand the basics of machine
learning and use the concepts as applied to their daily life.
At the conclusion of the module, students should be able to:
Understand the basics of machine learning.
Comprehend the basics of using datasets and working with data.
Understand the machine learning approach to problem-solving, and devise solutions to
problems.
Comprehend the latest design-related perspectives, ideas, concepts, and solutions.
Understand the importance of data and ways to protect it.
Execute projects using design thinking principles.
Understand and analyze data related problems.
Comprehend the basics of creating a BOT and the related working framework.
Understand the various challenges of creating a BOT.
Appreciate the similarities and differences between a machine-driven BOT and a human.
Understand the concept of cluster algorithms and apply the principle of ‘clustering’ on data.
5
Machine Learning – The foundation of Artificial
Intelligence
Often used interchangeably with artificial intelligence, machine learning, however, has a
different meaning. It is the ‘learning’ that the machine derives from its experience in processing
data. The primary objective of machine learning is to ensure that the machine learns from the
data. In other words, machine learning is an application of artificial intelligence (AI) that
provides computer systems with the ability to automatically learn and improve from experience
without being explicitly programmed. Machine learning as a science focuses on the
development of computer programs that can access data and use it to learn for themselves.
This is sometimes known as heuristic programming.
Machine learning can also be defined as the study of computer-based algorithms designed to
automatically improve the experience through acquired learning. Machines are created with a
built-in capability to read and understand human language to comprehend their surroundings
and make as many accurate predictions as they can. They can also perform simultaneous real-
time assessments of predictions and adapt according to their environment. When a user wants
to search a topic, the search engine shows up the most frequently searched related ‘search
topics’. The search engine looks at past clicks from people around the world in order to
understand the pages that are more relevant for those searches than others. It then serves
those results a list with the most relevant being at the top. It should be noted that such an
exercise is impossible to be performed by humans in the time frame of a few seconds.
The machine learns how to handle search requests and generate a set of instructions to create
the expected outcome. Hence, machine learning can also be understood as a set of
procedures, which deals with huge amounts of data smartly (using algorithms or a set of
logical rules) to derive results.
6
Machine Learning
Whilst you are all, by now, familiar with artificial intelligence (AI), machine learning is a specific
subset of AI which simply trains a machine on how to learn. It is an application of artificial
intelligence that provides computer systems with the ability to spontaneously learn and
improve based on its experience without being explicitly programmed.
The process of machine learning begins with analyzing observations (data) such as examples,
direct experiences, or instructions, and looks for patterns in the data. Based on this analysis and
the cumulative data it was provided, it learns to make better decisions in the future. The
primary goal is to allow the machine to learn automatically without human intervention or
assistance and adjust actions accordingly.
7
The Need for Machine Learning
The reason behind machine learning is to automate mundane tasks to the extent that the
machine can learn, think and make smart decisions on its own. It is also to minimize human
interference and thus bias in various scenarios. The need for machine learning is to complete
tasks that are too complex for humans to code computers for directly. Some tasks are so
complex that it is impractical, if not impossible, for humans to cater for all the nuances and
code for every single instance separately. Instead, a large amount of data is provided to a
machine learning algorithm and the algorithm computes the result by exploring the data and
constructing a model that will achieve the desired outcome.
Machine learning is also useful for finding relationships between things, especially in
exceptionally large datasets which are too big for humans to process efficiently. Its uses here
are in object recognition, marketing analytics, analyzing scientific data in labs, and numerous
other applications that involve large amounts of data needing to be analyzed.
Fig 2.3:
The key difference between traditional programming and machine learning is that in traditional
programming, the data and the program are run on the computer to produce an output.
However, in machine learning, the data and the output are fed into a computer to create a
program. This program can be then used in the same way as one created by traditional
programming.
A few examples of Machine Learning in our day-to-day life are:
Cortana
8
Refined search engine results (as represented in fig 2.4)
9
Understanding Data and Datasets
Data and Its Utility
This is Datum Fig
2.5:
Data
Data can be defined as the collection of facts, numbers or other information that are used
either for reference, or analysis. The singular form of Data is Datum.
In Figure 2.5, the task is to compute the cumulative percentage of each student’s marks. The
percentage obtained in an exam by a student is calculated by the sum of all the marks
obtained in different subjects divided by the number of subjects. Therefore, the marks are
important for calculating the percentage, and to arrive at the result it is important to take into
account the marks obtained by each student in each subject.
10
Dataset
A dataset is defined as a collection, or group, of data where every column denotes a particular
variable and each row relates to a specific member of that dataset.
This is a Dataset
11
Use of Data in Machine Learning
Data can be in the form of text, images, numbers, and even sound or video. The datasets are
analyzed to create an experience which in turn is used to create a form of machine learning
program.
12
Image Processing
Datasets for image processing can be used for object captioning, detection and segmentation
of the dataset (Maj, 2019).
A Dataset of a variety of facial expressions is used to understand expression and caption the
image accordingly. Such as a happy or sad face.
13
Sentiment Analysis
Whilst one parameter may be Object Recognition, another is that of the human sentiment. This
layer of ‘sentiment analysis’ when put into context can categorize the various human emotions
as a datatype and its intensity. The algorithms used for the analysis of the human sentiments
are advanced and designed to generate accurate and useful results. Examples of the use of this
can be to analyze the sentiment of a customer.
14
Fig 2.11: Steps of Sentiment Analysis of an Ecommerce Platform
How is this achieved? Two `Polarity’ nodes are created; one for a positive sentiment, the other
for a negative sentiment. This is done to assist in identifying the right sentiment of a person.
The words associated with a given polarity node are then re-submitted to the algorithm for
more accurate sentiment analysis.
16
Fig 2.14: Example of Sentiment Analysis of a Yammer discussion
In Fig 2.14, the AI software takes words from a micro blogging website and associates them
with a particular sentiment. This enables it to identify the overall tone of the conversation.
17
Video Processing
In this example the AI Software takes screenshots at regular intervals from a live video stream
and analyses it to count the number of people who are about to get onto the bus. By
recognizing other objects it can also calculate other parameters such as the frequency of buses
arriving at the bus stop, or automatically identifying crowd rush hours across the day. This data
can then be used to manage more efficient public transport.
Speech Recognition
Speech recognition can be defined as a technology which enables the recognition of the
spoken word and subsequent translation into text. The machine learns ways of identifying and
analyzing various human voices, both live and recorded, and processes them accordingly, such
as the conversion into text for dictation purposes in word-processors such as Microsoft Word.
It is also used to understand the user in voice-activated modules in automated cars, and in the
important role of assisting those people with disabilities.
Internet of Things (IoT)
The Internet of Things (IoT) makes reference to the countless networked devices we use to
make our lives easier. These devices rely on the Internet to gather and share data from
responsible sources in order to provide ‘smart’ services. With these huge datasets and massive
amounts of data sources becoming a reality, machine learning has become an integral part of
our daily lives.
18
Machine learning can be applied in almost all scenarios where the outcome is known. It can
however also be applied where the datasets are unknown and in situations where there are
repeated forms of the same sort of data which can be used to reinforce the machine learning.
For example, machine learning can help in understanding and analyzing the patterns of waves
and oceanic currents in order to predict future sea temperatures, monsoon patterns, and even
the potential for a cyclone or other natural disaster in a specific geographical location.
Capturing IoT and Sensor Data
The Internet of Things (IoT) is more of a concept than an actual thing. The concept is to allow
us to interpret data from networked sensors or devices in the most meaningful ways possible.
The aim is to measure, analyze, visualize, predict, and react to the data accumulated from these
sensors. One form of IoT most people are familiar with is a smart thermostat, smart switches, or
other internet-connected devices and appliances in your house. These are generally considered
part of Consumer IoT. Then there's Industrial IoT, or IIoT. This includes things like the use of IoT
devices in smart buildings, industrial automation, and monitoring of industrial processes.
Processing IoT Data
Processing the data from connected IoT sensors requires time and many interactions with sub-
procedures such as:
Standardizing or transforming the data into a uniform format to ensure it is compatible with
your application.
Creating and Storing a backup of the newly transformed data.
Removing any repetitive, outdated, or unwanted data to help improve accuracy.
Integration with additional structured (or unstructured) data from other sources to help enrich
the dataset.
IoT Data Analytics
When we apply data analysis tools or procedures to different types of IoT data, the process is
called IoT analytics. This process is performed on huge datasets to improve the efficiency of
procedures, applications, business processes, and production. Several types of data analytics
can be used on IoT data:
Prescriptive analytics
Prescriptive analytics is used to analyze what steps to take in a specific situation. It’s often
described as being a combination of descriptive and predictive analysis. When used in
commercial applications, prescriptive analytics helps decipher large amounts of information to
obtain more precise conclusions.
Spatial analytics
This is used to analyze location based data. Spatial analytics deciphers various geographic
patterns, determining any type of spatial relationship between various physical objects. Parking
19
applications, smart cars, and crop management are all examples of applications that benefit
from spatial analytics.
Streaming analytics
Streaming analytics, sometimes referred to as event stream processing, is the analysis of
massive datasets of moving images. These real-time data streams can be analyzed to detect
emergency or urgent situations, facilitating an immediate response. The types of IoT
applications that benefit from streaming analytics include those used in traffic analysis and air
traffic control, and CCTV by Police.
Time series analytics
Time series analytics is based on time-based data, which is analyzed to show any anomalies,
patterns, or trends. Two systems that greatly benefit from time series analytics are health and
weather-monitoring systems.
We are surrounded by IoT data in our homes, our cars, and in our schools. The amount of data
that IoT technology produces is massive. By collecting, processing, and analyzing this data, we
can gain valuable insights to help us make better decisions about their future.
The following links give access to free datasets of IoT and sensor-based data for you to
download.
https://data.world/datasets/iot
https://hub.packtpub.com/25-datasets-deep-learning-iot/
https://www.kaggle.com/uciml/biomechanical-features-of-orthopedic-patients
https://www.datasciencecentral.com/profiles/blogs/great-sensor-datasets-to-prepare-your-
next-career-move-in-iot-int
20
Design Thinking, Problem Identification, & Working
with Data
Design Thinking
Design thinking is a non-linear iterative process designed to help understand a problem, the
users it affects, assumptions made, and any available solutions keeping in mind all the
parameters. Design teams are responsible for implementing solutions to problems within a
specific time frame. Design thinking is therefore a problem-solving methodology aimed at
devising solutions.
Over the years, ‘design thinking’ has gained prominence in terms of its effectiveness in solving
problems or finding as many alternative solutions as possible. Organizations such as Microsoft
use it successfully to design and create products. Other organizations such as Universities,
Banks and other companies also use it to help solve real-life problems in their industry sectors.
The various stages of design thinking are a part of an iterative process where the ultimate
objective is to acquire an in-depth understanding of the problem and suggest a single solution
or alternatives within specific boundaries.
Stage 1: Empathize—Research Your User’s Needs
This stage permits the team to gain an understanding of the problem, and to what extent it
affects the user. This step requires gaining an empathetic understanding of the user’s issues.
Empathy, or the ability to imagine oneself in the condition of another, is vital in any human-
centric design process as the process is not to be based on the team’s assumptions but in
terms of the user’s perspective.
Stage 2: Define—State Your User’s Needs and Problems
In this stage, information is accumulated from the previous stage for analysis. The team makes
its observations and a definition of the core problem is written based on these aspects. This is
known as the Problem Statement and the most important aspect of this is to understand that
the problem statement should be as human-centric as possible. We will teach you how to
create a Problem Statement in depth later in the course.
Stage 3: Ideate—Challenge Assumptions and Create Ideas
Once the design team enters this third stage; they completely understand the user, the matters
of concern to the user, and have defined the problem as exactly as they can. Now is the time
for the team to come up with innovative ideas by thinking creatively and ‘outside-the-box’.
Stage 4: Prototype—Start to Create Solutions
This next stage is an experimental phase with the sole objective of identifying the best possible
solution that can be provided to solve the problem. The aim is to find solutions that are
possible, inexpensive, and achievable.
21
Stage 5: Test—Try Your Solutions Out
At this final stage the team test their solutions to check their feasibility and recommend the
best possible one. As an iterative process, the results obtained may be used to redefine one or
more of the solutions identified by choosing to return to previous stages in the process in
order to make further changes, refinements, or to rule out particular alternative solutions.
Problem Identification
Creating the Problem Statement
A problem statement should be designed to address the Five Ws (Who, What, When, Where
and Why). A simple and well-defined problem statement is often used by every member of a
project team to understand the problem and work together toward developing a solution.
The reason for the existence of a problem statement is the identification and explanation of the
problem itself and as such includes a description of the environment where the problem exists,
and the impact that it has on other elements such as users, finances, resource allocation, or
additional activities. A problem statement also explains the anticipated environment the
solution is to run in. This definition of the problem helps create a holistic overview and to
define the problem in an elaborate but clear manner. Furthermore, the project goals to be
accomplished, and the purpose for initiating a project can also be written clearly without
doubt, ambiguity, or uncertainty of any kind.
Another useful purpose of creating a problem statement is that it can be used as a
communications mechanism to others. It helps the project team identify any support staff and
other kinds of expertise that may be needed to complete the project. Before the start of the
project, the people involved need to understand the problem, and goals, not only from a team
perspective but also from an individual contributor’s perspective. This step makes it clear what
the goal of the project is as well as the role that each team member will play in the execution
of the project.
Defining the boundary of the problem
Every problem has limitations to an extent beyond which the solution devised to solve the
problem is not applicable. The extent of these limitations is known as the boundary of the
problem. In reality, there is no physical boundary, but more an understanding that exists
between each team member.
To define the boundary of a problem it is important to focus on the real issues that make up
the problem. This can be achieved only when there is a thorough clarification of these issues. It
is understood by all team members that the boundary is a clear demarcation between the
factors that will greatly impact the problem and lesser-affecting factors. Lesser-affecting factors
are not considered to be within the boundaries of the problem definition and are thus not
considered when creating the solution.
22
It is important to understand that for each person dealing with devising solutions for a
problem, or set of problems, that the problem boundary will differ. This is based on their
understanding of the problem. Their understanding will also be affected by their concerns and
any underlying human biases. This may pose slight setbacks in the project. However, as the
design team works cohesively, any biases are more often than not taken care of early in the
process.
How will data be accessed, managed and analyzed?
Machine learning is an especially important component of artificial intelligence which relies
heavily on data. What a machine learning device will achieve or not is based on the kind of
data it is given as input. Hence, it is vitally important that the input data is acquired from
sources that are reliable and are not tampered with or altered in any way.
Every algorithm needs to be fed a particular kind of data depending on the expected outcome
to be performed by the machine. Training any AI-based algorithm often requires thousands or
even millions of points of information. The data required often may be unavailable as access to
it may conflict with privacy rules or government regulations on sharing data. In other words,
gaining access to data is a complicated process. Hence, it is required to have best practices
policies in place to ensure that all AI related systems follow the rules of privacy and security
uniformly.
What are the privacy and security aspects around the data collection?
There have been incidents in the past that have raised questions as to the sanctity of data
being used by machine learning devices. It is mandatory for the organization handling the data
to ensure that the security of that data is maintained and is far out of reach from people with
malicious intentions.
The amount of data collected and stored each day is enormous. Data organizations gather data
from innumerable sources such as live data sources, blogs, social media, and other sources,
which is quite an extensive task and therefore we need to have strong, robust data
management systems and stringent laws to protect it from misuse of any kind.
Access to data
Data Access Statements are statements that are created to document the datasets required to
support a specific purpose and the necessary conditions under which they can be found and
used.
Research data archive and repository organizations provide users with a permanent identifier
for data they housed known as a Digital Object Identifier (DOI) or accession number.
23
Openly available data
The following are things that should be provided in a Data Access Statement:
The name(s) of the repositories/archives of the dataset(s), managing the dataset(s)
The persistent identifier for your dataset
Data subject to access restrictions
Justification for the data to be subject to access restrictions (for example, ethical, legal, or
commercial sensitivity)
Information on arrangements for accessing the data, including the persistent identifier to the
dataset, or a statement that the data are not accessible
If you have used secondary or third-party data information the data source should be credited
If you have used secondary or third-party data, you can provide information on how the data
were accessed
Data Access Statements can also be combined with formal data citations, particularly when a
publication is supported by multiple datasets in different locations.
24
Fig 2.16: A user communicating with a chat BOT of an online shopping site
25
The BOT Framework
26
Data Labeling
One of the most important requisites of supervised learning is the labeling of data. With
artificial intelligence having more of an impact on our daily routines, there is a constant need
to upgrade the machines in order to continue to provide results with ever enhanced accuracy.
To accomplish this the data input into the algorithm must be precisely labeled.
Look at the image in Figure 2.19 closely. What do you see? In it, unlabeled data gives a learning
machine no information about what is in the image.
As such the machine cannot learn much about it and therefore the outcomes are inaccurate.
From a machine point of view the need for accurately labeled data is of the utmost priority in
order for it to understand what the image shows, what is written in a piece of text, and even
what a sound recording contains.
27
In the subsequent image (Figure 2.20), the data has been labeled. The machine can now easily
identify what is understood from the image and can find similar patterns from other images
when fed similar data. Data labeling is a process that involves putting electronic boundary
boxes on image files and tagging them with keywords that are both related and relevant to the
item within the boundary. It can also involve many other processes such as marking a human
face with points to analyze facial features for use in person identification search engines such
as those used by the police. Another important aspect is the categorization of texts, audio files
and videos, based on their content. In our example, the tag would be a ‘car’ as the traffic image
shows many varieties of vehicles including cars, mini trucks, open vans, two-wheelers, buses
etc.
28
Machine Learning and CHATBOTs
Robots and Humans
With advancements in the field of science and robotics, we have slowly entered an era where
robots can be found doing many tasks both at work and in our personal lives. They can now be
found doing daily household chores such as vacuuming, driving vehicles, disarming bombs,
controlling artificial (prosthetic) limbs, support surgical procedures, manufacture products,
entertain, teach, and a lot more.
Why do you think robots are being created to perform certain kinds of work when traditionally
humans have been doing them for years? The main reason is that of speed and accuracy, and
reduction of threat of life. A robot can work faster and more efficiently when compared to a
human; this is why assembly lines use robot machines. The tasks to be performed are routine
and do not have unexpected variations. Robots are also being used on farms to help farmers
with the removal of weeds or unwanted plants from the field. Another use of robotics is to
minimize human errors when performing tasks.
A BOT is a computer program that performs automatic repetitive tasks. It also acts as the
primary tool for automating interactions with website content on a large scale. BOTs on the
Internet is not a new concept and has been around for many years. BOT software is easy to
implement and can serve a variety of purposes.
29
The Science behind the Generation of a BOT
31
Demonstration
Fig
2.23: LUIS in action
Click on the link to watch the video demonstration: https://youtu.be/9tdkIQ-nkdo
32
Introduction to Supervised Machine Learning
Major machine learning methods
There are many widely adopted machine learning methods such as supervised and
unsupervised learning.
Supervised learning
Supervised Learning algorithms are trained using labeled examples, from an input where the
desired output is known. The learning algorithm receives a collection of inputs and the
corresponding correct outputs and learns by comparisons between its actual output and
previous correct outputs in order to identify errors, and modifies its model accordingly. Using
strategies such as classification, regression, prediction and others, supervised learning uses
patterns to predict the values of the label on unlabeled data. Supervised learning is usually
employed in applications wherever historical knowledge can easily predict future events.
There is another form of machine learning known as unsupervised machine learning. In this
form of machine-learning there are no labeled examples and the outcome is unknown.
Semi-Supervised learning
This type of learning is used for the same types of applications as supervised learning, but uses
a mix of labeled and unlabeled data. Semi-supervised learning is useful when the cost
associated with labeling is too high to allow for a fully labeled training process. Early samples
of this is seen in systems that distinguish an individual's face on a web-based camera from
other faces.
33
Fig 2.25: Pictorial representation of Semi-supervised Learning
Reinforcement learning
This form of learning is used in robotics, gaming and navigation applications. In reinforcement
learning the algorithm discovers through trial and error which actions yield the greatest
rewards. This type of learning has three primary components: the agent (the learner or
decision-maker), the environment (everything the agent interacts with) and actions (what the
agent can do).
The objective is for the agent to choose actions that maximize the expected reward over a
given amount of time. The agent will reach the goal much faster by following a good strategy.
The goal in reinforcement learning is to learn the best strategy.
34
Supervised Machine Learning
Supervised machine learning is said to take place when the machine algorithm is provided with
labeled data. This can help the machine understand the data and to generalize a model based
on it. Supervised learning works on ‘training data’ without which the algorithm cannot give the
correct results. The use of training data is to guide the machine learning profile.
The training data consist of a set of training examples. In supervised learning, each example is
a pair consisting of an input object and the desired output value.
A supervised learning algorithm analyzes the training data and produces an function which
infers a result based on the input and past inferences. This is constantly mapped with similar
and new examples. An optimal scenario will allow the algorithm to correctly determine the
class labels for instances never encountered before.
Classroom Activity
Suppose you had a basket filled with a number of different shaped blocks (circle, square,
triangle and rectangular).
Your task is to arrange them into groups.
To understand the task first assign names to these shapes.
We have four types of block called circle, square, triangle and rectangular. But they could be
called anything.
You learn from previous information about the physical characters of the blocks, so arrange
some of the blocks from the basket of the same type together. In data mining terminology the
earlier work is called training the data.
Now take a new block from the basket, note its size and shape and put it in the right group
based on what you have learn from an analysis of previous blocks.
This is Supervised Learning. The dataset you will have used to classify the blocks will be as
follows:
4. With 4 straight sides (opposite sides are equal) and 4 corners Rectangle
35
Table 1: Dataset for Classroom Activity
Unsupervised learning
This kind of learning is used with data that has no historical labels and therefore cannot use
these to help it learn. However, the dataset may contain a few data points that are labeled. The
algorithm needs to compute and analyze based purely on this limited amount of labelled data
available and learn how to automatically label those that are not. The goal is to explore the
data and recognize some patterns contained within.
Unsupervised learning works well on transactional data. For example, it can identify customers
with similar attributes who can then be treated similarly in marketing campaigns. Or it can
notice the most prominent attributes that a particular group of customers have. Popular
techniques which use this form of learning include self-organizing maps, nearest-neighbor
mapping, marketing analysis etc.
36
Clustering
With no formal guide on the labeling of data points, unsupervised machine learning is
dependent on something called ‘clustering’. Clusters are found by separate algorithms that run
on the machine to identify them only if they exist in the dataset.
Clustering can be defined as the process of dividing the data points into groups in a way that
all the data points in a group share similar properties. The main goal of clustering is to create
groups with data points which have similar traits among them and assign them a label.
A few types of clustering are as follows:
K–Means Clustering – Involves clustering the data points into K number of clusters. Finding out
the exact number of K clusters is a complex process.
Hierarchical Clustering – Involves clustering of the data points into parent and child clusters.
Probabilistic Clustering – Involves clustering of the data points into clusters on the probabilistic
scale that they are identical.
Clustering can be used in a variety of applications. Some of them can be:
Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection
37
Assessments Questions
1. Define data and list examples of data that you can think your school would gather from
others or prepare itself and how it could be used.
8. Using any example from your daily life to explain what you have understood from
unsupervised and supervised learning.
38
14. Fill in the Blanks
True or False
Unlabeled data is less expensive and takes less effort to acquire.
Reinforcement learning, the algorithm discovers through trial and error which actions yield the
greatest rewards.
39
Questions to consider
Name the major differences between alarms and reminders.
https://medium.com/truemd/whats-the-difference-between-an-alarm-and-a-reminder-a73c11dc1a73
What probable challenges will Cortana face when playing a song that has a remake version too?
https://www.howtogeek.com/402579/i-used-a-cortana-smart-speaker-all-weekend.-heres-why-it-failed/
Probe the various aspects of the debate of ‘human vs. BOT interaction’.
https://www.intercom.com/blog/bots-versus-humans/
https://www.retaildive.com/news/70-of-consumers-still-want-human-interaction-versus-bots/543324/
https://cyfuture.com/blog/the-great-bot-battle-ai-chatbots-vs-human-powered-live-chat
Investigate the various data labeling approaches and find out the pros and cons with suitable
examples.
https://www.kdnuggets.com/2018/05/data-labeling-machine-learning.html
Elaborate the steps for data protection? Name the steps to annotate the data?
https://www.richardsandrichards.com/6-steps-to-complete-data-protection-for-your-small-business/
https://resources.infosecinstitute.com/how-to-implement-a-data-privacy-strategy-10-steps/#gref
https://medium.com/thelaunchpad/spinning-up-an-annotation-team-c74c6765531b
Design steps for ‘Tic Tac Toe’ so that winning chances of computer is maximum.
https://www.wikihow.com/Win-at-Tic-Tac-Toe
40
Calculate the best moves to solve the ‘Tower of Hanoi’ problem.
https://en.wikipedia.org/wiki/Tower_of_Hanoi
Write down the different appliances that generate datasets, in a smart home
o Security camera related data
o Thermostat related data
o Electricity consumption data
-
41
Some Practical Assignments/Lab Work
Assignment 1
Use the Bing search engine to prepare a report on the following.
A. Types of machine learning.
B. Heuristic search in AI.
C. Knowledge representation technique.
D. AI based games for competitive entertainment such as Chess.
Assignment 2
Outline the challenges that can be encountered in problem identification.
https://www.toolshero.com/problem-solving/problem-definition-process/
Assignment 3
Evaluate the importance and role of an identifier in the dataset.
https://www.ngdc.noaa.gov/wiki/index.php/Data_Set_Identifiers_and_other_Unique_IDs
https://www.dataone.org/best-practices/provide-identifier-dataset-used
Assignment 4
Create datasets for the following:
Image processing – Create a dataset of at least 100 images of natural scenes
Sentiment Analysis – Chose a product of your choice and search more than one ecommerce
website. Write down all the reviews (not less than 100) written for the product in a document
under the website name.
Video processing – Shoot or download birthday party videos (not less than 50) and collect
them in a folder.
IoT – Download (any three) IoT data online. Suggested – Weather data, traffics data, agriculture
data and smart phone data.
42
Assignment 5
Suggested questions for different scenarios:
Club membership enquiry
What is your age (check for eligibility)?
What games do you play?
What are your play timings?
Duration for membership – quarterly, bi-monthly, half yearly or yearly?
Career guidance or higher education
Percentage scored in class 10th exams?
Present options for subjects?
What is the user preference – (present streams – science, arts, commerce, commerce with
mathematics etc.)
Design a BOT interaction possible and questions and answers based on the same for Club
membership enquiry scenario.
Design a BOT interaction possible questions and answers for career guidance or higher
education in AI scenario.
Assignment 6
Suggested keywords
Staff communication mails – staff, teacher, educator, subject, class teacher, discipline,
permission, class, etc.
Parent complaint mails – ward, mother, father, guardian, class, student, complaint, unaware
etc.
Educational bodies mails – authority, board, school, inspection, requirements etc.
Vendor mails – vendor, issue, payment, permission, principal, office, dated etc.
Co-Curricular notification mails – notification, circular, school, district, state, level,
competitions, class, participation, participate etc.
It is to be noted that there could be certain keywords that would be common to more than one
communication type. The machine is expected to focus on both similar and dissimilar keywords
and labels to identify and segregate.
Consider the scenario of a school where the principal needs help from a machine-based
application with mail segmentation. Students are to consider segregation of the mails in the
following categories
Staff communication mails
Parent complaint mails
Educational bodies mails
Vendor mails
43
Co-Curricular notification mails
Assignment 7
Prepare a detailed report on how machines develop intelligence and learn from reinforcement
methodology in a game of chess.
https://www.infoworld.com/article/3400876/reinforcement-learning-explained.html
Assignment 8
Design a student’s assistance program for students with low performance. How can AI assist in
identifying the weak students?
44
Assignment 9
Refer to the website below to understand the IRIS dataset and answer the following questions.
https://archive.ics.uci.edu/ml/datasets/iris
A. What are the features/attributes of the dataset?
B. What are the targets/classes of the dataset?
C. How many rows are there in the dataset?
D. Are there any missing values in the dataset?
E. Is the data univariate or multivariate?
F. If we follow 60:20:20 pattern for train, validate and test, how many rows will be there in each of
the dataset?
Assignment 10
Perform classification of students to understand who would be interested in joining the sports
club of the school.
Lab Session -1: Create a dataset or use any existing free dataset
Lab Session -2: Study the dataset of the students
Lab Session -3: Dataset should include:
Student ID/Admission number
Interest in the sports (name of the sport in which the student is interested to participate)
Achievement in the sports
Academic scores
Distance from school to home
Height and shoe size
45
Practical Assignments
Assignment 1
Imagine a situation at home where your family is expecting guests. You have lights at various
locations both indoors and outdoors. There are lights at the doorway, near the gate, along the
pathway, in the garden and also in the interior rooms of the house. Each family member has a
different option about which light to switch on. Write down the problem statement and
alternative solutions.
Assignment 2
Imagine a hypothetical situation where you are looking forward for various applicable career
choices with a help of a CHATBOT assisting you in the process. Prepare a set of
question/answer trails.
Assignment 3
Create a bank of images (more than 50). It should contain images of people with emotions
(various ages, color, expressions etc.). Using Microsoft’s online ‘Face and Emotion Recognition’
application, run the images to predict the emotion and analyze visual content.
https://aidemos.microsoft.com/face-recognition
46
Further Reading
https://www.forbes.com/sites/willemsundbladeurope/2018/10/18/data-is-the-foundation-for-artificial-
intelligence-and-machine-learning/#3eccba5251b4
https://towardsdatascience.com/role-of-data-science-in-artificial-intelligence-950efedd2579
http://www.dbta.com/BigDataQuarterly/Articles/The-Importance-of-Data-for-Applications-and-AI-
129316.aspx
https://www.technative.io/data-quality-vs-data-quantity-whats-more-important-for-ai/
https://pjreddie.com/darknet/yolo/
https://www.houseofbots.com/news-detail/3581-4-understand-the-machine-learning-from-scratch-for-
beginners
https://www.minigranth.com/artificial-intelligence/problem-solving-in-artificial-intelligence/
Problem Solving in Artificial Intelligence by Prof Philippe Codognet Link -
http://webia.lip6.fr/~codognet/PSAI/1-introduction.pdf
Introduction to Artificial Intelligence: Problem Solving and Search by by Berhard Beckert 2004. Link -
https://formal.iti.kit.edu/~beckert/teaching/Einfuehrung-KI-WS0304/04ProblemSolving.pdf
Learning problem solving (artificial intelligence, machine) by Bruce Walter Porter by University of
California, Irvine 1984.
Learning problem solving strategies using refinement and macro generation by HA Güvenir, GW Ernst,
Elsevier Science Publishers B.V. (North-Holland) 1990. Link -
http://repository.bilkent.edu.tr/bitstream/handle/11693/26215/bilkent-research-paper.pdf?
sequence=1&isAllowed=y
Microsoft Power BI Dashboards Step by Step 1st Edition by Errin O'Connor
The 5 Clustering Algorithms Data Scientists Need to Know - https://towardsdatascience.com/the-5-
clustering-algorithms-data-scientists-need-to-know-a36d136ef68
Clustering Introduction & different methods of clustering -
https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-
clustering/
What is Data Labeling? - https://www.youtube.com/watch?v=_BasmAAub7w
Why Smart Labeling is the Future of Data Annotation - https://www.youtube.com/watch?
v=V33Ut36eUsY
Four Mistakes You Make When Labeling Data - https://towardsdatascience.com/four-mistakes-you-
make-when-labeling-data-7e431c4438a2
Practical Machine learning problems - https://machinelearningmastery.com/practical-machine-learning-
problems/
https://www.messengerpeople.com/chatbots-what-is-a-whatsapp-bot-actually/
https://www.geeksforgeeks.org/what-is-reinforcement-learning/
https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/
47
Reference Links
Algorithmia (2018). Introduction to Unsupervised Learning | Algorithmia Blog. [online]
Algorithmia Blog. Available at: https://blog.algorithmia.com/introduction-to-unsupervised-
learning/ [Accessed 10 Sep. 2019].
Al-Masri, A. (2019). What Are Supervised and Unsupervised Learning in Machine Learning?
[online] Medium. Available at: https://towardsdatascience.com/what-are-supervised-and-
unsupervised-learning-in-machine-learning-dc76bd67795d [Accessed 6 Sep. 2019].
Author (2019). Data labeling service: training data for machine learning | Clickworker. [online]
Clickworker.com. Available at: https://www.clickworker.com/crowdsourcing-glossary/data-
labeling/ [Accessed 6 Sep. 2019].
Automationanywhere.com. (2019). TAKE CHARGE OF THE BOT LIFECYCLE. [online] Available at:
https://www.automationanywhere.com/in/solutions/enterprise-bot-lifecycle-management
[Accessed 12 Jul. 2019].
Brownlee, J. (2015). Basic Concepts in Machine Learning. [online] Machine Learning Mastery.
Available at: https://machinelearningmastery.com/basic-concepts-in-machine-learning/
[Accessed 29 Jun. 2019].
Chen, J. (2019). Neural Network Definition. [online] Investopedia. Available at:
https://www.investopedia.com/terms/n/neuralnetwork.asp [Accessed 27 Sep. 2019].
Chris, (2009) How To Write A Problem Statement | Ceptara. 2009. How To Write A Problem
Statement | Ceptara. [online] Available at: http://www.ceptara.com/blog/how-to-write-problem-
statement. [Accessed 04 July 2019].
Decypher. (2018). Machine Learning: What it is and Why it Matters - Decypher. [online] Available
at: https://www.decypher.com/machine-learning-matters/ [Accessed 4 Jul. 2019].
Dietrich, D., Heller, B. and Yang, B. (2015). Data Science & Big Data Analytics: Discovering,
Analyzing, Visualizing and Presenting Data. [ebook] Indianapolis: John Wiley & Sons, Inc., pp.29-
30. Available at: http://index-of.co.uk/Big-Data-Technologies/Data%20Science%20and%20Big
%20Data%20Analytics.pdf [Accessed 13 Sep. 2019].
Guru99team (2019). Supervised Machine Learning: What is, Algorithms, Example. [online]
Guru99.com. Available at: https://www.guru99.com/supervised-machine-learning.html [Accessed
6 Sep. 2019].
Kaushik, S. (2016). Clustering Introduction & different methods of clustering. [online] Analytics
Vidhya. Available at: https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-
clustering-and-different-methods-of-clustering/ [Accessed 10 Sep. 2019].
Loon, R. (2018). Machine learning explained: Understanding supervised, unsupervised, and
reinforcement learning. [online] Big Data Made Simple. Available at: https://bigdata-
madesimple.com/machine-learning-explained-understanding-supervised-unsupervised-and-
reinforcement-learning/ [Accessed 6 Sep. 2019].
Maj, M. (2019). Object Detection and Image Classification with YOLO. [online] Kdnuggets.com.
Available at: https://www.kdnuggets.com/2018/09/object-detection-image-classification-
yolo.html [Accessed 29 Jun. 2019].
48
McFadin, P. (2019). Internet of Things: Where Does the Data Go? [Online] WIRED. Available at:
https://www.wired.com/insights/2015/03/internet-things-data-go/ [Accessed 15 Nov. 2019].
Sarah Mitroff. 2019. What is a BOT? - CNET. [ONLINE] Available at: https://www.cnet.com/how-
to/what-is-a-bot/. [Accessed 05 July 2019].
Sheth, B. (2016). The BOT Lifecycle. [online] CHATBOTs Magazine. Available at:
https://chatbotsmagazine.com/the-bot-lifecycle-1ff357430db7 [Accessed 12 Jul. 2019].
Shoemaker, C. (2019). IoT Data: How to Collect, Process, and Analyze Them. [Online] Tech.
Available at: https://it.toolbox.com/blogs/carmashoemaker/iot-data-how-to-collect-process-
and-analyze-them-032619 [Accessed 15 Nov. 2019].
Simmons, D. (2019). Pushing IoT Data Gathering, Analysis, and Response to the Edge - DZone IoT.
[Online] dzone.com. Available at: https://dzone.com/articles/pushing-iot-data-gathering-
analysis-and-response-to-the-edge [Accessed 15 Nov. 2019].
Smith, A. (2018). Understanding Architecture Models of CHATBOT and Response Generation
Mechanisms - DZone AI. [online] dzone.com. Available at:
https://dzone.com/articles/understanding-architecture-models-of-chatbot-and-r [Accessed 12
Jul. 2019].
University of Bath, (2019) Data access statements - Archiving and sharing data - Library at
University of Bath. 2019. Data access statements - Archiving and sharing data - Library at
University of Bath. [online] Available at: https://library.bath.ac.uk/research-data/archiving-and-
sharing/data-access-statements. [Accessed 04 July 2019].
University of Nebraska-Lincoln, (2019) Remember the 5 W's | IT Best Practices | Nebraska. 2019.
Remember the 5 W's | IT Best Practices | Nebraska. [online] Available at:
https://its.unl.edu/bestpractices/remember-5-ws. [Accessed 04 July 2019].
49
Glossary
Ancient - belonging to the very distant past and no longer in existence.
Logic - a system or set of principles underlying the arrangements of elements in a computer or
electronic device so as to perform a specified task.
Algorithms - a process or set of rules to be followed in calculations or other problem-solving
operations, especially by a computer.
Perceptions - The way in which something is regarded, understood, or interpreted.
Intervention - The action or process of intervening.
Complex - A group or system of different things that are linked in a close or complicated way;
a network.
Segmentation - Division into separate parts or sections.
Sentiment - Feelings of tenderness, happiness, sadness, or nostalgia.
Emotion - a strong feeling deriving from one's circumstances, mood, or relationships with
others.
Polarity - The state of having two opposite or contradictory tendencies, opinions, or aspects.
Parameter - a numerical or other measurable factor forming one of a set that defines a system
or sets the conditions of its operation.
Non-linear - Not arranged in a straight line.
Crux - The decisive or most important point at issue.
Application - A program or piece of software designed to fulfil a particular purpose.
Data mining - The practice of examining large pre-existing databases in order to generate
new information.
Stakeholder - A person with an interest or concern in something
Narrative - A spoken or written account of connected events.
Untagged - Of a piece of text or data not identified or categorized by a tag.
50