KEMBAR78
DA Scripts Session 1 | PDF
0% found this document useful (0 votes)
33 views14 pages

DA Scripts Session 1

Uploaded by

rubayat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views14 pages

DA Scripts Session 1

Uploaded by

rubayat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DA001

Module overview

Hello and welcome. This module covers a family of techniques that are variously called data
mining, data analytics, predictive analytics or machine learning. I'm Dr. Sandy Brownlee and I
will be taking you through the course with my colleagues, Dr. Jason Adair and Dr. Kevin
Swingler.

[Cut to Jason for personal intro]

[Cut to Kevin for personal intro]

[Back to Sandy]

This is a truly fascinating subject, seeing extensive uptake in industry, and touching on several
hot topics at the cutting edge of research. It also includes my own research interests in
optimisation and machine learning, and I’m excited to share some snippets of my research as
we progress through the course. I have worked in industry and academia for around 17 years,
with my earliest roles being all about data cleaning and statistics. I spent a year working with
the Scottish historical fisheries database writing scripts to automatically process manually
recorded data dating back 50 years. That system was not only able to tell you how many fish
are in the sea, but what gender they are, how big they are, and how old they are! More
recently, I’ve worked with international airports and airlines to predict the movements of
aircraft, with the goal of automating their route planning so they run more efficiently. I’ve
worked with buildings engineers to model how different home improvements can be deployed
within a given budget to keep our homes comfortable while making them better for the
environment. These days I’m particularly interested in how we can get machines to explain
their decisions so they can be trusted: we’ll come on to that near the end of this course.

In this module we’ll focus on machine learning for structured data (that is, numbers), but we
will also touch on computer vision and natural language processing. We’ll look at the
underpinning theory, how we might manage a data mining project in practice, and how we can
build the best models for a given task. We’ll also look at trust and ethical issues around machine
learning, including some key concepts in visualisation.

We will be using two tools throughout the course. We will begin by using a visual tool called
Orange: this uses a simple point-and-click interface and is extremely helpful to “play around”
with data as you try to understand it. For learning, it is also very good at visualising the overall
process without getting caught up in low-level details. Having established the concepts in
Orange, we will then repeat the processes by programming in Python on Jupyter Notebooks,
using a very popular toolkit called scikit-learn or S K learn. Don’t worry if you have little or no
experience of programming; we will provide plenty of examples and opportunities to seek help.
This course will involve a lot of exercises to give you practice. The assessment is split 50:50
between a hands-on project just like you might be asked to complete in industry, and an exam
focused on the more theoretical parts of the course. We will also regularly share our progress
and discuss concepts among the class.

By the end of this module you will be able to:


● Understand what can be achieved with data analytics
● Apply data analytics to business problems
● Describe different techniques that can be used to mine data
● Manage a data driven project
● Apply graphics and visualisation in data mining and data representation
● Consider broad issues of trust and ethical issues around algorithmic decision making

We hope you enjoy the course.

END

DA002
Welcome to the Module

Hello and welcome to this module on Data Analytics. I'm Dr. Sandy Brownlee and I will be taking
you through this course with my colleague, Dr. Jason Adair.

[Cut to Jason for personal intro]

[Back to Sandy]

This is a truly fascinating subject, seeing extensive uptake in industry, and touching on several
hot topics at the cutting edge of research. It also includes my own research interests in
optimisation and machine learning, and I’m excited to share some snippets of my research as
we progress through the course. I have worked in industry and academia for around 17 years,
with my earliest roles being all about data cleaning and statistics. I spent a year working with
the Scottish historical fisheries database writing scripts to automatically process manually
recorded data dating back 50 years. That system was not only able to tell you how many fish
are in the sea, but what gender they are, how big they are, and how old they are! More
recently, I’ve worked with international airports and airlines to predict the movements of
aircraft, with the goal of automating their route planning so they run more efficiently. I’ve
worked with buildings engineers to model how different home improvements can be deployed
within a given budget to keep our homes comfortable while making them better for the
environment. These days I’m particularly interested in how we can get machines to explain
their decisions so they can be trusted: we’ll come on to that near the end of this course.

In this module we’ll be looking at a family of techniques that are variously called data mining,
data analytics, predictive analytics or machine learning. You’ll find these terms are often used
interchangeably, so we’ll be doing the same! We’ll focus on structured data (that is, numbers),
but we will also touch on computer vision and natural language processing. We’ll look at the
underpinning theory, how we might manage a data mining project in practice, and how we can
build the best models for a given task. As I mentioned, we’ll also look at trust and ethical issues
around machine learning, including some key concepts in visualisation.

We will be using two tools throughout the course. We will begin by using a visual tool called
Orange: this uses a simple point-and-click interface and is extremely helpful to “play around”
with data as you try to understand it. For learning, it is also very good at visualising the overall
process without getting caught up in low-level details. Having established the concepts in
Orange, we will then repeat the processes by programming in Python on Jupyter Notebooks,
using a very popular toolkit called scikit-learn or S K learn. Don’t worry if you have little or no
experience of programming; we will provide plenty of examples and opportunities to seek help.
This course will involve a lot of exercises to give you practice. The assessment is split 50:50
between a hands-on project just like you might be asked to complete in industry, and an exam
focused on the more theoretical parts of the course. We will also regularly share our progress
and discuss concepts among the class.

Ahead of all that though, we’ll take a look at what data mining is, and what it can and can’t do.
Let’s begin.

END

DA003
Introduction to the data analytics process

In this first session we will be introducing the data analytics process. We’ll begin by exploring
what you already know about data mining, data analytics or machine learning, and what
relevant experience you might have. Please engage with the discussion so we can get to know a
bit more about each other. This will make it much easier to help each other out later on.

We will explore different applications of data mining, gain an understanding of what it can and
can’t do, and what makes projects using these techniques rather different to other software
projects. We will then move on to some important fundamental concepts: types of variables,
distributions, and data cleaning. We will introduce the CRISP-DM framework, an industry-
standard process for managing a data mining project, that we will make use of throughout the
course. There will be some administrative items: we will set up the syndicate groups for the
course and ensure your computer is set up with the necessary software for the practical
exercises. We’ll also get our hands on a real data set to give a first taste of data mining in
practice.
END

DA004
Description of some data analysis projects and applications

Let’s start by thinking about some real-world applications of machine learning. I’ll give some
examples and a little background for each. After this, I’d like you to think about applications you
have seen, or might have worked on yourself in the past, and share these with your syndicate
group. In each case the approach is driven by learning something about data that would have
been tricky, or impossible, to write computer code for by hand. The knowledge learned is
embedded in what’s called a model.

The first example is a part of my research that I mentioned in the introduction. The project was
developing automated systems to guide aircraft safely around the taxiways at a busy airport.
We had some clever routing algorithms that would weigh up the options and find a reliably
quick route for each aircraft, given the movements of all the other aircraft. However, this
depended on getting accurate predictions of the time it would take to cover a particular
taxiway for a given aircraft. There are a huge number of factors to consider that could be inputs
to a model. Obviously the distance and how straight or curved the route is are important, but
we might want to consider things like the type of aircraft, the weather, the number of other
aircraft moving at the same time, and whether the aircraft is arriving or departing. In speaking
with the airport ground staff we were also told other factors that might be important: for
example, some airlines might instruct their pilots to taxi more quickly than others as a matter of
policy. It was not clear what the relationship was between all these factors, so it was not
possible to just write a programme to make the predictions. Thus, we have a clear application
for machine learning.

We recorded the movements of around ten thousand aircraft (a little over two weeks of traffic
at the time). We then fitted models to this data and tried predicting the movements of aircraft
several weeks later. The models were able to predict most of the movements to within one
minute of their real taxi time (an error of about 10%) which it turned out was good enough to
build our automated routing system. We were also able to analyse the models to learn more
about the airport’s operations. As expected, distance and route straightness were the most
important factors. The average speed of recently departed aircraft, and whether the present
aircraft was a departure or arrival was also important. Otherwise most of the factors we might
think of by intuition were unimportant. This was very useful to know as the airport embarked
on development of a new layout and supporting policy for their taxiway system.

Another project involved a colleague of mine. Their task was building a system that could take,
as inputs, several measurements being taken for the driver of a car: changes in their seating
position and their movements operating the steering wheel, gears and so on. The output of this
system was simply one of two options: whether or not the driver was tired and needed to pull
over for rest. Machine learning is applicable here because, again, there is no clear formula to
relate these movements to tiredness. A model was trained on the movements of hundreds of
test subjects and was confirmed to produce surprisingly accurate assessments of driver
tiredness.

Many other applications exist. Machine learning is frequently applied to analysing the habits of
shoppers to predict what they might like to buy next, allowing targeted advertising to be
generated. Models have been used to estimate the likelihood that a person will default on a
loan. Predictive maintenance applies machine learning to estimate when a piece of equipment
is likely to fail, so maintenance can be targeted when it is needed before a critical problem
arises. Researchers in my department have used models to estimate when a cow is most fertile
more accurately than experienced farmers. Machine learning is also used for labelling emails as
spam, identifying malicious network activity or computer viruses, and for labelling the objects in
images and videos so their content can be tagged and easily found, or so audio descriptions can
be generated for blind people.

As you can see, there are a vast number of highly diverse applications. Can you think of any
others? What are the inputs and outputs to the model? What is the goal? Is it prediction,
labelling, or something else? In each case, why could more conventional software not be used
rather than machine learning? What might a “good” model look like? What might go wrong?
Once you have a couple of examples, discuss them with your syndicate group in the next
exercise.

END

DA005
Making predictions

You've probably got the idea from the applications we discussed that data mining is about
learning patterns and making predictions. In practice, this involves building a model that can
take data from events we know about - things we have already seen - and using that model to
help us guess about things we don't know about. That might be predicting things in the future
or it might be classifying things in the present.

We’ll come on to what different kinds of model look like later in the course, but for now you
can think of a model as some way of transforming the data to produce an output. Most models
are represented as a set of “parameters” (numbers) that are interpreted by software that
implements a particular data mining technique. A very simple model estimate the time it takes
to boil a kettle. The model could take the volume of water in litres and multiply it by 2 to get a
time estimate in minutes. Here, the 2 is the parameter; the volume of water and the time are
what are called variables. The parameter 2 might be “learned” from previous data: perhaps we
previous boiled one litre 5 times and took the average. If we used a different kettle and took
more measurements, we would probably have a different value for our parameter.
Measurements with a more powerful kettle would have a smaller number; measurements with
one of those travel kettles you find in hotel bedrooms would have a much bigger number.

Hopefully you can see that the functionality of the model is determined by data and not pre-
programmed rules. The data we’re talking about is usually a snapshot or a sample of the real
world in some way. Data Mining assumes that whatever produced the data in the first place will
continue to produce data in a similar way in the future. So in this way we're trying to use what
we've already seen in order to make predictions about things we haven't seen before. A model
learned from the measurements for one kettle might not work for predicting times on a
different kettle.

Of course there are problems with this approach. It might be that the data we’re provided with
has errors. It might have been measured or recorded incorrectly, or missing parts. It might not
measure the right things or simply that there just isn't enough data to be able to do anything
useful with, and collecting more data might be expensive or even impossible. We'll talk about
how we deal with these issues in a later part of the course.

Assuming the data is good enough, what kind of things might we be mining from it? Well, we
might be looking at relationships between variables. For example, that age affects car insurance
risk; temperature affects whisky distilling quality; or prices affect sales. There might be more
complex relationships than that, with three, four, five or more factors all working together to
produce some kind of outcome.

Learning patterns from vast amounts of data allows computers to perform tasks that we
couldn't necessarily program them to do. A colleague of mine was developing a model to
predict whether or not insurance claims were fraudulent. Talking to a claims assessor he said
that he knew that a claim was fraudulent when the hairs on the back of his neck stood up. It is
impossible to program rules like this: a computer has no concept of intuition, nor does it have
hairs or even a neck! All it can do is infer patterns from examples it’s seen in the past. So, data
mining is about learning rules rather than taking them from a human.

Broadly speaking, there are three kinds of tasks that we might do in this setting.

Classification tasks are about assigning a class or label to objects. That might be things like
saying the picture contains a cat, dog or mouse, or perhaps more usefully, this email is spam or
not, or this insurance claim is likely to be fraudulent or genuine.

Regression tasks are about predicting a number. Examples that I've looked at have included
estimating the energy usage of a house after it has been refurbished or estimating the time in
minutes that an aircraft will take to taxi from one part of an airport to another. The kettle
boiling model I described earlier was also an example of a regression model.

Clustering tasks are about grouping elements of data together. A classic example is finding
groups of similar customers that can be targeted for marketing.

Machine learning for Classification and Regression is usually referred to as supervised learning,
because the data you are trying to learn from is marked with “correct” values that you can use
to test the quality of the model. These correct values are a bit like a teacher or “supervisor”.
Clustering is usually referred to as unsupervised learning, because we usually don’t have an
existing set of clusters to compare against.

It's also worth saying that it's crucially important throughout the process to keep your mind on
what the task ultimately is about. What the real world application is. So, just as you wouldn't
dig a gold mine if you knew nothing about the geography of the Earth you wouldn't go data
mining without having some understanding of the geography and meaning of the data. In
essence, it's really important to know your problem. I find understanding a new application
area to be one of the most fun parts of the whole process.

We will now have a go at getting your machine set up with Orange and Python, so you are
ready to play with some real-world datasets in the next screen. Please also take a look at the
text box with some key definitions of terms we’ll be using throughout the course.

END

DA006
Orange walkthrough (screencast)
Screencast of opening Orange, and adding / editing widgets. Show configuration of widgets.
Show that we can click on edges to change data flows. Show how to get help.

END

DA007
Variable Types

We’ve learned that machine learning is about constructing models where data goes in and data
(in the form of predictions) comes out. The basic building blocks of data are “variables”.

Variables are something we can measure that varies, hence the name! Sometimes you will also
hear variables referred to as “attributes” or “features”. Once you have decided on a task,
identifying the variables is the first thing to do. Some examples of variables you might see are:
age, height, temperature, weight, size or price.

For a particular example of whatever is being measured (called a “data point”), each variable
takes one value. You might think of the variables as columns in a table, and the data points as
rows. So, for example, data representing a person might have the variables age, height and
occupation. For a particular person, these variables might have the values age “37 years”,
height “175cm”, and occupation “Lecturer”.

There are two broad types of variable: numerical and categorical. Each of these can be divided
into two sub-types.

Numerical variables can be either continuous or discrete. The values of continuous variables
can be anything within a range: there are an infinite number of acceptable values. So a good
example is height: although in everyday life we tend to round to the nearest centimetre or inch,
the true value of an adult person’s height can be anything within a range of about 120-200cm.
A person might have a height of exactly 175cm, but it could equally be 174.1253cm. On the
other hand, a discrete variable has a fixed set of acceptable values within a range. So, “number
of children” is a good example. 2 or 3 children are acceptable, but 3.2 children would not be!

Categorical variables can be either nominal or ordinal. Their values are labels rather than
numbers, and take their values from a fixed set. Nominal variables refer to things like qualities
where there is no particular order. So, for example, the colour of a fruit might be one of red,
green, yellow or purple. The concept that yellow is greater than purple makes no sense here.
Similarly, the make of a car might be a Ford, Volkswagen, or Nissan, and again there is no clear
ordering of these values. Binary choices like yes and no are also nominal. In contrast, ordinal
variables do have an ordering. For example, the size of a t-shirt might be small, medium or
large. In my work modelling building refurbishments, a window might be single, double or triple
glazed. The ordering here is based on how well the window keeps heat in the house.

Now we’ve understood the different types of variables, let’s take a look at an example data set
using Orange and try to identify the different variables present.

END

DA008
CRISP-DM Model Description

We’ve already discussed how data mining projects are about learning patterns from data so we
can make predictions. A data driven project is very different to a standard software engineering
project (or, indeed, many other projects involving a design and implementation). A standard
software project would begin with a specification that sets out precisely what the software will
do. The system will then be implemented and tested against that specification before it is
signed off and put into use.

One difference with a data driven project is that we are, in general, not developing a full
software product as such, just a piece of functionality that is embedded into a larger system.
The other crucial difference is that the results depend, in part, on the data itself, so it can be
very hard to know ahead of time what a successful application will look like. This means that
specifying the goals at the outset of the project is very difficult, indeed, selling in the idea itself
can be a lot more challenging.

Typically this means that at the start of the project we can only establish in general terms what
we would like to achieve. For example, being able to predict what items a customer might want
to buy given their shopping history. Such projects then jump back and forth as we learn more
about the data and what might be possible, repeatedly seeking feedback from the client.

The CRISP-DM standard provides us with a framework to follow when running a data mining
project. Other standards exist, but CRISP is one of the most well established and remains one of
the most popular. It consists of six stages, and as you can see, is not linear. It is possible to step
back and forth between stages: the most important movements are shown by arrows on the
diagram. We’ll now briefly describe each of the stages.

Business Understanding. At this stage we try to understand the task. What’s the general aim?
Is it a regression or classification problem? Is there a minimum performance that would be
acceptable? What is the cost of making errors in the predictions? Do false positives or false
negatives matter more? For example, in a medical application we’d want to avoid false
negatives, so we don’t give someone the all clear when they are in fact ill. On the other hand, a
spam filter would want to avoid false positives, so that a person doesn’t miss an email that was
actually important to them. The medical application should have a very low error rate, but a
spam filter can be more relaxed.

Data Understanding. At this stage we ask what and how much data is available, and how easy is
it to collect more? What variables are present, what types do they have and how do they relate
to the problem? What is the target variable? We might plot some basic statistics to get a feel
for what our data set looks like. Already at this stage, we might need to go back if it doesn’t
look like the data is suitable for the problem, or if it is poorly documented and more business
insight is needed.

Data Preparation. This stage is all about cleaning the data and getting it into a format suitable
for modelling. We will deal with missing values and fix obvious errors, like a person’s age being
200. We will recode obvious inconsistencies: for example, perhaps in some places M and F were
used instead of male and female, so we settle on one standard. We also deal with minority
values, unbalanced data and skewed distributions here: more on those later. We decide which
variables might be of use for modelling by looking at the distributions and levels of noise in
each. At this point we also split the data into training and test sets, so that we can measure the
quality of the model on data it hasn’t seen before.

Modelling. This stage is a little like scientific research. We choose a number of techniques
suitable for the task: for example, a neural network and a decision tree for classification. We
train these models on the training data. We tune them and compare them until we are happy
that we have found a model that performs well on unseen data.

Evaluation. At this stage we apply our chosen model to the unseen test data, and measure its
performance. If it is deemed to be good enough we can move to the final stage.

Deployment. The chosen model is implemented as part of a working system. This might be part
of a website, or a smartphone app, or deployed in a data centre. The nature of the data being
processed might change over time, so the model is kept under continual review, possibly
restarting the process if necessary.

These steps provide a helpful process to follow but data mining solutions can be hard to sell.
They often replace intelligent workers with ‘intelligent’ computers. It is impossible to say
whether they will work until you have seen the data, and you can’t demonstrate it working at a
sales pitch (not on their data, anyway). Selling the project falls back to describing the power of
machine learning approaches to find patterns in the data that are difficult or even impossible
for humans to see, but always with the condition that the patterns are there to be found in the
first place.

We’ll now have a look at an example of a real world project, and see how CRISP-DM might be
applied to it.

END

DA009
Discrete vs Continuous Distributions

We’ve talked about how variables are the basic building blocks of the data we are modelling. A
particular example of whatever we are measuring has one value for each variable. When we are
trying to build a model, we will have a lot of examples: what we refer to as our training data. In
reality, this is a set of observations of the real world.

Many real world systems are stochastic, or uncertain, but that uncertainty can be characterised
using probabilities. Many statistical tests and measures are based on the idea that an observed
variable will take different values with different probabilities. Statistical machine learning,
which is the basis of the techniques in this course, is the same. Distributions are a way to
represent the probabilities of the range of values a variable can take.

You might have already heard of one kind of distribution: the normal or Gaussian distribution.
This can be represented as a bell curve, where values in the middle of the bell shape are more
likely than those at either side. Many things you might measure take this distribution. For
example, shoe sizes. We know that there is not an even spread of foot sizes in the population.
In fact, if we stopped 300 people on the street and measured their feet we could plot it like this
(show bar chart). If you were stocking a shoe shop, knowing this distribution would be very
helpful. You would know that you should get more shoes in sizes 7 to 10 than sizes outside that
range.

With the uniform distribution, all values are equally likely. A classic example is a fair die: as you
roll it again and again, each side should come up roughly the same number of times. There are
many other distributions, but we won’t cover those in this course.

The bar plot is a visualisation of what’s known as a probability mass function. This describes the
distribution of a discrete variable by telling us a probability for each value the variable can take.
Shoe sizes are discrete because there is only a fixed number of them. Don’t be confused by the
fact that there are half sizes! Shoe size is discrete because it’s not possible to have any value,
like 6.2.

In contrast, the length of a person’s foot is continuous, because if we measure a foot it really
can take any length: it’s very unlikely to be exactly 26.0cm. The distribution of a continuous
variable can be described by a probability density function. This doesn’t directly tell you a
probability for a given value. It can’t, because there are an infinite number of possible values,
so in theory each one has a probability of zero! But it does give a relative measure: if the
density for one value is greater than that of another value, that value is more likely to occur.
The normal distribution is an example of a probability density function: it’s a smooth curve. We
can approximate this curve with a histogram, which groups the possible values into bins. The
idea of grouping the continuous values into bins treats the variable as a discrete one, for which
we can actually get probabilities.

If we made a few measurements of people’s feet, we’d get a histogram like this (show
histogram). You can see that this roughly follows the normal distribution. We have to be careful
to choose the bin sizes correctly: too few look like this (show histogram 2) and too many look
like this (show histogram 3). We’ll revisit this idea when we talk about bias and variance in the
next session.

It’s not crucial that you understand the mathematics underlying all this. What is important is
that we are treating continuous and discrete variables slightly differently, and that we are
trying to approximate the probabilities that the variables take different values. You can think of
probability mass functions in terms of bar plots; and probability density functions as smooth
curves that can be approximated by histograms; and you’ll be about right.

At this point, it’s important to note the difference between the full population and a sample. In
reality, across the whole population of people in a city who might ever visit our shoe shop,
there is one distribution. If we just took foot measurements of the people who visited the shop
in any one day and plotted those (what we call a sample), the bar plot would be roughly the
same shape as for the whole population. Perhaps a few more people with small feet happened
to visit us, so it wouldn’t be an exact match. Another day, the distribution would look slightly
different another way; perhaps a few more people with larger feet. In any application of
machine learning, we’re trying to build a model of the population, but we usually only have a
sample to learn from. We’ll come back to this point later in the course as well.

So far we’ve talked about individual variables: the distribution of one variable is known as a
marginal distribution. Typically in machine learning we are building a model that approximates
the distribution of many variables together: this is known as a joint distribution, but the basic
principles are the same.

We can also plot the distributions of variables in our raw data to help with data cleaning. We’ll
look at that in the next screen. Before that, though, let’s have a look at a few examples of
distributions in Orange.

END

DA010
Data cleansing

Real world data is usually full of errors and mistakes. One of my first jobs was an entire year of
writing code to clean up manually entered data in the Scottish historical fisheries database. A
simple thing like a missed decimal point would mean a fish that was 100m long instead of
10cm! This will obviously cause problems if we try to model the data. As such, understanding
the distributions of variables in our data is key to both the Data Understanding and Data
Preparation stages of CRISP-DM. We’ll now discuss a few examples of the things you might be
looking for in your data.

Outliers are a small number of values that are much larger or much smaller than all the others
for a given variable. These could be data entry errors or could just be very rare examples.
Outliers can disrupt the data mining process and give misleading results. You should either
remove them, if possible correct them, or, if they are important, collect more data to reflect
this aspect of the world you are modelling.

Minority Values are those that only appear very infrequently in the data. They are not
necessarily very big or small, just rare. You should ask whether they appear often enough to
contribute to the model. If they are then it is probably worth collecting more data where they
are represented, otherwise we will want to remove them from the data as they will just make it
harder to construct the model without adding much to it. You should also check whether they
are the result of data entry errors: for example, occasionally someone entered “M” instead of
“Male”. Where something like this is the case, they can be fixed easily.

Large Majority Values are where most of the data is concentrated in a small space, where the
same value appears many times. Although it looks like there is a lot of data, perhaps it is largely
about the same thing and in practice not much use. In this example we have customer data for
a UK company that also trades overseas: most of the customers are in the UK. Usually we either
ignore variables with a large majority value, or we try to collect more data.

Flat and Wide Variables are so-called because their distribution is flat and wide. All the values
are minority values, with only one or two of each possible value. These variables are of little use
in data mining because the goal is to find general patterns from specific data, but no such
patterns can exist if each data point is completely different. A classic example is a unique
database ID. Usually this will essentially be a random number. These variables should normally
be excluded from a model.

Data Balance should also be considered at this stage. This is where our target variable is a large
majority value, and so is heavily skewed towards one outcome. Imagine I want to predict
whether or not a prospective customer will respond to a mailing campaign. I collect the data,
and the split of people who responded vs those who didn’t respond looks like this: as you might
expect, most people didn’t respond. We then train a model on this data set, which learns and
reports a success rate of 98%. That sounds good, but when I put a new set of prospects through
to see who to mail, what happens? ... the system predicts ‘No’ for every single prospective
customer. With a response rate on a campaign of 2%, then the system is right 98% of the time if
it always says ‘No’, so it never chooses anybody to target in the campaign. This clearly is not
much use to anyone! There are two solutions to this problem. One. We can choose a random
subset of our overrepresented values so that the data becomes evenly split between “yes” and
“no”. Two. We can add more “yes” examples. This can either be done by collecting more data
until we match the number of “no” examples, or we can use a technique called SMOTE to
generate more “yes” examples that look similar to the ones we already have. You can read
more about SMOTE on the link provided along with this video.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

These are a few of the most common things you’ll need to do to prepare the data for
modelling. As mentioned earlier you’ll also need to deal with missing values. You might also
need to normalise value to a common range, and introduce what are called “dummy variables”.
We’ll cover these in a later practical exercise.

For now, let’s have some practice at data cleaning in Orange.

END

DA011
Jupyter + python walkthrough

Screencast of opening Jupyter. Show that we can run all, or click to run one or control+enter to
run one. Show that output appears inline. Show editing and a snippet of a print statements.
Identify variables. Show that they keep their values (and the numbers next to the scripts
identify the order things were run in). Show how to reset kernel. Show how to get help.

END

DA012
Session review

In this session we’ve introduced you to the idea of data analytics and machine learning, which is
largely about making predictions or classifications from data. We discussed several real-world
applications. We saw that machine learning is best suited to situations where we do not have
existing rules that we could use to design a software system. Instead, we need to learn patterns
in the data, which are embedded into what’s called a model. We introduced the CRISP-DM
process for data analytics and thought about how it might be applied to example problems. We
will have more practice with CRISP in a later session once we have looked in more detail at the
techniques used in each of its stages. You’ll also follow CRISP as a major part of your
assignment.

We talked about the basic building blocks of data, variables, and identified the different types
of variables: numerical variables which can be discrete or continuous, and categorical variables
which can be nominal or ordinal. We saw that distributions capture the values that variables
can take, and how this can be used in identifying and cleaning the errors that frequently appear
in real data. We also practiced loading and cleaning data in both Orange and Python.

END

You might also like