KEMBAR78
Full Text 01 | PDF | Dependent And Independent Variables | Machine Learning
0% found this document useful (0 votes)
13 views67 pages

Full Text 01

This master's thesis explores the use of machine learning regression models to predict the reach of television advertisements based on historical campaign data. The study finds that the XGBoost model performs best, achieving a mean absolute percentage error of just under 5%. Additionally, the report discusses the most impactful features on reach and the potential benefits of data augmentation for model performance.

Uploaded by

dnyaneshwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views67 pages

Full Text 01

This master's thesis explores the use of machine learning regression models to predict the reach of television advertisements based on historical campaign data. The study finds that the XGBoost model performs best, achieving a mean absolute percentage error of just under 5%. Additionally, the report discusses the most impactful features on reach and the potential benefits of data augmentation for model performance.

Uploaded by

dnyaneshwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Science


2022 | LIU-IDA/LITH-EX-A--2022/048--SE

Predicting television advertise-


ment reach with machine learn-
ing models
Åskådarprediktion av TV-reklam med hjälp av maskininlärn-
ingsmodeller

Alexander Olsson, Joar Måhlén

Supervisor : Alireza Mohammadinodooshan


Examiner : Niklas Carlsson

Linköpings universitet
SE–581 83 Linköping
+46 13 28 10 00 , www.liu.se
Upphovsrätt
Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-
ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-
pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-
ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan
användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-
heten och tillgängligheten finns lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som
god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet
ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-
nens litterära eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida
http://www.ep.liu.se/.

Copyright
The publishers will keep this document online on the Internet - or its possible replacement - for a
period of 25 years starting from the date of publication barring exceptional circumstances.
The online availability of the document implies permanent permission for anyone to read, to down-
load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial
research and educational purpose. Subsequent transfers of copyright cannot revoke this permission.
All other uses of the document are conditional upon the consent of the copyright owner. The publisher
has taken technical and administrative measures to assure authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when his/her work
is accessed as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its procedures
for publication and for assurance of document integrity, please refer to its www home page:
http://www.ep.liu.se/.

© Alexander Olsson, Joar Måhlén


Abstract

Despite the entry of many media services, television remains the most used media ser-
vice and accounts for the largest advertising spending globally. One of the main metrics for
measuring the successfulness of a television advertising campaign is reach, the percentage
of the intended target audience that has seen the television advertisement. To help plan
television advertisements, the industry aims to find new methods for predicting television
advertisement reach more accurately. Therefore, it is of interest to explore the possibility to
utilize machine learning regression models. This report examines how well four machine
learning regression models are suited for predicting reach based on historical campaign
data. The results indicate that the best-performing model is an XGBoost model with a
mean absolute percentage error just below 5%. The report also describes which features
impact reach the most and if data augmentation can improve the performance of the ma-
chine learning models.
Acknowledgments

Thanks to the employees at GMP Systems for welcoming us and for taking the time to share
your knowledge regarding the media industry. An extra thank you needs to be given to
our supervisor Tomas Hiselius for guiding us throughout this project and introducing us to
different domain experts.
We would also like to thank our examiner Niklas Carlsson and supervisor Alireza
Mohammadinodooshan, provided by Linköping University, for assisting with this project
through several reviews and answering all our questions.

iv
Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4
2.1 Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Method 13
3.1 Frameworks and hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Hyper-parameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Results 26
4.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Feature importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Discussion 39
5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Conclusion 45

v
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Bibliography 47

A Additional details 51
A.1 Target feature plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Data augmentation model hyper-parameters . . . . . . . . . . . . . . . . . . . . 54

B Extended experiments 56
B.1 Start year removed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
B.2 Feature selection using augmented data . . . . . . . . . . . . . . . . . . . . . . . 57
B.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

vi
List of Figures

2.1 Single layer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


2.2 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Histograms over reach distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


3.2 Histograms over GRP distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Histograms over CPP30 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Histograms over prime time distribution . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Histograms over position in break distribution . . . . . . . . . . . . . . . . . . . . . 17
3.6 Histogram over the spot length relation distribution . . . . . . . . . . . . . . . . . . 17
3.7 Histograms over the start dates distribution . . . . . . . . . . . . . . . . . . . . . . . 18
3.8 Histograms over the periods distribution . . . . . . . . . . . . . . . . . . . . . . . . 18
3.9 Histogram over the age distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.10 Density histogram over the distribution of the number of channels . . . . . . . . . 19
3.11 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Loss graph feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


4.2 Baseline predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Ridge predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Pre-pruned decision tree predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Post-pruned decision tree predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6 XGBoost predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7 Neural network predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.8 Feature importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.9 Feature importance without feature transformation . . . . . . . . . . . . . . . . . . 34
4.10 Histograms over CPP30 distribution for original and augmented datasets . . . . . 37
4.11 Histograms over periods distribution for original and augmented datasets . . . . . 38

A.1 Target vs GRP scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


A.2 Target vs CPP30 scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.3 Target vs prime time and position in break scatter plots . . . . . . . . . . . . . . . . 52
A.4 Target vs spot length relation scatter plot . . . . . . . . . . . . . . . . . . . . . . . . 52
A.5 Target vs start dates scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A.6 Target vs periods scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A.7 Target vs target ages scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A.8 Target vs number of channels scatter plot . . . . . . . . . . . . . . . . . . . . . . . . 54

B.1 Model results cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58


B.2 Feature importance cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
B.3 Zoomed feature importances cross-validation . . . . . . . . . . . . . . . . . . . . . . 59

vii
List of Tables

2.1 People meter company per country . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Missing values handling for data augmentation . . . . . . . . . . . . . . . . . . . . 25

4.1 Models mean absolute percentage error . . . . . . . . . . . . . . . . . . . . . . . . . 27


4.2 Ridge regression hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Decision tree attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 XGBoost hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Neural network architecture and hyper-parameters . . . . . . . . . . . . . . . . . . 32
4.6 Feature selection deterioration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7 Feature ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.8 Feature tiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.9 Models mean absolute percentage error with augmented data . . . . . . . . . . . . 37

A.1 Ridge regression hyper-parameters with augmented data . . . . . . . . . . . . . . . 54


A.2 Decision tree attributes with augmented data . . . . . . . . . . . . . . . . . . . . . . 54
A.3 XGBoost hyper-parameters with augmented data . . . . . . . . . . . . . . . . . . . 54
A.4 Neural network architecture and hyper-parameters with augmented data . . . . . 55

B.1 Models mean absolute percentage error . . . . . . . . . . . . . . . . . . . . . . . . . 56


B.2 Model results with augmented feature selection . . . . . . . . . . . . . . . . . . . . 57

viii
1 Introduction

Marketing is one of the biggest industries in the world. Companies annually invest billions
of dollars in marketing and some of the most profitable companies in the world base their
income on marketing, although marketing can both have negative and positive impacts on
society as a whole. Marketing and advertisement can help consumers to receive the infor-
mation needed to make the purchase that fulfills their needs, but it can also lead to over-
consumption and unnecessary purchases etc. This report will focus on marketing from the
advertisers’ perspective.
It is of interest for advertisers to be able to get an estimate of how many possible con-
sumers their advertisements will reach. That estimate will enable the advertisers to evaluate
the decision to purchase that media. According to Danaher et al., television advertisements
constitute the majority of the money spent on advertisements across all mediums [16]. There-
fore, this project will attempt to predict the percentage of the targeted individuals that were
reached by a television advertisement campaign using machine learning models.
In this chapter, the motivation, aim, and research questions for the project are presented
with a brief explanation of the delimitations.

1.1 Motivation
There exists previous studies that investigate predicting television viewers using machine
learning [16, 29, 36, 45]. However, these papers attempt to predict the number of impressions
for a given television spot, meaning the number of viewers where a viewer can be counted
several times. This paper will investigate television advertisement campaigns. Since televi-
sion advertisements are typically shown multiple times during a campaign, only predicting
impressions is not enough. An individual may have seen the advertisement multiple times.
Instead, this paper will predict a campaign’s reach, reach is defined as the number of individ-
uals that have seen the campaign, one individual is only counted once [46].
According to Danaher et al. television advertisements are priced based on the predicted
number of impressions for the advertisement [16]. It is of interest for advertisers to be able
to, based on the predicted impressions, predict how many of the intended individuals their
campaign will reach. Advertisers can use the predicted reach to better plan how much ad-
vertisement the campaign requires to fulfill marketing goals.

1
1.2. Aim

To train this paper’s machine learning model historical data is required. This thesis project
is done in collaboration with GMP Systems which provides the necessary dataset.

GMP Systems
GMP Systems have developed a platform for managing activities related to media invest-
ments called GMP365. Today advertisers buy services from media agencies and media audi-
tors for their marketing campaigns. The agencies help the advertiser to select which media to
buy while the auditors help to analyze the result of the media purchase. With GMP365 media
agencies have to provide the platform with the marketing data. In the GMP365 platform,
GMP System has access to historical marketing data.

1.2 Aim
This project aims to create and evaluate different machine learning regression models that can
be used to predict television advertisement reach. The models will be trained on historical
television advertisement data. The models will be evaluated based on the accuracy of the
predictions of the TV-advertisement reach.

1.3 Research questions


1. Which machine learning model1 can best predict television advertisement reach to help
with the planning of future television advertisement campaigns?
As mentioned in Section 1.1 the related research is mostly focused on predicting tele-
vision impressions. Because of the similarities between impressions and reach, this
report will utilize insights from those studies to choose models and evaluation metrics.
To enable a model to be used for the planning of television advertisement campaigns,
the selection of features will correspond to the attributes agencies estimate during the
planning phase.

2. How well can a machine learning model perform2 by utilizing historic campaign data
to predict an unseen campaign’s reach?
For the model to be useful when planning a television advertisement campaign the
performance of the model must be evaluated. To our knowledge, there are no previous
studies on predicting the reach of whole advertisement campaigns. Because of this, the
models will be evaluated against a baseline instead of the results from other research.

3. What features are of most importance for predicting television advertisement reach?
As previously stated, the selection of features is based on attributes used when planning
a television advertisement campaign. Therefore, it is of interest to understand how
strongly each feature contributes to reach. With that information, it would be possible
to choose which attributes to prioritize.

4. Can the performance of a machine learning model be improved by increasing the size
of the dataset through data augmentation, by adding mid-campaign results as full cam-
paigns?
Given that the way television watching is changing with today’s multi-media climate
[17], the availability of relevant historical television data is limited. Therefore, it is of
interest to explore the possibility to augment data to improve machine learning models
further.
1 The examined models are presented in Section 4.2.
2 Definitions of evaluation metrics can be found in Section 3.6.

2
1.4. Contributions

1.4 Contributions
There exists previous studies that utilize machine learning models to predict either the ratings
or reach of individual television spots. This was conducted both for television programs
and advertisements. This thesis will attempt to predict the reach for television marketing
campaigns. Campaigns typically consist of many advertisement spots running over a longer
period. This motivates the need for accurate estimations of the campaign’s result which will
be this thesis’ main contribution.
This investigation will not only yield insights into how well modern machine learning
techniques can predict the metric television advertisement reach but also give an indication
of what features are the most impactful for predicting reach. Given that the features will be
selected based on metrics and indices that are set during the planning phase of a campaign
the contribution is valuable in a real-world context.
Lastly, given the way television is being watched is constantly changing, the size of useful
historical data is at risk of always being too small [17]. This thesis will therefore investigate a
method for increasing the size of datasets to mitigate this issue.

1.5 Delimitations
The dataset used for this project is limited to data only provided by GMP System. Given
that television habits and supply may differ in different countries, this report will limit itself
to training models based on historical data from countries in northern Europe (Denmark,
Norway, and Sweden). The target audience of the television marketing campaigns examined
in this project is categorized by gender (male, female, or both) and age groups. Meaning that
more narrow audiences such as men between 20 to 40 who own a certain sports package will
be excluded from the dataset.

1.6 Thesis outline


This thesis is divided into six separate chapters.
Chapter one is the Introduction, where the motivation, aim, and research questions are
presented. The chapter also gives a brief description of the thesis’ contributions and delimi-
tations.
Chapter two Background elaborates on how the media industry works, presents a sum-
mary of the related works, and describes the theory behind relevant machine learning models
and methods.
Chapter three presents the Method used for answering the research questions. The chap-
ter includes an investigation of the dataset’s feature distributions and motivations for the
method and models used for this project.
The Result chapter presents all results from the different experiments. A discussion of
individual results is included.
In the Discussion chapter, high-level discussions from combined results are presented.
Assumptions and limitations of the projects are highlighted and their effect on the project is
discussed. The method for the study is critiqued and possible alternatives to the methods are
presented.
The final chapter Conclusion, presents a summary of the conclusions that were made
throughout the report. Suggestions for future work are also included.

3
2 Background

In this chapter, the relevant background for the project is discussed. Both practices within the
advertisement industry, as well as technical aspects, are covered. In Section 2.2 the related
works for this project are summarized.

2.1 Industry
According to Katz, it is important to have clear objectives with marketing campaigns before
any media purchase [28]. A media planner needs to determine how to reach the target audi-
ence for the product or service through advertising in media. A clear definition of the target
audience is therefore essential. However, the target audience for the media advertisement
often is more general than the target audience for the product or service. Target audiences in
media are traditionally very general and most often based on demographics such as gender,
age, and geographic location.
Furthermore, Katz explains that two important factors of the media objective are the target
audience’s reach and frequency. Similar to reach, frequency is defined as the number of times
the target audience has been exposed to the advertisement [28]. An example of a media
purchase goal could be to reach 20% of the target audience men age 30-50 at least three times.
To achieve the media objectives a media plan is needed. Abe explains the process of
buying television advertisement typically involves three parties, advertisers, media agencies,
and television networks [2]. Advertisers contact the media agencies with objectives they want
to achieve through their marketing. Based on these objectives of the advertisers the media
agency constructs a media plan that specifies a budget, the duration of the advertisement, the
scheduling distribution, and gross rating points (GRP). GRP is a way of counting impressions
for a particular target audience, one GRP equals one percent of the whole target audience.
Given that an advertisement can be viewed multiple times by a person GRP can exceed 100
[28]. When the media plan is approved by the advertiser, the plan is sent to the television net-
works. The networks suggest a scheduling and actual cost for the advertisements according
to the media plan’s objectives. The cost of an advertisement is based on many factors, accord-
ing to Abe one of the most significant factors is the predicted impressions the advertisement
will have [2].
After the campaign, it is possible to see the result of the advertisement schedule [28].
Television statistics is typically measured with the use of people meters [2]. A people meter

4
2.2. Related work

is an audience measuring tool that is installed together with a television set. The people
meters account for demographic properties, each viewer needs to log in using a unique code
when watching television [39]. The people meters are only used by a small portion of the
total audience, statistical analysis is then used to get a representation of the whole audience.
In the Nordic countries, multiple businesses collect television statistics, see Table 2.1.

Country Company URL


Denmark Kantar https://www.tv.kantargallup.dk
Normawy Kantar https://kantar.no/medier/tv/
Sweden MMS https://www.mms.se
Table 2.1: Companies responsible for collecting television statistics in the nordic countries.

2.2 Related work


As mentioned in Section 1.1 this project investigates the possibility to predict future television
campaign reach. Given that there is limited previous work on predicting reach, the related
work section covers papers that have predicted television impressions.
Sereday and Cui studied how machine learning can be used to predict television impres-
sions for TV networks [45]. Their models relied both on historical impressions data and pro-
gram characteristics as input. The input included but was not limited to impressions, reach,
genre, air time and, marketing spend. The authors tested many regression models, includ-
ing linear regression, penalized regression, multiple adaptive regression splines, random de-
cision forests, support vector machines, neural networks, and gradient boosting machines.
The model that received the best result was a gradient boosting machine using the XGBoost
library. Though, the authors concluded that a combined model might be best suited for this
particular task.
Khryashchev et. al studied the properties of the 100 largest TV networks in the US and
evaluated the performance of forecasting models to predict hourly impressions [29]. The
authors examine the performance of four models including seasonal averaging that is used
as a baseline, Facebook’s prophet, Fourier extrapolation, and XGBoost. They also produced
a model that combined the predictions of the four mentioned models. The result from the
baseline model was on par with the prophet, XGBoost, and, Fourier. The combined model
was the only model that was significantly better than the baseline. When comparing the
Symmetric Mean Absolute Percentage Error the combined model was 11% better than the
baseline.
Meyer and Hyndman predict television impressions based on historical data from three
television channels in New Zealand in July 2003 [36]. The article compared three aggrega-
tion approaches combined with three machine learning models (regression, decision trees,
and neural networks). The aggregations that were used were to predict the impression for
the whole population at once, to train one model per population segment ("Middle-Aged",
"Kids", "Older", and "PayTV"), and to train a model that receives demographic, periodic and
genre-based information as input called individual. Meyer and Hyndman concluded that the
most accurate model and aggregation was to train a neural network model for each segment.
Danaher et al. compare Bayesian model averaging and a linear regression model to fore-
cast TV impressions [16]. The article concluded that seasonality parameters were crucial for
predicting impressions and that the genre of programs had an impact. Both models were
trained with and without adding a program-specific random element. Danaher concluded
that a linear regression model with random effects performed the best with the Bayesian
model averaging as a close second.
Nikolopoulo et al. predict the impact, how many more viewers compared to the nor-
mal amount of viewers at that time and channel, for sporting events [41]. This prediction

5
2.3. Machine learning

uses self-labeled input indicating the event’s importance, competition, and timing. Linear
regression, nearest neighbor, and neural network models were implemented and compared.
Nikolopoulo et al. concluded that a k-nearest-neighbor method and a simple neural network
performed the best.
Bína et al. investigate how different scheduling factors impact the reach of a single televi-
sion advertisement spot [6]. The study is conducted on all television commercial spots in the
Czech Republic from 2017, resulting in a dataset consisting of 5.608 million advertising spots.
To capture the feature’s importance the authors create an ANOVA table using a linear regres-
sion model. From the table, they could conclude that the TV channel was the most important
feature followed by the program type before the break, and the time of the day. However, the
linear model only captured 56% of the variability and the results should therefore be consid-
ered with care. To capture more of the variability they implemented a neural network that
showed promising results, though it lacked interpretability of the feature importance.

2.3 Machine learning


In this section, the relevant machine learning theory for this project is presented.

Feature selection
Feature selection is the process of selecting a subset of features from all input features to
reduce computational cost and improve predictor performance [12]. Using feature selection
can help improve the understanding of the data, reduce training time and improve prediction
results [38]. Molina et al. present different feature selection methods with the main goal to
improve predictor performance [38]. One method that was introduced was forward variable
selection. According to Guyon and Elisseeff, forward variable selection is considered universal
and resistant to overfitting [24].
In forward variable selection, features are added iteratively [38]. For every iteration, every
unselected feature is tried one at a time and the performance of the predictor is evaluated. The
feature that improves the predictor the most is selected. If no feature improves the predictor
the feature selection is finished.
Guyon and Elisseeff suggest using a linear predictor with forward variable selection as a
filter of features before training a more complex model [24]. That way the dimensionality of
the input to the complex model is reduced. Naïve Bayes, least-square linear prediction, and
support vector machines are mentioned as popular choices for the linear predictor.

Hyper-parameter optimization
Normally when training machine learning models a set of parameters are configured by the
algorithm to optimize a training criterion, often in terms of a loss. However, many times the
model itself also has parameters that can be tuned to enable the model to achieve greater
performance. These parameters are called hyper-parameters [5]. Two of the most common
methods for optimizing the hyper-parameters are grid search and random search.
In grid search, a set of possible values is defined for every hyper-parameter. Then every
combination of hyper-parameter values is trained and evaluated [5]. The combination that
yields the best performance is chosen. Given that every combination is evaluated grid search,
suffers from the curse of dimensionality. However, Bergstra and Bengio explain that grid
search is simple to implement and parallelizable and that it is suited for problems with few,
one to two, hyper-parameters [5]. According to Bengio, it is important to consider exploring
hyper-parameters beyond the defined border of the value set if the best value for any hyper-
parameter is found near the border [3]. Better values for the hyper-parameters may be found
with further exploration.

6
2.3. Machine learning

In random search, all combinations of possible hyper-parameter values are not evaluated.
Instead, hyper-parameter values are randomly sampled iteratively from a selected interval
[3] Bengio claims that continuous-valued hyper-parameters should be sampled uniformly in
the log-domain and discrete-valued hyper-parameters from a multinomial distribution [3].
However, with knowledge regarding likely good hyper-parameters, the multinomial distri-
bution can be defined accordingly. For instance, this could be beneficial for hyper-parameter
values that only are sensible together with other distinct hyper-parameter values.
Bergstra and Bengio compared the efficiency of grid search and random search when tun-
ing the hyper-parameters of a neural network [5]. They concluded that random search was in
all cases at least as good as the grid search and the optimization was far less time-consuming.

Detecting outliers
Outlier detection is about analyzing samples in the dataset to check how well they conform
with the overall pattern in the dataset. Similar to noise, outliers complicate data analysis.
Therefore, it is common to remove outliers using outlier detection methods before data anal-
ysis [51].
There are many methods for outlier detection. Breunig et al. introduced a density-based
method detecting outliers based on a local outlier factor (LOF) [8]. The local outlier factor is a
measurement of how much a data point’s local density deviates from its k nearest neighbors’
local density. Breunig et al. define the local density for a data point as the inverse of the
average distance to the k nearest neighbors [8].
For example, consider a data point that has a small distance to its k-nearest-neighbors and
the neighbors that has a large distance to their respective k-nearest-neighbors. Then the local
density for the data point is high while the local density for the k-nearest-neighbors is low.
This yields that the data point will have a low local outlier factor, which represents that the
data point is an inlier.
The value for k is set by first establishing a lower and upper bound for potential values
for k. Breunig et al. recommend never setting a lower bound under 10, the value for the
upper bound should be task-specific and determined by the maximum number of points in
a cluster that could be outliers [8]. The local outlier factor for a point is the maximum local
outlier factor encountered when evaluating for all possible k’s in the range.
According to Chandola et al., the positive aspect of using a nearest-neighbor-based tech-
nique for outlier detection is that the method is unsupervised and does not need to make any
assumptions regarding the datasets distribution [11]. However, nearest-neighbor techniques
have high computational complexity. The computations for the local outlier factor have a
complexity of O( N 2 ).

Data augmentation
Data augmentation is a technique where artificial data points are added to the training set.
This can then increase both the quality and the size of a dataset used for deep learning. This
technique has been proven successful in many domains where data from real-world appli-
cations are limited [58]. Though deep learning models perform very well on certain tasks,
the performance is reliant on the existence of large training datasets to avoid overfitting [58].
This is what causes the need for data augmentation.
Data augmentation increase the dataset size through manufactured data points covering
more of the input space while retaining true target values [58]. This process prevents machine
learning models from overfitting by enabling a small dataset to achieve the characteristics of
a larger dataset [49].
According to Shorten and Khoshgoftaar there are two main categories for data augmen-
tation, data warping and oversampling [49]. Data warping is a method where the input features
are transformed while the target label is kept the same. This can for example be done through

7
2.3. Machine learning

cropping, jittering, and flipping [58]. Oversampling augmentation, however, is when artifi-
cial data points are created and added to the training data set, there are several different
methods to create the artificial data points, such as feature space augmentation, learning
methods, statistical models, etc [49, 58].
Cui et al. introduced a data augmentation technique called Window slicing [15]. Window
slicing is an oversampling augmentation method used for time series where the time series
is sliced to create more data. For instance if a data points is for time {t0 ...tn } a new data point
is created by for time ti ...t j where 1 ď i ď j ď n. This way several data points can be created
from each time series data point [15].
According to Fawaz et al., the window slicing method for time series classification was
inspired by image cropping used for data augmentation in computer vision [20]. The idea
was that it can be guaranteed that the cropped image holds the same information as the
original. However, the authors claim that this assumption does not necessarily hold for time
series data, given that it is not possible to guarantee that no information has been lost when
a time region is cropped [20].

Models
Below, the theory for the machine learning models is presented.

Linear model
A linear model is a machine learning model that estimates the target value by mapping the
input features with coefficients to create the linear function
p
ÿ
f( X ) = β 0 + Xj β j , (2.1)
j =1

where β is the trainable coefficients and X is the features [25].


According to Hastie et al. linear models are trained by minimizing a cost function to best
fit a set of data points [25]. Least squares is the most popular linear estimation method where
the cost function is
N
ÿ
cost( β) = (yi ´ f( Xi ))2
i =1
N p (2.2)
ÿ ÿ
2
= ( yi ´ β 0 ´ xij β j ) ,
i =1 j =1

where yi is the target value for data point i.


Linear models suffer from having a large variance yielding a low accuracy [25]. However,
linear models can be improved by utilizing regularization. Regularization attempts to penal-
ize the model’s complexity, resulting in a higher bias and possibly lower variance, this leads
to a more general model [25].
Ridge regression is a regularization method that aims to keep the coefficient values low
[25]. This is done by adding the coefficient’s size to the cost function, this yielding the new
regularized cost function
N
ÿ p
ÿ p
ÿ
2
cost( β, λ) = ( yi ´ β 0 ´ xij β j ) + λ β2j , (2.3)
i =1 j =1 j =1

where λ is the penalty factor that manages how much the model shall be regularized. λ ě 0,
larger value for λ results in more regularization and a λ equal to zero yields no regularization
and therefore, the cost function is equal to Equation 2.2.

8
2.3. Machine learning

Lasso regression is a regularization method that similar to ridge regression penalizes the
size of the models coefficients. However, the penalty term differs from the one used in ridge
regression. Instead of the sum of squares, lasso regression penalize the cost function with the
sum of the constants absolute values [25]. Yielding the cost function

N
ÿ p
ÿ p
ÿ
cost( β, λ) = ( yi ´ β 0 ´ xij β j )2 + λ |β j |. (2.4)
i =1 j =1 j =1

Both of the regularization methods shrink the size of the coefficients. However, with lasso
regularization, individual coefficients can be shrunk to zero excluding the corresponding fea-
ture from the prediction. Thus, performing a form of feature selection [25]. Ridge regression
does not have this property, all features are always utilized to some degree.
Ogutu et al. explain that ridge regression is of ideal use if there are many features with all
non-zero coefficients with a normal distribution [42]. The authors state that ridge regression
is also suited for datasets where the features are highly correlated. According to Benoit, it is a
common practice to perform a log transformation on features that have a skewed distribution
[4]. The logarithmation aims to convert the skewed distribution into a distribution that is
more similar to a gaussian distribution. Lasso regression, however, is not robust to correlated
features. When the features are correlated lasso regression will arbitrarily retain one of the
correlated features, and the other feature’s corresponding coefficient will be set to zero.
Generally, lasso regression is less computationally expensive than ridge regression, al-
though there are certain algorithms with the same computational cost as ridge regression
[25].

Neural network
Neural networks are a non-linear machine learning model whose design is inspired by neu-
roscience (the study of the nervous system) [23]. The idea of the model is to have several
units called artificial neurons that connect in the form of some neurons’ output becomes the
input of other units. Creating a network of neurons called a neural network.
Artificial neurons are often defined as a single-layer perceptron. Lippmann describes the
single-layer perceptron as the result of a weighted sum of inputs minus a threshold θ is passed
through a function, called activation function ϕ [34]. See Figure 2.1 for clarification.

Figure 2.1: A single layer perceptron with input size m. Weighted input and threshold is
summed before being used as input to the activation function.

The network of neurons is single-layer perceptrons connected in layers. Goodfellow et al.


explain that the first layer of units represents the input layer where each perceptron’s input
is the input for the model. Following the input layers are hidden layers, in which input is the
previous perceptron layer’s output and the hidden layer’s output will be input for the next
layer. The last layer will be the output of the model and is therefore called the output layer
[23]. See Figure 2.2 for clarification.

9
2.3. Machine learning

Figure 2.2: A neural network with input size m and output size n. The network contains
several hidden layers, all layers are fully connected with each other.

Lippman explains that during training the neural network will optimize the weights be-
tween the neurons and the threshold for each neuron [34]. The optimization is achieved by
first computing a gradient through a backpropagation algorithm. Then traditionally stochas-
tic gradient descent utilizes the gradient to update the weights and thresholds. The mag-
nitude of the update is based on the learning rate. However, as the training data and the
number of parameters increase stochastic gradient descent often tends to be too slow, for this
reason, faster algorithms are instead used as replacement [23].
According to Goodfellow et al., the learning rate for neural network models is one of the
hardest hyper-parameters to optimize [23]. Changes to the learning rate significantly alter a
model’s performance and are highly sensitive. One way of handling this issue is by utilizing
an adaptive learning rate where each parameter has its own learning rate that decays during
training [23]. AdaDelta, RMSProp, and Adam are three popular algorithms with adaptive
learning rates.
To speed up the learning of stochastic gradient descent momentum can be applied to the
method. The idea of momentum is to update the parameters not only based on the currently
calculated gradient, but also factor in the direction and magnitude of previous gradients [23].
A variant of the momentum algorithm is Nesterov momentum. The difference between regu-
lar momentum and Nesterov momentum is that the gradient is calculated after the previous
gradient has been applied. Some of the optimization algorithms that utilize momentum are
stochastic gradient descent with momentum, RMSProp with momentum, and Adam. Good-
fellow et al. explain that there is no consensus about which is the best optimization algorithm
[23].
Given the possible sophisticated architecture of neural networks, they can be very expres-
sive models. This causes the neural networks to be prone to overfitting and regularization
techniques are often required to be a better fit for unseen data [55].
Similar to ridge and lasso regression, neural networks often add a regularization term
to the cost function to reduce the magnitude of the weights during backpropagation [23].
This regularization technique is called weight decay and the two most common penalty terms
are L2 and L1 . When using L2 penalty the sum of the squared weights is multiplied with a
manually set scaling parameter, this term is added to the cost function. Similarly, using L1 ,
the absolute value of the weights is summed instead of the squared weights. Goodfellow et
al. mention an important difference between L1 and L2 , L1 can set weights equal to zero, L2
can not. As a result, the weight matrix after L1 regularization may become sparse [23]. This
sparsity property of L1 can be used as a mechanism for feature selection.
Srivastava et al. introduced the neural network regularize method dropout [55]. Dropout
is a method that during the training of a neural network hidden neurons are temporally
removed ("dropped") at random. Different neurons are "dropped" for every mini-batch. The
idea behind dropout is to train many less complex neural networks that share weights and
then during testing average over all of them. Srivastava et al. claim that dropout prevents
overfitting and that the method works well combined with momentum, adaptive learning
rate, and weight decay [55].

10
2.3. Machine learning

Decision trees for regression


Decision trees that handles regression problems are referred to as regression trees and is a type
of Classification and regression tree (CART) [56]. A regression tree is a supervised method for
learning an unknown regression function. The tree structure is a series of logical comparisons,
comparing one of the data point’s feature variables at each step. The prediction for a new
observation is the numerical value of a leaf node obtained by following the path of the tree
structure down to the said leaf node. The numerical value of that leaf node is the average of
the target variables for the data points in the training set in the leaf node. Making the decision
tree a non-linear model.
The tree is created with all data points starting from the root node with new branches
created through splits. To find the best binary split all possible conditions are explored. This is
done by sorting all occurring values for each feature to then incrementally try every condition
for the corresponding feature. The binary split that results in the lowest error is chosen, given
that the error is lower than before the split [56]. The splitting is performed until the tree’s max
depth is reached or the node size has reached its minimum. One way to regularize the tree is
to set the max depth and/or node size to a smaller value forcing the tree to become smaller.
This is called pre-pruning.
Another method to avoid overfitting is to use a technique called post-pruning [56]. Post-
pruning is performed by first creating an overly sized tree that is overfitted. The mission is
then to find the best sub-tree that minimizes the validation error. Initially, the overly sized
tree is divided into smaller sub-trees together with an error estimation for each tree. To select
which sub-tree will be used cost complexity pruning can be used [18]. The idea with cost com-
plexity pruning is to prune the branches that least increase the error rate per pruned leaf. The
post-pruning proceeds until only the root node remain, saving each sub-tree in the process.
Through cross-validation, the best sub-tree is chosen.
According to Torgo, regression trees have several benefits. Regression trees include an
automatic variable selection and thus no variable selection is required beforehand [56]. How-
ever, due to the binary comparisons, the regression tree can have poor accuracy in some
domains. Also, decision trees have properties enabling the model to quantify the importance
of each feature [60]. The feature importance score is the measurement of the total error reduc-
tion by that feature.

Extreme gradient boosting


Extreme gradient boosting commonly known as XGBoost is a derivation followed from gra-
dient tree boosting machines and was initially created by Tianqi Chen [13]. Gradient tree
boosting is a supervised ensemble method that combines a set of weak CART tree models to
accurately predict a target variable. Gradient boosting can be used for both regression and
classification. For regression, the weak individual models consist of binary regression trees.
Every leaf node of the regression trees represents a continuous score. The final prediction
outcome is calculated by summing the output of each regression tree [13]. This results in the
parameters the model should optimize being regression trees rather than values. Therefore,
traditional optimization techniques are invalid.
Instead, for optimization, an additive strategy is applied. The model starts with zero trees.
Then it greedily adds the regression tree that most reduces the model’s loss function [13]. A
predefined number of regression trees are sequentially added. During the construction of a
new binary regression tree, a leaf splits if the loss function is reduced by a given threshold
provided that the max depth has not been reached 1 .
One advantage of gradient tree boosting is that it mitigates some of the disadvantages
with single tree models while keeping many of the advantages [21]. The predictions from
a single tree model are based on rough piecewise estimates that may result in inaccuracy.
1 https://xgboost.readthedocs.io

11
2.3. Machine learning

Given that gradient tree boosting combines multiple tree models, the estimation from every
tree can repel each other resulting in a final estimation that is more fine-grained.
Similar to decision trees, gradient tree boosting includes the possibility to extract the fea-
ture importances [60]. XGBoost has three options for extracting the feature importance, using
weight, gain, or cover. Weight is defined as the total number of times the feature is used for
splitting a node. Gain is the average error reduction when a node is split by the feature. Cover
is the average number of data points that were affected by the splits from the feature. Xia et
al. explain that gain is the best-suited method for comparing feature importance with CART
trees because it is the average of the CART tree’s feature importance [60].
The model is regularized through shrinkage and feature subsampling [13]. Shrinkage is
used to reduce the influence of upcoming trees and thus leaving space for further improve-
ment, similar to a learning rate. Feature subsampling is a technique also used in random
forest, where for each tree a random subset of the features is used as input [7].
Chen and Guestrin explain that what distinguishes XGBoost from ordinary gradient tree
boosting is that the loss function is also regularized [13]. Similar to Equation 2.3 the loss
function is regularized with the sum of the squared weights to keep the model less complex.

Evaluation
Naser and Alavi mentions 29 commonly used evaluation metrics often used in machine learn-
ing regression models, including Mean squared error, Mean absolute error, and Mean absolute
percentage error [40]. The authors explain that the absolute- and squared errors are utilized
to ensure that there are no cancellations between positive and negative errors. The different
metrics punish errors differently. Shcherbakov et al. mention that mean squared error are
heavily influenced by outliers and that it is highly dependent on the fraction of the dataset
used, meaning that different dataset fractions can result in a substantially different mean
square error [48].
The equations for the mentioned metrics can be seen below, A being the actual value and
P being the predicted value. E represents the error, calculated according to E = A ´ P.
n
1ÿ 2
MSE = Ei (Mean squared error)
n
i =1

n
1ÿ
MAE = |Ei | (Mean absolute error)
n
i =1
n
100 ÿ |Ei |
MAPE = (Mean absolute percentage error)
n |Ai |
i =1

12
3 Method

In this chapter, the methods for answering the research questions are presented. This in-
cludes the frameworks, hardware, and dataset as well as motivations for which models will
be implemented.

3.1 Frameworks and hardware


The machine learning models in this project will be implemented using the following li-
braries, Keras1 2.8.0, Tensorflow2 2.8.0, scikit-learn3 1.0.2, and XGBoost4 1.5.2.
Keras is an open-source deep-learning API working as an interface for the machine learn-
ing platform Tensorflow [14, 1]. It presents a Python interface for artificial neural networks.
Keras helps users quickly build machine learning models with a minimum amount of code.
It also has the benefit of being able to run on both CPUs and GPUs.
Similar to Tensorflow, Scikit-learn is too an open-source machine learning library [44, 9].
It provides functionality for numerous regression, classification, and pre-processing algo-
rithms. It differs from Keras by providing a library for creating traditional machine learning
models rather than deep neural networks.
XGBoost is a library used for creating gradient boosted tree models, designed for effi-
ciency and flexibility [13]. Like TensorFlow, XGBoost has support for GPUs and distributed
systems.
All computations will be run on a Macbook Pro 2021 with an Apple M1 Pro (32GB RAM)
processor.

3.2 Dataset
The data set available is rather small consisting of 2237 data points representing television
campaigns from Denmark, Norway, and Sweden, where the target audience is defined by
only gender and age group. Each data point contains numerous parameters from television
1 https://keras.io
2 https://www.tensorflow.org
3 https://scikit-learn.org
4 https://xgboost.readthedocs.io

13
3.2. Dataset

advertisement campaigns, the goal is to create a model that can predict the Reach of a target
audience for a television campaign.
• Reach: The percentile of the target audience that has seen the ad at least once.
The parameters that will be used as possible features are parameters that are set during
the planning phase. The possible features are the following:
• GRP: GRP is a measurement of how many that has been exposed to a media adver-
tisement campaign, meaning it is a way of counting impressions. One GRP represents
that the number of impressions is equal to one percent of the whole target audience.
If an individual has been exposed to the campaign multiple times it will be counted
accordingly.
• CPP30: CPP30 is the cost the advertiser paid in SEK for each GRP30. GRP30 is a nor-
malized value of GRP as if the commercial was 30 seconds long. GRP30 is used instead
of GRP to enable comparisons of television commercials that are of different lengths.
• Prime time: The percentage of the GRP30 that was reached during prime time. The
time interval of prime time is defined differently for weekdays and weekends as well
as different for different countries. Prime time represents the time that people watch
the most television.
• Position in break: The percentage of the GRP30 that was reached during premium
position in breaks. Usually, the premium position is defined as the two first or last
spots during a commercial break.
• Spot length relation: The length relation is an indicator of the run-time length for a
commercial. A higher Spot length relation value indicates a shorter commercial and vice
versa. If the indicator is equal to one, the commercial is 30 seconds long. However, the
relation is not linear and therefore, does not have a unit and should not be interpreted
as seconds.
• Start year: The start year of the campaign.
• Start month: The start month of the campaign.
• Periods: One period is equal to the campaign running for one week. However, a week
that transpires over two months is divided into two periods.
• Maximum age: The maximum age of the target audience for the television advertise-
ment campaign. If a target audience does not have a set maximum age, it is manually
set to 99.
• Minimum age: The minimum age of the target audience for the television advertise-
ment campaign.
• Number of channels: The number of channels the television advertisement campaign
has aired on.

Variable distributions
In this section, the distribution of the features and target variable is examined. The distribu-
tion of each variable is plotted as histograms before any pre-processing. How each variable
is distributed is examined by visual inspection. Given that no pre-processing has been ap-
plied there exist incomplete and extreme outlying values. To make it easier to identify the
correct distribution for all variables, plots with incomplete and extreme outliers removed are
also displayed. The determined distribution is then used as a decider on what normalization
technique will be used on the feature, see Section 3.2.

14
3.2. Dataset

Reach
Figure 3.1 shows histograms of the distribution of reach in the dataset. The visual inspection
of Figure 3.1b indicated the distribution of reach to be a skewed normal distribution. In
Figure 3.1a it is clear that there exist several missing values indicated by the large bar at reach
equal to zero.

3.0 3.0
2.5
Frequency density

Frequency density
2.5
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Reach Reach
(a) Reach density histogram (b) Non-zero reach density histogram
Figure 3.1: Reach histograms,the non-zero histogram also plots a line of the approximated
skewed normal distribution.

GRP
Figure 3.2 shows histograms of the distribution of GRP in the dataset. The unfiltered distri-
bution in Figure 3.2 is highly skewed. As mentioned in Section 2.3 features with a skewed
distribution can be logarithmically transformed to appear more gaussian distributed. There-
fore the GRP values are logarithmically transformed. When examining the distribution of the
logarithmic GRP in Figure 3.2b, the distribution resembles a logistic distribution.

1e 3 0.6
2.0
0.5
Frequency density
Frequency density

1.5 0.4
0.3
1.0
0.2
0.5 0.1
0.0 0.0
0 2000 4000 6000 8000 1e1 1e2 1e3 1e4
GRP GRP
(a) GRP density histogram (b) Non-zero and logarithmic GRP density his-
togram
Figure 3.2: GRP histograms, the non-zero logarithmic histogram also plots a line of the ap-
proximated logistic distribution.

CPP30
Figure 3.3 shows histograms of the distribution of the variable CPP30 in the dataset. As with
GRP, the unfiltered distribution in Figure 3.3a is skewed. For the same reason, the CPP30

15
3.2. Dataset

values are logarithmically transformed. The resulting distribution is seen in Figure 3.3b, it
also resembles the logistic distribution.

1e 4 0.6
0.5

Frequency density
Frequency density

1.5
0.4
1.0 0.3
0.2
0.5
0.1
0.0 0.0
0 20000 40000 60000 80000 1e1 1e2 1e3 1e4 1e5
CPP30 CPP30
(a) CPP30 density histogram (b) Non-zero and logarithmic CPP30 density his-
togram
Figure 3.3: CPP30 histograms, the non-zero logarithmic histogram also plots a line of the
approximated logistic distribution.

Prime time
The histograms of the distribution for the variable prime time can be seen in Figure 3.4. The
prime time variable is assumed to be distributed normally when examining Figure 3.4b.

6
5
5
Frequency density

Frequency density

4
4
3 3
2 2
1 1
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Prime time Prime time
(a) Prime time density histogram (b) Non-zero Prime time density histogram
Figure 3.4: Prime time histograms, the non-zero histogram also plots a line of the approxi-
mated normal distribution.

Position in break
The histograms of the distribution for the variable position in break can be seen in Figure 3.5.
Position in break is assumed to have a bimodal normal distribution. However, the approxi-
mated line in Figure 3.5b does not fit the height of the second peak.

16
3.2. Dataset

3.0
3.0
2.5

Frequency density
Frequency density 2.5
2.0
2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Position in break Position in break
(a) Position in break density histogram (b) Non-zero Position in break density histogram

Figure 3.5: Position in break histograms, the non-zero histogram with removed outliers also
plots a line of the approximated bimodal normal distribution.

Spot length relation


The histogram of the distribution for the variable spot length relation can be seen in Figure
3.6. The distribution of the spot length relation is undetermined.

5
Frequency density

4
3
2
1
0
0.0 0.5 1.0 1.5 2.0
Spot length relation index
Figure 3.6: Density histogram over the spot length relation variable.

Start date
The histograms of the distribution for the campaign start year and month can be seen in
Figure 3.7. Given that the values for both year and month are in small ranges, no filtering
was needed to see the distribution. In Figure 3.7a we see a distribution that is solely based on
the historical data available to GMP Systems. In Figure 3.7b the distribution is expected to be
uniform.

17
3.2. Dataset

0.10
2.0

Frequency density
Frequency density
0.08
1.5
0.06
1.0 0.04
0.5 0.02
0.0 0.00
2015 2016 2017 2018 2019 2020 2021

Feb uary
Ma ry

Ma l
y
e
Se ug y
pte ust
Oc ber
De emb r
ce er
er
Ap h
Start year

ri

v e
Jul
Jun
rc
rua

No tob

mb
m
Jan

A
(a) Start year density histogram Start month
(b) Start month density histogram
Figure 3.7: Histograms over the start dates.

Periods
Figure 3.8 shows the histograms of the distribution for the campaign periods. Given that the
distribution for periods is discrete, it is assumed to be a poisson distribution.

0.5
0.4
Frequency density

Frequency density

0.4
0.3
0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Periods Periods
(a) Periods density histogram (b) Non-zero Periods density histogram
Figure 3.8: Periods histograms, the non-zero histogram with removed outliers also plots a
line of the approximated poisson distribution.

Age
Figure 3.9 shows the histograms of the distributions for the maximum and minimum age
variables. Both variables are discrete and follows a similar distribution that is assumed to
poisson. No filtering was needed for either distribution.

18
3.2. Dataset

0.4
0.150

Frequency density
Frequency density 0.125 0.3
0.100
0.075 0.2
0.050
0.1
0.025
0.000 0.0
0 20 40 60 80 100 0 10 20 30 40 50 60
Maximum age Minimum age
(a) Maximum age density histogram (b) Minimum age density histogram
Figure 3.9: Histograms over the maximum and minimum ages, both features also plots a line
of the approximated poisson distribution.

Number of channels
The distribution of the number of channels for the television marketing campaigns is dis-
played in Figure 3.10.

0.12
Frequency density

0.10
0.08
0.06
0.04
0.02
0.00
0 5 10 15 20 25
Number of channels
Figure 3.10: Histogram over the distribution of the number of channels.

Pre-processing
As noticed in the distribution figures, every variable has missing values and contains extreme
outliers. To ensure the performance of the machine learning models, a series of pre-processing
steps are required:

• Remove all incomplete data points


The distribution figures indicate that there are data points where one or more features
are zero. After discussion with GMP Systems, it was concluded that such data points
are incomplete and thus need to be removed.

• Outlier removal
The dataset contains extreme outliers, for example noticeable in Figure 3.8a. To handle
this, the method presented in LOF: Identifying Density-Based Local Outliers will be uti-
lized [8]. As mentioned in Section 2.3, a k-nearest-neighbor outlier removal method,
like local outlier factor, does not make distribution assumptions. Given that the vari-
ables have different distributions this method is deemed suitable. The complexity cost
of the local outlier factor should not be an issue given that the dataset size is small. The
lower bound will be set to ten as recommended by Breunig et al. [8]. Because of the size

19
3.3. Hyper-parameter optimization

of the dataset, the upper bound will be set to a small value to avoid removing inliers.
The upper bound will therefore be set to twenty.

• Logarithm transform As mentioned, the Figures 3.2a and 3.3a indicates that GRP and
CPP30 are highly skewed. To make their distribution appear more Gaussian the loga-
rithm of both features will be used as input features in addition to the original feature
values [4].

• Hold-out method
To enable comparisons between the different models used in this project the hold-out
method will be applied. 80% of the dataset will be used to train the models, while the
remaining 20% will be used to evaluate the performance of each model.

• Normalization
To improve the result on some of the machine learning models normalizations of the
features need to be applied [10, 27]. For the features with a bell curve distribution (log-
arithm of GRP, logarithm of CPP30, prime time, periods, minimum age, and maximum
age) the z-score is defined by the equation
x´µ
x1 = ,
σ
where µ is the distributions mean and σ is the standard deviation, will be used. Ac-
cording to Jain et al., z-score is especially useful when the distribution of the variables
is Gaussian [26]. The features that were picked to use z-score have a distribution that is
similar to or equal to a Gaussian distribution.
For the remaining features min-max normalization defined by the equation

x ´ min( x )
x1 = ,
max( x ) ´ min( x )

is applied. Throughout the normalization all data will be normalized based on values
(e.g. minimum and standard deviation) the training data points.

Before pre-processing, the dataset contained 2237 data points. The pre-processing filtered
out 14% of the dataset, 11% was removed due to missing values and the remaining 3% was
removed by the outlier filtering, resulting in a final dataset size of 1929 points. In Appendix
A.1 plots of how each feature relate to the target value are presented.

3.3 Hyper-parameter optimization


When optimizing the hyper-parameters of every model, a random search method will be
applied as presented in Section 2.3. The motivation for this choice is that Bergstra and Ben-
gio concluded that grid search was only suitable for up to two parameters and that random
search yielded equal or better results than grid search with far less computing [5].
To evaluate each hyper-parameter during the random search cross-validation will be ap-
plied. The cross-validation method will use five-folding and every model will try 100 iter-
ations of hyper-parameters. The cross-validation will be implemented using the scikit-learn
library.
When an optimal hyper-parameter value is found it will be evaluated against the prede-
fined hyper-parameter bounds. If the optimal value is near the upper or lower bound, values
beyond the defined border will also be explored as recommended by Bengio [3].

20
3.4. Feature selection

3.4 Feature selection


Given that the size of the dataset is small it is important to choose a feature selection method
that is resistant to overfitting. For this reason, forward variable selection is especially useful
given that it is robust against overfitting as mentioned in Section 2.3.
The forward variable selection will use a least-square model as its predictor according to
the suggestions by Guyon and Elisseeff [24]. The least-square model will be a Ridge regres-
sion model. Lasso regression was not picked because of its property to remove the impact of
features, given that the goal is to evaluate if a specific feature set is useful it makes no sense
to have a model that can nullify features by itself. For every iteration, the Ridge regression
model’s alpha value will be optimized through the explained hyper-parameter optimization
technique mentioned above. The penalty factor value bounds will be set to (0.01, 2). The
performance of the currently selected features is evaluated using the hold-out method on a
validation set.

3.5 Models
In this section, the models that will be implemented for the project will be explained. For
every model, a list of hyper-parameters that the model will optimize is included. The bounds
for the hyper-parameters were initially larger than what is mentioned, the resulting bounds
were achieved through trial and error by moving the bounds if the resulting parameter values
were near the border of the bounds. Every model will be evaluated both with and without
feature selection and data augmentation.

Baseline - Ridge regression


Goodfellow et al. mention in their book Deep learning that depending on the problem com-
plexity it is often suitable to use a linear statistical model as a default baseline model [23]. The
authors also mention that it is a good practice to include regularization in the baseline model
provided that the training dataset does not contain tens of million data points.
The baseline model for this project will be a Ridge regression model. The motivation for
this choice is that Ridge regression is a simple linear model suitable for highly correlated
features, see Section 2.3. Ridge regression also includes regularization which as mentioned is
good practice for a baseline model. To keep the baseline as simple as possible, the logarithmic
feature transform, and feature selection will be excluded.
The first model that will be evaluated against the baseline is also a ridge regression model.
However, this model will also use the logarithmically transformed features and will be eval-
uated with and without feature selection. Henceforth, this model will be referred to as the
ridge regression model and the baseline model will be called baseline.
The hyper-parameter optimization for both the baseline and the ridge regression model
will be:

• Penalty factor λ: (0.01, 2)

All features and the target variable will be normalized before training both ridge regression
models.
The baseline is not expected to yield the best result of all models. Instead, it will be used to
compare the other model’s performances. The Ridge regression models will be implemented
using the scikit-learn library.

Decision tree regressor


Another regression model that will be examined is a decision tree regressor model. One of
the reasons for this choice is that the decision tree does not make any distributional assump-

21
3.5. Models

tions [54]. The model is deemed suitable because, given the distribution figures, it can be
concluded that not all feature distributions are known. Also, according to Galathiya et al.,
decision trees are suitable when working with small datasets [22].
Another reason that makes the decision tree suitable is that the model is self-explanatory
in its decision making [54]. Unlike for example a neural network that is considered a black
box, where the relation between input and output is unknown. This explanatory property is
useful to get insights into how to achieve a high television advertisement reach according to
the model, which can give GMP Systems a deeper domain knowledge. Moreover, this also
enables the model to extract the features’ importance which will be useful when evaluating
the features.
Two different optimization techniques will be evaluated on the decision tree regressor,
post-pruning, and hyper-parameter tuning commonly referred to as pre-pruning. Both tech-
niques will be performed without feature normalization. The pre-pruning will follow the
methodology for hyper-parameter optimization stated above. The hyper-parameter that will
be optimized is:

• Max depth of the tree: (7, 15)

All other possible hyper-parameters will be set to their default values.


Cost-complexity pruning will be used in accordance with Section 2.3. By fitting an overly
sized tree and then prune that tree by removing the branches that increase the error the least
until only the root node is left. Then cross-validation will be applied to find the sub-tree that
yields the best result.
However, Song and Lu explain that decision trees are at risk of both underfitting and
overfitting given a small dataset [54]. Due to the size of this project’s dataset, this is something
that needs to be considered. To avoid overfitting cross-validation will be used both in pre-
and post-pruning. The decision tree regressor model will be implemented using the scikit-
learn library.

XGBoost
According to Chen and Guestrin XGBoost has been shown to give state-of-the-art results on
many tasks and it is heavily used in machine learning competitions [13]. In Sereday and Cui’s
study XGBoost yielded the best results among the studied models [45]. Also, as mentioned
in Section 2.3, the XGBoost model combines the advantages with single tree models and the
structure of the model mitigates some of the disadvantages. For instance, the XGBoost model
also has the property to extract the features’ importance. For these reasons, XGBoost is con-
sidered to be a promising model to implement.
The XGBoost model have several of hyper-parameters that can be optimized. This project
will optimize:

• Shrinkage η: (0.01, 0.3)

• Feature sub-sampling ratio: (0.25, 1)

• Max depth: (3, 9)

• Number of trees: (50, 200)

• L2 regularization λ: (1e-5, 0.5)

All other possible hyper-parameters will be set to their default values. The model will be im-
plemented in Python using the XGBoost library. When training and evaluating the XGBoost
model, normalization of features will not be performed.

22
3.6. Evaluation

Neural network
In both Nikolopoulo et al.’s and Meyer and Hyndman’s studies a neural network yielded
the best results [41, 36]. This motivates the decision to further examine the performance of
neural networks in a similar domain. However, it is important to notice that the two articles
had a significantly larger dataset size than this project. The neural networks in both articles
were shallow neural networks, meaning with one or two hidden layers [32]. This and the fact
that a large dataset is needed to fully utilize the potential of deep learning leads to a shallow
neural network being the most reasonable choice [43].
The activation function for the hidden neurons will be the rectified linear unit (ReLU)

ϕ( x ) = max(0, x ),

as recommended by Sharma et al.[47]. For the output layer the recommendation by Merkel
et al. to use the linear activation function

ϕ( x ) = a ˚ x + c,

when using a traditional neural network for regression will be followed [35].
To avoid overfitting two regularization techniques will be applied. While training the
neural network, the validation error will be monitored. The model will stop training when
the validation error starts to increase, a method called Early stopping. Secondly, weight decay
will be utilized. To avoid the neural network nullifying input features L1 regularization will
not be used, only L2 regularization will be applied. It is considered important to keep all
input features to enable evaluation of feature importance, see Section 3.6 for details. The
batch size will be set to 32 according to the recommendations by Goodfellow et al. and the
model will be trained for a maximum of 100 epochs [23].
The model will use normalized data as input and utilize the Adam optimizer to speed
up the learning process. The neural network architecture will be determined through hyper-
parameter optimization. The number of hidden layers and the width will be considered
hyper-parameters. The bounds of the hyper-parameters for the neural network are:
• Number of hidden layers: (1, 2)

• Width per hidden layer: (3, 64)

• Learning rate η: (0.0001, 0.01)

• L2 regularization λ: (0.0001, 0.01)


All other possible hyper-parameters will be set to their default values. The model will be
implemented in Python using the Keras library.

3.6 Evaluation
The performance metric used to evaluate the result of the different models will be the mean
absolute percentage error (MAPE). During a discussion with representatives from GMP Sys-
tems, they explained that MAPE best captures the magnitude of the errors. They explained
that for instance, it is much worse to predict a 10% reach when the actual result was 20%,
compared to predicting 70% when the true value was 80%. The comparison between the
models will be performed using the hold-out method.
The procedure for evaluating the importance of each feature will be performed through
two operations:
• Insignificant features
Each model will be evaluated using both all features and the subset features from fea-
ture selection. If the performance of the models is similar or better using the subset

23
3.7. Data augmentation

features it can be concluded that the excluded features are insignificant. If that is not
the case, it can be concluded that forward variable selection with Ridge regression is
not suitable for this particular dataset.

• Feature ranking
The subset of features from feature selection will be ordered based on importance. Sim-
ilarly, XGBoost and decision tree regressor will rank the features based on importance.
As described by Xin et al. XGBoost should use the gain method to be comparable with
the decision trees [60]. The importance’s from all models will be normalized to en-
able comparisons. To determine what features are of most importance for predicting
television advertisement reach, the feature rankings will be compared and analyzed in
relation to the models’ and feature selection’s performance.

Given the uncertainty with window slicing augmentation for time-series data points men-
tioned by Fawaz et al., the models’ performances with augmented training data need to be
carefully evaluated [20]. Therefore, every model will be trained with both the original and
augmented training data to enable a comparison of which training data yields the lowest
MAPE. As mentioned previously every model will be evaluated with and without feature
selection, the same procedure will be applied with the augmented training dataset. To enable
comparison between the two training datasets the selected features resulted from the original
dataset will be used.

3.7 Data augmentation


As mentioned in Section 2.3 both Shorten and Khoshgoftaar and Wen et al. claimed that data
augmentation is useful to prevent overfitting on a small dataset [49, 58]. Therefore, it is of
interest to investigate how data augmentation is suited for this project given the size of the
dataset.
A television advertisement campaign commonly runs over several periods. Given that a
period is a time span, a campaign can be seen as a series of time spans. This results in a cam-
paign being comparable with a time series where a period is one time step. This resemblance
suggests window slicing could be an appropriate data augmentation method.
The artificial data points created through data augmentation should include unexplored
input space while maintaining true target values [58]. The window slicing method presented
by Cui et al. slices both the start- and endpoints of the time series [15]. However, this ap-
proach is not possible for this project. Reach is a value that increases during the course of
a campaign, for this reason it is not possible to extract the reach for a specific period. For
example, if the reach is at 20% after the first period and 30% after the second, with that infor-
mation it is impossible to determine the second period’s reach if the first period has not been
considered. The reason for this is because the people that saw the ad the first period could
have seen it during the second period as well. Therefore, every augmented data point will
start in the first period of the original campaign. The idea is to simulate as if the campaign
ends after every period.
To evaluate the performance of the window slicing, all models will be tested both with the
original training data and a training dataset containing augmented data points. The result of
all models will then be compared to evaluate if window slicing is suitable for predicting
television campaign reach.

24
3.7. Data augmentation

Figure 3.11: Illustration of how a five-period campaign is augmented into five new data
points.

Figure 3.11 shows an example of how the window slicing method will be applied to each
television advertisement campaign. As visible in the figure, every campaign will be sliced for
every period, always starting from the first period to keep the characteristics of the original
campaign.
To enable the data augmentation each campaign needs to be examined period by period.
However, after a first examination of the campaign data, it was evident that some campaigns
had missing feature values for individual periods. Missing feature values are handled ac-
cording to Table 3.1.

Handling Feature
Discard Reach
GRP
CPP30
Replace Prime time
Position in break
No augmentation Spot length relation
Start year
Start month
Maximum age
Minimum age
Number of channels
Table 3.1: How missing values are handled for different types of features. Discard refers to
ignoring to augment that data point and not adding it to the training set, replace means to
replace the missing value with the value from the full campaign, and no augmentation means
that the value is not augmented, it remains the same as original campaign.

Reach is the target value and needs to be correct, while GRP and CPP30 are values that
increase as the campaign proceeds. It is therefore impossible to augment a campaign correctly
when either value is missing, therefore such data points are discarded. However, prime time
and position in break are both percentages of how GRP30 is distributed over the campaign.
Therefore, instead of discarding the data point, it is deemed reasonable to replace the value
with the percentage of how GRP30 is distributed over the original campaign. The features
that are not augmented are the same throughout the whole campaign.

25
4 Results

In this chapter, the results from the feature selection, models, and data augmentation are
presented and individually discussed. The results using the original dataset is presented first
followed by the results using the augmented data.

4.1 Feature selection


The forward variable selection resulted in two features (spot length relation and periods)
being discarded from the feature set. The remaining features were:

1. The logarithm of GRP 7. CPP30


2. Maximum age 8. GRP
3. Start year
9. Number of channels
4. The logarithm of CPP30
5. Minimum age 10. Position in break

6. Prime time 11. Start month

The order of the feature list is how they were selected by the forward variable selection.
An explanation of each feature can be found in Section 3.2. However, the order of the fea-
tures does not indicate how much the square loss L( x, y) = i (y ´ f ( xi ))2 differs using the
ř

different features.
Figure 4.1 shows how the square loss is altered when appending additional features. From
the figure, it is visible that after the feature selection (yellow dotted line) the loss is not de-
creasing. However, the loss appears to be relatively constant before the feature selection is
completed. This may be an indication that the two features, number of channels, and position
in break should also be discarded.

26
4.2. Models

37.5
35.0
32.5
30.0

Loss
27.5
25.0
22.5

rith Sta m agP


m rt e
im PP r
Pri um a30
me ge
CP me
Po of G30
Sp ion inanne P
ot Sta b ls

rel nth

ds
gth mo k

Pe tion
Min of C yea

len rt rea
Lo ximu f GR

sit ch R

rio
P
ti

a
o
Mathm
ari

er
g

mb
ga
Lo

Nu
Figure 4.1: How the loss changes during the feature selection, where the feature that yields
the lowest loss is added. Orange dotted line indicates the last feature to be added during the
feature selection.

The feature selection resulted in both of the logarithms of GRP and CPP30 being selected
before its counterpart. This was expected given that ridge regression is expected to perform
better when the features have Gaussian-like distributions [42]. That is what motivated the
logarithmation originally, see Section 3.2. However, it is surprising that the feature selection
also selected the original features later in the process.

4.2 Models
Table 4.1 shows the mean absolute percentage error for each of the models using the original
dataset. Both the results from the full and selected feature set are included. However, given
that the baseline used neither logarithmic features nor feature selection it does not have a
result for the feature selection.

Model Without feature selection With feature selection


Baseline 10.422 % -
Ridge regression 7.670 % 7.629 %
Decision tree regressor
Pre-pruning 7.914 % 7.864 %
Post-pruning 8.342 % 8.202 %
XGBoost 4.956 % 5.190 %
Neural network 6.449 % 6.414 %
Table 4.1: The mean absolute percentage errors for all models with and without feature selec-
tion. The baseline did not use feature selection.

From Table 4.1 it is evident that all additional models outperform the baseline model by
at least two percentage points. The two best models were XGBoost and neural network,
with XGBoost being the best model outperforming the neural network by more than one
percentage point. The best overall model was XGBoost without any feature selection. In
Appendix B.3, the result of the models using nested cross-validation are presented, see Figure
B.1. XGBoost with all features was the best model for the cross-validation as well.

27
4.2. Models

Baseline
The idea with the baseline model is to create a basic model, whose results serve as a bench-
mark for comparing the other more complex model’s results. The baseline model for this
project was a ridge regression model that did not include any feature transformations. As
expected the baseline model performed worse of all, see Table 4.1.

1.0
0.8
Actual reach
0.6
0.4
0.2
0.2 0.4 0.6 0.8 1.0
Reach prediction
Figure 4.2: The baseline’s reach predictions versus the actual reach. The orange dotted line
expresses a perfect fit for comparability.

Figure 4.2 shows how the baseline model predicts the reach versus the actual reach. No-
ticeable is that the model tends to overestimate the actual reach of campaigns when reach
is in both the lower- and higher range. The model performs better for reach in the middle
range. Less extreme errors occur but the variance remains significant.

Ridge regression
The ridge regression model outperformed the baseline which did not use the logarithmic
transform of GRP and CPP30 by almost three percentage points, see Table 4.1. As mentioned
ridge regression benefits from Gaussian-like distributions of the input features and therefore
an improvement over the baseline was expected [42].

0.8 0.8
Actual reach

Actual reach

0.6 0.6

0.4 0.4

0.2 0.2

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8


Predicted reach Reach prediction
(a) All features (b) Selected features
Figure 4.3: The ridge regression model’s reach predictions versus the actual reach. The orange
dotted line expresses a perfect fit for comparability.

No recognizable differences can be found when examining the ridge regression model’s
prediction versus actual plots, see Figure 4.3. Compared with the baseline’s result in Figure
4.2, the ridge regression model did not overestimate the reach as the baseline model. The
model fits the "perfect fit line" but with a high variance across all ranges.

28
4.2. Models

Table 4.2 shows the hyper-parameters used by the ridge regression models. The hyper-
parameters when used both with and without feature selection are included.

Hyper-parameter All features Selected features


Penalty factor λ 1.866 1.866
Table 4.2: Hyper-parameters used for the ridge regression models.

From Table 4.2, it is evident that the penalty factor λ was the same with and without
feature selection. Table 4.1 also shows that the two models’ performances were similar with
the feature selection having a slight advantage.

Decision tree regressor


Both the pre- and post-pruned decision trees outperformed the baseline, see Table 4.1. The
table also shows that the pre-pruned tree yielded a lower MAPE than the post-pruned tree.
There are disadvantages with the cost-complexity pruning algorithm used to post-prune
the tree that could explain its weaker performance. According to Esposito et al., cost-
complexity pruning can be at a disadvantage because it can only choose its final tree among
the created set of sub-trees [18]. The sub-trees are created by pruning branches from an overly
sized tree. This process can result in the optimal tree not being a part of the set of pruned sub-
trees.

0.8 0.8
Actual reach

Actual reach

0.6 0.6
0.4 0.4
0.2 0.2
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Predicted reach Predicted reach
(a) All features (b) Selected features
Figure 4.4: The pre-pruned decision tree model’s reach predictions versus the actual reach.
The orange dotted line expresses a perfect fit for comparability.

29
4.2. Models

0.8 0.8
Actual reach

Actual reach
0.6 0.6
0.4 0.4
0.2
0.2
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Predicted reach Predicted reach
(a) All features (b) Selected features
Figure 4.5: The post-pruned decision tree model’s reach predictions versus the actual reach.
The orange dotted line expresses a perfect fit for comparability.

The predictions versus actual graphs for the decision tree models shows how the decision
trees makes their predictions, see Figures 4.4, 4.5. It is noticeable that the predictions are
ordered in vertical lines. This is because several data points’ input values are predicted to the
value of the same leaf node leading to the same predicted reach.
The graph for both trees using all features have a data point in the upper left corner in-
dicating a heavy underestimation, see Figures 4.4a, 4.5a. Interestingly this underestimation
does not occur when the trees used feature selection, see Figures 4.4b, 4.5b.
Compared with the ridge regression model, see Figure 4.3, the tree models also follow the
"perfect fit line" but with higher variance than the ridge regression models, see Figures 4.4,
4.5. Both trees performed slightly better with the selected features, however, visually it is not
detectable that the selected features should perform better than with all features, except the
mentioned underestimation.

Tree Attribute All features Selected features


Pre-pruning Max depth 9 9
Node count 609 665
Post-pruning Max depth 13 10
Node count 185 133
Table 4.3: Tree characteristics for pre- and post-pruned tree using both with and without
feature selection.

From Table 4.3 it is salient that the slight performance increase of the pre-pruned tree
comes at a cost of a much larger tree, it consists of more than three times as many nodes as
the post-pruned tree. Generally, with a larger tree, it is more likely that the tree has been
partly fitted to noise in the training dataset [33]. For this reason, a smaller tree is typically
preferred for it to be more generalizable.

XGBoost
The XGBoost models had the best performances both with and without feature selection of all
the models, see Table 4.1. This result was expected since XGBoost is a state-of-the-art model
often used in machine learning competitions, also there are related works that have had great
success using the XGBoost model. Since XGBoost is an ensemble model based on multiple
regressor trees it combines the benefits of the regressor trees with a more sophisticated model.
This further ensures why the model outperformed the decision tree.

30
4.2. Models

0.8 0.8
Actual reach

Actual reach
0.6 0.6
0.4 0.4
0.2 0.2
0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8
Predicted reach Predicted reach
(a) All features (b) Selected features
Figure 4.6: The XGBoost model’s reach predictions versus the actual reach. The orange dotted
line expresses a perfect fit for comparability.

The prediction versus actual graphs for the XGBoost models shows the best fits of all
the models, see Figure 4.6. Using both all features and the selected features the models fit
the "perfect fit line" well with low variance. The model using all features has less variance
than with the selected features with four clear outliers, see Figure 4.6a. The model using the
selected features does not have as many outliers, however, this is most likely because the
variance is higher. Therefore, the outliers from the model with all features do not differ from
the other data points and are not seen as outliers anymore, see Figure 4.6b.

Hyper-parameter All features Selected features


Shrinkage η 0.2233 0.0553
Feature sub-sampling ratio 0.8317 0.7524
Max depth 4 6
Number of trees 184 160
L2 regularization λ 0.0023 0.0021
Table 4.4: Hyper-parameters used for the XGBoost models.

Looking at Table 4.4, it is visible that most of the hyper-parameters were similar. Shrink-
age η is the one parameter that stood out. Chen and Guestrin explain that shrinkage is the
deciding parameter for the degree of impact of newly added trees during training [13]. A high
shrinkage value leads to newly added trees having a larger impact. Table 4.4 shows that the
more complex model, without feature selection, has a higher shrinkage. Meaning that each
tree has a larger impact on the prediction. This, and the fact that the more complex model has
a larger set of trees indicates that the complex model takes more details into consideration.

Neural network
Table 4.1 shows the performance of the neural network models. The neural network models
were the second best model only outperformed by XGBoost.

31
4.3. Feature importance

0.8 0.8
Actual reach

Actual reach
0.6 0.6

0.4 0.4

0.2 0.2

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8


Predicted reach Predicted reach
(a) All features (b) Selected features
Figure 4.7: The neural network model’s reach predictions versus the actual reach. The orange
dotted line expresses a perfect fit for comparability.

The predictions versus actual graphs show that both the neural network models follow
the "perfect fit line" well, see Figure 4.7. However, compared to the XGBoost models there is
a higher variance with more extreme outliers, see Figure 4.6. There are only small differences
between the graphs for the two neural network models, which could be explained by the fact
that the performance of the two models is similar.

Hyper-parameter All features Selected features


Number of hidden layers 2 2
Width per hidden layer 54 53
Learning rate η 0.0017 0.0020
L2 regularization 0.0050 0.0012
Table 4.5: Architecture and hyper-parameters used for the neural network models.

As shown in Table 4.5 both models are similar in terms of both architecture and hyper-
parameters. Table 4.1 shows that the neural models with and without feature selection per-
form similar. It can therefore be concluded that the parameters spot length and periods do
not contribute to the performance of the neural models.
According to Simard et. al, the performance of neural networks typically is restricted by
the dataset size and quality [50]. For this reason, the performance of the neural networks was
surprisingly good given the size of the dataset.

4.3 Feature importance


In this section, the result of the feature importance is presented, including the features
deemed insignificant and the ranking of the impact of each feature.

Insignificant features
Table 4.6 shows the difference in MAPE between the models using feature selection and the
ones that did not. The results from the models differed when feature selection was applied
and when it was not. Decision trees and neural networks improved with feature selection,
ridge regression showed no difference in error score whereas the error score decreased using
feature selection for XGBoost.

32
4.3. Feature importance

Model Feature selection difference


Ridge regression 0.041 % Better
Decision tree regressor
Pre-pruning 0.050 % Better
Post-pruning 0.140 % Better
XGBoost 0.234 % Worse
Neural network 0.035 % Better
Table 4.6: The amount of mean absolute percentage error points each model differed when
using feature selection.

Table 4.6 show that the decision tree models consistently performed better using feature
selection. This was surprising given that decision trees include an automatic feature selection
[56]. Therefore the decision trees with all features should have been able to exclude the fea-
tures that do not improve the result by themselves. XGBoost was the only model that did not
improve with feature selection. XGBoost, like decision trees, includes an automatic feature
selection, therefore, it was expected that the model would not improve with feature selection.
However, it was surprising that the XGBoost model did not follow the same pattern as the
decision trees given that the XGBoost architecture consists of several decision trees.
Given that XGBoost worsened by removing the unselected features, the unselected fea-
tures can not be deemed insignificant at this point. Even though both the neural network and
the decision trees improved without the unselected features, the fact that XGBoost yielded
the best performance and had the largest difference between using the full feature set and the
selected outweighs the result of the other models.

Feature ranking
In this section, the features are ranked based on the order they are selected by the feature
selection together with the feature weights given by the decision trees and XGBoost models.

0.8 Pre-pruned decision tree


Post-pruned decision tree
0.6 XGBoost
Importance

0.4

0.2

0.0
Lo ithm P30
rit f C P
x o 0
Minimumf GRP
mb Sta um age
of mo e
Po hann th

r
Sp on in eriods
ot P b s

Staelatioe
gth e t k

rt y n
ea
sit P el

len rim rea


ga o GR

er rt ag

r im
Ma hm PP3

c n
CP

im

i
r
ga
Lo

Nu

Figure 4.8: The importance of each feature from the pre- and post-pruned decision tree, and
the XGBoost model. The importance have been normalized to enable comparisons.

Figure 4.8 shows that GRP and the logarithm of GRP were the most important features. As
mentioned one of the benefits of decision trees and XGBoost is that the models do not make
any distributional assumptions and no feature transformation should be required for these

33
4.3. Feature importance

models [54]. Therefore, an implementation of the models without feature transformations


was attempted but yielded a worse prediction accuracy overall.
Given that the aim is to identify the most important feature, and not to rank feature trans-
formations, it is more reasonable to sum the importance of the logarithmic and original fea-
tures. Figure 4.9 shows the corresponding graph.

0.8 Pre-pruned decision tree 0.10


Pre-pruned decision tree
Post-pruned decision tree Post-pruned decision tree
XGBoost 0.08 XGBoost
0.6
Importance

Importance
0.06
0.4
0.04
0.2 0.02
0.0 0.00
0
Min um P
Nu S imum age
er rt m ge
ch th
sit Pe nels

ot Pri bre s

re e

r
len me ak

rt y n

0
Min um P
Nu S imum age
er rt m ge
ch th
sit Pe nels

ot Pri bre s

re e

r
len me ak

rt y n
ea
in d

ea
R

gth tim

in d
P3

Sta latio

gth tim
P3

Sta latio
of on

of on
ion rio
mb ta a

ion rio
xim G

mb ta a
xim G
CP

CP
an

an
Ma
Ma

Po
Po

Sp
Sp

(a) Feature importance summed transformations (b) Zoomed feature importances

Figure 4.9: The feature importance from pre- and post-pruned decision tree, and the XGBoost
model. The importance of logarithmic features have been added to the importance of the
original feature. The importance have been normalized to enable comparisons.

From Figure 4.9a it is evident that the feature importance was similar for all three mod-
els. GRP was the most important feature by a large margin for every model. However, the
zoomed-in graph in Figure 4.9b, indicates some deviance’s between feature importance for
the models. The feature importance for the pre- and post-pruned tree models was still similar
but XGBoost deviated.
For the two tree models, excluding GRP, the most important features were minimum and
maximum age. The pre-pruned tree slightly preferred maximum age while the post-pruned
tree preferred minimum age. Thereafter, the pre- and post-pruned tree valued the number of
channels as the most important feature. The number of channels was followed by position
in break, periods, and start, which were similarly important for the two tree models. Prime
time and CPP30 did not prove to be as important, and start month had the least impact on
the tree models. Start month was nearly insignificant for the post-pruned tree model.
The XGBoost model also valued minimum and maximum age second most after GRP.
However, after that, there were significant differences compared to the decision trees. Periods
and start year were favored after the minimum and maximum age followed by number of
channels. After number of channels a set of equally important features consisting of CPP30,
start month, and position in break followed. The least important features were prime time
and spot length relation. A similar comparison of the feaures’ importance with the models
evaluated using nested cross-validation can be found in Appendix B.3.
Table 4.7 shows how each model ranks the features relative its importance. Feature selec-
tion ranking is based on in what order the features were selected.

34
4.3. Feature importance

Forward variable Decision tree XGBoost


selection Pre-pruning Post-pruning
GRP 1 1 1 1
CPP30 4 9 9 7
Prime time 6 10 10 10
Position in break 8 5 5 8
Spot length relation - 8 8 11
Start year 3 7 6 5
Start month 9 11 11 9
Periods - 6 7 4
Maximum age 2 2 3 2
Minimum age 5 3 2 3
Number of channels 7 4 4 6
Table 4.7: Rankings of the most important features for each model with the most important
feature is set to 1. The rankings for GRP and CPP30 is based on the sum of the importance of
the logarithmic and original value. Given that the feature selection algorithm did not provide
a numerical importance and thus can not be summed, the ranking of GRP and CPP30 is set
to the position of the first version that appeared in the list of features. Features which are
deemed insignificant by the forward variable selection are excluded.

Given that some of the features have similar importance, the features will not be ranked
individually. Instead, the features will be grouped in tiers (5-0) based on their importance.
Where a higher tier, indicates a more important feature. Table 4.8 shows the feature tiers with
the corresponding features presented.

Tier Feature Tier Feature


5 GRP 2 Position in break
4 Minimum age CPP30
Maximum age 1 Prime time
3 Number of channels Start month
Periods 0 Spot length relation
Start year

Table 4.8: Features grouped in tiers in accordance of their importance. Higher tiered features
are considered more important.

After comparison of the feature importance in Figure 4.9 and the feature rankings in Table
4.7, it is obvious that GRP was by far the most important feature for predicting reach. In
Figure 4.9a GRP was more than five times more important than any other feature and in
Table 4.7 it was ranked as the most important feature for every model. Therefore, GRP is in
its own tier with the highest rank, tier 5.
Table 4.7 show maximum age and minimum age being ranked either two or three for all
models except forward variable selection, which ranks minimum age at fifth. From Figure
4.9b, it is clear that the models’ values maximum and minimum age significantly more than
the remaining features. Therefore, maximum age and minimum age are placed in their own
tier, tier 4.

35
4.4. Data augmentation

The next tier, tier 3, consists of number of channels, periods, and start year. Figure 4.9b
indicates that the importance of periods and start year is a bit higher for XGBoost compared
to number of channels. However, given that the number of channels is ranked fourth by both
trees it is still deemed in tier 3. Forward variable selection did not select periods and four out
of five models improved the accuracy without it. However, given that XGBoost ranks it high
and yields the best performance it is placed in tier 3.
Position in break was ranked fifth for both trees, see Table 4.7. However, it was ranked
eighth by forward variable selection and XGBoost. Therefore, it is placed in a lower tier
than tier 3. Given that Figure 4.9b indicates that XGBoost ranks CPP30 and position in break
similarly and a bit higher than the remaining features they construct tier 2.
Prime time and start month are placed in tier 1. They are not deemed irrelevant but seen
as the least impactful features for reach. The importance from XGBoost is similar for both
features, see Figure 4.9b. One note is that the decision trees indicate that start month had
almost no impact on reach, but it was quite more important for XGBoost and is therefore not
in a tier by itself.
The last tier, tier 0, contains spot length relation. It is placed in its own tier because the
feature is deemed irrelevant for predicting reach. The motivation for this is that the feature
was removed by feature selection, four out of five models increased the accuracy without it,
and the model that did not was XGBoost which ranked the feature last.
The results suggest that minimum age and maximum age are both two important features,
see Table 4.8. Looking at MMS annual report from 2021 television reach differs significantly
between age groups [37]. This and the fact that a television campaign usually aims to reach
a particular age group, see Section 2.1, indicate that the target age group should have an
impact on reach. Therefore it is reasonable that maximum and minimum age was ranked in
the second-highest tier.
MMS annual report also shows how the television viewing time differs for years and
months [37]. According to their measurements, television habits varies over months, for in-
stance, television is watched less during the summer than in the winter. But when comparing
over years the viewing time is similar for at least the last three years (2019 to 2021). It is
therefore surprising that start year was ranked in a higher tier than start month, see Table
4.8. Given the fact that start year should not have any significant impact according to MMS
annual report further experiments were conducted, see Appendix B. To measure the impact
of the start year feature, it was manually removed before training and evaluating each of the
models. Without start year, every model except the post-pruned decision tree with feature
selection performed worse. This indicates that start year indeed has an impact on predicting
reach for this particular dataset.

4.4 Data augmentation


After the data augmentation, the training dataset included 4387 data points resulting in a
184% increase from the original training dataset. The dataset increase was as expected. The
idea was to create a new data point for every period on every campaign, given the distribu-
tion of periods in Figure 3.8 most campaigns were between one to four periods long. There-
fore, a training dataset about three times as large seems reasonable.
Table 4.9 presents the mean absolute percentage error for every model with and without
feature selection. The hyper-parameters for the models is presented in Appendix A.2.

36
4.4. Data augmentation

Model Without feature selection With feature selection


Baseline 17.791 % -
Ridge regression 38.514 % 14.322 %
Decision tree regressor
Pre-pruning 30.510 % 11.853 %
Post-pruning 25.110 % 12.367 %
XGBoost 25.696 % 9.538 %
Neural network 24.323 % 12.298 %
Table 4.9: Mean absolute percentage error for all models with and without feature selection
using augmented training data. The baseline did not use feature selection.

Comparing the original results in Table 4.1 with the results using augmented training
data in Table 4.9, it is evident that the augmentation significantly worsened the performance
of every model. Therefore, it can be concluded that period slicing of campaigns is not suitable
for this domain. It is not possible to simulate that a campaign ends after each period while
still maintaining the characteristics needed for predicting the campaign’s reach.
One reason for the decreased accuracy using augmented data could be the fact that the
window slicing method does not consider any distributions. When the new data points are
created some of the feature values are set by how the campaign is proceeding. This results
in a risk that those feature values do not follow the distribution of the feature values for
completed campaigns. To demonstrate this possible risk a distribution plot of the feature
CPP30 for both the original training data set and the augmented one is shown in Figure 4.10.
This would lead to the models being trained on an augmented distribution which does not
follow the distribution the models are being evaluated on.

1e 4 1e 4
2.0 4
Frequency density
Frequency density

1.5 3
1.0 2
0.5 1
0.0 0
0 10000 20000 30000 40000 0 2000 4000 6000 8000 10000
CPP30 CPP30
(a) Histogram CPP30 original
(b) Histogram CPP30 augmented
Figure 4.10: CPP30 distribution for original training dataset to the left and the augmented
training datasets to the right.

The issue regarding different distributions for feature values is most prominent on the
period feature, see Figure 4.11. Given that the augmented data points are created by simu-
lating a campaign ending at every period, the augmentation creates more short campaigns.
For instance, every campaign no matter the length, will create a data point with one period
and so on, resulting in a completely new distribution for the period feature as seen in Figure
4.11b.

37
4.4. Data augmentation

0.20 0.25
Frequency density

Frequency density
0.20
0.15
0.15
0.10
0.10
0.05 0.05
0.00 0.00
1 2 3 4 5 6 7 8 9 10111213141516 1 2 3 4 5 6 7 8 9 10111213141516
Periods Periods
(a) Histogram periods original (b) Histogram periods augmented
Figure 4.11: Distribution of periods for original training dataset to the left and the augmented
training datasets to the right.

Noticeable in Table 4.9, all models improved drastically by feature selection. One of the
features that were removed by feature selection is periods, which have a completely different
distribution. The fact that the models performed better by removing periods suggests that
the issue with the window slicing method is distributional. The improvement with feature
selection and the fact that the features used were selected using the original training data
raised the question of whether the models would improve using their own feature selection.
Therefore, an extended feature selection with the augmented training dataset was conducted.
However, the results from that experiment showed that the model’s performance worsened
in comparison with the original selected features. See Appendix B.2 for the results of the
augmented feature selection.

38
5 Discussion

In this chapter, the result and method of the project are discussed. Lastly, the project’s social
and ethical impacts are discussed, giving a view of the work in a wider context.

5.1 Results
In this section high-level results are discussed. Individual results are combined and analyzed
to gain further insights. Unexpected results and how the results relate to the background
chapter are also commented on.

Features
Table 4.6 indicate that four of the five models increased in accuracy using feature selection.
This could lead to the belief that the unselected features (spot length relation and periods) are
insignificant for predicting reach. However, given that the XGBoost model, which yielded the
best accuracy, worsened using feature selection suggests that at least one of the two features
is useful.
Table 4.7 shows that XGBoost ranked the unselected feature periods fourth of the eleven
features and Figure 4.9b indicates that periods were valued significantly more than the lower-
ranked features. This leads to the conclusion that periods do impact television reach.
In Appendix B.3, the feature importance was evaluated using nested cross-validation.
When comparing Figure 4.9 with the Figures B.2, B.3 it is evident that the feature impor-
tance is similar when using the hold-out method and nested cross-validation. For instance,
GRP is the most important feature for every model by far and the rankings of the features for
the models are the same using both the hold-out method and nested cross-validation. Periods
were ranked the second most important feature for XGBoost when using cross-validation and
fourth using hold-out. However, Figure B.3 show that periods have a very large confidence
interval. This demonstrates the difficulty in ranking features using the hold-out method be-
cause feature importance can vary heavily depending on the training and test dataset.
The fact that the results using nested cross-validation were similar to the results using the
hold-out method strengthens the credibility of the results regarding feature importance. Since
similar importance is given despite what part of the dataset is held out during the training of
the models.

39
5.1. Results

The feature selection was successful since it removed spot length relation which is deemed
more or less insignificant, see Table 4.8. However, the forward variable selection method also
removed periods which is concluded to have an impact on television reach. This could be
because the relationship between periods and reach is too complex for the linear model used
in feature selection to detect.

Models
One of the reasons for implementing both XGBoost and neural networks was that they
showed promising results in the related works [36, 41, 45]. The results from this study fol-
low the same pattern, in that the models yielded the two best results, see Table 4.1. Sereday
and Cui also compared the performance of an XGBoost model and a neural network, in their
research the XGBoost model also performed the best [45].
It was surprising that the ridge regression model performed better than both tree models,
see Table 4.1. As mentioned in Section 2.3 ridge regression is most suitable when the features
follow a normal distribution which the used dataset did not, see Section 3.2. The decision
trees do not make distributional assumptions and should be suitable when working with
a small dataset as in this project. These facts considered, the ridge regression model still
outperformed the decision trees.
The models were also evaluated using nested cross-validation, see Appendix B.1. The
ordering of the models’ performances was the same when using the hold-out method. Figure
B.1 shows bar plots of the models’ performances. When comparing the results in the graph
with the results using hold-out method in Table 4.1, it is notable that all models perform
worse using cross-validation. Some of the models’ results are even outside of the confidence
interval. The fact that the XGBoost model performs best using the two evaluation methods
suggests that it is a suitable model for this task. However, a 4.956% MAPE should not be
expected since it is outside the confidence interval presented in Figure B.1.

Evaluation
No related works regarding predicting television campaign reach were found during this
project. Therefore, there are no prior results to compare the performance of the models from
this study with. This causes it to be undeterminable how much and if the accuracy can be
improved further.
Through the course of the project, it has become apparent that different feature sets can
both improve and worsen the performance of the models greatly. This leads to the ques-
tion whether there exist available features not utilized in this project that could improve the
results. An extensive feature search was not within the scope of this project. According to
Guyon and Elisseeff using domain knowledge when creating the feature set is preferred [24].
Therefore, it would be interesting for future work to explore the possibility to alter the feature
set of this project with the help of domain experts. For example, the MMS report indicates a
slight difference in viewing habits between males and females [37]. Therefore, one interesting
new feature to explore could be the target audience gender (male or female or both).
Throughout the project, the XGBoost model has consistently outperformed all other mod-
els. In Table 4.1 XGBoost was twice as accurate as the baseline and outperformed the neural
network which was the second-best model with almost 1.5 percentage points. This and the
fact that the related works had great success using the XGBoost model leads to the conclusion
that XGBoost is a well-suited model for this problem [45]. Therefore, this project recommends
using XGBoost for future work regarding predicting television campaign reach.

40
5.2. Method

5.2 Method
In this section, the method for the project is discussed from a critical standpoint. How as-
sumptions and limitations of the project could have affected the result are highlighted and
possible alternative methods are acknowledged. The method discussion is based on the con-
cepts of replicability, reliability, and validity.

Dataset
As mentioned in Section 1.5 the dataset used for this project consisted of television campaigns
from Denmark, Norway, and Sweden. The choice of countries was based on the assumption
that the three countries have similar television habits and markets. To avoid these assump-
tions another approach would be to predict reach of individual countries. That approach was
not chosen because of the dataset size. The dataset is already small with the three countries
and to only use a third of that dataset did not seem reasonable.
There is a trade-off between the need to make strong marketing and habit assumptions
when looking at multiple countries and the dataset size. For instance, it would be possible to
use the largest dataset with all countries. However, the assumption that the television market
is similar in all countries is incorrect. For this project, the aim was to achieve the best balance
between dataset size and the need for habitual assumptions.
One concern with small datasets is the risk of overfitting [59]. Since this project is working
on a small dataset this risk needs to be accounted for. Throughout the project, the hold-out
method is continuously used to be able to measure how well the models perform on unseen
data. Otherwise, the results would only give a measurement of how good the models are to
mimic the training data points [30].
Beyond overfitting, the size of the dataset also creates concerns regarding the reliability of
the results in this project. Would the project be reproduced with a different larger dataset the
results would vary. One indication of this is for instance the distribution of the start years of
campaigns in Figure 3.7a. The distribution of start years varies and is not evenly distributed.
However, as mentioned the MMS annual report shows that viewing habits have not changed
within those years, therefore the assumption can be made that the number of television mar-
keting campaigns should not have been changed drastically [37]. The distribution for the start
year is specific for this particular dataset. From the extended experiments in Appendix B, it
was concluded that start year has an impact on reach. However, another dataset would likely
have a different distribution for start year and could therefore result in conflicting importance
for start year.
The project’s Delimitations outlined in Section 1.5 cause validity concerns. The research
questions this project is trying to answer are regarding how well machine learning models
can predict television advertisement reach. However, given that the historical data used in
this project only includes data points from three countries and with target audiences defined
by gender and age group (removing more strict target audiences) this has not been answered.
What the results answer is how well machine learning models predict reach for the mentioned
countries for larger target audiences.
Bína et al. could conclude in their article that the television channel was by far the most
impactful feature for reach [6]. Although their linear model only captured half of the vari-
ability, it can still be determined that the television channel is important for reach. Since this
study is investigating campaigns from several countries, it is not feasible to add the channels
used for the campaign as a feature. However, given Bína et al.’s result the models would
likely improve if a feature capturing television channels could be added [6]. For example,
a new feature indicating how many of the GRP that will be reached from popular channels,
similar to the prime time and position break feature, could be beneficial.

41
5.2. Method

Pre-processing
Before training the machine learning models a series of pre-processing steps was applied
to the dataset. Two of the pre-processing steps trimmed the dataset by removing extreme
outliers and incomplete data points, see Section 3.2. This leads to the question of how well
the models perform since data points that could be seen as harder to predict are removed
from the dataset.
GMP Systems collects its data through media agencies that report campaign data using
the GMP platform. This leads to the possibility of human errors and incorrect data points.
Given that the local outlier factor is used to remove data points that do not conform to the
overall pattern, incorrect data points can be removed [51]. By this practice, the human error
should be accounted for and the results of the models are valid.
Both decision trees and XGBoost models support missing values [54, 13]. However, given
that ridge regression and neural networks do not, the incomplete data points were excluded
to enable comparisons between the models. The supported models could improve when
including incomplete data points. Therefore, it would be interesting to explore the possibility
in future work. However, it is worth noting that the missing values as well as outliers could
be caused by human errors and should be discarded.
The features were normalized either using z-score or min-max normalization. Which
method was used was depending on the distribution of that particular feature, see Section
3.2. However, the distributions were not identified analytically, instead, they were identi-
fied visually. Meaning that every distribution that resembled a bell curve was normalized
using z-score. Though this approach is convenient, a more analytical approach for identify-
ing distributions could result in other distributions that benefit from different normalization
methods that in the end might impact the results [26].

Feature selection
Forward variable selection was the method used for feature selection due to its resistance to
overfitting, given the small dataset, see Section 3.4. However, given that periods were one of
the most impactful features for XGBoost and that it was not selected by the feature selection,
see Figures 4.1 and 4.9b, there is reason to believe that the ridge regression model used in the
forward variable selection were too simple to identify complex relations.
This leads to the question if it is possible to achieve greater results if a more advanced
feature selection method had been applied. A more complex feature selection method could
be able to select a different set of features that would yield higher accuracy than the current
XGBoost model using all features. However, a more advanced method could be more prone
to overfitting removing the main motivation for using the forward variable selection.
Worth noting is that the XGBoost model includes an internal feature selection, which
could point to feature selection overall will not result in better accuracy than using all fea-
tures for XGBoost. However, the decision tree also has an internal feature selection but still
achieved greater accuracy using the simple feature selection method. Therefore, it is still
worth investigating the possibility of improving the feature selection in future work.

Feature ranking
An apparent issue with feature ranking is how to value the rankings from the different mod-
els. When dividing the feature into tiers, both the ranking and importance were weighed
against how well the model performed. But this was not done in a standardized way, in-
stead, it was based on reasoning. This begs the question of how well the tiers capture the
importance of the different features. Whether it would be more appropriate to base the tiers
on the best-performing model is left unanswered.

42
5.2. Method

Data augmentation
The performance of all models worsened using window slicing introduced by Cui et al. as a
data augmentation method [15]. However, in the author’s report, the result using their data
augmentation method was not compared to other methods or using no augmentation. This
leads to the question of whether or not window slicing improved their result. Although the
chosen data augmentation method was unsuccessful for this project, it is still of interest to
explore the possibility to augment data given the limited dataset.

Models
When the models were optimized the optimization was done with some limitations, every
possible set of hyper-parameters was not tuned for all of the models. This was because many
parameters were not deemed to have a considerable impact. However, this assumption could
be incorrect. If that is the case, there is a risk that a global minimum has not been reached.
The bounds for the hyper-parameters during the random search were moved if the se-
lected value was close to the bound to ensure that no better value existed according to Bengio
[3]. Even though this practice is recommended, it could lead to the best hyper-parameter
value found being in a local minimum. The reason for this is when the bound only is moved
slightly it does not check if there exists a better possible value after a larger move. There is a
chance that this could have occurred in this project since the hyper-parameter bounds were
set through trial and error.
During hyper-parameter tuning, the random search was run for 100 iterations. There
is a trade-off between the number of iterations and training time, the more iterations the
longer the training will take. More iterations also increase the chance of finding better hyper-
parameter values. Due to hardware limitations, the search ran for 100 iterations. This leads
to the possibility to achieve greater accuracy by training the models for even more iterations.
However, as mentioned there is only a chance that the accuracy increases by more iterations,
not a guarantee. The number of iterations may also create a reliability concern. Given that
if the study was to be replicated, other hyper-parameter values may be found which yield
different results.

Model selection
As mentioned in the Section 3.6 the hold-out method was applied to enable comparisons be-
tween the model performances. That way all models were compared on data that had not
been seen before to prevent choosing an overfitted model. The model’s hyper-parameters
were optimized through random search using five-fold cross-validation. According to Ko-
havi, the hold-out method makes wasteful use of the data since much of the data is not used
to train the model [31]. Varma and Simon claim that the hold-out method is not suitable when
working with small datasets [57]. They present an alternative for model selection and evalu-
ation through nested cross-validation. Nested cross-validation reduces the bias substantially
and gives an error estimation that resembles that of an independent testing set [57]. Nested
cross-validation works by using cross-validation for the model’s hyper-parameter optimiza-
tion inside another cross-validation where the final models are evaluated.
This leads to every data point being tested both as a training point and a testing point.
Every model will then have several results, resulting in the possibility to calculate a model’s
average error and standard deviation. That way it is possible to extract each model’s error in-
terval. The same procedure could be used to obtain intervals for the feature importances. This
practice would not necessarily yield improved results, however, the obtained result would be
more robust.
The decision to use the hold-out method instead of nested cross-validation has resulted
in some validity concerns for the report. Therefore, in the final stage of the thesis time frame,
an evaluation using cross-validation on the model’s performance and the feature importance

43
5.3. The work in a wider context

was conducted, the results can be found in Appendix B.3. Since this evaluation was per-
formed at a later stage of the project, there was no time to perform an equally extensive
evaluation of the result as when the hold-out method was applied.

Source criticism
The literature used for this project mainly consisted of peer-reviewed articles from scientific
journals and conferences. Beyond scientific articles, books were used to explain general ma-
chine learning practices and the media marketing industry. The literature was collected using
ACM digital library, Google Scholar, and IEEE Xplore:digital library. During the search for ar-
ticles, the number of citations and journal reputations have been key factors to determine the
reliability of the article. However, publicly available literature regarding the media industry
was scarce and therefore the number of citations and the reputation of the journals were not
of the same quality as literature explaining machine learning concepts. To get recent reports
on TV viewership an annual report from MMS, a leading media measurement bureau was
considered appropriate. With these practices, the literature selected for this project is deemed
to be reliable and of good quality for this type of project.

5.3 The work in a wider context


This project attempts to improve predictions of reach for television advertisement campaigns.
With better predictions, it is possible to plan television marketing campaigns to be better at
reaching the target audience. Resulting in more effective television marketing. If television
marketing campaigns become better at reaching their target audience it becomes more similar
to targeted advertising which is otherwise more commonly online [19].
According to Smith and Cooper-Martin, there are both positive and negative aspects to
targeted advertising [53]. They mention that targeted advertising is good at serving customer
needs, however, it is criticized for targeting vulnerable customers with harmful products. The
author mentions an example when a tobacco company designed a campaign to target African-
American people. The campaign became heavily criticized and was canceled. As television
advertisement campaigns become better at reaching the target audiences, vulnerable target
groups become more at risk of being targeted with harmful products or services.
As mentioned this project recommends that domain experts should expand the feature set
for the models as future work. This might create ethical concerns depending on what features
are added. If features to specify the target group is added the model will be able to tell how
important those features are for predicting reach. This leads to the possibility to find target
groups that are easier or harder to predict the reach for. This could potentially lead to certain
target groups being targeted more and others being targeted less. Advertising has since long
been criticized for possibly leading to unnecessary consumption [52]. This could potentially
yield overconsumption for the more easily predicted target groups and less consumption for
the target groups that are harder to predict reach for.

44
6 Conclusion

The purpose of this project was to evaluate how well machine learning models can predict
television advertisement reach based on historical television marketing campaigns. This is
of interest to advertisers who constantly are looking for possibilities to improve their ad-
vertising. Better reach predictions for marketing campaigns will enable advertisers to better
evaluate media purchases.
To concretize the purpose of the project, a set of research questions was produced. The
research questions included questions regarding which model is most suitable, how well a
model can perform, and what features are most important for reach predictions, see Section
1.3. The project evaluated four models to identify the most suitable model and how well they
performed. To further benefit advertisers this project also answers which features has the
largest impact on television advertisement reach. This knowledge gives advertisers insights
on what features to emphasize during the planning of a television advertisement campaign.
The best performing model proved to be an XGBoost model with a mean absolute percent-
age error of 4.956%. Given that no previous research that predicted television advertisement
reach for campaigns could be found, the result could not be compared. Given that the XG-
Boost result was outside of the confidence interval from cross-validation the result should
not serve as a comparison point for future studies on reach predictions. However, what can
be concluded is that the XGBoost model is well-suited for the stated problem, and is recom-
mended for future research. For comparison points Figure B.1 indicates an average of MAPE
at around 6% for the XGBoost model.
Regarding the most important features, different models valued the importance of the
features differently. For this reason, the features were ranked in tiers through analysis and
discussion, see Table 4.8. However, GRP was consistently by far the most important feature
for every model and can therefore be concluded to have a huge impact on television reach.
Following GRP, most models ranked the minimum and maximum age as the second or third
most important features. Thereafter, the results differ. What can be concluded is that the
feature spot length relation has a neglectable impact on reach. But given that the results
differ heavily in the middle tiers it is of interest to continue to investigate feature importance
for television reach, by both improving this method and by adding new possible features.
Given that all models worsened using data augmentation, see Table 4.9, it can be con-
cluded that the performance of the machine learning models does not improve by increasing
the training dataset size by adding mid-campaign results as full campaigns. However, it can

45
6.1. Future work

not be determined if data augmentation always decreases the performance of machine learn-
ing models for this problem. There could be another method that is suitable, preferably one
that does not disrupt the feature distribution.

6.1 Future work


In this section, the future work for the study is presented. It includes suggestions on interest-
ing concepts for expanding the study as well as improvements for the conducted study.

Domain experts in feature selection


During the project, it was concluded that slight changes in the feature set had a major impact
on the prediction results. Adding more suitable features improved the accuracy of the model
greatly. Given that Guyon and Elisseef mentioned that the best method for selecting features
is through domain knowledge [24]. Interesting future research would therefore consist of
domain experts selecting the feature set for the models.

Incomplete data points


As mentioned, both decision trees and the XGBoost model support missing values. It would
therefore be of interest to explore how incomplete data points affect the performance of the
supported models. Using the incomplete data points would extend the dataset which could
improve the results. This would also be beneficiary for the industry using the models in a
campaign planning phase, given that not every feature needs to be set to receive a prediction.

Feature selection method


The feature selection method used in this project failed to find the relation between the fea-
ture periods and television advertisement reach, while the XGBoost model ranked periods
as the fourth most important feature. This is an indication that forward variable selection
together with a ridge regression model is not the best feature selection method for the stated
problem. Future work could investigate the use of more advanced feature selection methods
to improve the prediction result further by finding a more optimal feature set. Worth noting
is that the dataset used in this project was small, and a more advanced method might be more
prone to overfitting than forward variable selection.

Altering the study


In this study, outliers were detected through local outlier factor and every detected outlier
was removed. The motivation for this was that the dataset contained extreme outliers and
that human error could have created incorrect data points. However, this was not proved
in the study only assumed. Therefore, if the study was to be replicated the detected outliers
should be further analyzed to determine if they are invalid data points.
The hold-out method was the main method used to evaluate the results of this project.
The motivation for this was to compare the models on unseen data to prevent overfitting.
However, Kohavi claims that the hold-out method is wasteful and Varma and Simon explain
that it is not suitable for small datasets [31, 57]. Instead, they suggest that the models should
be evaluated using nested cross-validation. That way it would be possible to extract result
intervals and get more robust results. Nested cross-validation was performed at the final
stage of the project timeline in Appendix B.3. Therefore, the hold-out method remained the
main method for evaluation. If the study would be conducted again, the evaluation of the re-
sult and the following conclusion would be solemnly based on nested cross-validation rather
than the hold-out method.

46
Bibliography

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S Corrado, A. Davis,
J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D.
Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V.
Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
Y. Yu, and X. Zheng. “TensorFlow: Large-Scale Machine Learning on Heterogeneous
Systems”. In: arXiv preprint arXiv:1603.04467 (2015).
[2] M. Abe. “A Household-Level Television Advertising Exposure Model”. In: Journal of
Marketing Research 34.3 (Aug. 1997), pp. 394–405.
[3] Y. Bengio. Practical Recommendations for Gradient-Based Training of Deep Architectures.
2nd ed. Berlin, Heidelberg, Germany: Springer Berlin Heidelberg, 2012, pp. 437–478.
[4] K. Benoit. “Linear regression models with logarithmic transformations”. In: London
School of Economics, London 22.1 (2011), pp. 23–36.
[5] J. Bergstra and Y. Bengio. “Random search for hyper-parameter optimization.” In: Jour-
nal of machine learning research 13.2 (2012), pp. 281–305.
[6] V. Bína, D. Gunina, and T. Kincl. “TV Advertising Reach: Model for Effective Schedul-
ing”. In: Advances in Advertising Research X: Multiple Touchpoints in Brand Communication.
Wiesbaden: Springer Fachmedien Wiesbaden, 2019, pp. 215–228.
[7] L. Breiman. “Random Forests”. In: Machine Learning 45.1 (2001), pp. 5–32.
[8] M.M Breunig, H. Kriegel, R.T Ng, and J. Sander. “LOF: Identifying Density-Based Local
Outliers”. In: Proceedings of the ACM SIGMOD International Conference on Management of
Data. May 2000, pp. 93–104.
[9] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P.
Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and
G. Varoquaux. “API design for machine learning software: experiences from the scikit-
learn project”. In: Proceedings of the ECML PKDD Workshop: Languages for Data Mining
and Machine Learning. Sept. 2013, pp. 108–122.
[10] J.W. BULCOCK and W.F. LEE. “Normalization Ridge Regression in Practice”. In: Socio-
logical Methods & Research 11.3 (1983), pp. 259–303.
[11] V. Chandola, A. Banerjee, and V. Kumar. “Anomaly Detection: A Survey”. In: ACM
Computing Surveys 41.3 (2009), pp. 1–58.

47
Bibliography

[12] G. Chandrashekar and F. Sahin. “A survey on feature selection methods”. In: Computers
Electrical Engineering 40.1 (Jan. 2014), pp. 16–28.
[13] T. Chen and C. Guestrin. “XGBoost: A Scalable Tree Boosting System”. In: Proceed-
ings of the International Conference on Computational Data and Social Networks. Aug. 2016,
pp. 785–794.
[14] F. Chollet et al. Keras. https://keras.io. 2015.
[15] Z. Cui, W. Chen, and Y. Chen. “Multi-scale convolutional neural networks for time
series classification”. In: arXiv preprint arXiv:1603.06995 (2016).
[16] P.J Danaher, T.S Dagger, and M.S Smith. “Forecasting television ratings”. In: Interna-
tional Journal of Forecasting 27.4 (Oct. 2011), pp. 1215–1240.
[17] G. Doyle. “Digitization and Changing Windowing Strategies in the Television Indus-
try: Negotiating New Windows on the World”. In: Television & New Media 17.7 (2016),
pp. 629–645.
[18] F. Esposito, D. Malerba, G. Semeraro, and J. Kay. “A comparative analysis of methods
for pruning decision trees”. In: IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 19.5 (1997), pp. 476–491.
[19] A. Farahat and M.C. Bailey. “How Effective is Targeted Advertising?” In: Proceedings of
the 21st International Conference on World Wide Web. Apr. 2012, pp. 111–120.
[20] H.I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.A. Muller. “Data augmenta-
tion using synthetic data for time series classification with deep residual networks”. In:
arXiv preprint arXiv:1808.02455 (2018).
[21] J.H Friedman. “Greedy Function Approximation: A Gradient Boosting Machine”. In:
The Annals of Statistics 29.5 (2001), pp. 1189–1232.
[22] A.S Galathiya, A.P Ganatra, and C.K Bhensdadia. “Improved decision tree induction
algorithm with feature selection, cross validation, model complexity and reduced error
pruning”. In: International Journal of Computer Science and Information Technologies 3.2
(2012), pp. 3427–3431.
[23] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016, pp. 164–223,
276, 290–306, 416–422.
[24] I. Guyon and A. Elisseeff. “An Introduction of Variable and Feature Selection”. In: J.
Machine Learning Research Special Issue on Variable and Feature Selection 3 (2003), pp. 1157–
1182.
[25] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. 12th ed.
New York, NY, USA: Springer New York Inc., 2001, pp. 44–56, 61–78.
[26] A. Jain, K. Nandakumar, and A. Ross. “Score normalization in multimodal biometric
systems”. In: Pattern Recognition 38.12 (2005), pp. 2270–2285.
[27] T. Jayalakshmi and A. Santhakumaran. “Statistical normalization and back propagation
for classification”. In: International Journal of Computer Theory and Engineering 3.1 (2011),
pp. 1793–8201.
[28] H. Katz. The Media Handbook A Complete Guide to Advertising Media Selection, Planning,
Research, and Buying. 2nd ed. London, UK: Awarence Erlbaum Associates, 2003, pp. 34–
35, 42–43, 104–108, 123, 168.
[29] D. Khryashchev, A. Papiu, J. Xuan, O. Dinica, K. Hubert, and H. VO. “Who Watches
What: Forecasting Viewership for the Top 100 TV Networks”. In: Proceedings of the In-
ternational Conference on Computational Data and Social Networks. Nov. 2019, pp. 163–174.
[30] W. Kim, K.S Kim, J.E Lee, D.Y Noh, S.W Kim, Y.S Jung, M.Y Park, and R.W Park. “De-
velopment of novel breast cancer recurrence prediction model using support vector
machine”. In: Journal of breast cancer 15.2 (2012), pp. 230–238.

48
Bibliography

[31] R. Kohavi. “A study of cross-validation and bootstrap for accuracy estimation and
model selection”. In: Proceedings of the International Joint Conference on Artificial Intelli-
gence. Aug. 1995, pp. 1137–1145.
[32] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. “Exploring strategies for train-
ing deep neural networks.” In: Journal of machine learning research 10.1 (2009), pp. 1–40.
[33] R.J Lewis. “An introduction to classification and regression tree (CART) analysis”. In:
Proceedings of the Annual meeting of the society for academic emergency medicine in San Fran-
cisco, California. 2000.
[34] R. Lippmann. “An introduction to computing with neural nets”. In: IEEE ASSP Maga-
zine 4.2 (1987), pp. 4–22.
[35] G.D Merkel, R.J Povinelli, and R.H Brown. “Short-term load forecasting of natural gas
with deep neural network regression”. In: Energies 11.8 (2018), pp. 1–12.
[36] D. Meyer and R.J Hyndman. “The accuracy of television network rating forecasts: The
effects of data aggregation and alternative models”. In: Model Assisted Statistics and Ap-
plications 1.3 (Nov. 2006), pp. 147–155.
[37] MMS. MMS Årsrapport 2021. https://mms.se/rapporter-lista.php/?t=ty&
y=Årsrapporter. 2022.
[38] L.C Molina, L. Belanche, and A. Nebot. “Feature selection algorithms: a survey and
experimental evaluation”. In: Proceedings of the IEEE International Conference on Data
Mining. Dec. 2002, pp. 306–313.
[39] P.M Napoli. “Audience Measurement and Media Policy: Audience Economics, the Di-
versity Principle, and the Local People Meter”. In: Communication Law and Policy 10.4
(2005), pp. 349–382.
[40] M. Z. Naser and Amir H. Alavi. “Error Metrics and Performance Fitness Indicators for
Artificial Intelligence and Machine Learning in Engineering and Sciences”. In: Architec-
ture, Structures and Construction 1 (2021).
[41] K. Nikolopoulos, P. Goodwin, A. Patelis, and V. Assimakopoulos. “Forecasting with
cue information: A comparison of multiple regression with alternative forecasting ap-
proaches”. In: European Journal of Operational Research 180.1 (July 2007), pp. 354–368.
[42] J. Ogutu, T. Schulz-Streeck, and H. Piepho. “Genomic selection using regularized linear
regression models: ridge regression, lasso, elastic net and their extensions”. In: Proceed-
ings of the European workshop on QTL mapping and marker assisted selection. May 2012,
S10.
[43] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z.B Celik, and A. Swami. “The Limi-
tations of Deep Learning in Adversarial Settings”. In: Proceedings of the IEEE European
Symposium on Security and Privacy (EuroS P). Mar. 2016, pp. 372–387.
[44] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M.
Brucher, M. Perrot, and E. Duchesnay. “Scikit-learn: Machine Learning in Python”. In:
Journal of Machine Learning Research 12.1 (2011), pp. 2825–2830.
[45] S. Sereday and J. Cui. “Using Machine Learning To Predict Learning Future TV Rat-
ings”. In: Nielsen Journal of Measurement 1.3 (Feb. 2017), pp. 3–12.
[46] R. Sethuraman, G.J Tellis, and R.A Briesch. “How Well Does Advertising Work? Gener-
alizations from Meta-Analysis of Brand Advertising Elasticities”. In: Journal of Market-
ing Research 48.3 (Oct. 2011), pp. 457–471.
[47] S. Sharma, S. Sharma, and A. Athaiya. “Activation functions in neural networks”. In:
International Journal of Engineering Applied Sciences and Technology 4.12 (2020), pp. 310–
316.

49
Bibliography

[48] M. Shcherbakov, A. Brebels, N.L Shcherbakova, A. Tyukov, T.A Janovsky, and V.A
Kamaev. “A survey of forecast error measures”. In: World applied sciences journal 24.24
(2013), pp. 171–176.
[49] C. Shorten and T.M Khoshgoftaar. “A survey on image data augmentation for deep
learning”. In: Journal of big data 6.1 (2019), pp. 1–48.
[50] P.Y Simard, D. Steinkraus, and J.C Platt. “Best practices for convolutional neural net-
works applied to visual document analysis.” In: Proceedings of the International Confer-
ence on Document Analysis and Recognition. Aug. 2003, pp. 958–963.
[51] K. Singh and S. Upadhyaya. “Outlier detection: applications and techniques”. In: Inter-
national Journal of Computer Science Issues (IJCSI) 9.1 (2012), p. 307.
[52] R. Singh and S. Vij. “Socio-economic and ethical implications of advertising-A percep-
tual study”. In: Proceedings of the International Marketing Conference on Marketing & Soci-
ety. Apr. 2007, pp. 46–59.
[53] N.C Smith and E Cooper-Martin. “Ethics and Target Marketing: The Role of Product
Harm and Consumer Vulnerability”. In: Journal of Marketing 61.3 (1997), pp. 1–20.
[54] Y.Y Song and Y. Lu. “Decision tree methods: applications for classification and predic-
tion”. In: Shanghai archives of psychiatry 27.2 (2015), p. 130.
[55] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. “Dropout:
a simple way to prevent neural networks from overfitting”. In: Journal of machine learn-
ing research 15.1 (2014), pp. 1929–1958.
[56] L. Torgo. Encyclopedia of Machine Learning and Data Mining. 2nd ed. Boston, MA, USA:
Springer US, 2017, pp. 1080–1083.
[57] S. Varma and R.Simon. “Bias in error estimation when using cross-validation for model
selection”. In: BMC bioinformatics 7.1 (2006), pp. 1–8.
[58] Q. Wen, L. Sun, F.Yang, X. Song, J. Gao, X. Wang, and H. Xu. “Time Series Data Augmen-
tation for Deep Learning: A Survey”. In: Proceedings of the International Joint Conference
on Artificial Intelligence. Aug. 2021.
[59] X.Ying. “An overview of overfitting and its solutions”. In: Journal of Physics: Conference
Series 1168.2 (2019), pp. 22–28.
[60] Y. Xia, C. Liu, Y.Y Li, and N. Liu. “A boosted decision tree approach using Bayesian
hyper-parameter optimization for credit scoring”. In: Expert Systems with Applications
78 (2017), pp. 225–241.

50
A Additional details

In this appendix, additional plots and tables for the project are presented.

A.1 Target feature plots


This section contains scatter plots of all features against the target variable reach.

1.0 1.0
0.8 0.8
0.6 0.6
Reach

Reach

0.4 0.4
0.2 0.2

0 2000 4000 6000 8000 1e1 1e2 1e3 1e4


GRP GRP
(a) Target vs GRP scatter plot (b) Target vs logarithm of GRP scatter plot
Figure A.1: How the feature GRP relate with the target value reach.

51
A.1. Target feature plots

1.0 1.0
0.8 0.8
0.6 0.6
Reach

Reach
0.4 0.4
0.2 0.2

0 20000 40000 60000 80000 100000 1e1 1e2 1e3 1e4 1e5
CPP30 CPP30
(a) Target vs CPP30 scatter plot (b) Target vs logarithm of CPP30 scatter plot
Figure A.2: How the feature CPP30 relate with the target value reach.

1.0 1.0
0.8 0.8
0.6 0.6
Reach

Reach

0.4 0.4
0.2 0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
Prime time Position in break
(a) Target vs prime time scatter plot (b) Target vs position in break scatter plot
Figure A.3: How the features prime time and position in break relate with the target value
reach.

1.0
0.8
0.6
Reach

0.4
0.2

0.0 0.5 1.0 1.5 2.0


Spot length relation
Figure A.4: How the feature spot length relation relate with the target value reach.

52
A.1. Target feature plots

1.0 1.0
0.8 0.8
0.6 0.6
Reach

Reach
0.4 0.4
0.2 0.2

2015 2016 2017 2018 2019 2020 2021


Start year

Feb uary
Ma ry

Ma l
y
e
Se ug y
pte ust
Oc ber
De emb r
ce er
er
Ap h
ri

v e
Jul
Jun
rc
rua

No tob

mb
m
Jan

A
(a) Target vs start year scatter plot
Start month
(b) Target vs start month scatter plot
Figure A.5: How the features start year and start month relate with the target value reach.

1.0
0.8
0.6
Reach

0.4
0.2

0 10 20 30 40
Periods
Figure A.6: How the feature periods relate with the target value reach.

1.0 1.0
0.8 0.8
0.6 0.6
Reach
Reach

0.4 0.4
0.2 0.2

20 40 60 80 100 10 20 30 40 50 60
Maximum age Minimum age
(a) Target vs maximum age scatter plot (b) Target vs minimum age scatter plot
Figure A.7: How the features minimum and maximum age relate with the target value reach.

53
A.2. Data augmentation model hyper-parameters

1.0
0.8
0.6

Reach
0.4
0.2

0 5 10 15 20 25
Number of channels
Figure A.8: How the feature number of channels relate with the target value reach.

A.2 Data augmentation model hyper-parameters


In this section, the hyper-parameters for all models trained on augmented data are presented.

Hyper-parameter Without feature selection With feature selection


Penalty factor λ 1.866 1.451
Table A.1: Hyper-parameters used for the ridge regression models trained on augmented
data.

Tree Attribute Without feature selection With feature selection


Pre-pruning Max depth 7 7
Node count 211 227
Post-pruning Max depth 7 7
Node count 43 51
Table A.2: Tree characteristics for pre- and post-pruned tree trained on augmented data using
both with and without feature selection.

Hyper-parameter Without feature selection With feature selection


Shrinkage η 0.0540 0.0402
Feature sub-sampling ratio 0.3588 0.6561
Max depth 4 4
Number of trees 163 171
L2 regularization λ 0.0090 5.43e-05
Table A.3: Hyper-parameters used for the XGBoost models trained on augmented data.

54
A.2. Data augmentation model hyper-parameters

Hyper-parameter Without feature selection With feature selection


Number of hidden layers 1 1
Width per hidden layer 28 13
Learning rate η 0.0004 0.0007
L2 regularization 0.0046 0.0070
Table A.4: Architecture and hyper-parameters used for the neural network models trained on
augmented data.

55
B Extended experiments

In this appendix, the results from the extended experiments are presented.

B.1 Start year removed


Table B.1 presents the Mean Absolute Percentage error for every model, with and without
feature selection, where the feature start year has been removed. The motivation for this
experiment is that the MMS annual report indicated that television viewings were similar for
the years 2019-2021 and by that, the start year of a campaign should be insignificant [37].

Model Without feature selection With feature selection


Baseline 10.727 % -
Ridge regression 7.833 % 7.833 %
Decision tree regressor
Pre-pruning 8.167 % 8.214 %
Post-pruning 8.525 % 8.056 %
XGBoost 5.504 % 5.960 %
Neural network 6.553 % 6.877 %
Table B.1: The mean absolute percentage error for all models with and without feature selec-
tion with the start year feature manually removed. The baseline did not use feature selection.

When comparing Table B.1 with the results from the report it can be concluded that all
models, except a post-pruned decision tree with feature selection, were worse when removing
the feature start year.

56
B.2. Feature selection using augmented data

B.2 Feature selection using augmented data


In this section, the features selected by the forward variable selection method using aug-
mented data are presented. The ordering of the features is determined by in what order the
features were selected by the forward variable selection algorithm. The feature selection re-
sulted in three features being discarded (logarithm of GRP, logarithm of CPP30, and prime
time). The remaining features were:

1. Periods 6. Minimum age

2. GRP 7. The logarithm of CPP30

3. Position in break 8. Start year

4. Maximum age 9. Start Month

5. Spot length relation 10. Number of channels

Table B.2 shows the results for every model using the augmented feature selection is pre-
sented.

Model MAPE
Ridge regression 14.045 %
Decision tree regressor
Pre-pruning 13.763 %
Post-pruning 14.106 %
XGBoost 11.221 %
Neural network 15.839 %
Table B.2: Results for all models using feature selection. The feature selection performed on
augmented data.

Comparing the results in Table B.2 with the results from the models using an augmented
training set with the original selected features in Table 4.9 it is evident that the original se-
lected features performed better. However, the ridge model performed better with the aug-
mented feature selection. This is expected given that ridge regression is used for feature
selection. This result further attests to the reasoning that the feature periods should be re-
moved from the augmented dataset because it has a different distribution than the original
dataset.

B.3 Cross-validation
In this section, the result of the models’ performances and the feature importance are pre-
sented using five-fold nested cross-validation, as suggested by Varma and Simon [57]. The
result from the cross-validation is then compared to the results utilizing the hold-out method,
found in Section 4.2.

Model results
In Figure B.1 the average MAPE for every fold is presented together with a graphical repre-
sentation of the error of uncertainty.

57
B.3. Cross-validation

All features
0.12 Selected features
0.10
0.08

Error
0.06
0.04
0.02
0.00

ine

de dge

ork
e

ost
e
Ne n tre
d d n tre

Bo
sel

etw
Ri

XG
Ba

isio
io

ln
cis

ec

ura
ed

ne
run

pru
-p
st-
Pre

Po
Figure B.1: The models’ performance using five-fold nested cross-validation. The height of
the bar plots is determined by the average MAPE of the five-fold cross-validation. The error
bar represents the 95% confidence interval.

From Figure B.1 it is clear that the baseline model is outperformed by every other model.
Even when the performance of the baseline is at the error plot’s lower bound. It is still out-
performed by every other model’s upper bound. The XGBoost model using all features was
the best performing model with 5.874% MAPE. It was also the only model that performed
better when using the whole feature set. All other models improved or had the same results
using the selected features.
The ridge regression model performed the similar both with and without the selected
feature set. When comparing the models’ error bars, it is visible that the XGBoost models and
ridge regression models had the least error of uncertainty, while the baseline and the decision
trees had the largest.

Feature importance
Figure B.2 shows the feature importances for three of the models. The feature importance in
the figure is the average importance of every fold in the cross-validation.

1.0
Pre-pruned decision tree
Post-pruned decision tree
0.8 XGBoost
Importance

0.6

0.4

0.2

0.0
0
P
im age
mb art ge
ch h

sit Per s

rel e
in s
ot rim ak

Sta tion
r
ea
l
ion iod
xim R

gth tim
of nt
P3

ne

Sp P bre
Nu St m a
G

rt y
er mo
CP

a
an
Min um

len e
u
Ma

Po

Figure B.2: XGBoost’s, post- and pre-pruned decision tree’s average feature importances us-
ing five-fold nested cross-validation. The error bar represents the 95% confidence interval.

Figure B.2 clearly shows that GRP was the most important feature for the three models.
Both the decision tree models valued GRP significantly more than the XGBoost model. Given

58
B.3. Cross-validation

that GRP was much more important than the other features, a zoomed-in version of the graph
is needed to compare the importance of the remaining features.

0.10
Pre-pruned decision tree
Post-pruned decision tree
0.08 XGBoost

Importance
0.06

0.04

0.02

0.00
0
P
im age
mb art ge
ch h

sit er s

rel e
in s
ot rim ak

Sta tion
r
ea
l
ion iod
xim GR

gth tim
of nt
P3

ne

Sp P bre
Nu St m a

rt y
er mo
CP

a
an
Min um

len e
P
u
Ma

Po
Figure B.3: A zoomed in version of the five fold cross-validation feature importance bar plot.

When inspecting the zoomed-in graph in Figure B.3, it is visible that both the decision
trees valued each feature similarly. The importance of the remaining features for XGBoost
was higher than both the decision trees. This is reasonable given that the importance of GRP
for XGBoost was smaller than for the decision trees, see Figure B.2.
Apart from GRP, no other feature stands out drastically, though some features were more
important than others. For the decision tree models, the maximum age, number of chan-
nels, and the minimum age were considered more important than the rest, with minimum
age slightly lower than the other two. The XGBoost model valued maximum and minimum
age, periods, and start year notably more than the other features with CPP30 and number of
channels valued just below.
Overall, the error bars were wider for the XGBoost model than for the decision tree mod-
els. This could be because the XGBoost model is an ensemble model consisting of multiple
decision trees, which results in a more complex model. An advanced model will learn more
complex relationships between the data points and therefore it can become more sensitive to
the training data set, resulting in a higher error of uncertainty.

59

You might also like