100% found this document useful (1 vote)

128 views12 pages

On The Evaluation of Generative Models in Music

This document discusses challenges in evaluating generative music systems. Subjective evaluation by human listeners is generally preferred, but can require significant resources to conduct properly. Objective evaluation methods are easier to implement but often lack musical relevance. The document proposes a set of objective metrics informed by musical rules to provide a formative evaluation of generative music systems in a reproducible way. This evaluation focuses on whether systems follow basic technical expectations, rather than assessing artistic creativity or aesthetics. The metrics are demonstrated on real-world data and aim to address shortcomings of previous evaluation methods.

Uploaded by

Kamran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

128 views12 pages

On The Evaluation of Generative Models in Music

Uploaded by

Kamran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Neural Computing and Applications (2020) 32:4773–4784

https://doi.org/10.1007/s00521-018-3849-7
(0123456789().,-volV)(0123456789().,-volV)

ORIGINAL ARTICLE

On the evaluation of generative models in music

Li-Chia Yang1 • Alexander Lerch1

Received: 18 June 2018 / Accepted: 26 October 2018 / Published online: 3 November 2018
Ó Springer-Verlag London Ltd., part of Springer Nature 2018

Abstract
The modeling of artificial, human-level creativity is becoming more and more achievable. In recent years, neural networks
have been successfully applied to different tasks such as image and music generation, demonstrating their great potential in
realizing computational creativity. The fuzzy definition of creativity combined with varying goals of the evaluated
generative systems, however, makes subjective evaluation seem to be the only viable methodology of choice. We review
the evaluation of generative music systems and discuss the inherent challenges of their evaluation. Although subjective
evaluation should always be the ultimate choice for the evaluation of creative results, researchers unfamiliar with rigorous
subjective experiment design and without the necessary resources for the execution of a large-scale experiment face
challenges in terms of reliability, validity, and replicability of the results. In numerous studies, this leads to the report of
insignificant and possibly irrelevant results and the lack of comparability with similar and previous generative systems.
Therefore, we propose a set of simple musically informed objective metrics enabling an objective and reproducible way of
evaluating and comparing the output of music generative systems. We demonstrate the usefulness of the proposed metrics
with several experiments on real-world data.

Keywords Objective evaluation Music generation Computational creativity

1 Introduction Structure (FBS) ontology [18, 62], we evaluate the actual

behavior of a system compared to its expected behavior.
The desire to understand creativity has driven the devel- The evaluation of creative systems can be categorized into
opment of computationally creative systems among a wide function and structure evaluation, which relates directly to
variety of tasks [5]. Just as deep learning has reshaped the the so-called summative and formative approaches. While
whole field of artificial intelligence, it has reinvented the former aims to assess whether the results of a system
generative modeling in recent years [63]. This thriving meet the stated goal of creativity, the latter focuses on
research area includes, for example, the creative generation monitoring how the instructional goals and objectives are
or the style transfer of artwork such as paintings or music being met [13, 20, 46]. Without a clear definition and
[15, 21]. consensus on the essence of (human) creativity, summative
Even with the research interest in generative systems, evaluation remains largely problematic [28].
the assessment and evaluation of such systems has proven As the ultimate judge of creative output is the human
challenging. Formally, categorization of evaluation strate- (listener or viewer), subjective evaluation is generally
gies can be derived from specifying the design ontology of preferable in generative modeling. The challenges of
the system. For instance, based on the Function–Behavior– designing and conducting an experiment leading to valid,
reliable, and replicable results, however, are often under-
estimated. Controlling all relevant variables, eliminating
& Li-Chia Yang bias, and recruiting a sufficient number of qualified sub-
richard40148@gatech.edu jects can easily blow the required resources out of reach for
Alexander Lerch small-scale projects. The most common shortcomings of
alexander.lerch@gatech.edu subjective studies evaluating generative systems are clo-
1
Center for Music Technology, Georgia Institute of sely related to both the available resources and the design
Technology, Atlanta, USA of experimental methodology [28, 47].

123
4774 Neural Computing and Applications (2020) 32:4773–4784

Thus, a method for objective evaluation of generative measuring the success of a generative system are
systems is desirable. addressing the summative and the formative assessment of
The image generation community has benefited from the the system behavior. Subjective approaches to measuring
introduction of the idea of the inception score by Salimans the success of generative systems by means of listening
et al. [47]. It uses a pattern recognition model to assess the experiments can often be categorized as summative
generated sample. The general concept of the inception assessment while objective evaluation strategies mostly fall
score is based on the assumption that a well-trained image into the category of formative assessment. Confusing these
classifier roughly has a human-like classification ability two challenges leads to unclear evaluation strategies.
[47]. This idea has been adapted by multiple researchers to Although subjective evaluation is generally preferable for
allow for an objective measure of various generative sys- evaluating generative modeling, it might require significant
tems [26, 29, 39]. The idea of the inception score is con- resources. Objective methods, on the other hand, can be
vincing and the first results look promising; ultimately, easily executed yet often lack musical relevance as they are
however, the assumed correlation to human judgment still often not based on musical rule systems or heuristics.
needs further scientific examination [19, 64].
The evaluation of generative music systems faces even 2.1 Subjective evaluation in music generation
harder challenges than that of image generation systems
[9]. The sequential yet highly structured form, the ever- Most assessments of generated symbolic music are based
changing interaction between composition and perfor- on inputs from human listeners. These evaluations either
mance, and the abstract nature of meaning and emotion in follow the concept of a musical Turing test [3] or use query
music [36, 61] make a semantic description of music metrics based on the modeled compositional theory [2].
exceedingly hard. The automatic analysis and categoriza- The Turing test [55] follows an intuitive concept that
tion of music is, although having made great progress, not evaluates whether a machine is able to exhibit behavior
close to human-level performance [35]. This makes indistinguishable from humans. One strategy to adapt the
assessing music very difficult [3, 22, 41, 59] and partly Turing test to generative music systems is asking the
explains why music assessment could not be automated by subjects to identify the pieces they consider to be com-
computational models so far. posed by a human as opposed to a computer [34]. This
Despite these high-level challenges, we will show below strategy has been used in several studies as listed in
that state-of-the-art generative music systems struggle with Table 1 [1, 21, 24, 25, 32, 49]. Over the past decades,
creating musical content that follows basic technical rules shortcomings of the Turing test have been pointed out in
and expectations. We argue that these technicalities have to various areas [2, 17, 44]. Many of these problems also
be solved before addressing the questions of aesthetics of apply to musical Turing tests. One of the fundamental
creative works with high-level structural and harmonic issues, however, is that many studies confound the two
properties. questions on whether a piece is aesthetically pleasing and
Therefore, we propose a formative evaluation strategy whether it is composed by a human.
for systems generating symbolic music. The proposed The design of a listening experiment is complex due to
method does not aim at assessing musical pieces in the the many variables ranging from the selection and rendition
context of human-level creativity nor does it attempt to of audio examples, the listening environment, and the
model the aesthetic perception of music. It rather applies selection of subjects, to the phrasing of the questions.
the concept of multicriteria evaluation [54] in order to Without proper guidance (compare, e.g., [6]), we find that
provide metrics that assess basic technical properties of the many contemporary studies struggle with presenting sig-
generated music and help researchers identify issues and nificant scientific evidence. Table 1 lists some of the
specific characteristics of both model and dataset. The variables for several major subjective evaluation studies in
usefulness of the presented method is demonstrated the context of music generation. It is worth noting that all
through a series of experiments, including dataset analysis, of these evaluations are performed with a different problem
comparison of state-of-the-art music generation models, configuration, i.e., different evaluation criteria are used,
and assessment of generative music systems. and both the questionnaires and the listening examples are
proprietary (if not arbitrary) and hard to compare. Without
addressing these issues properly, the reported results can
2 Related work only be understood as providing preliminary evaluation
results and fail at representing a scientific benchmark. First,
As mentioned above, research on automatic music gener- the majority of them ignore factors associated with the
ation systems has suffered from the difficulty of designing subjects themselves (e.g., their level of expertise), which
evaluation methodologies [42]. The two challenges of influences further analysis and the reliability of the

123
Neural Computing and Applications (2020) 32:4773–4784 4775

Table 1 Experiment design for

[43] [11] [60] [49] [1] [24] [25] [21] [32]
subjectively evaluating music
Subject’s background T N/A T and UT 3L N/A N/A N/A 3L 4L
generation research
Sample size 16 27 21 973/986 48 96 52 1272 759

Comp. w/models Yes Yes Yes Yes Yes No Yes Yes No

Comp. w/human-composed Yes No No Yes Yes Yes Yes Yes Yes
Comp. w/random samples No No No No No No No No No
The following abbreviations are use for the subjects’ background: T (musically trained), UT (musically
untrained), 3L (Three-point level of expertise), 4L (Four-point level of expertise)

experimental result [6]. Second, most of the studies rely— reports the summation of the generated sequence’s log-
probably due to limited resources—on a relatively small likelihood across notes and time steps [27]. Since the
sample size [11, 43, 60], which raises questions about the recurrent model used in his study is trained with the goal of
range of the confidence interval and the study’s statistical maximizing the log-likelihood of each training sequence,
significance (which are often not reported) [33]. Note that the measure is argued to be a meaningful quantitative
the common lack of reported statistical measures of con- measure of the performance. The used probabilistic mea-
fidence and significance in itself could be seen as an sures provide objective information, yet Theis et al.
indicator of insufficient scientific rigor. Finally, some of observe that ‘‘A good performance with respect to one
the studies rely on the preference of one model over criterion does not necessarily imply a good performance
another [11, 60]. The drawback of such a test paradigm is with respect to another criterion’’ and provide examples of
the absence of a standard comparison or absolute reference. bad samples with very high likelihoods [54].
While it can be used to measure relative differences or
improvements, it cannot provide any absolute measurement 2.2.2 Model-specific metrics
of quality.
Last but not least, these tests carry the risk of overesti- As the approaches and models vary greatly between dif-
mating the subject’s comprehension, as Ariza concludes ferent generative systems, some of the evaluation metrics
after comparing several subjective evaluation methods are correspondingly designed for a specific model or task.
(e.g., Musical Turing Tests, Musical Directive Toy Tests Bretan et al. proposed a metric for successfully predicting a
and Musical Output Toy Tests) [2]. music unit from a pool of units in a generative system by
evaluating the rank of the target unit [8]. Mogren designed
2.2 Objective evaluation in music generation metrics informed by statistical measurements of poly-
phony, scale consistency, repetitions, and tone span to
Given the advantages over subjective evaluation with monitor the model’s characteristics during its training [37].
respect to reproducibility and required resources, several Common to these evaluation approaches is the use of
recent studies have assessed their models objectively. We domain-specific, custom-designed metrics as opposed to
categorize the objective evaluation methods used by the standard metrics. Obviously, the authors realized the
recent studies on data-driven music generation into the problems with using standard metrics (e.g., edit distance of
following categories: (1) probabilistic measures without melodies) as musically meaningless and implemented
musical domain knowledge, (2) task-/model-specific met- metrics inspired by domain knowledge. The variability and
rics, and (3) metrics using general musical domain diversity of the proposed metrics, however, leads to com-
knowledge. parability issues. The design of nonstandard metrics also
poses additional dangers, such as evaluating only one
2.2.1 Probabilistic measures aspect of the output, or evaluating with a metric that is part
of the system design.
The use of evaluation metrics based on probabilistic mea-
sures such as likelihood and density estimation has been 2.2.3 Metrics based on domain knowledge
successfully used in tasks such as image generation [54]
and is increasingly used in music-related tasks as well To address the multi-criteria nature of generative systems
[14, 52]. For example, Huang et al. [24] propose a frame- and their evaluation [9], various humanly inter-
wise evaluation computing the negative log-likelihood pretable metrics have been proposed. More specifically,
between the model output and the ground truth across these metrics integrate musical domain knowledge and
frames. Similarly, Johnson considers the note combinations enable detailed evaluation with respect to specific music
over time steps of the training data as the ground truth and characteristics. Chuan et al. utilize metrics modeling the

123
4776 Neural Computing and Applications (2020) 32:4773–4784

tonal tension and interval frequencies to compare how In a first step, we gather two collections of samples as
different feature representations can influence a model’s our input datasets. For the application of objective evalu-
performance [12]. Sturm et al. [52] provide a statistical ation, one dataset contains generated samples, the other
analysis of the musical events (occurrence of specific contains samples from the training (target) dataset. This
meters and modes, pitch class distributions, etc.), followed approach can also be used for applications such as dataset
by a discussion with examples on the different application analysis or the comparison of characteristics of two gen-
scenarios. Similarly, Dong et al. apply statistic analysis erative systems. We then extract a set of custom-designed
including tonal distance, rhythmic patterns, and pitch features that are rooted in musical domain knowledge yet
classes to evaluate a multi-track music generator [14]. The easy to understand and interpret. These features encompass
advantages of metrics taking into account domain knowl- both pitch-based and rhythm-based features. After
edge are not only in their interpretability, but also in their extracting these features for both datasets, we are able to
generalizability and validity—at least as long as the compute both an absolute measurement (Fig. 1, top) and a
designed model aims to generate music under the estab- relative measurement. The absolute measurement can
lished rules. provide useful insights to a system developer about the
training dataset properties and generative system’s
characteristics.
3 Method The relative measurement (Fig. 1, bottom), on the other
hand, allows to compare two distributions in various
Following the approach of using domain knowledge for dimensions. It is computed by first applying pairwise
designing human-interpretable evaluation metrics for gen- exhaustive cross-validation to compute the distance of each
erative music systems, we present a formative evaluation sample to either the same dataset (intra-dataset) or to the
strategy based on a comprehensive set of simple yet other dataset (inter-dataset). The results are distance his-
musically meaningful features that can be easily applied to tograms per feature. Next, the probability distribution
a wide variety of different symbolic music generation function (PDF) of each feature histogram is estimated by
models. kernel density estimation [50].
The two targets of the proposed evaluation strategy are Finally, we compute two metrics for the objective
to provide (1) absolute metrics in order to give insights into evaluation of generative systems from the training dataset’s
properties and characteristics of a generated or collected set intra-set distance PDF (target distribution) and the inter-set
of data and (2) relative metrics in order to compare two sets distance PDF between the training and generated datasets:
of data, e.g., training and generated. The overall method is (1) the area of overlap and (2) the Kullback–Leibler
illustrated in Fig. 1 and described below. Divergence (KLD). The steps are introduced in detail in the
following sections.

Analysis of characteristic

Statistic of features
Mean, STD
for each feature in training dataset
Statistic
analysis of
features
Statistic of features
in generated dataset
Absolute
Training
Mean, STD within
dataset Features for each
each intra-set PDFs
dataset
Intra-set
Feature Absolute/relative distances
extraction measure

Generated
dataset
Relative PDFs of intra-set
Intra-set distances
for each feature
distances Evaluation metric
for each feature

Measures Difference of
Pairwise Kernel density
between intra-sets and
cross-validation estimation
distributions inter-set distances

PDFs of inter-set
Inter-set distances KLD, overlap area
distances
for each feature for each intra-set PDF
for each feature
with inter-set PDF

Fig. 1 General work flow of the proposed method

123
Neural Computing and Applications (2020) 32:4773–4784 4777

3.1 Input representation contain pitch information but is a rhythm-related fea-

ture. The output is a scalar for each sample.
Our proposed evaluation method reads input files in 2. Average inter-onset-interval (IOI): To calculate the
Musical Instrument Digital Interface (MIDI) format. MIDI inter-onset-interval in the symbolic music domain, we
is considered as one of the standard formats of symbolic find the time between two consecutive notes. The
domain representation of music [38]. Although a music output is a scalar in seconds for each sample.
generation system might have its own data representation 3. Note length histogram (NLH): To extract the note
and output format, the output is usually converted to MIDI length histogram, we first define a set of allowable beat
format for distribution and auralization. The MIDI file length classes [full, half, quarter, 8th, 16th, dot half,
format also provides useful musical metadata such as the dot quarter, dot 8th, dot 16th, half note triplet, quarter
time signature and the bar length through the resolution of note triplet, 8th note triplet]. The rest option, when
the MIDI file. activated, will double the vector size to represent the
For the current implementation of our method, the input same lengths for rests. The classification of each event
samples are required to be monophonic melodies with a is performed by dividing the basic unit into the length
fixed number of measures. of (barlength) / 96, and each note length is quantized
to the closest length category. The output vector has a
3.2 Feature extraction length of either 12 or 24, respectively.
4. Note length transition matrix (NLTM): Similar to the
The features listed below are computed for both, the entire pitch class transition matrix, the note length transition
sequence, and for each measure in order to get some matrix provides useful information for rhythm descrip-
structural information. tion [57]. The output feature dimension is 12 9 12 or
24 9 24, respectively.
3.2.1 Pitch-based features Obtaining these domain knowledge-based features give us
a generally interpretable representation of the data. The
1. Pitch count (PC): The number of different pitches features, however, have different dimensionality and nor-
within a sample. The output is a scalar for each sample. malization, complicating their direct use. Therefore, addi-
2. Pitch class histogram (PCH): The pitch class his- tional processing is applied to all these features.
togram is an octave-independent representation of the
pitch content with a dimensionality of 12 for a 3.3 Absolute measurement
chromatic scale [4, 40]. In our case, it represents the
octave-independent chromatic quantization of the fre- During the model design phase of a generative system, it
quency continuum. can be of interest to investigate absolute metrics from the
3. Pitch class transition matrix (PCTM): The transition of output of different system iterations or of datasets as
pitch classes contains useful information for tasks such opposed to a relative evaluation. A typical example is the
as key detection [30, 53], chord recognition [31], or comparison of the generated results from two generative
genre pattern recognition [10]. The two-dimensional systems: although the model properties cannot be deter-
pitch class transition matrix is a histogram-like repre- mined precisely for a data-driven approach, the observation
sentation computed by counting the pitch transitions of the generated samples can justify or invalidate system
for each (ordered) pair of notes. The resulting feature design choices. (e.g., Sect. 4.2).
dimensionality is 12 9 12. To acquire the analysis, the mean and standard devia-
4. Pitch range (PR): The pitch range is calculated by tion1 of each feature of the data are computed.
subtraction of the highest and lowest used pitch in
semitones. The output is a scalar for each sample. 3.4 Relative measurement
5. Average pitch interval (PI): Average value of the
interval between two consecutive pitches in semi- In order to enable the comparison of different sets of data,
tones. The output is a scalar for each sample. the relative measure generalizes the result among features
with various dimensions; the features are summarized to
3.2.2 Rhythm-based features (1) the intra-set distances and (2) the difference of intra-set
and inter-set distances.
1. Note count (NC): The number of used notes. As
opposed to the pitch count, the note count does not 1
The deviation here refers to an element-wise standard deviation,
which retains the dimension of each feature.

123
4778 Neural Computing and Applications (2020) 32:4773–4784

3.4.1 Pairwise cross-validation

To compare the distance of the features within and between

sets of data, a pairwise exhaustive cross-validation [16] is
performed for each feature. In each cross-validation step,
the Euclidean distance of one sample to each of the other
samples is computed. If the cross-validation is computed
within one set of data, we will refer to it as intra-set dis-
tances. If each sample of one set is compared with all
samples of the other set, we call it the inter-set distances.
The output of this process is a histogram of distances for
each feature.

3.4.2 Kernel density estimation

Fig. 2 Example of the proposed evaluation metric: measuring
In order to smooth the histogram results for a more gen- difference of intra-set and inter-set distances by Kullback–Leibler
eralizable representation, kernel density estimation [50] is divergence (KLD) and Overlapped area (OA)
applied to convert the histograms into PDFs. A Gaussian
kernel and Scott’s rule of thumb of bandwidth selection
two input datasets to each other and within themselves. An
[48, 56] is used for all features in inter-set and intra-set
artificial example is illustrated in Fig. 2, where we calcu-
distances.
late the intra-set and inter-set distances among three sets of
Note that the feature dimension plays a role impacting
randomly sampled entries from Gaussian distributions with
the robustness of density estimation. Silverman provides
same variance but different mean value (Set 1:
examples for the relation of sample size and dimensionality
l ¼ 0; r ¼ 1; Set 2: l ¼ 2; r ¼ 1; Set 3: l ¼ 5; r ¼ 1).
for the density estimation and the corresponding mean
Three datasets all have identical intra-set distances, but
square error [50].
distinct inter-set distances. By applying the proposed
For the estimated PDFs, simple statistical measures such
metric, the smaller KLD and larger OA between Set 2 and
as mean and standard deviation (STD) can be extracted and
Set 1 inter-set distances and Set 1 intra-set distances shows
directly convey properties of the input datasets. For
that Set 2 is more similar to Set 1.
instance, the mean value in the intra-set distances corre-
sponds to the diversity of the samples within a dataset, and
the mean value of the inter-set distances is a measure of the
average similarity of the two input datasets in this feature
4 Use-case demonstration and discussion
dimension. On the other hand, the STD value serves as an
Three experiments are conducted to demonstrate the value
indication of the reliability of mean value.
of the proposed analysis of musical characteristics:
3.4.3 Kullback–Leibler divergence and overlapped area 1. Exp. 1—Dataset evaluation: the analysis of datasets is
one of the fundamental processes of a data-driven
In addition to the statistical measures representing intra-set experiment. In this experiment, we evaluate (the
distances or inter-set distances, similarity measures differences between) two datasets from different music
between distributions are also of interest in the application genres, and how this result could inform the developer
of evaluating music generative systems. Two metrics are of a generative system.
computed, the Kullback–Leibler Divergence (KLD) and 2. Exp. 2—System comparison: as mentioned above (see
Overlapping Area (OA) of two PDFs. We propose to Sect. 2.1), the comparison between two generative
compute the distance between the target dataset’s intra-set systems is a common approach in subjective evaluation
PDF and the inter-set PDF. experiments. In this experiment, we evaluate two
Although the KLD is the most common measure of how music generation systems and compare the results
two PDFs diverge from each other, it is unbounded and with the summative answers from a subjective evalu-
asymmetric, i.e., DKL ðAjjBÞ 6 DKL ðBjjAÞ); for this reason, ation of these systems.
we further calculate the OA to provide a bounded measure 3. Exp. 3—Performance evaluation: a typical problem
in the range 2 ½0; 1. after prototyping a generative system is the
The above similarity measures can indicate the behavior parametrization of the system. This experiment is an
of the evaluated system, as it compares the similarity of example for the typical usage of the objective

123
Neural Computing and Applications (2020) 32:4773–4784 4779

evaluation method. We discuss how parameters can dimension. We can make the following observations. First,
influence the result of a generative system by compar- the higher mean of the intra-set self-distance for nearly all
ing the generated samples with the training dataset. features in the jazz genre as compared to folk indicates that
samples in the jazz genre generally have a higher diversity,
a result that matches expectation as folk is often based on
4.1 Experiment 1: Dataset evaluation
simple patterns [45] while jazz generally allows more
freedom in its musical composition [7]. Second, we
Musical style is defined by a set musical characteristics.
observe considerable differences for the absolute measures
Due to the complexity of musical content, observing style
of features such as note count and average inter-onset-
and properties of a music dataset can be a major challenge.
interval.
This experiment aims to demonstrate how the proposed
Figure 3a illustrates the average pitch class transition
approach allows to characterize data from two different
matrices (PCTM). The folk dataset is more restricted in the
music genres and provide insights into genre-specific
usage of certain pitches (i.e., D], F, G], B[.) and shows a
properties.
comparably sparse matrix compared to jazz, where both
pitches and pitch transitions tend to have more variety.
4.1.1 Input datasets
We can also observe that the folk music dataset shows a
larger mean for features such as note length histogram
The chosen two genres are folk and jazz music. The folk
(NLH), and note length transition matrix (NLTM). How-
music dataset is the Irish Tunes collected from the Henrik
ever, by illustrating the average NLTMs in Fig. 3b, we
Norbeck’s ABC Tunes website [23]. The jazz music
notice that folk dataset again shows a sparse matrix as
dataset comprises jazz lead sheets from both the Wikifonia
compared to the jazz dataset. This implies that the jazz
database [51] and publicly available jazz solo transcrip-
dataset has a higher variety of note length transitions within
tions collected by Mason et al. [8].
a song while having a lower diversity of note length tran-
The folk and jazz music datasets contain 2351 and 392
sition across the dataset.
entries, respectively. A pilot experiment determining the
In data-driven approaches to music generation, the
necessary amount of samples was carried out. The exper-
output of the generative system should directly relate to the
iment was then executed with 100 randomly selected songs
characteristics of the training dataset. The presented
from each dataset. Of these songs, only the first 8 bars are
absolute measures allow for a musically intuitive way of
considered.
highlighting various dimensions of such characteristics.
This can help with the critical step of designing a gener-
4.1.2 Analysis and discussion
alizable dataset, possibly from various sources, for training
a generative system.
Table 2 lists the results for both the intra-set distances and
the absolute measurements for features with one

Table 2 Experimental result of

Folk Jazz
data set evaluation (see
Sect. 4.1) Intra-set Absolute measure Intra-set Absolute measure
Mean STD Mean STD Mean STD Mean STD

PC 2.242 1.658 9.300 1.962 3.101 2.355 8.570 2.740

PC/bar 4.178 1.340 – – 5.635 1.982 – –
NC 11.583 8.281 47.020 10.018 11.386 8.510 26.270 10.001
NC/bar 5.415 2.446 – – 7.615 2.905 – –
PCH 0.339 0.107 – – 0.480 0.150 – –
PCH/bar 1.746 0.264 – – 2.702 0.389 – –
PCTM 0.307 0.057 – – 0.417 0.107 – –
PR 3.773 2.830 15.900 3.318 4.013 3.202 12.150 3.612
PI 0.557 0.439 2.694 0.499 0.989 0.818 2.590 0.903
IOI 0.031 0.027 0.277 0.029 0.838 3.754 0.922 2.706
NLH 0.769 0.504 – – 0.607 0.229 – –
NLTM 0.729 0.429 – – 0.557 0.162 – –

123
4780 Neural Computing and Applications (2020) 32:4773–4784

(a) (b)
Fig. 3 Example of absolute measurement: a average pitch class transition matrix (PCTM) and b average note length transition matrix (NLTM) of
Jazz and Folk music dataset (see Sect. 4.1)

4.2 Experiment 2: System comparison absolute measurements NC and PR indicate that MidiNet 2
tends to use more notes and has a higher average pitch
The second experiment compares MidiNet [60], a genera- range than Magenta’s lookback RNN.
tive adversarial network (GAN) for symbolic domain The fact that the outputs of these two systems have been
music generation, with the melody lookback recurrent used previously in a subjective study [60, Sect. 5] allows us
neural network (Lookback RNN) of the Magenta project to compare the subjective results with these objective
[58]. As discussed in the previous Sect. 3.4, the proposed results. The listening test resulted in a comparable rating
objective evaluation can assist studying different model for the questions How real and How pleasing the model
structures and behaviors when the training datasets for both outputs are; for the question How interesting, however,
models are available. In some cases, however, the training MidiNet acquired a slightly higher rating. This interest-
datasets are inaccessible as is the case for Magenta. Given ingness result might be related to the characteristics of
this issue, we consider this scenario for the proposed higher pitch range, pitch count, and note count that we find
method to compare the characteristics of different models. in the absolute measures.
We again exploit the intra-set distances and the absolute Magenta’s RNN, on the other hand, shows a higher
measurement utilized in the previous experiment. Fur- mean among the intra-set distances in these features; this
thermore, we attempt to relate reported subjective evalua- somewhat contradicts the result of the subjective test.
tion results to the identified characteristics. Therefore, we investigate this issue further by looking into
the STD value, as a higher STD might hint at a lower
4.2.1 Input datasets reliability of the mean value. No clear conclusions can be
drawn as the limited sample size in the listening test does
We implement and train the so-called MidiNet ‘‘Model 2’’ not allow for more detailed analysis.
[60], below referred to as MidiNet 2, by using 526 MIDI Finally, Fig. 4 showcases another visualization of data
tabs with 8 bars parsed from the TheoryTab.2 characteristics. The PDF of the intra-set distances among
The MidiNet model and the public accessible pre- features (PCH, PCTM, NLH, and NLTM) is shown in a
trained model of Magenta’s Lookback RNN generate 100 violin plot, an intuitive visualization of PDFs. The plot
samples each. Each sample contains a melody with 8 bars. echoes the previous argument, where a significant higher
The first bar is provided by the user while the remaining 7 skewness indicates a less diversified intra-set behavior and
bars are generated by the models. a higher STD indicates a lower reliability of the similarity
measure.
4.2.2 Analysis and discussion
4.3 Experiment 3: Performance evaluation
The results of Exp. 2 are shown in Table 3. It can be
observed that the two model outputs are distinctly different The final experiment demonstrates the use case of evalu-
in several dimensions such as pitch count, pitch interval ating a generative system. We compare two parametriza-
and pitch range; this is shown by the fact that the mean tions of MidiNet, ‘‘Model 1’’ and ‘‘Model 2’’ [60]. Both
values of the inter-set distances are larger than the mean models have identical architecture and share the same
values of both intra-set distances. Furthermore, the training data. The difference between the models is that
one model does not use feature matching regularizers
2
https://www.hooktheory.com/theorytab. (MidiNet 1) while the other model does (MidiNet 2).

123
Neural Computing and Applications (2020) 32:4773–4784 4781

Table 3 Experimental result for

Magenta MidiNet Inter-set
characteristic comparison of
generation models (see Intra-set Absolute measure Intra-set Absolute measure
Sect. 4.2)
Mean STD Mean STD Mean STD Mean STD Mean STD

PC 2.897 2.400 7.820 2.647 2.214 1.708 11.300 1.967 4.097 2.490
PC/bar 4.766 1.594 – – 4.866 1.324 – – 4.885 1.446
NC 10.228 9.534 27.310 9.837 6.086 4.596 30.740 5.366 8.940 7.576
NC/bar 6.870 2.903 – – 7.511 1.855 – – 7.359 2.220
PCH 0.490 0.156 – – 0.385 0.127 – – 0.440 0.142
PCH/bar 2.575 0.371 – – 2.591 0.283 – – 2.584 0.326
PCTM 0.441 0.099 – – 0.300 0.049 – – 0.386 0.079
PR 4.796 3.975 12.650 4.383 3.013 2.631 19.600 2.814 7.681 4.052
PI 1.209 1.274 2.940 1.236 1.105 0.812 5.559 0.965 2.773 1.275
IOI 0.257 0.241 0.653 0.248 0.108 0.095 0.531 0.101 0.205 0.212
NLH 0.538 0.223 – – 0.237 0.085 – – 0.420 0.180
NLTM 0.491 0.187 – – 0.271 0.059 – – 0.399 0.152

melodies, the model with active feature matching, Mid-

iNet 2, appears to have a larger OA and smaller KLD
across almost all features. This indicates that the feature
matching is able to deliver the expected improvement. The
intra-set distance metrics show that both models have—
compared to the training dataset—a lower mean and stan-
dard deviation in most features. This implies that both
systems lose the variety of the training samples. Rather
than using the metrics for a quality ranking, we urge the
user to use them as index of variability. They could also be
used to catch, e.g., an extreme case of losing the variety
referred to as mode collapse in GANs [47]. In this case, the
model is only able to generate very similar samples
although the training dataset has significant variability.
Fig. 4 Visualization of model characteristics through the PDFs of Figure 5 intuitively identifies pitch count (PC), note
proposed intra-set self-distance (Sect. 4.2)
count (NC), and pitch interval (PI) as the features for which
MidiNet 2 outperforms MidiNet 1 (KLD decrease and OA
Feature matching is a technique for stabilizing the GANs
increase drastically). It also points to features such as pitch
by urging the model follow patterns within the training data
range (PR) and pitch count across bars (PC/bar) as the
more closely [47].
dimensions in which both MidiNet models struggle as
indicated by a high KLD. Most importantly, the metrics
4.3.1 Input datasets
provide the measurement with respect to human-inter-
pretable musical features, allowing the user to easily pin-
We randomly pick 100 melodies from the training dataset
point the strengths and weaknesses of different system
(see Sect. 4.2: 526 MIDI tabs each with 8 bars) and gen-
designs.
erate 100 samples of melodies with 8 bars each with the
We can also make one counter-intuitive observation: the
two models. To insure an fair comparison, the generation is
KLD for the pitch class histogram features slightly
performed with the same setup as in Sect. 4.2, where we
increases from MidiNet 1 to MidiNet 2 while the over-
provide one bar for priming and let each model generate 7
lapped areas (OA) become larger. This reveals the limita-
continuing bars.
tions of KLD as visualized in Fig. 6: the PDFs of the intra-
set and inter-set distances of MidiNet 2 move toward the
4.3.2 Analysis and discussion
training data’s intra-set distances; however, the KLD
measure fails to register a performance improvement. Since
The results of Exp. 3 are shown in Table 4 and Fig. 5.
in discrete probability distributions, the KLD is calculated
When comparing the generated melodies with the training

123
4782 Neural Computing and Applications (2020) 32:4773–4784

Table 4 Experimental result for

Training data MidiNet1 MidiNet2
performance evaluation of
generation model (see Sect. 4.3) Intra-set Intra-set Inter-set Intra-set Inter-set
Mean STD Mean STD KLD OA Mean STD KLD OA

PC 2.527 2.760 2.367 1.805 1.185 0.158 2.214 1.708 0.130 0.541
PC/bar 5.446 2.130 4.551 1.264 0.812 0.042 4.866 1.324 0.572 0.909
NC 12.360 9.954 5.085 3.752 1.058 0.081 6.086 4.596 0.009 0.693
NC/bar 7.804 2.977 5.210 1.520 0.442 0.090 7.511 1.855 0.181 0.916
PCH 0.506 0.169 0.301 0.082 0.012 0.563 0.385 0.127 0.016 0.814
PCH/bar 2.497 0.414 1.546 0.221 0.018 0.319 2.591 0.283 0.025 0.899
PCTM 0.439 0.107 0.263 0.036 0.349 0.277 0.300 0.049 0.172 0.483
PR 4.726 3.935 1.803 1.443 0.502 0.164 3.013 2.631 0.453 0.400
PI 1.062 1.508 0.958 0.767 1.096 0.198 1.105 0.812 0.217 0.443
IOI 0.377 0.403 0.024 0.018 0.141 0.067 0.108 0.095 0.089 0.631
NLH 0.506 0.194 0.174 0.089 0.331 0.187 0.237 0.085 0.024 0.507
NLTM 0.502 0.165 0.208 0.099 0.599 0.201 0.271 0.059 0.112 0.455

in an element-wise manner, PDFs with identical shape (as

indicated by similar Kurtosis and Skewness) but shifted on
the x-axis (distinct in mean value) yield insignificant dif-
ferences in KLD. As mentioned in Sect. 3.4.3, the calcu-
lation of the OA can address these limitations of the KLD.
On the other hand, OA can be misleading when the PDFs
vary in their Kurtosis but have similar mean values; in this
case, the KLD is able to indicate the differences.

5 Conclusion

Evaluation of generative models has been falling behind

the system development itself. This is probably due to the
Fig. 5 Visualizing the model performance by the proposed KLD and challenges of assessing music aesthetic pieces in the aspect
OA metrics (Sect. 4.3) of summative evaluation [2], where human subjective tests
are typically unavoidable. Given the challenges of required
resources and listening experiment design, we have pro-
posed to address this issue by using a formative objective
evaluation for generative music models. This allows for
reproducible, reliable, and comparable objective results. It
also allows the analysis of large amounts of outputs instead
of a small set of hand-picked samples.
The method can be applied to two main tasks, the
analysis of characteristics or the objective evaluation with
interpretable metrics. Given a pair of datasets, features
rooted in musical domain knowledge are extracted, pro-
viding absolute measures to the user quantifying the
characteristics of a dataset in various dimensions. When
used as evaluation metric, a relative measurement allows to
look into intra-set and inter-set distances with respect to
Fig. 6 An example of PDF of the intra-set and inter-set distances the training and the output data. The statistic analysis with
(Sect. 4.3) respect to both the absolute measure and the similarity
measure serves as a tool for the analysis of quantifiable

123
Neural Computing and Applications (2020) 32:4773–4784 4783

dataset characteristics. This analysis allows the researcher 13. Colton S, Pease A, Ritchie G (2001) The effect of input knowl-
to draw conclusions about the system’s ability to model a edge on creativity. In: Technical reports of the Navy Center for
Applied Research in Artificial Intelligence. Washington, DC,
certain musical feature of the training dataset, as well as to USA
estimate the variability and the stability of different model 14. Dong HW, Hsiao WY, Yang LC, Yang YH (2018) Musegan:
designs. multi-track sequential generative adversarial networks for sym-
We have released the evaluation framework as an open- bolic music generation and accompaniment. In: Association for
the advancement of artificial intelligence (AAAI). New Orleans,
source toolbox which implements the demonstrated eval- Louisiana, USA
uation and analysis methods along with visualization tools. 15. Gatys LA, Ecker AS, Bethge M (2016) A neural algorithm of
Our future work will include the extension of the current artistic style. In: The annual meeting of the vision sciences
toolbox with additional dimensions (e.g., dynamics) and to society. St. Pete Beach, Florida, USA
16. Geisser S (1993) Predictive inference, vol 55. CRC Press, Boca
expand it toward polyphonic music. This toolbox is Raton
available in an online repository.3 17. Geman D, Geman S, Hallonquist N, Younes L (2015) Visual
turing test for computer vision systems. Proc Natl Acad Sci
112(12):3618–3623
18. Gero JS, Kannengiesser U (2004) The situated function–be-
Compliance with ethical standards haviour–structure framework. Des Stud 25(4):373–391
19. Gurumurthy S, Sarvadevabhatla RK, Radhakrishnan VB (2017)
Deligan: generative adversarial networks for diverse and limited
Conflict of interest The authors declare that they have no conflict of
data. In: IEEE conference on computer vision and pattern
interest.
recognition (CVPR). Honolulu, Hawaii, USA
20. Guyot WM (1978) Summative and formative evaluation. J Bus
Educ 54(3):127–129. https://doi.org/10.1080/00219444.1978.
References 10534702
21. Hadjeres G, Pachet F (2016) Deepbach: a steerable model for
1. Agarwala N, Inoue Y, Sly A (2017) Music composition using bach chorales generation. In: International conference on
recurrent neural networks. Stanford University, Technical Report machine learning (ICML). New York City, NY, USA
in CS224 22. Hale CL, Green SK (2009) Six key principles for music assess-
2. Ariza C (2009) The interrogator as critic: the turing test and the ment. Music Educ J 95(4):27–31. https://doi.org/10.1177/
evaluation of generative music systems. Comput Music J 0027432109334772
33(2):48–70 23. Henrik Norbeck’s abc tunes. Last accessed Mar 2018. http://
3. Asmus EP (1999) Music assessment concepts: a discussion of www.norbeck.nu/abc/
assessment concepts and models for student assessment intro- 24. Huang CZA, Cooijmans T, Roberts A, Courville A, Eck D (2017)
duces this special focus issue. Music Educ J 86(2):19–24 Counterpoint by convolution. In: International society of music
4. Babbitt M (1960) Twelve-tone invariants as compositional information retrieval (ISMIR). Suzhou, China
determinants. Music Q 46(2):246–259 25. Huang KC, Jung Q, Lu J (2017) Algorithmic music composition
5. Balaban M, Ebcioğlu K, Laske O (eds) (1992) Understanding using recurrent neural networking. Stanford University, Techni-
music with AI: perspectives on music cognition. MIT Press, cal Report in CS221
Cambridge 26. Huang X, Li Y, Poursaeed O, Hopcroft J, Belongie S (2016)
6. Bech S, Zacharov N (2007) Perceptual audio evaluation—theory, Stacked generative adversarial networks. In: IEEE conference on
method and application. Wiley, London computer vision and pattern recognition (CVPR). Las Vegas,
7. Boot P, Volk A, de Haas WB (2016) Evaluating the role of Nevada, USA
repeated patterns in folk song classification and compression. 27. Johnson DD (2017) Generating polyphonic music using tied
J New Music Res 45(3):223–238 parallel networks. In: International conference on evolutionary
8. Bretan M, Weinberg G, Heck L (2017) A unit selection and biologically inspired music and art, pp 128–143. Amsterdam,
methodology for music generation using deep neural networks. The Netherlands
In: International conference on computational creativity (ICCC). 28. Jordanous A (2012) A standardised procedure for evaluating
Atlanta, Georgia, USA creative systems: computational creativity evaluation based on
9. Briot JP, Hadjeres G, Pachet F (2019) Deep learning techniques what it is to be creative. Cognit Comput 4(3):246–279
for music generation—a survey. Springer, London 29. Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing
10. Chordia P, Rae A (2007) Raag recognition using pitch-class and of gans for improved quality, stability, and variation. In: Inter-
pitch-class dyad distributions. In: International society of music national conference on learning representations (ICLR). Toulon,
information retrieval (ISMIR), pp 431–436. Vienna, Austria France
11. Chu H, Urtasun R, Fidler S (2016) Song from pi: a musically 30. Krumhansl C, Toiviainen P et al (2000) Dynamics of tonality
plausible network for pop music generation. In: International induction: a new method and a new model. In: International
conference on learning representations (ICLR). San Juan, Puerto conference on music perception and cognition (ICMPC). Keele,
Rico UK
12. Chuan CH, Herremans D (2018) Modeling temporal tonal rela- 31. Lee K (2006) Automatic chord recognition from audio using
tions in polyphonic music through deep networks with a novel enhanced pitch class profile. In: International computer music
image-based representation. In: Association for the advancement conference (ICMC). New Orleans, Louisiana, USA
of artificial intelligence (AAAI). New Orleans, Louisiana, USA 32. Liang F, Gotham M, Johnson M, Shotton J (2017) Automatic
stylistic composition of bach chorales with deep lstm. In: Inter-
national society of music information retrieval (ISMIR). Suzhou,
3 China
https://github.com/RichardYang40148/mgeval.

123
4784 Neural Computing and Applications (2020) 32:4773–4784

33. Likert R (1932) A technique for the measurement of attitudes. 51. Simon I, Morris D, Basu S (2008) Mysong: automatic accom-
Arch Psychol 22(140):5–55 paniment generation for vocal melodies. In: Proceedings of the
34. Marsden A (2013) Music, intelligence and artificiality. In: SIGCHI conference on human factors in computing systems,
Readings in music and artificial intelligence, pp 25–38. pp 725–734. Florence, Italy
Routledge 52. Sturm BL, Ben-Tal O (2017) Taking the models back to music
35. Meredith D (2016) Computational music analysis. Springer, practice: evaluating generative transcription models built using
Berlin deep learning. J Creat Music Syst. https://doi.org/10.5920/JCMS.
36. Meyer LB (2008) Emotion and meaning in music. University of 2017.09
Chicago Press, Chicago 53. Temperley D, Marvin EW (2008) Pitch-class distribution and the
37. Mogren O (2016) C-rnn-gan: continuous recurrent neural net- identification of key. Music Percept Interdiscip J 25(3):193–212
works with adversarial training. In: Advances in neural infor- 54. Theis L, van den Oord A, Bethge M (2016) A note on the
mation processing systems, constructive machine learning evaluation of generative models. In: International conference on
workshop (NIPS CML). Barcelona, Spain learning representations (ICLR). Caribe Hilton, San Juan, Puerto
38. Moog RA (1986) Midi: musical instrument digital interface. Rico. arXiv:1511.01844
J Audio Eng Soc 34(5):394–404 55. Turing AM (1950) Computing machinery and intelligence. Mind
39. Mroueh Y, Sercu T (2017) Fisher gan. In: Advances in neural 59(236):433–460
information processing systems (NIPS). Long Beach, CA, USA 56. Turlach BA et al (1993) Bandwidth selection in kernel density
40. O’Brien C, Lerch A (2015) Genre-specific key profiles. In: estimation: a review. Université catholique de Louvain Louvain-
International computer music conference (ICMC). Denton, Tex- la-Neuve
as, USA 57. Verbeurgt K, Dinolfo M, Fayer M (2004) Extracting patterns in
41. Pati KA, Gururani S, Lerch A (2018) Assessment of student music for composition via Markov chains. In: International
music performances using deep neural networks. Appl Sci conference on industrial, engineering and other applications of
8(4):507. https://doi.org/10.3390/app8040507. http://www.mdpi. applied intelligent systems, pp 1123–1132. Springer, Ottawa, ON,
com/2076-3417/8/4/507 Canada (2004)
42. Pearce M, Meredith D, Wiggins G (2002) Motivations and 58. Waite E, Eck D, Roberts A, Abolafia D (2016) Project magenta:
methodologies for automation of the compositional process. generating long-term structure in songs and stories. https://
Music Sci 6(2):119–147 magenta.tensorflow.org/blog/2016/07/15/lookback-rnn-attention-
43. Pearce MT, Wiggins GA (2007) Evaluating cognitive models of rnn/
musical composition. In: International joint workshop on com- 59. Wu CW, Gururani S, Laguna C, Pati A, Vidwans A, Lerch A
putational creativity, pp 73–80. London, UK (2016) Towards the objective assessment of music performances.
44. Pease A, Colton S (2011) On impact and evaluation in compu- In: International conference on music perception and cognition
tational creativity: a discussion of the turing test and an alterna- (ICMPC). Hyderabad, AP, India
tive proposal. In: Proceedings of the AISB symposium on AI and 60. Yang LC, Chou SY, Yang YH (2017) Midinet: a convolutional
philosophy, p 39. York, United Kingdom generative adversarial network for symbolic-domain music gen-
45. Pease T, Mattingly R (2003) Jazz composition: theory and eration. In: International society of music information retrieval
practice. Berklee Press, Boston (ISMIR). Suzhou, China
46. Ritchie G (2007) Some empirical criteria for attributing creativity 61. Zbikowski LM (2002) Conceptualizing music: cognitive struc-
to a computer program. Minds Mach 17(1):67–99 ture, theory, and analysis. Oxford University Press, Oxford
47. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, 62. Zhang W, Wang J (2016) Design theory and methodology for
Chen X (2016) Improved techniques for training gans. In: enterprise systems. Enterp Inf Syst 10(3):245–248. https://doi.
Advances in neural information processing systems (NIPS). org/10.1080/17517575.2015.1080860
Barcelona, Spain 63. Zhang WJ, Yang G, Lin Y, Ji C, Gupta MM (2018) On definition
48. Scott DW (2015) Multivariate density estimation: theory, prac- of deep learning. In: World automation congress (WAC).
tice, and visualization. Wiley, Hoboken Stevenson, Washington, USA
49. Shin A, Crestel L, Kato H, Saito K, Ohnishi K, Yamaguchi M, 64. Zhou Z, Cai H, Rong S, Song Y, Ren K, Zhang W, Wang J, Yu Y
Nakawaki M, Ushiku Y, Harada T (2017) Melody generation for (2018) Activation maximization generative adversarial nets. In:
pop music via word representation of musical properties. arXiv- International conference on learning representations (ICLR).
preprint arXiv:1710.11549 Vancouver, Canada
50. Silverman BW (1986) Density estimation for statistics and data
analysis, vol 26. CRC Press, Boca Raton

123

Markfell Thesis
100% (1)
Markfell Thesis
215 pages
Haptics in Computer Music: A Paradigm Shift
No ratings yet
Haptics in Computer Music: A Paradigm Shift
4 pages
Creating A Simple Electroacoustic Piece in Easy Stages - 5
No ratings yet
Creating A Simple Electroacoustic Piece in Easy Stages - 5
8 pages
An 4VirtualAnalogFilters
No ratings yet
An 4VirtualAnalogFilters
24 pages
The Nord Modular Experience
No ratings yet
The Nord Modular Experience
10 pages
Alias-Free Digital Synthesis of Classic Analog Waveforms
No ratings yet
Alias-Free Digital Synthesis of Classic Analog Waveforms
12 pages
Sound Synthesis and Manipulation: Sources
100% (1)
Sound Synthesis and Manipulation: Sources
10 pages
Xsens Performance: Playing Music by The Rules
100% (1)
Xsens Performance: Playing Music by The Rules
2 pages
Synthesis
No ratings yet
Synthesis
12 pages
Max/MSP Signal & Frequency Analysis
100% (2)
Max/MSP Signal & Frequency Analysis
11 pages
MAX 6 Intro
No ratings yet
MAX 6 Intro
22 pages
E-Mu Systems Emax Clinic by Michael Marans
No ratings yet
E-Mu Systems Emax Clinic by Michael Marans
8 pages
Performing Electronic Music Live 1st Edition Kirsten Hermes Instant Download
No ratings yet
Performing Electronic Music Live 1st Edition Kirsten Hermes Instant Download
91 pages
Algorithmic Music Composition Course
No ratings yet
Algorithmic Music Composition Course
16 pages
AudioMulch Help PDF
No ratings yet
AudioMulch Help PDF
437 pages
NEL Spectral Suite For Kyma - Class Descriptions (1st Draft)
No ratings yet
NEL Spectral Suite For Kyma - Class Descriptions (1st Draft)
25 pages
Digital Audio Effects for DSD
No ratings yet
Digital Audio Effects for DSD
6 pages
Patching With YourPalRob
No ratings yet
Patching With YourPalRob
30 pages
Deadmau5, Derek Bailey, and The Laptop Instrument - Improvisation, Composition, and Liveness in Live Coding
100% (1)
Deadmau5, Derek Bailey, and The Laptop Instrument - Improvisation, Composition, and Liveness in Live Coding
9 pages
Gentle Fire An Early Approach To Live Electronic Music by Hugh Davies
100% (1)
Gentle Fire An Early Approach To Live Electronic Music by Hugh Davies
8 pages
REAPEREffects Guide 2021
No ratings yet
REAPEREffects Guide 2021
63 pages
Deadmau5 EDM
No ratings yet
Deadmau5 EDM
9 pages
AIRA SYSTEM-8 Plugout Synthesizer - The Ultimate Guide
No ratings yet
AIRA SYSTEM-8 Plugout Synthesizer - The Ultimate Guide
28 pages
Computer Music: A 1981 Overview
100% (1)
Computer Music: A 1981 Overview
16 pages
Polyplex Manual English
No ratings yet
Polyplex Manual English
74 pages
Octatrack Structure2.1
No ratings yet
Octatrack Structure2.1
1 page
Nonlinear Distortion in VA Oscillators
100% (1)
Nonlinear Distortion in VA Oscillators
13 pages
Digitakt Notebook A1v30
No ratings yet
Digitakt Notebook A1v30
220 pages
Digital Signal Synthesis DDS - Tutorial - Rev12!2!99
No ratings yet
Digital Signal Synthesis DDS - Tutorial - Rev12!2!99
122 pages
ReaLearn User Guide
No ratings yet
ReaLearn User Guide
24 pages
Spectral FFT Max Ms P
100% (1)
Spectral FFT Max Ms P
17 pages
Divisions of The Tetrachord by John H. Chalmers
No ratings yet
Divisions of The Tetrachord by John H. Chalmers
373 pages
Strokes v3.10 Manual
No ratings yet
Strokes v3.10 Manual
50 pages
Basic 65 Manual
No ratings yet
Basic 65 Manual
8 pages
Monotrail Tech Talk - 32 The Power of A Dual Low Pass Gate
No ratings yet
Monotrail Tech Talk - 32 The Power of A Dual Low Pass Gate
6 pages
The ChucK Audio Programming
No ratings yet
The ChucK Audio Programming
192 pages
FM Theory and Applications by Musicians For Musicians by John Chowning, David Bristow
100% (1)
FM Theory and Applications by Musicians For Musicians by John Chowning, David Bristow
196 pages
Little Lfo Manual
No ratings yet
Little Lfo Manual
24 pages
History of Electronic Music
No ratings yet
History of Electronic Music
2 pages
Tutorial On Hardware Emulation
No ratings yet
Tutorial On Hardware Emulation
3 pages
Dilemmas, Dichotomies and Definitions:: Acousmatic Music and Its Precarious Situation in The Arts
No ratings yet
Dilemmas, Dichotomies and Definitions:: Acousmatic Music and Its Precarious Situation in The Arts
13 pages
Subtractive Synthesis in Analogue Sound
100% (2)
Subtractive Synthesis in Analogue Sound
15 pages
Gottfried Michael Koenig Aesthetic Integration of Computer-Composed Scores
100% (1)
Gottfried Michael Koenig Aesthetic Integration of Computer-Composed Scores
7 pages
Wavetable Synthesis
100% (1)
Wavetable Synthesis
1 page
Digitone FM Synth Reference Guide
100% (1)
Digitone FM Synth Reference Guide
334 pages
The Synthesizer - An Overview
100% (1)
The Synthesizer - An Overview
23 pages
Max Mathews - 1963 - The Digital Computer As A Musical Instrument PDF
No ratings yet
Max Mathews - 1963 - The Digital Computer As A Musical Instrument PDF
2 pages
The Timbre Toolbox
No ratings yet
The Timbre Toolbox
15 pages
Shruti-1 User Manual: Mutable Instruments
No ratings yet
Shruti-1 User Manual: Mutable Instruments
18 pages
Vibrato Extraction and Parameterization PDF
No ratings yet
Vibrato Extraction and Parameterization PDF
5 pages
S.D Adshead, "An Introduction To The Study of Civic Design"
No ratings yet
S.D Adshead, "An Introduction To The Study of Civic Design"
14 pages
The Interrogator As Critic: The Turing Test and The Evaluation of Generative Music Systems
100% (1)
The Interrogator As Critic: The Turing Test and The Evaluation of Generative Music Systems
23 pages
Validation of Harmonic Progression Generator Using Classical Music
No ratings yet
Validation of Harmonic Progression Generator Using Classical Music
8 pages
A Review of Intelligent Music Generation Systems
No ratings yet
A Review of Intelligent Music Generation Systems
24 pages
A Review of Intelligent Music Generation Systems: Lei Wang, Ziyi Zhao, Hanwei Liu, Junwei Pang, Yi Qin and Qidi Wu
No ratings yet
A Review of Intelligent Music Generation Systems: Lei Wang, Ziyi Zhao, Hanwei Liu, Junwei Pang, Yi Qin and Qidi Wu
28 pages
Ebert - A Phenomenological Approach To Artificial Jazz Improvisation
100% (3)
Ebert - A Phenomenological Approach To Artificial Jazz Improvisation
46 pages
Computational Creativity and Music Generation Systems
No ratings yet
Computational Creativity and Music Generation Systems
21 pages
Evaluating Evaluation: Assessing Progress in Computational Creativity Research
No ratings yet
Evaluating Evaluation: Assessing Progress in Computational Creativity Research
6 pages
104 3152 1 PB
No ratings yet
104 3152 1 PB
17 pages
The Study of Misconceptions On Motion's Concept and Remediate Using Real Experiment Video Analysis
No ratings yet
The Study of Misconceptions On Motion's Concept and Remediate Using Real Experiment Video Analysis
7 pages
1st Year Laboratory Manual
No ratings yet
1st Year Laboratory Manual
105 pages
Action Research
No ratings yet
Action Research
5 pages
Sampling Assignment
No ratings yet
Sampling Assignment
4 pages
Module 4
No ratings yet
Module 4
9 pages
RSCH 120 2ND Quarter Exam
0% (2)
RSCH 120 2ND Quarter Exam
13 pages
Improving Nursing Students' Learning Outcomes in Fundamentals of Nursing Course Through Combination of Traditional and E-Learning Methods.
No ratings yet
Improving Nursing Students' Learning Outcomes in Fundamentals of Nursing Course Through Combination of Traditional and E-Learning Methods.
6 pages
Field Methods Chap 1-2 Reviewer
No ratings yet
Field Methods Chap 1-2 Reviewer
8 pages
Levin Et Al 1997 The Viscous Elastic Properties of Muscle
No ratings yet
Levin Et Al 1997 The Viscous Elastic Properties of Muscle
26 pages
ACTION RESEARCH Pre Test-Post Test
No ratings yet
ACTION RESEARCH Pre Test-Post Test
4 pages
Whipple 1912
No ratings yet
Whipple 1912
6 pages
Framing and Presentation Mode Effects in Professional Judgment: Auditors' Internal Control Judgments and Substantive Testing Decisions
No ratings yet
Framing and Presentation Mode Effects in Professional Judgment: Auditors' Internal Control Judgments and Substantive Testing Decisions
15 pages
Finance Pros' Green Investment Choices
No ratings yet
Finance Pros' Green Investment Choices
75 pages
Lesson Planning Sheet Title: Using and Applying Probability Learning Objectives
No ratings yet
Lesson Planning Sheet Title: Using and Applying Probability Learning Objectives
2 pages
Final: Laboratory Experiment No. 2
No ratings yet
Final: Laboratory Experiment No. 2
4 pages
235769the Processing of Events Oliver Bott Download
100% (3)
235769the Processing of Events Oliver Bott Download
76 pages
Conceptual Framework & Literature Review Guide
No ratings yet
Conceptual Framework & Literature Review Guide
36 pages
Soil Survey Manual
No ratings yet
Soil Survey Manual
524 pages
UKS50199 Chapter - 3
No ratings yet
UKS50199 Chapter - 3
16 pages
HCPC Course Information Form
100% (1)
HCPC Course Information Form
13 pages
Processing Load and Memory For Stereotype-Based Information: University of Wales College of Cardiff
No ratings yet
Processing Load and Memory For Stereotype-Based Information: University of Wales College of Cardiff
11 pages
Anova 123
No ratings yet
Anova 123
9 pages
The Effectiveness of Routine Physiotherapy With and Without Neuromobilization On Pain and Functional Disability in Patients With Shoulder Impingement Syndrome A Randomized Control Clinical Trial
No ratings yet
The Effectiveness of Routine Physiotherapy With and Without Neuromobilization On Pain and Functional Disability in Patients With Shoulder Impingement Syndrome A Randomized Control Clinical Trial
9 pages
Lab Report #2
No ratings yet
Lab Report #2
4 pages
SAT Online Course Test PDF
No ratings yet
SAT Online Course Test PDF
54 pages
Validation Dictionary
No ratings yet
Validation Dictionary
57 pages
(Ebook) Biostatistics For The Biological and Health Sciences With Statdisk: Pearson New International Edition by Marc M. Triola, Mario F. Triola ISBN 9781292039640, 1292039647 Download
100% (1)
(Ebook) Biostatistics For The Biological and Health Sciences With Statdisk: Pearson New International Edition by Marc M. Triola, Mario F. Triola ISBN 9781292039640, 1292039647 Download
49 pages
Ce8311 Civil CML Even Iiise Labmanual
No ratings yet
Ce8311 Civil CML Even Iiise Labmanual
54 pages
Breast Cancer Screening Program Proposal
100% (4)
Breast Cancer Screening Program Proposal
26 pages
Answer Set 3
0% (2)
Answer Set 3
36 pages

On The Evaluation of Generative Models in Music

Uploaded by

On The Evaluation of Generative Models in Music

Uploaded by

Neural Computing and Applications (2020) 32:4773–4784

On the evaluation of generative models in music

Keywords Objective evaluation Music generation Computational creativity

1 Introduction Structure (FBS) ontology [18, 62], we evaluate the actual

Table 1 Experiment design for

Comp. w/models Yes Yes Yes Yes Yes No Yes Yes No

Fig. 1 General work flow of the proposed method

3.1 Input representation contain pitch information but is a rhythm-related fea-

3.4.1 Pairwise cross-validation

To compare the distance of the features within and between

3.4.2 Kernel density estimation

Table 2 Experimental result of

PC 2.242 1.658 9.300 1.962 3.101 2.355 8.570 2.740

Table 3 Experimental result for

melodies, the model with active feature matching, Mid-

Table 4 Experimental result for

in an element-wise manner, PDFs with identical shape (as

Evaluation of generative models has been falling behind

You might also like