On The Evaluation of Generative Models in Music
On The Evaluation of Generative Models in Music
https://doi.org/10.1007/s00521-018-3849-7
(0123456789().,-volV)(0123456789().,-volV)
ORIGINAL ARTICLE
Received: 18 June 2018 / Accepted: 26 October 2018 / Published online: 3 November 2018
Ó Springer-Verlag London Ltd., part of Springer Nature 2018
Abstract
The modeling of artificial, human-level creativity is becoming more and more achievable. In recent years, neural networks
have been successfully applied to different tasks such as image and music generation, demonstrating their great potential in
realizing computational creativity. The fuzzy definition of creativity combined with varying goals of the evaluated
generative systems, however, makes subjective evaluation seem to be the only viable methodology of choice. We review
the evaluation of generative music systems and discuss the inherent challenges of their evaluation. Although subjective
evaluation should always be the ultimate choice for the evaluation of creative results, researchers unfamiliar with rigorous
subjective experiment design and without the necessary resources for the execution of a large-scale experiment face
challenges in terms of reliability, validity, and replicability of the results. In numerous studies, this leads to the report of
insignificant and possibly irrelevant results and the lack of comparability with similar and previous generative systems.
Therefore, we propose a set of simple musically informed objective metrics enabling an objective and reproducible way of
evaluating and comparing the output of music generative systems. We demonstrate the usefulness of the proposed metrics
with several experiments on real-world data.
123
4774 Neural Computing and Applications (2020) 32:4773–4784
Thus, a method for objective evaluation of generative measuring the success of a generative system are
systems is desirable. addressing the summative and the formative assessment of
The image generation community has benefited from the the system behavior. Subjective approaches to measuring
introduction of the idea of the inception score by Salimans the success of generative systems by means of listening
et al. [47]. It uses a pattern recognition model to assess the experiments can often be categorized as summative
generated sample. The general concept of the inception assessment while objective evaluation strategies mostly fall
score is based on the assumption that a well-trained image into the category of formative assessment. Confusing these
classifier roughly has a human-like classification ability two challenges leads to unclear evaluation strategies.
[47]. This idea has been adapted by multiple researchers to Although subjective evaluation is generally preferable for
allow for an objective measure of various generative sys- evaluating generative modeling, it might require significant
tems [26, 29, 39]. The idea of the inception score is con- resources. Objective methods, on the other hand, can be
vincing and the first results look promising; ultimately, easily executed yet often lack musical relevance as they are
however, the assumed correlation to human judgment still often not based on musical rule systems or heuristics.
needs further scientific examination [19, 64].
The evaluation of generative music systems faces even 2.1 Subjective evaluation in music generation
harder challenges than that of image generation systems
[9]. The sequential yet highly structured form, the ever- Most assessments of generated symbolic music are based
changing interaction between composition and perfor- on inputs from human listeners. These evaluations either
mance, and the abstract nature of meaning and emotion in follow the concept of a musical Turing test [3] or use query
music [36, 61] make a semantic description of music metrics based on the modeled compositional theory [2].
exceedingly hard. The automatic analysis and categoriza- The Turing test [55] follows an intuitive concept that
tion of music is, although having made great progress, not evaluates whether a machine is able to exhibit behavior
close to human-level performance [35]. This makes indistinguishable from humans. One strategy to adapt the
assessing music very difficult [3, 22, 41, 59] and partly Turing test to generative music systems is asking the
explains why music assessment could not be automated by subjects to identify the pieces they consider to be com-
computational models so far. posed by a human as opposed to a computer [34]. This
Despite these high-level challenges, we will show below strategy has been used in several studies as listed in
that state-of-the-art generative music systems struggle with Table 1 [1, 21, 24, 25, 32, 49]. Over the past decades,
creating musical content that follows basic technical rules shortcomings of the Turing test have been pointed out in
and expectations. We argue that these technicalities have to various areas [2, 17, 44]. Many of these problems also
be solved before addressing the questions of aesthetics of apply to musical Turing tests. One of the fundamental
creative works with high-level structural and harmonic issues, however, is that many studies confound the two
properties. questions on whether a piece is aesthetically pleasing and
Therefore, we propose a formative evaluation strategy whether it is composed by a human.
for systems generating symbolic music. The proposed The design of a listening experiment is complex due to
method does not aim at assessing musical pieces in the the many variables ranging from the selection and rendition
context of human-level creativity nor does it attempt to of audio examples, the listening environment, and the
model the aesthetic perception of music. It rather applies selection of subjects, to the phrasing of the questions.
the concept of multicriteria evaluation [54] in order to Without proper guidance (compare, e.g., [6]), we find that
provide metrics that assess basic technical properties of the many contemporary studies struggle with presenting sig-
generated music and help researchers identify issues and nificant scientific evidence. Table 1 lists some of the
specific characteristics of both model and dataset. The variables for several major subjective evaluation studies in
usefulness of the presented method is demonstrated the context of music generation. It is worth noting that all
through a series of experiments, including dataset analysis, of these evaluations are performed with a different problem
comparison of state-of-the-art music generation models, configuration, i.e., different evaluation criteria are used,
and assessment of generative music systems. and both the questionnaires and the listening examples are
proprietary (if not arbitrary) and hard to compare. Without
addressing these issues properly, the reported results can
2 Related work only be understood as providing preliminary evaluation
results and fail at representing a scientific benchmark. First,
As mentioned above, research on automatic music gener- the majority of them ignore factors associated with the
ation systems has suffered from the difficulty of designing subjects themselves (e.g., their level of expertise), which
evaluation methodologies [42]. The two challenges of influences further analysis and the reliability of the
123
Neural Computing and Applications (2020) 32:4773–4784 4775
experimental result [6]. Second, most of the studies rely— reports the summation of the generated sequence’s log-
probably due to limited resources—on a relatively small likelihood across notes and time steps [27]. Since the
sample size [11, 43, 60], which raises questions about the recurrent model used in his study is trained with the goal of
range of the confidence interval and the study’s statistical maximizing the log-likelihood of each training sequence,
significance (which are often not reported) [33]. Note that the measure is argued to be a meaningful quantitative
the common lack of reported statistical measures of con- measure of the performance. The used probabilistic mea-
fidence and significance in itself could be seen as an sures provide objective information, yet Theis et al.
indicator of insufficient scientific rigor. Finally, some of observe that ‘‘A good performance with respect to one
the studies rely on the preference of one model over criterion does not necessarily imply a good performance
another [11, 60]. The drawback of such a test paradigm is with respect to another criterion’’ and provide examples of
the absence of a standard comparison or absolute reference. bad samples with very high likelihoods [54].
While it can be used to measure relative differences or
improvements, it cannot provide any absolute measurement 2.2.2 Model-specific metrics
of quality.
Last but not least, these tests carry the risk of overesti- As the approaches and models vary greatly between dif-
mating the subject’s comprehension, as Ariza concludes ferent generative systems, some of the evaluation metrics
after comparing several subjective evaluation methods are correspondingly designed for a specific model or task.
(e.g., Musical Turing Tests, Musical Directive Toy Tests Bretan et al. proposed a metric for successfully predicting a
and Musical Output Toy Tests) [2]. music unit from a pool of units in a generative system by
evaluating the rank of the target unit [8]. Mogren designed
2.2 Objective evaluation in music generation metrics informed by statistical measurements of poly-
phony, scale consistency, repetitions, and tone span to
Given the advantages over subjective evaluation with monitor the model’s characteristics during its training [37].
respect to reproducibility and required resources, several Common to these evaluation approaches is the use of
recent studies have assessed their models objectively. We domain-specific, custom-designed metrics as opposed to
categorize the objective evaluation methods used by the standard metrics. Obviously, the authors realized the
recent studies on data-driven music generation into the problems with using standard metrics (e.g., edit distance of
following categories: (1) probabilistic measures without melodies) as musically meaningless and implemented
musical domain knowledge, (2) task-/model-specific met- metrics inspired by domain knowledge. The variability and
rics, and (3) metrics using general musical domain diversity of the proposed metrics, however, leads to com-
knowledge. parability issues. The design of nonstandard metrics also
poses additional dangers, such as evaluating only one
2.2.1 Probabilistic measures aspect of the output, or evaluating with a metric that is part
of the system design.
The use of evaluation metrics based on probabilistic mea-
sures such as likelihood and density estimation has been 2.2.3 Metrics based on domain knowledge
successfully used in tasks such as image generation [54]
and is increasingly used in music-related tasks as well To address the multi-criteria nature of generative systems
[14, 52]. For example, Huang et al. [24] propose a frame- and their evaluation [9], various humanly inter-
wise evaluation computing the negative log-likelihood pretable metrics have been proposed. More specifically,
between the model output and the ground truth across these metrics integrate musical domain knowledge and
frames. Similarly, Johnson considers the note combinations enable detailed evaluation with respect to specific music
over time steps of the training data as the ground truth and characteristics. Chuan et al. utilize metrics modeling the
123
4776 Neural Computing and Applications (2020) 32:4773–4784
tonal tension and interval frequencies to compare how In a first step, we gather two collections of samples as
different feature representations can influence a model’s our input datasets. For the application of objective evalu-
performance [12]. Sturm et al. [52] provide a statistical ation, one dataset contains generated samples, the other
analysis of the musical events (occurrence of specific contains samples from the training (target) dataset. This
meters and modes, pitch class distributions, etc.), followed approach can also be used for applications such as dataset
by a discussion with examples on the different application analysis or the comparison of characteristics of two gen-
scenarios. Similarly, Dong et al. apply statistic analysis erative systems. We then extract a set of custom-designed
including tonal distance, rhythmic patterns, and pitch features that are rooted in musical domain knowledge yet
classes to evaluate a multi-track music generator [14]. The easy to understand and interpret. These features encompass
advantages of metrics taking into account domain knowl- both pitch-based and rhythm-based features. After
edge are not only in their interpretability, but also in their extracting these features for both datasets, we are able to
generalizability and validity—at least as long as the compute both an absolute measurement (Fig. 1, top) and a
designed model aims to generate music under the estab- relative measurement. The absolute measurement can
lished rules. provide useful insights to a system developer about the
training dataset properties and generative system’s
characteristics.
3 Method The relative measurement (Fig. 1, bottom), on the other
hand, allows to compare two distributions in various
Following the approach of using domain knowledge for dimensions. It is computed by first applying pairwise
designing human-interpretable evaluation metrics for gen- exhaustive cross-validation to compute the distance of each
erative music systems, we present a formative evaluation sample to either the same dataset (intra-dataset) or to the
strategy based on a comprehensive set of simple yet other dataset (inter-dataset). The results are distance his-
musically meaningful features that can be easily applied to tograms per feature. Next, the probability distribution
a wide variety of different symbolic music generation function (PDF) of each feature histogram is estimated by
models. kernel density estimation [50].
The two targets of the proposed evaluation strategy are Finally, we compute two metrics for the objective
to provide (1) absolute metrics in order to give insights into evaluation of generative systems from the training dataset’s
properties and characteristics of a generated or collected set intra-set distance PDF (target distribution) and the inter-set
of data and (2) relative metrics in order to compare two sets distance PDF between the training and generated datasets:
of data, e.g., training and generated. The overall method is (1) the area of overlap and (2) the Kullback–Leibler
illustrated in Fig. 1 and described below. Divergence (KLD). The steps are introduced in detail in the
following sections.
Analysis of characteristic
Statistic of features
Mean, STD
for each feature in training dataset
Statistic
analysis of
features
Statistic of features
in generated dataset
Absolute
Training
Mean, STD within
dataset Features for each
each intra-set PDFs
dataset
Intra-set
Feature Absolute/relative distances
extraction measure
Generated
dataset
Relative PDFs of intra-set
Intra-set distances
for each feature
distances Evaluation metric
for each feature
Measures Difference of
Pairwise Kernel density
between intra-sets and
cross-validation estimation
distributions inter-set distances
PDFs of inter-set
Inter-set distances KLD, overlap area
distances
for each feature for each intra-set PDF
for each feature
with inter-set PDF
123
Neural Computing and Applications (2020) 32:4773–4784 4777
123
4778 Neural Computing and Applications (2020) 32:4773–4784
123
Neural Computing and Applications (2020) 32:4773–4784 4779
evaluation method. We discuss how parameters can dimension. We can make the following observations. First,
influence the result of a generative system by compar- the higher mean of the intra-set self-distance for nearly all
ing the generated samples with the training dataset. features in the jazz genre as compared to folk indicates that
samples in the jazz genre generally have a higher diversity,
a result that matches expectation as folk is often based on
4.1 Experiment 1: Dataset evaluation
simple patterns [45] while jazz generally allows more
freedom in its musical composition [7]. Second, we
Musical style is defined by a set musical characteristics.
observe considerable differences for the absolute measures
Due to the complexity of musical content, observing style
of features such as note count and average inter-onset-
and properties of a music dataset can be a major challenge.
interval.
This experiment aims to demonstrate how the proposed
Figure 3a illustrates the average pitch class transition
approach allows to characterize data from two different
matrices (PCTM). The folk dataset is more restricted in the
music genres and provide insights into genre-specific
usage of certain pitches (i.e., D], F, G], B[.) and shows a
properties.
comparably sparse matrix compared to jazz, where both
pitches and pitch transitions tend to have more variety.
4.1.1 Input datasets
We can also observe that the folk music dataset shows a
larger mean for features such as note length histogram
The chosen two genres are folk and jazz music. The folk
(NLH), and note length transition matrix (NLTM). How-
music dataset is the Irish Tunes collected from the Henrik
ever, by illustrating the average NLTMs in Fig. 3b, we
Norbeck’s ABC Tunes website [23]. The jazz music
notice that folk dataset again shows a sparse matrix as
dataset comprises jazz lead sheets from both the Wikifonia
compared to the jazz dataset. This implies that the jazz
database [51] and publicly available jazz solo transcrip-
dataset has a higher variety of note length transitions within
tions collected by Mason et al. [8].
a song while having a lower diversity of note length tran-
The folk and jazz music datasets contain 2351 and 392
sition across the dataset.
entries, respectively. A pilot experiment determining the
In data-driven approaches to music generation, the
necessary amount of samples was carried out. The exper-
output of the generative system should directly relate to the
iment was then executed with 100 randomly selected songs
characteristics of the training dataset. The presented
from each dataset. Of these songs, only the first 8 bars are
absolute measures allow for a musically intuitive way of
considered.
highlighting various dimensions of such characteristics.
This can help with the critical step of designing a gener-
4.1.2 Analysis and discussion
alizable dataset, possibly from various sources, for training
a generative system.
Table 2 lists the results for both the intra-set distances and
the absolute measurements for features with one
123
4780 Neural Computing and Applications (2020) 32:4773–4784
(a) (b)
Fig. 3 Example of absolute measurement: a average pitch class transition matrix (PCTM) and b average note length transition matrix (NLTM) of
Jazz and Folk music dataset (see Sect. 4.1)
4.2 Experiment 2: System comparison absolute measurements NC and PR indicate that MidiNet 2
tends to use more notes and has a higher average pitch
The second experiment compares MidiNet [60], a genera- range than Magenta’s lookback RNN.
tive adversarial network (GAN) for symbolic domain The fact that the outputs of these two systems have been
music generation, with the melody lookback recurrent used previously in a subjective study [60, Sect. 5] allows us
neural network (Lookback RNN) of the Magenta project to compare the subjective results with these objective
[58]. As discussed in the previous Sect. 3.4, the proposed results. The listening test resulted in a comparable rating
objective evaluation can assist studying different model for the questions How real and How pleasing the model
structures and behaviors when the training datasets for both outputs are; for the question How interesting, however,
models are available. In some cases, however, the training MidiNet acquired a slightly higher rating. This interest-
datasets are inaccessible as is the case for Magenta. Given ingness result might be related to the characteristics of
this issue, we consider this scenario for the proposed higher pitch range, pitch count, and note count that we find
method to compare the characteristics of different models. in the absolute measures.
We again exploit the intra-set distances and the absolute Magenta’s RNN, on the other hand, shows a higher
measurement utilized in the previous experiment. Fur- mean among the intra-set distances in these features; this
thermore, we attempt to relate reported subjective evalua- somewhat contradicts the result of the subjective test.
tion results to the identified characteristics. Therefore, we investigate this issue further by looking into
the STD value, as a higher STD might hint at a lower
4.2.1 Input datasets reliability of the mean value. No clear conclusions can be
drawn as the limited sample size in the listening test does
We implement and train the so-called MidiNet ‘‘Model 2’’ not allow for more detailed analysis.
[60], below referred to as MidiNet 2, by using 526 MIDI Finally, Fig. 4 showcases another visualization of data
tabs with 8 bars parsed from the TheoryTab.2 characteristics. The PDF of the intra-set distances among
The MidiNet model and the public accessible pre- features (PCH, PCTM, NLH, and NLTM) is shown in a
trained model of Magenta’s Lookback RNN generate 100 violin plot, an intuitive visualization of PDFs. The plot
samples each. Each sample contains a melody with 8 bars. echoes the previous argument, where a significant higher
The first bar is provided by the user while the remaining 7 skewness indicates a less diversified intra-set behavior and
bars are generated by the models. a higher STD indicates a lower reliability of the similarity
measure.
4.2.2 Analysis and discussion
4.3 Experiment 3: Performance evaluation
The results of Exp. 2 are shown in Table 3. It can be
observed that the two model outputs are distinctly different The final experiment demonstrates the use case of evalu-
in several dimensions such as pitch count, pitch interval ating a generative system. We compare two parametriza-
and pitch range; this is shown by the fact that the mean tions of MidiNet, ‘‘Model 1’’ and ‘‘Model 2’’ [60]. Both
values of the inter-set distances are larger than the mean models have identical architecture and share the same
values of both intra-set distances. Furthermore, the training data. The difference between the models is that
one model does not use feature matching regularizers
2
https://www.hooktheory.com/theorytab. (MidiNet 1) while the other model does (MidiNet 2).
123
Neural Computing and Applications (2020) 32:4773–4784 4781
PC 2.897 2.400 7.820 2.647 2.214 1.708 11.300 1.967 4.097 2.490
PC/bar 4.766 1.594 – – 4.866 1.324 – – 4.885 1.446
NC 10.228 9.534 27.310 9.837 6.086 4.596 30.740 5.366 8.940 7.576
NC/bar 6.870 2.903 – – 7.511 1.855 – – 7.359 2.220
PCH 0.490 0.156 – – 0.385 0.127 – – 0.440 0.142
PCH/bar 2.575 0.371 – – 2.591 0.283 – – 2.584 0.326
PCTM 0.441 0.099 – – 0.300 0.049 – – 0.386 0.079
PR 4.796 3.975 12.650 4.383 3.013 2.631 19.600 2.814 7.681 4.052
PI 1.209 1.274 2.940 1.236 1.105 0.812 5.559 0.965 2.773 1.275
IOI 0.257 0.241 0.653 0.248 0.108 0.095 0.531 0.101 0.205 0.212
NLH 0.538 0.223 – – 0.237 0.085 – – 0.420 0.180
NLTM 0.491 0.187 – – 0.271 0.059 – – 0.399 0.152
123
4782 Neural Computing and Applications (2020) 32:4773–4784
PC 2.527 2.760 2.367 1.805 1.185 0.158 2.214 1.708 0.130 0.541
PC/bar 5.446 2.130 4.551 1.264 0.812 0.042 4.866 1.324 0.572 0.909
NC 12.360 9.954 5.085 3.752 1.058 0.081 6.086 4.596 0.009 0.693
NC/bar 7.804 2.977 5.210 1.520 0.442 0.090 7.511 1.855 0.181 0.916
PCH 0.506 0.169 0.301 0.082 0.012 0.563 0.385 0.127 0.016 0.814
PCH/bar 2.497 0.414 1.546 0.221 0.018 0.319 2.591 0.283 0.025 0.899
PCTM 0.439 0.107 0.263 0.036 0.349 0.277 0.300 0.049 0.172 0.483
PR 4.726 3.935 1.803 1.443 0.502 0.164 3.013 2.631 0.453 0.400
PI 1.062 1.508 0.958 0.767 1.096 0.198 1.105 0.812 0.217 0.443
IOI 0.377 0.403 0.024 0.018 0.141 0.067 0.108 0.095 0.089 0.631
NLH 0.506 0.194 0.174 0.089 0.331 0.187 0.237 0.085 0.024 0.507
NLTM 0.502 0.165 0.208 0.099 0.599 0.201 0.271 0.059 0.112 0.455
5 Conclusion
123
Neural Computing and Applications (2020) 32:4773–4784 4783
dataset characteristics. This analysis allows the researcher 13. Colton S, Pease A, Ritchie G (2001) The effect of input knowl-
to draw conclusions about the system’s ability to model a edge on creativity. In: Technical reports of the Navy Center for
Applied Research in Artificial Intelligence. Washington, DC,
certain musical feature of the training dataset, as well as to USA
estimate the variability and the stability of different model 14. Dong HW, Hsiao WY, Yang LC, Yang YH (2018) Musegan:
designs. multi-track sequential generative adversarial networks for sym-
We have released the evaluation framework as an open- bolic music generation and accompaniment. In: Association for
the advancement of artificial intelligence (AAAI). New Orleans,
source toolbox which implements the demonstrated eval- Louisiana, USA
uation and analysis methods along with visualization tools. 15. Gatys LA, Ecker AS, Bethge M (2016) A neural algorithm of
Our future work will include the extension of the current artistic style. In: The annual meeting of the vision sciences
toolbox with additional dimensions (e.g., dynamics) and to society. St. Pete Beach, Florida, USA
16. Geisser S (1993) Predictive inference, vol 55. CRC Press, Boca
expand it toward polyphonic music. This toolbox is Raton
available in an online repository.3 17. Geman D, Geman S, Hallonquist N, Younes L (2015) Visual
turing test for computer vision systems. Proc Natl Acad Sci
112(12):3618–3623
18. Gero JS, Kannengiesser U (2004) The situated function–be-
Compliance with ethical standards haviour–structure framework. Des Stud 25(4):373–391
19. Gurumurthy S, Sarvadevabhatla RK, Radhakrishnan VB (2017)
Deligan: generative adversarial networks for diverse and limited
Conflict of interest The authors declare that they have no conflict of
data. In: IEEE conference on computer vision and pattern
interest.
recognition (CVPR). Honolulu, Hawaii, USA
20. Guyot WM (1978) Summative and formative evaluation. J Bus
Educ 54(3):127–129. https://doi.org/10.1080/00219444.1978.
References 10534702
21. Hadjeres G, Pachet F (2016) Deepbach: a steerable model for
1. Agarwala N, Inoue Y, Sly A (2017) Music composition using bach chorales generation. In: International conference on
recurrent neural networks. Stanford University, Technical Report machine learning (ICML). New York City, NY, USA
in CS224 22. Hale CL, Green SK (2009) Six key principles for music assess-
2. Ariza C (2009) The interrogator as critic: the turing test and the ment. Music Educ J 95(4):27–31. https://doi.org/10.1177/
evaluation of generative music systems. Comput Music J 0027432109334772
33(2):48–70 23. Henrik Norbeck’s abc tunes. Last accessed Mar 2018. http://
3. Asmus EP (1999) Music assessment concepts: a discussion of www.norbeck.nu/abc/
assessment concepts and models for student assessment intro- 24. Huang CZA, Cooijmans T, Roberts A, Courville A, Eck D (2017)
duces this special focus issue. Music Educ J 86(2):19–24 Counterpoint by convolution. In: International society of music
4. Babbitt M (1960) Twelve-tone invariants as compositional information retrieval (ISMIR). Suzhou, China
determinants. Music Q 46(2):246–259 25. Huang KC, Jung Q, Lu J (2017) Algorithmic music composition
5. Balaban M, Ebcioğlu K, Laske O (eds) (1992) Understanding using recurrent neural networking. Stanford University, Techni-
music with AI: perspectives on music cognition. MIT Press, cal Report in CS221
Cambridge 26. Huang X, Li Y, Poursaeed O, Hopcroft J, Belongie S (2016)
6. Bech S, Zacharov N (2007) Perceptual audio evaluation—theory, Stacked generative adversarial networks. In: IEEE conference on
method and application. Wiley, London computer vision and pattern recognition (CVPR). Las Vegas,
7. Boot P, Volk A, de Haas WB (2016) Evaluating the role of Nevada, USA
repeated patterns in folk song classification and compression. 27. Johnson DD (2017) Generating polyphonic music using tied
J New Music Res 45(3):223–238 parallel networks. In: International conference on evolutionary
8. Bretan M, Weinberg G, Heck L (2017) A unit selection and biologically inspired music and art, pp 128–143. Amsterdam,
methodology for music generation using deep neural networks. The Netherlands
In: International conference on computational creativity (ICCC). 28. Jordanous A (2012) A standardised procedure for evaluating
Atlanta, Georgia, USA creative systems: computational creativity evaluation based on
9. Briot JP, Hadjeres G, Pachet F (2019) Deep learning techniques what it is to be creative. Cognit Comput 4(3):246–279
for music generation—a survey. Springer, London 29. Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing
10. Chordia P, Rae A (2007) Raag recognition using pitch-class and of gans for improved quality, stability, and variation. In: Inter-
pitch-class dyad distributions. In: International society of music national conference on learning representations (ICLR). Toulon,
information retrieval (ISMIR), pp 431–436. Vienna, Austria France
11. Chu H, Urtasun R, Fidler S (2016) Song from pi: a musically 30. Krumhansl C, Toiviainen P et al (2000) Dynamics of tonality
plausible network for pop music generation. In: International induction: a new method and a new model. In: International
conference on learning representations (ICLR). San Juan, Puerto conference on music perception and cognition (ICMPC). Keele,
Rico UK
12. Chuan CH, Herremans D (2018) Modeling temporal tonal rela- 31. Lee K (2006) Automatic chord recognition from audio using
tions in polyphonic music through deep networks with a novel enhanced pitch class profile. In: International computer music
image-based representation. In: Association for the advancement conference (ICMC). New Orleans, Louisiana, USA
of artificial intelligence (AAAI). New Orleans, Louisiana, USA 32. Liang F, Gotham M, Johnson M, Shotton J (2017) Automatic
stylistic composition of bach chorales with deep lstm. In: Inter-
national society of music information retrieval (ISMIR). Suzhou,
3 China
https://github.com/RichardYang40148/mgeval.
123
4784 Neural Computing and Applications (2020) 32:4773–4784
33. Likert R (1932) A technique for the measurement of attitudes. 51. Simon I, Morris D, Basu S (2008) Mysong: automatic accom-
Arch Psychol 22(140):5–55 paniment generation for vocal melodies. In: Proceedings of the
34. Marsden A (2013) Music, intelligence and artificiality. In: SIGCHI conference on human factors in computing systems,
Readings in music and artificial intelligence, pp 25–38. pp 725–734. Florence, Italy
Routledge 52. Sturm BL, Ben-Tal O (2017) Taking the models back to music
35. Meredith D (2016) Computational music analysis. Springer, practice: evaluating generative transcription models built using
Berlin deep learning. J Creat Music Syst. https://doi.org/10.5920/JCMS.
36. Meyer LB (2008) Emotion and meaning in music. University of 2017.09
Chicago Press, Chicago 53. Temperley D, Marvin EW (2008) Pitch-class distribution and the
37. Mogren O (2016) C-rnn-gan: continuous recurrent neural net- identification of key. Music Percept Interdiscip J 25(3):193–212
works with adversarial training. In: Advances in neural infor- 54. Theis L, van den Oord A, Bethge M (2016) A note on the
mation processing systems, constructive machine learning evaluation of generative models. In: International conference on
workshop (NIPS CML). Barcelona, Spain learning representations (ICLR). Caribe Hilton, San Juan, Puerto
38. Moog RA (1986) Midi: musical instrument digital interface. Rico. arXiv:1511.01844
J Audio Eng Soc 34(5):394–404 55. Turing AM (1950) Computing machinery and intelligence. Mind
39. Mroueh Y, Sercu T (2017) Fisher gan. In: Advances in neural 59(236):433–460
information processing systems (NIPS). Long Beach, CA, USA 56. Turlach BA et al (1993) Bandwidth selection in kernel density
40. O’Brien C, Lerch A (2015) Genre-specific key profiles. In: estimation: a review. Université catholique de Louvain Louvain-
International computer music conference (ICMC). Denton, Tex- la-Neuve
as, USA 57. Verbeurgt K, Dinolfo M, Fayer M (2004) Extracting patterns in
41. Pati KA, Gururani S, Lerch A (2018) Assessment of student music for composition via Markov chains. In: International
music performances using deep neural networks. Appl Sci conference on industrial, engineering and other applications of
8(4):507. https://doi.org/10.3390/app8040507. http://www.mdpi. applied intelligent systems, pp 1123–1132. Springer, Ottawa, ON,
com/2076-3417/8/4/507 Canada (2004)
42. Pearce M, Meredith D, Wiggins G (2002) Motivations and 58. Waite E, Eck D, Roberts A, Abolafia D (2016) Project magenta:
methodologies for automation of the compositional process. generating long-term structure in songs and stories. https://
Music Sci 6(2):119–147 magenta.tensorflow.org/blog/2016/07/15/lookback-rnn-attention-
43. Pearce MT, Wiggins GA (2007) Evaluating cognitive models of rnn/
musical composition. In: International joint workshop on com- 59. Wu CW, Gururani S, Laguna C, Pati A, Vidwans A, Lerch A
putational creativity, pp 73–80. London, UK (2016) Towards the objective assessment of music performances.
44. Pease A, Colton S (2011) On impact and evaluation in compu- In: International conference on music perception and cognition
tational creativity: a discussion of the turing test and an alterna- (ICMPC). Hyderabad, AP, India
tive proposal. In: Proceedings of the AISB symposium on AI and 60. Yang LC, Chou SY, Yang YH (2017) Midinet: a convolutional
philosophy, p 39. York, United Kingdom generative adversarial network for symbolic-domain music gen-
45. Pease T, Mattingly R (2003) Jazz composition: theory and eration. In: International society of music information retrieval
practice. Berklee Press, Boston (ISMIR). Suzhou, China
46. Ritchie G (2007) Some empirical criteria for attributing creativity 61. Zbikowski LM (2002) Conceptualizing music: cognitive struc-
to a computer program. Minds Mach 17(1):67–99 ture, theory, and analysis. Oxford University Press, Oxford
47. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, 62. Zhang W, Wang J (2016) Design theory and methodology for
Chen X (2016) Improved techniques for training gans. In: enterprise systems. Enterp Inf Syst 10(3):245–248. https://doi.
Advances in neural information processing systems (NIPS). org/10.1080/17517575.2015.1080860
Barcelona, Spain 63. Zhang WJ, Yang G, Lin Y, Ji C, Gupta MM (2018) On definition
48. Scott DW (2015) Multivariate density estimation: theory, prac- of deep learning. In: World automation congress (WAC).
tice, and visualization. Wiley, Hoboken Stevenson, Washington, USA
49. Shin A, Crestel L, Kato H, Saito K, Ohnishi K, Yamaguchi M, 64. Zhou Z, Cai H, Rong S, Song Y, Ren K, Zhang W, Wang J, Yu Y
Nakawaki M, Ushiku Y, Harada T (2017) Melody generation for (2018) Activation maximization generative adversarial nets. In:
pop music via word representation of musical properties. arXiv- International conference on learning representations (ICLR).
preprint arXiv:1710.11549 Vancouver, Canada
50. Silverman BW (1986) Density estimation for statistics and data
analysis, vol 26. CRC Press, Boca Raton
123