KEMBAR78
An Audio To MIDI Aplication in Java | PDF
0% found this document useful (0 votes)
307 views55 pages

An Audio To MIDI Aplication in Java

This thesis presents the foundation of an audio-to-MIDI application developed in Java. The application analyzes audio and converts it to MIDI data, with generally good results for simple sounds but requiring more work to handle complex sounds as expected in typical usage. The thesis discusses pitch detection techniques, MIDI standards, and challenges of audio-to-MIDI conversion before detailing the application's design, implementation, testing and areas for future improvement.

Uploaded by

Ümit Gürman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
307 views55 pages

An Audio To MIDI Aplication in Java

This thesis presents the foundation of an audio-to-MIDI application developed in Java. The application analyzes audio and converts it to MIDI data, with generally good results for simple sounds but requiring more work to handle complex sounds as expected in typical usage. The thesis discusses pitch detection techniques, MIDI standards, and challenges of audio-to-MIDI conversion before detailing the application's design, implementation, testing and areas for future improvement.

Uploaded by

Ümit Gürman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

2009:073 CIV

MASTER'S THESIS

An Audio-to-MIDI Application in Java

Gustaf Forsberg

Luleå University of Technology

MSc Programmes in Engineering


Computer Science and Engineering
Department of Computer Science and Electrical Engineering
Division of Information and Communication Technology

2009:073 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--09/073--SE


Abstract

Audio and MIDI data are fundamentally different, yet intertwined in the world of
computer-based music composition and production. While a musical performance
may be represented in both forms, MIDI data can always be edited and modified
without compromising sound quality, and musical notation can be produced from it
rather straightforwardly. Thus, having a performance stored as MIDI data can
sometimes be preferable to having it stored as audio data. However, in the absence of
a MIDI-enabled instrument, the MIDI data would need to be generated from the
audio data, putting some rather severe restrictions on the possibilities.

This thesis presents the foundation of an audio-to-MIDI application developed in


Java, following an introductory discussion on pitch detection, MIDI, and the general
problem of audio-to-MIDI translation. The audio-to-MIDI performance of the
application is generally good for music with fairly simple sounds, but more work is
needed for it to properly handle the more complex sounds expected in the typical
usage scenario.

i
ii
Author’s notes

For almost as long as I can remember, music has been a central part of my life. I grew
up with the music of Johann Sebastian Bach, and although I did not realize it at the
time, its sublime beauty is often mirrored in the patterns and behavior of nature.
During the years I studied composition, I became increasingly aware of the
mathematics of music; during the years I have been studying computer science, I have
become increasingly aware of ‘the music of mathematics’.

The subject of this thesis arose from a wish to apply software engineering skills in a
musical context, and also – importantly – to learn something new. I had never
previously done any sound programming, which gave the practical aspect a certain
appeal. Since I have not specialized in signal analysis, I needed to read up quite a bit
on the theory as well. This proved to be a tremendously interesting experience, most
often leading to contemplation way beyond the purely mathematical details.

In closing, I would like to thank my supervisor, Dr. Kåre Synnes, for advice and
assistance throughout the work on the thesis.

Gustaf Forsberg
April 2009

iii
iv
Table of Contents

1 INTRODUCTION
1.1 Background..................................................................................................................1
1.2 Thesis overview...........................................................................................................1
1.2.1 Purpose .............................................................................................................2
1.2.2 Delimitations....................................................................................................2
1.2.3 General structure.............................................................................................2

2 TECHNICAL BACKGROUND
2.1 Pitch detection ............................................................................................................3
2.1.1 General issues...................................................................................................3
2.1.2 Time-domain methods ...................................................................................5
2.1.3 Frequency-domain methods ..........................................................................5
2.1.4 Some notes on the DFT and the FFT .........................................................8
2.2 MIDI...........................................................................................................................10
2.2.1 Messages .........................................................................................................10
2.2.2 Standard MIDI files and General MIDI....................................................11
2.3 Audio-to-MIDI .........................................................................................................12
2.3.1 General considerations .................................................................................12
2.3.2 Hardware solutions .......................................................................................14
2.3.3 Software solutions .........................................................................................14

3 DESIGN AND IMPLEMENTATION


3.1 Overview....................................................................................................................15
3.1.1 General design ...............................................................................................15
3.2 Graphical user interface...........................................................................................17
3.2.1 Design notes...................................................................................................18
3.2.2 Implementation notes...................................................................................18
3.3 MIDI functionality ...................................................................................................19
3.3.1 Design notes...................................................................................................19
3.3.2 Implementation notes...................................................................................20
3.4 Audio functionality...................................................................................................21
3.4.1 Design notes...................................................................................................22
3.4.2 Implementation notes...................................................................................22

v
3.5 Pitch detection functionality ...................................................................................24
3.5.1 Design notes...................................................................................................24
3.5.2 Implementation notes...................................................................................25
3.6 Audio-to-MIDI functionality..................................................................................26
3.6.1 Design notes...................................................................................................26
3.6.2 Implementation notes...................................................................................26

4 TESTS AND ANALYSIS


4.1 Test approach............................................................................................................27
4.2 Structured tests..........................................................................................................27
4.2.1 Monophony....................................................................................................28
4.2.2 Two-voice polyphony...................................................................................30
4.2.3 Three-voice polyphony.................................................................................32
4.3 Further tests...............................................................................................................34
4.3.1 Electric guitar: a jazz lick..............................................................................34
4.3.2 Acoustic guitar: a chord progression..........................................................36
4.4 Test result summary .................................................................................................37

5 DISCUSSION AND FUTURE WORK


5.1 Results.........................................................................................................................39
5.1.1 General application quality ..........................................................................39
5.1.2 Feature set.......................................................................................................39
5.1.3 Audio-to-MIDI functionality ......................................................................39
5.2 Feature improvements and additions ....................................................................40
5.2.1 General pitch detection improvements .....................................................40
5.2.2 Audio and MIDI editing ..............................................................................41
5.2.3 Audio and file formats..................................................................................41
5.2.4 A ‘project’ format ..........................................................................................41
5.2.5 Volume controls ............................................................................................42
5.2.6 Fast forward/rewind/pause.........................................................................42
5.2.7 GUI..................................................................................................................42
5.2.8 Instrument tuner............................................................................................42
5.3 Concluding remarks .................................................................................................43

APPENDICES
Appendix A: References .................................................................................................45
Appendix B: List of figures ............................................................................................47

vi
1 Introduction

1.1 Background

In computer-based music creation, one is often working with two fundamentally


different formats; audio and MIDI. While audio data represents the actual sound (i.e.
the waveform), MIDI simply provides a protocol used to communicate performance-
related information. Sound production is left to a MIDI instrument, which may be
either hardware or software.

In some aspects, when working with a piece of music, MIDI has a number of
advantages over working directly with audio data. Tasks like for instance adjustment
of tempo or phrasing, tweaking velocity, or removing unwanted notes are trivial when
working with MIDI, whereas in the audio case re-recording would likely be preferred.
Furthermore, due to the nature of MIDI, the step to musical notation is fairly short; in
some cases the conversion is a one-step affair, although some manual editing is usually
required to produce a good-looking score. As a concluding example, it could also be
mentioned that MIDI provides a very space-efficient way of storing a performance.

There are, then, several situations where MIDI data may be preferable to audio data.
Generally, this does not present much of a problem to a keyboard player – most
keyboards today have MIDI functionality, and indeed, MIDI was designed with
keyboard instruments in mind. However, the situation is quite another if an
instrument lacks MIDI functionality, or if only an audio recording of a performance is
available. In such cases, it would be practical to be able to translate audio data into
MIDI data.

1.2 Thesis overview

Audio-to-MIDI translation is the main subject of this project, both from a theoretical
and from a practical perspective. An outline of the thesis is presented below, in terms
of purpose, delimitations, and information organization.

1
2 Introduction

1.2.1 Purpose

The goal of this thesis is the design and implementation of a general-purpose audio-
to-MIDI application. The application is not intended to let users make MIDI files
from their CD or mp3 collection; rather, it should be thought of as a musician’s tool,
to be used for example as a quick means to transcription of improvisations.

The application aims to provide a ‘working environment’ and not just limit itself to
pure audio-to-MIDI functionality. Hence, it will support audio and MIDI file handling
and playback, audio recording, and other related features. The application should be
quick and easy to use, so a clear and intuitive GUI is desired.

1.2.2 Delimitations

Although signal analysis is central in the subject of audio-to-MIDI translation, the area
of the thesis project is in fact software engineering. The main implication of this is
that the formal focus of the thesis lies on the application, rather than on the often
intricate mathematical details. Nevertheless, a significant amount of time had to be
dedicated to theory studies since the author did not have any previous experience of
signal analysis.

Regarding pitch detection and audio-to-MIDI, even seemingly simple musical


passages can present significant difficulties and may require quite exquisite solutions.
However, since pitch detection and audio-to-MIDI are only part of what the
application does, the degree of sophistication of such functionality had to be balanced
with respect to the other desired features. Thus, work had to be delimited by
excluding certain features from the application and limiting functionality of others; a
more detailed discussion on conceived features and functionality is presented in
chapter 5. The application should currently be considered a platform or prototype
which will be further refined, since the envisioned final version lies well beyond the
scope of this thesis.

1.2.3 General structure

After this brief introductory chapter, we will turn our attention to the prerequisites of
an audio-to-MIDI application, discussing topics such as pitch detection and MIDI.
Following that, chapters three and four concern themselves with the design and
implementation of the application, along with a series of tests to determine its general
performance. The thesis concludes with a discussion on both the current state of the
application and future work.
2 Technical background

2.1 Pitch detection

The problem of algorithmically identifying which notes are sounding at a given


moment can range from fairly trivial to hard, or even impossible. A gentle melody
played by solo violin, for example, does not need to be particularly difficult. A furious
violin solo accompanied by an equally furious orchestra, on the other hand, would be
quite another matter. While pitch detection may be considered solved in the
monophonic case, non-monophonic methods are still an interesting research area,
perhaps particularly so in conjunction with timbre identification and separation. This,
however, is far beyond the scope (and purpose) of this thesis; here, we restrict our
concerns to identifying one or more sounding notes without attempting to identify the
instruments.

2.1.1 General issues

As we recall, the simplest pitched sound is the sine tone, with its pitch being
determined solely by the frequency of the single sinusoid. When dealing with sine
tones, it is trivial to determine even several simultaneous pitches, since each peak in
the frequency spectrum corresponds to a separate note.

Typically, however, a pitched sound will have several periodic components (referred
to as partials), differing in frequency, amplitude, and phase. In the typical pitched
instrument, the frequencies of the partials align in a harmonic series. This means that
the frequencies are whole-number multiples of some common fundamental frequency,
and partials with this property are called harmonics. The term overtone is often used to
refer to any partial – harmonic or inharmonic – other than the fundamental. To
varying degrees, the presence of overtones makes pitch detection more complicated.

We may assume the pitch of a note to be determined by its fundamental frequency,


although it is important to point out that pitch really is a psychoacoustic concept – it
is something we perceive. There are several interesting examples of this; the Shepard
scale, for instance, seems to remain within a fixed pitch interval (e.g. an octave) no
matter how far we continue to ascend or descend in pitch. It is an auditory illusion,
created by means of Shepard tones (basically tones which are constructed from sine

3
4 Technical background

tones with differing amplitudes in octaves) [1]. Another example of the psychological
(and neurological) aspect of pitch is that in the case of harmonic partials, we tend to
‘hear’ the fundamental even if it is not present; this is known as periodicity pitch, missing
fundamental, or subjective fundamental. It may seem like a somewhat artificial example, but
the effect is used in practice, for instance in the production of deep organ tones [2].

Musically, the second and fourth harmonics lie one and two octaves above the
fundamental, respectively, and the third harmonic lies a fifth above the second
harmonic. Together, these intervals produce a very clean sound; indeed, much of the
‘color’ of the sound lies in the configuration of the higher harmonics. Overtones are
typically not perceived as separate notes, but in some sounds they are. Even so, we
hardly think of them as notes being played, but rather consider them components of
the sound. In other words, they do not necessarily alter the perceived pitch.

100 100
Magnitude (%)

Magnitude (%)

50 50

0 0
0 1000 2000 3000 4000 0 1000 2000 3000 4000
Frequency (Hz) Frequency (Hz)
(a) (b)
Figure 1. Magnitude spectra of the note g with fundamental frequency approximately
196 Hz, played on an electric guitar with a clean tone (a) and a tenor crumhorn (b).
Compared to the guitar, the crumhorn is notably rich in overtones; the frequency
range of the plots has been limited for readability, but partials of the crumhorn
continue way up to about 15 kHz.

Unless we have a case with a missing fundamental, we might be able to do


monophonic pitch detection by finding the lowest frequency present, although this
requires a clean signal. For non-monophonic pitch detection, we need to single out
the fundamentals from the overtones. In Figure 1 (a) above, the magnitude at the
fundamental frequency is the largest, and indeed, sometimes it is possible to perform
pitch detection by simply finding peaks above a certain threshold in the magnitude
spectrum. However, when several notes are sounding simultaneously, some partials
may have common frequencies, and the results of wave superposition could easily
thwart our attempts to identify fundamentals by magnitude. Also, as we can see in
Figure 1 (b), it is by no means a given that the magnitude of the fundamental is the
largest. In fact, not even the guitar tone of Figure 1 (a) can be assumed to always have
largest magnitude at the fundamental frequency; as the string vibrates, the relative
amplitudes of the partials vary.
Pitch detection 5

Apart from issues that arise from a musical context, such as tone complexity and
polyphony, there are several other factors which can complicate proper pitch
detection. The noise level of the signal is one such factor. Some pitch detection
methods are more sensitive to noise than others, and often a compromise must be
reached between noise sensitivity, accuracy, and computational cost. Naturally, if there
is a real-time requirement, keeping the computational cost down becomes more
important. Sounds or sound phenomena originating from the recording environment
(such as echoes or reverb) also complicate analysis.

We may distinguish two basic approaches to the pitch detection problem; the time-
domain approach, and the frequency-domain approach. In the following sections, a
few examples of each approach are discussed.

2.1.2 Time-domain methods

A straightforward approach to the pitch detection problem is the zero-crossing method;


if we have for example a sine tone, we can obtain its frequency by simply determining
the zero-crossing rate (or peak rate) of the signal. This method is computationally
inexpensive, but generally sensitive to noise and not very well suited for more
complex signals. It may however be somewhat improved by means of adaptive
filtering [3].

Auto-correlation is another, quite popular, way to tackle the pitch detection problem in
the time domain. The main idea is to compare a segment of the signal with a shifted
version of itself; the correlation should be greatest when the shift corresponds to the
fundamental period of the signal. A problem with this approach is that the accuracy
tends to decrease at higher frequencies, due to periods becoming shorter and
approximation errors becoming greater. This method also suffers somewhat from
false detections – typically it has problems with periodic signals where the period is
that of a missing fundamental [4] – and it may not, in its basic form, be well suited for
polyphonic music [5].

Related to auto-correlation is the average magnitude difference function (AMDF). While


auto-correlation computes the product of the original signal segment and the shifted
version, AMDF computes the difference. It is possible to combine the auto-
correlation function and AMDF to form the weighted auto-correlation function, which is
better at handling noisy signals [6].

2.1.3 Frequency-domain methods

There are several methods available to transform from the time domain to the
frequency domain. The best known and most frequently used method is probably the
6 Technical background

Fourier transform, allowing for decomposition of a function into oscillatory functions.


In particular, the Fourier series lets us express any periodic function as a sum of
sinusoids, with varying amplitudes and frequencies. These frequencies are related in
that they are integer multiples of the fundamental frequency of the periodic function;
in other words, they are harmonics. In the digital realm, we must work with discrete
transforms. The discrete Fourier transform (DFT) is, unsurprisingly, the discrete analogue
of the continuous Fourier transform. The DFT is practically always implemented as a
fast Fourier transform (FFT), which produces the same result but in significantly less
time.

Popularity aside, there are some issues with the Fourier approach. For example, better
localization in frequency means worse localization in time, and vice versa. Moreover,
since frequencies of musical notes are distributed logarithmically and the frequency
bins of the FFT are distributed linearly, resolutions at low frequencies tend to be too
low, and resolutions at high frequencies tend to be unnecessarily high.

Many of these issues are absent in wavelet transforms. Simply put, a wavelet is a function
which divides the initial function into different frequency components and allows for
examination of these components in an appropriate scale; this remedies the resolution
issues of the FFT. Also, contrary to the FFT, wavelets are localized in both time and
frequency, and generally do not have a problem handling discontinuities. While not
(yet) widely adopted in audio processing, wavelets are often used in image processing.

There are several other transforms which may be used in audio processing, for
example the constant Q transform and the discrete Hartley transform. Nevertheless,
the FFT is ubiquitous, and hereafter we take the word ‘transform’ to imply the FFT.

After transformation into the frequency domain, there are several ways to estimate
pitch. In very simple cases, for example monophonic music with non-complex
sounds, it might be sufficient with frequency peak detection directly following the
transform, as mentioned in section 2.1.1. Most often, however, a more sophisticated
method is called for.

To obtain the harmonic product spectrum (HPS) [7], we begin by downsampling the
spectrum a number of times, each time producing a more ‘compressed’ version.
Specifically, the nth downsampled spectrum is 1/(n + 1) the size of the original
spectrum. The point is to utilize that the partials belong to a harmonic series, so that
the first harmonic (i.e. the fundamental) in the original spectrum aligns with the
second harmonic in the first downsampled spectrum, which in turn aligns with the
third harmonic in the second downsampled spectrum, and so on. Thus, the number of
spectra considered equals the number of harmonics considered, and the HPS is finally
produced by multiplying the spectra together, with the idea of amplifying the
fundamental frequencies.
Pitch detection 7

Figure 2. Downsampling to make harmonics align. The first harmonic in the


original spectrum (top) coincides with the second, third, and fourth harmonics of
the downsampled spectra, respectively.

While the HPS method is quite insensitive to noise and generally works well, there
may be problems with notes being detected in the wrong octave (usually one octave
too high). Some extra peak analysis may help [6], but this may be difficult in
polyphonic cases.

Another method of pitch-tracking is cepstral analysis. A cepstrum, first described in a


1963 paper [8], is basically the spectrum of a spectrum, obtained by taking the
transform of the logarithm of the magnitude spectrum. Hence, cepstral analysis is not
really carried out in the frequency domain, but actually in the quefrency domain
(although the method is still generally considered to belong to the frequency-domain
approach). Quefrency can be said to be a measure of time in a different sense, and
peaks (rahmonics) at certain quefrencies occur due to periodicity of partials. Cepstral
analysis is quite popular in speech analysis since the logarithm operation increases
robustness for formants (acoustic resonances, for example of the human vocal tract),
but it also leads to a raised noise level [9].

As a last example of frequency-domain methods, a combination of the HPS and


cepstral methods may be a promising alternative [10]. In the aptly named cepstrum-
biased harmonic product spectrum (CBHPS), we see both the noise robustness of the HPS
and the robustness to pitch errors of the cepstrum [6]. Since the HPS exists in the
frequency domain and the cepstrum in the quefrency domain, combining them first
requires the cepstrum to be converted to frequency-domain indexing. Multiplying
together the HPS and the frequency-indexed cepstrum produces the CBHPS.
8 Technical background

2.1.4 Some notes on the DFT and the FFT

As we shall see in chapter 3, the implementation relies on the FFT for transformation
into the frequency domain. While the DFT and the FFT are standard textbook
material and thus shall not be covered in depth here, a brief review of some relevant
aspects may be appropriate.

For a discrete sequence x of length n, the DFT is defined by

n −1
X k = ∑ x m ωnkm , k = 0, ..., n − 1 ,
m =0

where

2πi

ωn = e n

is a primitive nth root of unity. The DFT is, by definition, a complex transform; it
takes complex-valued input and produces complex-valued output. For a real-valued
input, the second half of the transform will be a complex conjugate mirror of the first;
that is,

X k = X n∗−k .

Hence, in the case of real-valued input, we need only consider the first half of the
transform.

Each Xk corresponds to a particular frequency (or ‘frequency bin’). The distance


between two frequency bins (i.e. the spectral resolution) is obtained by dividing the
sample rate with the input (window) size. If the input signal contains frequencies
which are not integer multiples of the spectral resolution, spectral leakage of some
degree will occur, as seen in Figure 3 (a). This is basically a side effect of the
discontinuities that may arise when considering a finite-length segment of (what is
assumed by the transform to be) an infinite signal. Spectral leakage from one sinusoid
may very well obscure another sinusoid in the signal; in particular, several notes close
to each other could produce an almost unusable spectrum from a pitch detection
perspective.
Pitch detection 9
Magnitude

Magnitude
Frequency Frequency

(a) (b)
Figure 3. A signal with a frequency that is not an integer multiple of the spectral
resolution of the transform will produce spectral leakage as seen in (a); energy has
’spilled over’ into the other frequency bins. In (b), a Hamming window was applied
before transforming, with a noticeable reduction in spectral leakage effects as a result.

With no pre-processing, the segment is a view of the signal through a rectangular


window. To counter spectral leakage, we may use a non-rectangular window in order
to make the segment begin and end less abruptly. In Figure 3 (b), we see the results of
applying the Hamming window function

2πm
w (m ) = 0.54 − 0.46 cos
n −1

to the signal segment before taking the transform. Like many window functions, the
Hamming window is bell-shaped, but other shapes (such as triangular windows) are
occasionally used.

From the definition of the DFT, we can see that it has an asymptotical complexity of
O(n2). The FFT is a divide-and-conquer method which, by utilizing properties of the
complex roots of unity, improves the complexity to O(n lg n). While the divide-and-
conquer approach may hint of recursion and dependence upon the factorization of n,
there are several FFT algorithms of varying kinds. Most commonly seen are iterative
implementations of the Cooley-Tukey radix-2 algorithm, which requires n to be a
power of two.

Although all FFT algorithms run in O(n lg n) time, naturally we wish to minimize the
time taken in practice. Our audio signal will be strictly real, and if we have n real
samples it might seem like we have to ‘add’ an imaginary signal consisting solely of
zeros before transforming, since it is a complex transform. This is fortunately not the
case. By taking every even-indexed sample to be real-valued and every odd-indexed
sample to be imaginary-valued, we produce a complex input of size n/2, on which the
transform is applied. From the result, the transform of the initial real sequence is
obtained through a final unwrapping step (which, like the construction of the complex
input from the real input, runs in linear time). Hence, a real signal of length n requires
10 Technical background

only an n/2-size transform, which is a significant improvement. For the mathematical


details, see for example [11].

2.2 MIDI

MIDI (Musical Instrument Digital Interface) was created in the early 1980’s in an
effort to standardize the way digital musical instruments communicated. Previously,
various manufacturers had developed their own digital interfaces, and some were
beginning to worry that the use (and hence sales) of synthesizers would be inhibited
by the lack of compatibility. The first MIDI instrument, the Sequential Circuits
Prophet-600, appeared in January 1983, soon to be followed by Roland’s JX-3P. At
this time, the MIDI specification was very simple, defining only the most basic
instructions. Since then, however, it has grown significantly.

The MIDI specification can be said to consist of three main parts; the message
specification, the transport specification, and the file specification. Of these three,
probably the most important part, and the part of our primary concern, is the message
specification, or protocol.

2.2.1 Messages

A MIDI message consists of a status (or command) byte, followed by a number of


data bytes. Status bytes are identified by the MSB being set. For commands in the
0x80-0xEF range (also known as channel commands), the three bits following the MSB
specify the command in question and the remaining four bits specify which MIDI
channel the command affects. Thus, there are 7 channel commands and 16 channels.
Among the channel commands, we find instructions for playing and manipulating
notes and similar. The commands in the 0xF0-0xFF range are system commands, which
are not aimed at a particular channel; rather, they are concerned with for example
starting or stopping playback.

Naturally, the content of any present data bytes depend on the command with which
they are associated. A program change command, for instance, is followed by one data
byte, containing the number of the instrument sound (or patch) to be used. When two
data bytes are used, they usually contain one separate piece of information each; for
example, a note on command uses two data bytes, specifying note number and velocity,
respectively. Since the MSB is used to signify whether it is a status byte or a data byte,
this gives 128 possible note numbers (in comparison, a standard piano has 88 keys)
and 128 different velocities (including the zero velocity). Here, 128 different values are
quite sufficient, but in some cases a greater range is desired. An example is the pitch
bend command, where one data byte holds the least significant bits and the other the
MIDI 11

most significant; the 214 different values allow for very smooth pitch transitions.
Channel messages always have one or two data bytes, while system messages may
have zero data bytes. Thus, a MIDI message is at most three bytes in size.

The small message size is important for the timing. The MIDI protocol is a serial
communications protocol, with a specified bandwidth of a mere 31.25 kBaud
(approximately 3.8 kByte/s). There is no true simultaneity; a chord, for example, is in
practice a really fast arpeggio. With a maximum message size of three bytes, well over
a thousand messages can be sent per second even in the worst case, disregarding
practical limitations.

2.2.2 Standard MIDI files and General MIDI

A standard MIDI file (SMF) is little more than a list of performance-related


information. There are two frequently used types; type 0, which uses a single track,
and type 1, where individual parts have individual tracks. There is also a type 2, which
can contain multiple songs, but this is not commonly used.

Storing a performance in a file necessitates an extra piece of information; timestamps.


The ‘classic’ method is tempo-based timestamping, where the timing resolution is
given in pulses per quarter note (PPQ). Timestamping may also be time-based,
according to SMPTE (Society of Motion Picture and Television Engineers)
specifications. Here, there are a certain number of frames per second (ranging from 24
to 30), and there is a certain number of ticks per frame. Obviously, in order for the
translation to stay true to the original performance, the resolution needs to be high
enough to avoid inconsistencies.

Since no actual sound data is stored, the space required is very small, especially
compared to an audio file. By accessing the file, any device or application capable of
MIDI playback can replay the performance. How it actually sounds depends on the
MIDI instrument utilized for playback, and this brings up the issue of uniform
playback. While some MIDI instruments have very high-quality sounds, and while the
program change command lets us tell the MIDI instrument to use a certain patch, it is
not specified what sound we will actually get – it may differ from instrument to
instrument. This means, for example, that a piece which plays back with an organ
sound on one MIDI instrument might play back with drum sounds on another.

To counter this problem, General MIDI (GM) was created. While not a part of MIDI
per se, GM defines specific features for MIDI instruments. For instance, with a GM
instrument, we know that MIDI channel 10 is reserved for percussion sounds, and we
also know that a particular note number played on this channel will always produce a
particular percussion instrument sound. For other channels, we know which program
number corresponds to which instrument (for example, the acoustic grand piano is
12 Technical background

always found at program number 1, the violin is always number 41, and so on). In
addition to organizing instrument layout, GM also makes specifications regarding
polyphony, velocity, and multitimbrality. Thus, adhering to the GM standard increases
the chances of correct playback on foreign systems.

Since first published in 1991, GM has been superseded by GM2 (in 1999). GM2 is
fully compatible with the original GM, while considerably extended. There also exists
a slimmed-down version (General MIDI ‘Lite’, GML) aimed at mobile applications,
and some instrument manufacturers have introduced their own extensions and
variants, for example the XG standard from Yamaha.

2.3 Audio-to-MIDI

Now, having acquainted ourselves somewhat with both pitch detection and MIDI, we
can reflect a bit upon the requirements and possibilities of audio-to-MIDI
functionality. As we shall see, we encounter limitations of both pitch detection and
MIDI.

2.3.1 General considerations

Perhaps the most obvious issue is that it is not enough to have a well-functioning
pitch detection algorithm; in order to produce a correct translation, we must also be
able to tell when a note was played, and when it was released. There are a number of
ways to tackle this problem. For example, we may consider spectral peaks to indicate
note onsets whenever a certain threshold is exceeded. Determining that threshold may
be a bit tricky in practice; for instance, since the dynamics of an instrument may vary
so that notes in a certain register are naturally louder than notes in another register,
the threshold value may need to vary throughout the frequency range. It may also be
desired to consider changes in spectral magnitude (i.e. spectral flux) in addition to the
values themselves. Other approaches to note onset detection may work with the
amplitude of the signal, or changes in phase [12].

Of course, some cases are easier to handle than others. Sounds such as sine tones may
be easy to deal with from a pitch detection perspective, but the unchanging nature of
the notes can make proper detection of repeated notes difficult. With other sounds, it
may be very easy to determine when a note begins or ends, but there may be
inharmonic transients that complicate pitch detection.

Erratic note detections can arise from very slight fluctuations in frequency or
amplitude. These fluctuations may be temporary and very short, and in such cases the
resulting notes have very short durations (i.e. the note onset is almost immediately
Audio-to-MIDI 13

followed by the note offset). Thus, it may at times be possible to use note duration as
a ‘mistake criterion’ in order to clean up the audio-to-MIDI translation.

Apart from note onset and offset detection, obviously we must also handle the pitch
information itself. In the 12-tone equal temperament, the fundamental frequency fk of
the kth semitone above a note with fundamental frequency f 0 is given by

f k = f 0 ⋅ 2 k 12 , k = 1, 2, ... .

This makes each note have a fundamental frequency which is approximately 5.9%
higher than that of the preceding semitone. The most basic approach for audio-to-
MIDI translation is probably to disregard devices such as vibrato and glissando and
simply match frequencies to their closest note in the equal temperament. This is of
course ideal for music that is itself limited in that respect (such as piano music), but
less well suited if we wish the translation to mimic the original performance in detail.

To handle for example glissando correctly, the audio-to-MIDI method must ‘know’
when a new note should be played and when to simply apply the expression to an
already sounding note. This implies a certain degree of sophistication in note onset
and note offset detection. We must also keep in mind that the relevant MIDI
commands are channel commands; for example, pitch bend will affect all currently
sounding notes on the specified channel. Thus, if we want to be able to handle
polyphonic cases such as when one voice is doing a glissando while another is not,
each voice needs a separate channel. Obviously, this requires that the number of
voices does not exceed the number of channels.

The quick and dirty way to assign different voices to different channels would be to
simply correlate channel number with pitch order. For example, we might always let
the highest note be handled by channel 1, the next highest by channel 2, and so on.
There are, however, numerous problems with this approach. For one thing, if two
voices cross each other, their channel numbers will no longer correspond to their
pitch order. This is likely to necessitate manual corrections if for instance musical
notation is to be produced. Moreover, if the crossing of voices is the result of a
glissando, this approach will simply not work. Ideally, we would like to be able to
track each voice and thus make sure that each note gets assigned to the correct
channel, but this would most often require elaborate pattern matching and timbre
identification.

As a final example of audio-to-MIDI considerations, we may take the dynamics of


music; in a way, this is related to note onset detection, discussed earlier. In the
simplest case, there are no dynamics to speak of; all notes are equal in volume. On the
other hand, if some notes are soft and others loud, we must make sure that soft notes
are not disregarded. If we, in addition, wish the dynamics to be reflected in the MIDI
translation, each note must be assigned an appropriate MIDI velocity.
14 Technical background

Several products aiming to bridge the gap between audio and MIDI exist, of varying
character. For example, they may be aimed at hobbyists or professionals, they may be
general-purpose or specialized for a particular instrument, and so on. We conclude
this chapter with some general remarks about hardware- and software-based solutions.

2.3.2 Hardware solutions

There are two main approaches to hardware solutions to the audio-to-MIDI problem;
integrated and non-integrated. An example of the former case are MIDI guitars with
on-board DSP’s, allowing the MIDI cable to be connected directly into the
instrument. However, musicians tend to be very picky about their instruments, and
most would be unhappy having to use an instrument they did not like in order to have
MIDI functionality. In such cases, the non-integrated approach may be more
appealing; since the processing is done externally, the instrument generally needs no
modification apart from possibly mounting a special pickup. For example, stringed
instruments may be fitted with a pickup that sends a separate signal for each string,
greatly simplifying multi-pitch detection. As a note, hybrid solutions exist as well,
where the pickup is integrated but the DSP is not.

Hardware-based solutions are particularly appealing in cases like that of stringed


instruments, where the ability to process the signal from each string separately could
strongly affect the quality of the result. Also, hardware solutions may be preferred if
platform independence is desired, or in live settings.

2.3.3 Software solutions

Most audio-to-MIDI software are stand-alone applications, but there are also plug-ins,
intended for use within a host application. Plug-ins often have direct hardware
counterparts, and are typically fairly light-weight and dedicated to a particular real-time
task.

In software solutions, the GUI possibilities pave the way for numerous additional
features, such as extensive editing functionality and production of musical notation.
Unless audio-to-MIDI needs to be performed in a real-time situation – for example
having audio triggering MIDI events during a live performance – the direct result of
the translation is often an intermediary step requiring editing. Of course, the editing
itself does not really depend on whether the translation was performed by hardware or
software, but it can be convenient to be able to perform all tasks using a single tool or
platform.
3 Design and implementation

3.1 Overview

The application was developed in JDK6 on Windows XP, using an Intel quad-core
machine with 2 GB of RAM. No reasonably modern system should have any
problems running it; the only ‘real’ requirements are enough RAM to hold the audio
data (a non-issue these days) and a decent sound card.

In general terms, the main features of the application are the following:

• Opening/saving audio and MIDI files


• Playback of audio/MIDI files
• Audio recording with or without metronome
• Configurable audio-to-MIDI translation

There are various configuration options available, for example allowing the user to
control the sample rate and bit depth used during recording, and which window
function to use for pre-processing during pitch detection. Also, the user has a number
of options for controlling audio-to-MIDI behavior (such as MIDI resolution and
pitch detection thresholds).

In discussion, it is sometimes difficult to clearly separate design from implementation.


In this chapter, the design discussions are typically concerned with overall structure
and component interaction, while discussions on implementation may comment on
for example language specifics. The design sections will contain UML diagrams to give
an overview of either a component as a whole, or the specific capabilities of the
component.

3.1.1 General design

The application design is based on the Model-View-Controller (MVC) pattern, where the
controller is notified of relevant user interaction with the view and reacts accordingly
through direct access to both the view and the model. While there are several variants
of this pattern, the main point is to separate user interface from business logic.

15
16 Design and implementation

Model View

Controller

Figure 4. MVC as employed in the application.


Direct and indirect access is indicated by solid
and dashed arrows, respectively.

Often in MVC, the model has no direct access to the view; instead, the view observes
the model, fetching data of interest when notified of a relevant change. Another
variant has the controller managing all the information flow, with no connection
whatsoever between the view and the model. Except for a particular real-time case,
this is the variant used in the application.

In implementation terms, simplifying slightly, the view corresponds to the GUI class,
the controller to the Controller class, and the model is split into the MidiCentral and
AudioCentral classes. Instantiation (and initial configuration) of these classes is the
duty of the AudioToMidiApp class.

GaussWindow HammingWindow

BlackmanWindow RectangularWindow

MenuSystem
<<interface>>
WindowFunction IterativeFFT PlottingPanel

ControlPanel

PitchDetector AudioToMidiPanel

MidiCentral AudioCentral GUI

Controller

Figure 5. An overview of the application, the AudioToMidiApp


launcher class excluded. The AudioCentral has access to the GUI, but
otherwise all interaction between the components goes through the
Controller.
Graphical user interface 17

The Controller class implements several listener interfaces, allowing it to be aware of


and handle user interaction with the GUI as well as special MIDI and audio events.
Also, usage-oriented checking (such as querying the user whether a file should be
saved before exiting) is handled here.

3.2 Graphical user interface

Ideally, using the application should require as little interaction as possible. Following
the ‘make the common case fast’ guideline, all the basic tools needed for recording,
playback, and audio-to-MIDI translation are accessible directly from the control
panels on the main screen. Additional functionality is provided through menus.

Figure 6. The application during audio playback. The buttons


controlling audio recording and audio-to-MIDI translation are
greyed out.

While the record button is audio specific, the stop and play buttons control both
audio and MIDI. Also, buttons are disabled at times when their functionality is not
available. For example, as seen in Figure 6, the record button is disabled during
playback, as is the audio-to-MIDI button. Both are re-enabled when playback ends.
However, as mentioned in chapter 5, such ‘user proofing’ is not yet consistently
implemented.
18 Design and implementation

3.2.1 Design notes

Several parts make up the graphical user interface. Apart from the main window, the
important elements are the two plots, the two lower panels from which for instance
playback and audio-to-MIDI is controlled, and the menu bar.

GUI 2 PlottingPanel

...

<<constructor>> GUI(in title:String, in controller:Controller) 1


ControlPanel
showConfirmDialog(in message:String):boolean
setPlaying():void
setRecording():void
setStopped():void 1 AudioToMidiPanel
getAudioFileChooser():JFileChooser
getMidiFileChooser():JFileChooser
plotAudio(in samples:double[*]):void
plotSpectrum(in magnitudeSpectrum:double[*]):void 1
clearPlots():void MenuSystem

Figure 7. The GUI class provides a number of methods used for interaction
with GUI components.

The GUI class and its components are fully unaware of the rest of the application
save for the Controller, which is registered as a listener to various GUI components.
Changes to GUI appearance and functionality are handled through direct method
calls; the GUI class acts as an interface to other GUI elements, most importantly the
two plots. Although there are efficiency reasons to let the model feed plot data
directly to the view in this manner, an observer pattern may be a cleaner approach
regarding smaller updates, and is subject to future evaluation.

3.2.2 Implementation notes

Basic Swing/AWT components are used throughout. The main application window is
provided by the GUI class, which extends javax.swing.JFrame. The GUI class also
handles instantiation of the other GUI components, in particular the PlottingPanel
objects, the ControlPanel, the AudioToMidiPanel, and the MenuSystem. The
latter is a subclass of javax.swing.JMenuBar, while the panels are subclasses of
javax.swing.JPanel.

Through the plotAudio() and plotSpectrum() methods in the GUI class, the plots
are continuously fed with data during playback. These methods are called by an inner
class of AudioCentral (see section 3.4), and adjust the supplied data for the plots.
This generally means scaling with regards to plot height and plot width, and in the
MIDI functionality 19

case of the audio signal data passed to plotAudio() this pre-processing also includes
root mean square (RMS) calculations.

Since it is not assumed that the data supplied to the plotting methods describe the
complete signal (or the spectrum taken over the complete signal), but rather a small
segment, the scaling procedure assumes that the given data has already been
normalized to values within [-1.0, 1.0].

The setPlaying(), setRecording(), and setStopped() methods are called by the


Controller when playback or recording starts or stops. Through these methods, visual
(and sometimes functional) changes such as icon changes or disabling of buttons are
controlled. These methods, along with the abovementioned methods used for
plotting, highlights the view being unaware of the model.

The GUI also owns the file chooser dialogs used when opening and saving files.
However, instead of offering methods to interact with these, the GUI class provides
methods to obtain them as to facilitate direct interaction. This results in somewhat less
cluttered code.

Currently, the GUI is all hand-written (i.e. not constructed using a GUI builder tool).
Although a clear and intuitive GUI is important, it was somewhat down-prioritized at
this stage in favor of pitch detection and other key features. The aim was mostly to
provide a sufficiently good GUI within the scope of the thesis. Thus, there is room
for much polishing, both with regards to design details and implementation details
(see chapter 5).

3.3 MIDI functionality

The application supports opening of MIDI files and saving the MIDI data produced
by an audio-to-MIDI translation as a type 0 MIDI file. When MIDI data is present, it
may be played back, and the playback may be muted/unmuted at any time. The
playback tempo can be controlled through a spinner in the GUI.

3.3.1 Design notes

Since the MIDI functionality is so basic, dividing it over several classes would rather
lead to fragmentation than to improved structure.
20 Design and implementation

MidiCentral

...

<<constructor>> MidiCentral(in metaEventListener:MetaEventListener)


isPlaying():boolean
openFile(in file:File):void
saveFile(in file:File):void
setSequence(in sequence:Sequence):void
startPlayback():boolean
stopPlayback():void
rewindToBeginning():void
setTempo(in tempo:int):void
setUseMetronome(in useMetronome:boolean):void
startMetronome():void
stopMetronome():void
setMuted(in b:boolean):void
...

Figure 8. Public elements of the MidiCentral.

The MIDI functionality is provided by the single MidiCentral class. There is,
however, an inner (private) class for the metronome functionality, as described in the
following section.

3.3.2 Implementation notes

Java’s MIDI capabilities are accessed through the javax.sound.midi package.


Working with MIDI sequences and playback is fairly straightforward, although a
couple of things could be mentioned.

A MIDI sequence is represented by a Sequence object, which has a number of Track


objects containing the MIDI events. After adding a sequence to a Sequencer, it may
be played back by calling the latter’s start() method; local sound production is
handled by a Synthesizer (which may obtain sound data from a Soundbank). In the
normal case, the sequencer’s Transmitter sends MIDI messages to the synthesizer’s
Receiver. Transmitters and receivers may be obtained and connected explicitly; unless
this is done, defaults are used.

The sequencer does not close itself when the end of the sequence is reached, thus
keeping hold of acquired system resources. However, at the end of playback, a
particular MetaEvent is dispatched, which we may use to trigger the closing of the
Sequencer. In our case, this MetaEvent is caught by the Controller, which is
registered as a MetaEventListener in the MidiCentral.

The MidiCentral is also responsible for providing metronome functionality. If ‘Use


metronome’ in the GUI is checked, there will be quarter-note MIDI clicks during
Audio functionality 21

audio recording. What happens at each metronome click is detailed in the inner class
MetronomeTask, which implements the Runnable interface. This task is run by
means of a ScheduledExecutorService. Note that, depending on system audio
settings (e.g. “What U Hear” source selection), the metronome click may come to be
recorded. There is currently no ‘stand-alone’ metronome; it is only available during
recording, and hence started through the ‘record’ button in the GUI.

The Sequencer class has a setTrackMute() method, which is the way muting of
MIDI playback is currently implemented. While it has a certain appeal in its simplicity,
there are some issues with this approach. For one thing, it could be considered to be a
kind of ‘fake’ mute, since we in effect mute notes instead of sounds. Moreover,
according to the API documentation, it is actually not guaranteed that a Sequencer
supports this functionality. Muting MIDI playback by muting the synthesizer is
perhaps the proper way, and this leads us to some issues of MIDI volume control in
Java.

The easiest way to set up a MIDI playback system is to use Java’s own default
synthesizer. However, some versions of the JRE (e.g. the Windows version) do not
ship with a soundbank, thus requiring the user to download it separately. It can
therefore not be assumed that a soundbank is present. Java Sound has a fallback
mechanism so that if it can not obtain a soundbank for the synthesizer, it tries to
utilize a hardware MIDI port instead. However, this is generally not desired since it
results in various inconsistencies.

If no soundbank was found, attempting to change the MIDI volume through the
default synthesizer will obviously not work; we must obtain the Receiver from the
MidiSystem instead of from the Synthesizer if we want control. This was tried
during implementation, but the results were considered unsatisfactory. Not wishing to
require the user to download a soundbank or configure the sound system manually,
the volume control functionality was skipped in this version of the application.

3.4 Audio functionality

The application supports recording and playback of audio in 8- or 16-bit PCM format
at various sample rates. Stereo is currently not supported. Any audio file within
specification may be opened and played back, and present audio data may be saved.
Playback may be muted/unmuted at any time. Also, plot data is continuously fed to
the GUI during playback.
22 Design and implementation

3.4.1 Design notes

The audio functionality is a bit more complex than the MIDI functionality, being
directly involved in pitch detection and audio-to-MIDI translation in addition to
handling playback, recording, and opening/saving files.

AudioCentral

...

<<constructor>> AudioCentral(in controller:Controller, in gui:GUI)


getSupportedRecordingFormats():AudioFormat[*]
setRecordingSampleRate(in recordingSampleRate:float):void
setRecordingBitDepth(in recordingBitDepth:int):void
setTempo(in tempo:int):void
setPPQ(in ppq:int):void
isPlaying():boolean
newFile():void
openFile(in file:File):void
saveFile(in file:File):void
startPlayback():boolean 1
PitchDetector
startRecording():void
stopPlaybackAndRecording():void
rewindToBeginning():void
setMuted(in b:boolean):void
setWindowFunction(in name:String):void
setLowThreshold(in d:double):void
setHighThreshold(in d:double):void
createMidiFromAudio():Sequence
...

...

Figure 9. The AudioCentral class. Not shown are two inner classes used for
playback and recording, respectively.

The AudioCentral provides the Controller with a frontend to audio-related


functionality. While the specifics of pitch detection are left to the PitchDetector (see
section 3.5), the audio-to-MIDI translation procedure is implemented in the
AudioCentral. Audio-to-MIDI specifics are discussed in section 3.6; the following
section is more concerned with more general implementation details.

3.4.2 Implementation notes

The javax.sound.sampled package provides core audio functionality. Audio devices


are represented as Mixer objects, and a device may have one or several ports (such as
for instance microphone input or line-level output; these are represented by Port
objects). While detailed device and port selection is possible, the application currently
uses the defaults.
Audio functionality 23

An object implementing the Line interface may be viewed as an audio transport path
to or from the system. Mixers and ports are both lines, although when speaking of
lines we usually refer to lines going into or out from the mixer. To capture audio, we
acquire a TargetDataLine from which the signal is read. For playback, we may use
either a SourceDataLine or a Clip. While the former is continuously fed with audio
data during playback by writing to its buffer, the latter lets all the data be loaded from
the beginning. This results in lower playback latency, and also makes it possible to
jump between different positions in the audio (which may be desired for fast
forward/rewind functions). Also, looping of the audio data is directly supported by
the Clip class. Hence, unless the audio data requires too much memory to be loaded
at once, or is not known in its entirety at the start of playback, a Clip is generally to be
preferred over a SourceDataLine. Clip is the playback line of choice in the
implementation.

Supposedly, there have been issues with out-of-memory errors when attempting to
load clips greater than 5 MB. However, no such issues have been encountered during
development. As an example, 10 MB of audio was recorded, played back, and re-
loaded from file with no problems whatsoever.

An audio clip is played back by calling Clip’s start() method. When playback is
complete, a LineEvent is dispatched. In the implementation, this is noticed by the
Controller instance, which is registered as a LineListener with the Clip.

Playback and recording is handled by two inner classes of AudioCentral;


PlaybackTask and RecordingTask. Both implement the Runnable interface and
are executed by means of an ExecutorService. In the case of playback, a
ScheduledExecutorService is used, which sees to it that the GUI plots are fed with
data regularly.

Again, as was mentioned in section 3.3.2, if the metronome is used during audio
recording, it may come to be recorded, depending on system settings.

The sample rate and bit depth used for audio recording may be specified via the
‘Settings’ menu in the GUI. Currently, four pre-determined sample rates are listed, all
assumed (by means of a rather unsophisticated test performed at application launch)
to be supported by the system. Note, however, that playback and audio-to-MIDI is
supported for any available sample rate.

In section 3.3.2, the omission of MIDI volume control in the current implementation
was discussed. Controlling audio volume does not present similar difficulties, but for
reasons of consistency an audio volume control was omitted as well.
24 Design and implementation

3.5 Pitch detection functionality

The application supports pitch detection of (definite-pitched) sounds from about the
note F# (at approximately 92.5 Hz) and upward, depending on sample rate. From the
audio-to-MIDI panel in the GUI, the thresholds used to filter sounding pitches can be
adjusted.

3.5.1 Design notes

As previously seen, the PitchDetector is a component of the AudioCentral.

IterativeFFT

...

<<constructor>> IterativeFFT()
<<constructor>> IterativeFFT(in numSamples:int)
setNumSamples(in numSamples:int):void
getMagnitudes(in input:double[*]):double[*]
...

1
BlackmanWindow

PitchDetector
GaussWindow
...

<<constructor>> PitchDetector()
<<constructor>> PitchDetector(in sampleRate:float, in windowSize:int) HammingWindow
setWindowSize(in windowSize:int):void
setSampleRate(in sampleRate:float):void
setWindowFunction(in name:String):void RectangularWindow
setLowThreshold(in d:double):void
setHighThreshold(in d:double):void
<<interface>>
getMagnitudeSpectrum(in audioData:double[*]):double[*]
1 WindowFunction
prepareForTranslation():void
getPitches(in audioData:double[*]):double[*] apply(in data:double[*]):void
...

Figure 10. The PitchDetector class and its closest friends.

Generally, analysis logic resides within the PitchDetector class, while processing is
performed by the IterativeFFT and WindowFunction instances.

The design has basically everything even remotely related to signal processing go via
the PitchDetector. This includes producing the data used to plot the frequency
spectrum in the GUI, although no actual pitch detection is performed in that case.
Pitch detection functionality 25

3.5.2 Implementation notes

The pitch detection utilizes an iterative radix-2 FFT algorithm, meaning only sample
windows having a size which is an even power of two are supported. With this
restriction in mind, the window size is set depending on the sample rate in an attempt
to strike a balance between frequency resolution and time resolution. For example, for
audio sampled at 44100 Hz, a window size of 8192 samples will be used. This
corresponds to roughly 0.186 seconds and gives a frequency resolution of
approximately 5.4 Hz, which is sufficient to correctly identify the note F# at about
92.5 Hz. On the other hand, with a sample rate of 8000 Hz, a 1024-sample window
will be used, which corresponds to 0.128 seconds and a frequency resolution of
7.8125 Hz. Here, we can only reliably detect notes down to d at about 146.8 Hz.

The transform is implemented in the class IterativeFFT. The ‘n for the price of n/2’
procedure mentioned in section 2.1.4 is employed, along with miscellaneous smaller
tweaks such as using bit-shift operators for multiplications and divisions by a power of
two. Also, n is required to have been specified before using the transform. This allows
pre-computation of constants specific to a transform of a given size; such values are
often referred to as twiddle factors.

It may be noted that IterativeFFT does not currently have a method that returns the
actual transform, i.e. a sequence of complex numbers; there is only the
getMagnitudes() method, which returns (the first half of) the magnitude spectrum.

Also mentioned in section 2.1.4 is the phenomenon of spectral leakage. Via the
‘Settings’ menu, the user may choose from several window functions. These are
implemented as ‘function objects’; they implement the WindowFunction interface
and provide a single apply() method.

The pitch detection algorithm utilizes the harmonic product spectrum, as described in
section 2.1.3. When the magnitude of some frequency in the HPS exceeds a certain
high threshold, that frequency is added to a list of currently sounding frequencies.
Similarly, when the magnitude falls below a certain low threshold, the frequency is
removed from the list.

Currently, these thresholds are calculated from the number of harmonics considered
when constructing the HPS, the size of the sample window, and the values specified
via the threshold spinners in the GUI. In other words, the thresholds do not take the
power of the signal into account. This causes the pitch detection to behave somewhat
differently depending on sample rate and overall volume.
26 Design and implementation

3.6 Audio-to-MIDI functionality

The application supports, or attempts to support, audio-to-MIDI translation of audio


of varying complexity. The timing resolution can be controlled via the ‘Tempo’ and
‘PPQ’ spinners in the GUI.

3.6.1 Design notes

The PitchDetector keeping track of currently sounding pitches makes it perhaps even
more intertwined with the audio-to-MIDI functionality. Indeed, it was designed with
audio-to-MIDI in mind. There is no separate audio-to-MIDI object; all methods of
the translation mechanism are defined within the AudioCentral, utilizing the
functionality of the PitchDetector.

3.6.2 Implementation notes

Audio-to-MIDI translation works by processing a number of overlapping sample


windows. The degree of the overlap is determined by the length of a MIDI tick, which
in turn is determined by the tempo and PPQ settings. For each window, pitch
detection is followed by translating the detected pitches to MIDI note numbers.
Appropriate note on and note off messages are then generated with proper timestamping,
and added to a Track (belonging to a Sequence; see section 3.3.2). When the
translation is complete, the Controller sees to it that the MidiCentral obtains the
produced MIDI sequence.

While the PitchDetector keeps track of currently sounding frequencies, the


getMidiFromAudio() method of the AudioCentral keeps track of currently
sounding MIDI notes during the translation process. The currently sounding notes
will typically change less often than the currently sounding frequencies, since several
frequencies may map to the same note. This is particularly the case with higher
pitches.
4 Tests and analysis

4.1 Test approach

Within the scope of the thesis, the ambition regarding audio-to-MIDI functionality
has been to achieve good results at least in the single-voice case with not too complex
sounds. In this chapter, we shall consider this and other cases by examining pitch
detection functionality and timing accuracy. Regarding pitch detection, it will be
examined both with respect to tone complexity and with respect to the number of
simultaneous notes.

Two different tests, or test series, have been performed. One has a more ‘clinical’
approach, while the other is perhaps closer to real-world use cases.

4.2 Structured tests

In this section, the different tests were created to be somewhat comparable with each
other, despite being of different character. Recordings of three musical passages were
constructed by first creating MIDI versions of the passages via manual editing, and
then exporting them as 16-bit audio files with a sample rate of 44100 Hz. Two
recordings of each passage were done; one using a square wave sound, and one using
a piano sound. In the piano sound, a slight reverb was present. Within the application,
each recording was then opened, translated to MIDI, and saved. A resolution of 96
PPQ was used, and the tempo was set to match that of the recording. The data in the
resulting MIDI files was then imported into the MIDI editor and compared with the
original MIDI used to produce the audio files. In this editor (specifically, the key
editor in Steinberg’s Cubase SX3), the ‘y axis’ shows the notes and the ‘x axis’ their
durations; it can be thought of as a ‘musical’ version of a frequency/time plot.

Due to the number of configurations possible, the testing was restricted to achieving
‘good enough’ or typical translations. In some cases, this has meant accepting a few
non-detected or falsely detected notes. While better results may at times be possible to
achieve through more careful parameter selection, general improvements in the pitch
detection procedure are of more interest; the errors in the translations highlight the
weaknesses of the current mechanism.

27
28 Tests and analysis

4.2.1 Monophony

In monophonic music, there is a single voice with no accompanying harmony. Here,


we examine the application’s audio-to-MIDI capabilities when no more than one note
is sounding at a time.
√ nœ
# œ œ nœ œ
nœ bœ œ #œ bœ œ nœ bœ bœ Œ Ó
3
4 # œ n œ b œ œ b œ
#œ œ #œ nœ nœ bœ nœ #œ nœ
3 3

&4 #œ #œ nœ nœ bœ #œ b œ n œ
#œ # œ n œ b œ nœ #œ nœ
◊ 5
5
6
6

Figure 11. The monophonic test case. It spans almost four octaves and has a variety of
note values.

The single-voice line illustrated above was specifically created to be demanding. The
fundamental frequencies of the notes range from approximately 277 to 2093 Hz, and
played at a tempo of 100 BPM, the line contains notes as short as 1/10 of a second.
Viewed in the MIDI editor, the original MIDI appears as below.

Figure 12. The accelerating, ascending single-line test case as written in the
MIDI editor. The ’x axis’ shows bars with quarter-note subdivisions. MIDI note
C3 (having MIDI note number 60) corresponds to c´ at approximately 261.6 Hz.

This test case should allow us to gain a fairly good grasp of the application’s ability to
handle rather short notes, and also the behavior of the pitch detection mechanism at
different frequencies. The lower frequencies are particularly interesting with the piano
sound, since lower notes often have more pronounced overtones.
Structured tests 29

We proceed with examining the square-wave recording of this test case. The non-
complex character of square waves makes a good translation easy to obtain. In fact,
the translation is 100% correct with regards to pitch, and the timing differences
compared to the original are insignificant.

Figure 13. Audio-to-MIDI translation of the monophonic square-wave case.


The differences compared to the original are negligible.

Clearly, a piano sound is more difficult to handle than a square wave sound, for
several reasons. For one thing, square waves are truly periodic, while a piano tone
changes notably with time. Also, as the hammer strikes the strings, the mechanism
introduces a certain ‘thump’ sound, which may be more or less prominent depending
on register.

It should also be pointed out that with square waves, all notes are created equal. With
a piano, all notes are different, since each note is produced from one or more
individual strings sounding together. Piano sample libraries vary in how accurately this
is reflected; some high-quality libraries do indeed sample each key individually, while
other libraries may have sampled some lesser number of keys and then utilized pitch-
shifting to fill the gaps. Similarly, the number of velocity levels at which each key is
sampled may differ.

When doing audio-to-MIDI translation of music with a piano sound, the threshold
values are of greater importance since the notes are not constant in volume while they
sound. Due to for example transients, resonance, and noise, using an ‘aggressive’
window function (such as a Gauss or Blackman window) may be required. As an
example, consider the following translation.
30 Tests and analysis

Figure 14. A not-quite-perfect translation. The first two bars contain incorrectly
detected pitches (marked as black), and in the third bar a note is missing.

In Figure 14, we see that the two lowest notes (c# and e at about 138.6 and 164.8 Hz,
respectively) are both detected along with their second and third harmonics, and there
is also a false detection of the note c below c# at the very beginning. Notes from f to
f#´ (approximately within the span 174.6 – 370.0 Hz) are paired with their respective
second harmonics. Thereafter, the single notes are correctly detected with the
exception of a missing h´´ near the end of the third bar, most likely due to reverb.
Rhythmic accuracy is decent, although the low threshold probably could have been
raised slightly to trigger earlier note off events.

4.2.2 Two-voice polyphony

In polyphonic music, the voices are rhythmically independent of each other; in a


sense, it could be considered to consist of several simultaneous melodies.

œ œ œ
&c ≈ œ œ œ œ œ œ œ #œ œ œ œ œ œ œ œ œ œ œ Œ
ÿ

?c j œ #œ œ. œ œ œ œ œ œ œ œ œ #œ œ œ́ œ œ œ œ œ œ œ
œ J
ÿ
Figure 15. Our two-voice test case: the beginning of Bach’s two-
part invention no. 13, BWV 784.
Structured tests 31

In this section, we examine the application’s audio-to-MIDI capabilities when faced


with two-voice polyphony; specifically, the first two bars of J.S. Bach’s two-part
invention no. 13. It is played back at 104 BPM, making a sixteenth note approximately
0.144 seconds in duration. The lowest note is A at 110 Hz and the highest note is e´´
at approximately 659 Hz. Of particular interest in this case are the occasions when the
interval between the voices is an octave. In the editor, the original MIDI of this
passage appears as below. It might be mentioned that while the musical notation in
Figure 15 contains information regarding phrasing and suchlike, the MIDI edit used is
absolutely devoid of interpretative nuances.

Figure 16. The original edit of the two-voice test case as appearing in the
editor. For clarity, the ’x-axis’ is subdivided down to sixteenth-note
durations.

The square-wave recording of this case was translated to MIDI without any problems.

Figure 17. A translation of the two-voice test played with a square wave
sound.
32 Tests and analysis

Again we see that the nature of square waves makes them easy to handle by the
application. Some insignificant rhythmic inaccuracies aside, all notes have been
properly detected; the octaves did not pose a problem here. Possibly the low threshold
could have been raised to further lessen the occasional slight overlapping of notes.

Moving on, the following translation of the piano recording was obtained.

Figure 18. A translation of the two-voice passage played back with a piano
sound. Incorrect notes are marked as black.

Looking at Figure 18, a strong third harmonic of the first note (A at 110 Hz) may
explain some of the oddities in the beginning of the result. Throughout, we see
numerous notes being incorrectly paired with notes corresponding to their second
harmonic (i.e. the octave). The rhythmic accuracy of note onsets is generally good,
with slight deviations in either direction.

4.2.3 Three-voice polyphony

For a three-voice test case, we take a segment from a Bach fugue. Here, a frequency
range of roughly 175-831 Hz is covered. Played at 80 BPM, the shortest note duration
is 3/16 of a second.

. . . . . . . . .
b œ œ œ œ œ œœ œœ œœ œ œ œ œ nœ œ œ
&bb œ Œ œ œ ‰œ ‰ œ œ œ œ œ œ œ œ
œ
. . . .
œ n œ œ. œ. œ. œ œ œ. œ. .œ œ œ œ. œ. œ œ œ œ œ
? b ‰
bb
>

Figure 19. Bars 7-8 of the c-minor fugue in the first volume of
Bach’s Das Wohltemperirte Clavier, BWV 847.
Structured tests 33

As in the previous case, no accentuation or phrasing specifics are present in the MIDI
edit of the passage. In the editor, this passage appears as in Figure 20.

Figure 20. Original test 3 MIDI as appearing in the editor.

Again, proper detection of octave intervals between voices is of interest. The square
wave recording did not present much difficulty, as seen in Figure 21 below.

Figure 21. A translation of the square-wave recording of the three-voice


test case.

The highest note on the second beat is missing. As we can see, it was determined that
no new note was played, but instead the preceding note of the same pitch was held for
another sixteenth-note length. In some cases, such errors should be possible avoid by
tweaking the threshold values, but it is problematic with a square wave sound since
the character and volume remains constant throughout the duration of the note. Thus,
the application can only rely on the miniscule silence between the repeated notes to
tell them apart, something that would require a significant (and most likely, practically
impossible) timing resolution. However, other than the missing note, the translation is
correct, with only very slight rhythmic deviations compared to the original.

Translating the piano-sound version, we once again see the familiar octave errors. In
some cases, the octave errors result in missing notes, since the algorithm does not
34 Tests and analysis

issue new note on commands if it finds the note to be already sounding. It typically
looks worse than it sounds since octaves are harmonically consistent with the correct
notes, but errors are errors nevertheless.

Figure 22. A translation of a piano sound recording of test case 3. Incorrect


note detections are marked as black.

The repeated note which was missing in the square-wave case was properly detected
here, owing to the difference in character between the beginning and end of a piano
tone. We also see variations in the note durations, although the rhythms are generally
correct.

4.3 Further tests

Since the audio of the tests in the preceding section originated from MIDI
instruments, the signal was very clean. In this section, we shall perform two further
tests which may be closer to most practical circumstances. All audio was recorded
using the application, and the metronome functionality was used to provide a
rhythmic point of reference. Since there is no source MIDI to compare the results
with, as was the case in section 4.2, the accuracy of the translations will be evaluated
largely by ear, although the MIDI editor will be used for visualization.

4.3.1 Electric guitar: a jazz lick

The following line was written to incorporate common electric guitar playing
techniques, such as sweep picking, hammer-ons, and pull-offs. It was recorded using a
solid-body electric guitar with a clean tone, save for a very slight reverb effect. The
tone was dialed in to have a typical jazz guitar character, with rolled-off highs.
Further tests 35

≥ ≥
œ̆ ˘œ b >œ œ œ # œ≤ œ œ≤ ≥œ n œ≤ b œ œ≤ ≥œ ≤ œ ≤

≥ ≤ ≥ ≤ ≥
#œ œ b œ n œ œ n >œ b œ≤
4 œŒ Œ ‰œ œ
œ
&4 œœ Œ
J 3

Figure 23. A jazz line. It was played with a loose swing feel at a tempo
of 160 BPM. By convention it is notated one octave above sounding
pitch. Appropriately, the recording was made late at night.

This test should present a number of difficulties. For example, the tonal characteristics
of each note will vary depending on which string it was played on, whether it was
picked or not, and if it was, whether it was an upstroke or a downstroke. Also, the
strings used were less than new; the fresher the strings, the cleaner the tone.

In the figure below, we see the audio-to-MIDI translation of the recording.

Figure 24. MIDI translation of the jazz lick. The ‘x axis’ is subdivided into eight-note
triplets. While there are some false detections and rhythmic deviations, it is essentially
correct.

We see that there are some octave errors (even featuring false detections two octaves
above the fundamental, corresponding to the fourth harmonic). There is also an
obvious misdetection at the very end. We also see that the durations of the
erroneously detected notes are small compared to that of the simultaneously sounding
correct notes. The rhythms are fairly accurate compared with the recording, and
perhaps worthy of special mention is that the initial sweep-picked arpeggio was
translated correctly. Judging from the figure, those notes are roughly 1/16 of a second
apart.
36 Tests and analysis

4.3.2 Acoustic guitar: a chord progression

Here, a chord progression played on an acoustic (nylon-string) guitar was recorded


using a Sennheiser E840S microphone, directed straight at the sound hole of the
guitar from a distance of about 15 centimeters. The recording took place in a room
with two running computers, obviously contributing to the noise level in the
recording.

b 4 œœ œœ œœ œœ œœ Œ ## œœœ ˙
&b 4 œ Œ œœ Œ œ Œ œ
œ
Œ œ œ
Œ n ˙˙ Ó
œ œ œ ˙

Figure 25. A chord progression cliché. Again, it is written an


octave above sounding pitch, which means the G in the last
measure is almost as low as the pitch detection implementation
can handle.

Again, several factors will affect the possibility of getting a good MIDI translation, for
example how forcefully each string was picked (finger-picking this time) and where it
was picked; transients, particularly those of the lower notes, might cause some
problems with correct note onset detection.

Figure 26. A translation of the chord progression. The ‘x axis’ is


subdivided into quarter notes.

As we may guess, the translation in Figure 26 does not sound fantastic, although the
chords are discernible. There are a number of notes present that should not be; some
due to octave errors, some probably due to transients or string noise. Also, there are
some missing notes – the highest note is absent in the third, fifth, and sixth chord of
the progression.
Test result summary 37

4.4 Test result summary

We have tested the application’s audio-to-MIDI functionality with music and sounds
of varying complexity and character, using recordings with differing noise levels.
Unsurprisingly, clean signals with fairly simple sounds do not appear to be much of a
problem. With the square-wave recordings, the number of voices did not seem to be
very significant with regards to the accuracy of the result; except for a single missing
note in the three-voice polyphony test, due to specific difficulties with repeated notes,
these recordings were all translated correctly.

The more complex sounds of the piano were more difficult for the application to deal
with, and we began to see octave errors and some missing notes. As mentioned in
section 2.1.3, octave errors are a common problem with the HPS method of pitch
detection. While the single-note line played on an electric guitar was handled decently,
although with some errors, there were some obvious problems with the acoustic guitar
chord progression. In addition to properties of the guitar and the way the chord
progression was performed, factors such as noise level and the placement and
frequency response of the microphone come into play.
5 Discussion and future work

5.1 Results

Within the scope of the thesis, the functionality of the application is to be considered
satisfactory. Below, we shall discuss this in terms of general quality and features, with
special emphasis on the audio-to-MIDI functionality.

5.1.1 General application quality

Since the audio-to-MIDI functionality is the key feature, other features have not been
subject to rigorous testing. Due to time constraints, little ‘user-proofing’ was done
during the development of the application. Although there are exceptions, the
application does not currently do much to handle user mistakes and unintended usage
scenarios.

Development was conventionally prototype-oriented; ‘make it work when used as


intended on a known test system’. Naturally, the prototypical character of the
implementation will be gradually erased, and both design and general performance will
be improved where appropriate.

5.1.2 Feature set

With the exception of volume controls, all originally planned features and some
additional ones (such as the metronome) were implemented. While at times somewhat
unpolished, each feature serves its purpose sufficiently well to allow the user to obtain
results without having to resort to workarounds. Conceived improvements and
additions to the feature set are discussed in more detail later in this chapter.

5.1.3 Audio-to-MIDI functionality

The ambition, within the scope of the thesis, was audio-to-MIDI functionality good
enough to properly handle at least monophonic music with non-complex sounds. This
has been achieved and surpassed; as long as the sounds are of simple character, the

39
40 Discussion and future work

application has no apparent problems regardless of whether the music is monophonic


or polyphonic.

More complex sounds, however, may pose some problems. Apart from the typical
octave errors of the HPS method, we may also see missing notes and even notes
which are flat out wrong – notes that are not harmonics of any of the actual notes in
the source, and at first glance appear to have come out of nowhere. Although these
translations are ‘essentially correct’ – that is, the original music is clearly discernible
and not obscured by the errors – a higher degree of accuracy is obviously desired.

There are several factors which can lead to errors. Transients can often produce false
positives, and may also interfere with note onset detection. Oscillations in the
amplitudes of the partials may be a key factor in the occurrence of octave errors (and
errors including other harmonics than the first), and both note onset and note offset
detection can be affected. There is also the problem of ‘non-uniformity’ often present
in instruments. For example, a note played on one particular string of a guitar has a
different character than the same note played on another string; even though they are
the same note, their spectra can differ quite notably. This may be of particular
significance in polyphonic cases.

In some cases, errors can be avoided by tweaking the threshold values and using an
appropriate window function. However, this is not really a solution to the problem,
which is more about the core pitch detection algorithm than superficial parameter
twiddling. Thus, even though the thesis ambition has been realized, further refinement
is needed before the application can be said to have ‘release quality’. Some further
thoughts on this are mentioned in section 5.2.1.

5.2 Feature improvements and additions

Existing features can, of course, be improved upon in several areas, and there are also
a number of new features which would be of benefit to the general usability of the
application. Some ideas and suggestions are mentioned below, in no particular order.

5.2.1 General pitch detection improvements

Being a central part of the application, pitch detection functionality can never be good
enough. Several possible improvements come to mind. A ‘minor’ improvement would
be a better formula used for threshold calculation. However, the underlying pitch
detection algorithm itself should be improved – currently, only a harmonic product
spectrum is used, and while this does a decent job it has a number of shortcomings.
Feature improvements and additions 41

Pitch detection algorithms can be, and often are, especially well suited to a particular
type of music or sound; naturally, the requirements of the algorithm differ depending
on whether we intend to use it for sine tones or a saxophone, a single melody or a
chord progression, and so on. The general functionality of the application could be
enhanced by letting the user choose from several pitch detection methods.

Some window functions – for example the Gauss and Blackman windows – have
parameters which can be configured to obtain a certain ‘flavor’ of the window. In the
application, these parameters are treated as constants, but further user control of the
details of the window functions may be a nice touch. It may even be conceivable with
user-defined window functions, but this is a rather substantial change.

5.2.2 Audio and MIDI editing

Some editing functionality would be convenient. For example, the user may wish to
adjust or remove incorrect MIDI notes, or trim off silence from the beginning and
end of the audio data. User-controlled audio filtering may also be desired.

5.2.3 Audio and file formats

While the application supports audio of various sample rates and bit depths, it is still
restricted to PCM-encoded mono files. In addition to PCM, Java Sound has native
support for A-law and µ-law encoding, but not for other formats such as mp3 or Ogg
Vorbis. Having the application support several formats would potentially alleviate the
user from having to convert files in another application before working with them.

As for stereo support, it is mostly a question on how to implement the pitch detection
functionality. For example, pitch detection could be performed on each channel
separately, or the audio could be converted to mono before pitch detection. Which is
more appropriate might depend on the audio in question, so it is possible that the user
should be able to choose method manually.

5.2.4 A ‘project’ format

Occasionally, it might be desirable to have an audio file and its corresponding MIDI
file connected in some way, together with any relevant settings (such as for example
volume settings and so on). This would for example allow ‘simultaneous’ loading of
audio and its MIDI translation. Such a feature could be quickly implemented by letting
a text file represent each project, linking an audio file with a MIDI file and storing
various settings.
42 Discussion and future work

5.2.5 Volume controls

As noted in section 3.3.2, proper control of MIDI playback volume is not completely
straightforward to implement keeping user convenience in mind, and therefore this
feature (together with audio volume control for consistency reasons, as mentioned in
section 3.4.2) was left out of this version. It should, however, be present in a future
version. The easiest way to handle it would be to require the user to have a soundbank
installed, but whether this is the most appealing solution is another matter.

5.2.6 Fast forward/rewind/pause

Currently, the only ‘tape deck’ commands available are play, stop, and record. Being
able to pause playback, or fast forward/rewind to a specific position, may in some
cases be desirable. Such a feature would benefit from a ‘song position’ GUI element,
perhaps even a slider with which the user can interact directly.

5.2.7 GUI

As mentioned in section 3.2.2, the GUI was hand-written with the aim of providing
sufficient functionality. This means that several things could be improved, not least
from an aesthetic point of view.

From a functional perspective, the most important issue is probably the plotting
functionality. The purpose of the plots in the current version of the application is
mostly to provide some visual feedback. They may be improved in several ways, most
importantly by adding properly graded and labeled axes. Another idea is to allow the
user to interact with the plots using the mouse, for instance setting the frequency
ranges and thresholds for pitch detection.

5.2.8 Instrument tuner

An instrument tuner requires higher frequency resolution than the pitch detection
needed for ‘conventional’ audio-to-MIDI translation. While the latter may only need
to be able to assign a given pitch to the closest note in the equally tempered 12-tone
scale, an instrument tuner must be able to tell clearly how far from ideal pitch the
signal is. There is also a real-time requirement. Obviously, a properly tuned instrument
is desired for correct audio-to-MIDI translation, and the addition of tuning
functionality would broaden the usability of the application.
Concluding remarks 43

5.3 Concluding remarks

Pitch detection is an interesting subject with many and diverse applications, for
example speech analysis and music information retrieval services such as ‘query by
humming’. Since MIDI data has a ‘musical character’, is easy to edit, and is widely
used in music applications, audio-to-MIDI is a rather natural approach to the problem
of automatic music transcription.

This thesis documents the first phase of the development of an audio-to-MIDI


application. Currently, the application does rather well in controlled circumstances,
but to make it ready for typical usage situations, more work is required. While it may
be improved in several aspects, the most important is to achieve a robust pitch
detection system. Achieving this will require deeper theory studies and extensive
testing.

Java was chosen as the programming language for several reasons. For one thing, it
has a rich and well-documented API. Second, it will (hopefully) mean that the
application can run on different platforms (although this may require some extra work
due to platform specifics – ‘write once, run everywhere’ is by no means a given). The
choice of Java was also motivated by the author’s personal curiosity regarding the
sound programming possibilities of the language.

While there are, perhaps, some quirks in Java’s sound API, there does not seem to be
any particular problems with implementing this kind of application in Java. While
other languages such as C++ may perform better, Java’s performance seems decent
enough at least for the time being, and keeps the option of platform independence
open. Thus, there are at this stage no plans to switch to another language, and further
development will build upon the current code base.
Appendices

Appendix A: References

[1] Shepard, Roger N., ‘Circularity in Judgements of Relative Pitch’. In Journal of the
Acoustical Society of America, vol. 36, issue 12, december 1964, pp. 2346-2353.

[2] Berg, Richard E., and Stork, David G. The Physics of Sound, 2nd ed. Prentice-Hall,
Englewood Cliffs, New Jersey, 1995, p. 156.

[3] Tadokoro, Y., Matsumoto, W., and Yamaguchi, M. ‘Pitch detection of musical
sounds using adaptive comb filters controlled by time delay’. In Proceedings of the
International Conference on Multimedia and Expo (ICME), August 2002, Lausanne,
Switzerland, Vol. 1, pp. 109-112.

[4] Sound and Vision Engineering Department, University of Gdansk, 2000, ‘Pitch
Detection Methods’, most recently viewed on april 2, 2009,
http://sound.eti.pg.gda.pl/student/eim/synteza/leszczyna/index_ang.htm.

[5] Barry, John M., Polyphonic Music Transcription Using Independent Component Analysis,
Master’s Thesis, Churchill College, University of Cambridge, April 2003, p. 5.

[6] de la Cuadra, P., Master, A., and Sapp, C. ’Efficient Pitch Detection Techniques
for Interactive Music’. In Proceedings of the International Computer Music Conference
(ICMC), 2001, Havana, Cuba, pp. 403-406.

[7] Noll, M. ‘Pitch Determination of human speech by the harmonic product


spectrum, the harmonic sum spectrum, and a maximum likelihood estimate’. In
Proceedings of the Symposium on Computer Processing in Communications, vol. XIX,
1970, Brooklyn, New York, pp. 779-797.

[8] Bogert, B. P., Healy, M. J. R., and Tukey, J. W. ’The Quefrency Alanysis of
Time Series for Echoes: Cepstrum, Pseudo Autocovariance, Cross-Cepstrum
and Saphe Cracking’. In Proceedings of the Symposium on Time Series Analysis,
Chapter 15, 1963, New York, pp. 209-243.

45
46 Appendices

[9] Savard, A. ‘An Overview of Pitch Detection Algorithms’, lecture slides, Schulic
School of Music, McGill University, Canada, February 2006, most recently
viewed on april 2, 2009,
http://www.music.mcgill.ca/~savard/Presentation_Pitch_Tracking.ppt

[10] Master, Aaron S., Speech Spectrum Modelling from Multiple Sources, Master’s Thesis,
Churchill College, University of Cambridge, August 2000, p. 32.

[11] Engineering Productivity Tools Ltd., ‘FFT of Pure Real Sequences’, 1999, most
recently viewed on april 2, 2009,
http://www.engineeringproductivitytools.com/stuff/T0001/PT10.HTM.

[12] Dixon, S. ‘Onset Detection Revisited’. In Proceedings of the 9th International


Conference on Digital Audio Effects, September 2006, Montreal, Canada, pp. 133-
137.
Appendix B: List of figures 47

Appendix B: List of figures

1 Magnitude spectra of electric guitar and crumhorn, p. 4.


2 Spectral Downsampling for the harmonic spectrum method, p. 7.
3 Spectral leakage, p. 9.
4 Model-View-Controller pattern, p. 16.
5 Application overview, p. 16.
6 Application screenshot, p. 17.
7 GUI related classes, p. 18.
8 MIDI related class overview, p. 20.
9 Audio related class overview, p. 22.
10 Pitch detection related classes, p. 24.
11 Musical notation of monophonic test case test, p. 28.
12 MIDI editor view of monophonic test case, p. 28.
13 Audio-to-MIDI result of monophonic square wave recording, p. 29.
14 Audio-to-MIDI result of monophonic piano recording, p. 30.
15 Musical notation of two-voice polyphonic test case, p. 30.
16 MIDI editor view of two-voice polyphonic test case, p. 31.
17 Audio-to-MIDI result of two-voice polyphonic square wave recording, p. 31.
18 Audio-to-MIDI result of two-voice polyphonic piano recording, p. 32.
19 Musical notation of three-voice polyphonic test case, p. 32.
20 MIDI editor view of three-voice polyphonic test case, p. 33.
21 Audio-to-MIDI result of three-voice polyphonic square wave recording, p. 33.
22 Audio-to-MIDI result of three-voice polyphonic piano recording, p. 34.
23 Musical notation of a jazz lick, 35.
24 Audio-to-MIDI result of the jazz lick, p. 35.
25 Musical notation of a chord progression, p. 36.
26 Audio-to-MIDI result of the chord progression, p. 36.

You might also like