An Audio To MIDI Aplication in Java
An Audio To MIDI Aplication in Java
MASTER'S THESIS
Gustaf Forsberg
Audio and MIDI data are fundamentally different, yet intertwined in the world of
computer-based music composition and production. While a musical performance
may be represented in both forms, MIDI data can always be edited and modified
without compromising sound quality, and musical notation can be produced from it
rather straightforwardly. Thus, having a performance stored as MIDI data can
sometimes be preferable to having it stored as audio data. However, in the absence of
a MIDI-enabled instrument, the MIDI data would need to be generated from the
audio data, putting some rather severe restrictions on the possibilities.
i
ii
Author’s notes
For almost as long as I can remember, music has been a central part of my life. I grew
up with the music of Johann Sebastian Bach, and although I did not realize it at the
time, its sublime beauty is often mirrored in the patterns and behavior of nature.
During the years I studied composition, I became increasingly aware of the
mathematics of music; during the years I have been studying computer science, I have
become increasingly aware of ‘the music of mathematics’.
The subject of this thesis arose from a wish to apply software engineering skills in a
musical context, and also – importantly – to learn something new. I had never
previously done any sound programming, which gave the practical aspect a certain
appeal. Since I have not specialized in signal analysis, I needed to read up quite a bit
on the theory as well. This proved to be a tremendously interesting experience, most
often leading to contemplation way beyond the purely mathematical details.
In closing, I would like to thank my supervisor, Dr. Kåre Synnes, for advice and
assistance throughout the work on the thesis.
Gustaf Forsberg
April 2009
iii
iv
Table of Contents
1 INTRODUCTION
1.1 Background..................................................................................................................1
1.2 Thesis overview...........................................................................................................1
1.2.1 Purpose .............................................................................................................2
1.2.2 Delimitations....................................................................................................2
1.2.3 General structure.............................................................................................2
2 TECHNICAL BACKGROUND
2.1 Pitch detection ............................................................................................................3
2.1.1 General issues...................................................................................................3
2.1.2 Time-domain methods ...................................................................................5
2.1.3 Frequency-domain methods ..........................................................................5
2.1.4 Some notes on the DFT and the FFT .........................................................8
2.2 MIDI...........................................................................................................................10
2.2.1 Messages .........................................................................................................10
2.2.2 Standard MIDI files and General MIDI....................................................11
2.3 Audio-to-MIDI .........................................................................................................12
2.3.1 General considerations .................................................................................12
2.3.2 Hardware solutions .......................................................................................14
2.3.3 Software solutions .........................................................................................14
v
3.5 Pitch detection functionality ...................................................................................24
3.5.1 Design notes...................................................................................................24
3.5.2 Implementation notes...................................................................................25
3.6 Audio-to-MIDI functionality..................................................................................26
3.6.1 Design notes...................................................................................................26
3.6.2 Implementation notes...................................................................................26
APPENDICES
Appendix A: References .................................................................................................45
Appendix B: List of figures ............................................................................................47
vi
1 Introduction
1.1 Background
In some aspects, when working with a piece of music, MIDI has a number of
advantages over working directly with audio data. Tasks like for instance adjustment
of tempo or phrasing, tweaking velocity, or removing unwanted notes are trivial when
working with MIDI, whereas in the audio case re-recording would likely be preferred.
Furthermore, due to the nature of MIDI, the step to musical notation is fairly short; in
some cases the conversion is a one-step affair, although some manual editing is usually
required to produce a good-looking score. As a concluding example, it could also be
mentioned that MIDI provides a very space-efficient way of storing a performance.
There are, then, several situations where MIDI data may be preferable to audio data.
Generally, this does not present much of a problem to a keyboard player – most
keyboards today have MIDI functionality, and indeed, MIDI was designed with
keyboard instruments in mind. However, the situation is quite another if an
instrument lacks MIDI functionality, or if only an audio recording of a performance is
available. In such cases, it would be practical to be able to translate audio data into
MIDI data.
Audio-to-MIDI translation is the main subject of this project, both from a theoretical
and from a practical perspective. An outline of the thesis is presented below, in terms
of purpose, delimitations, and information organization.
1
2 Introduction
1.2.1 Purpose
The goal of this thesis is the design and implementation of a general-purpose audio-
to-MIDI application. The application is not intended to let users make MIDI files
from their CD or mp3 collection; rather, it should be thought of as a musician’s tool,
to be used for example as a quick means to transcription of improvisations.
The application aims to provide a ‘working environment’ and not just limit itself to
pure audio-to-MIDI functionality. Hence, it will support audio and MIDI file handling
and playback, audio recording, and other related features. The application should be
quick and easy to use, so a clear and intuitive GUI is desired.
1.2.2 Delimitations
Although signal analysis is central in the subject of audio-to-MIDI translation, the area
of the thesis project is in fact software engineering. The main implication of this is
that the formal focus of the thesis lies on the application, rather than on the often
intricate mathematical details. Nevertheless, a significant amount of time had to be
dedicated to theory studies since the author did not have any previous experience of
signal analysis.
After this brief introductory chapter, we will turn our attention to the prerequisites of
an audio-to-MIDI application, discussing topics such as pitch detection and MIDI.
Following that, chapters three and four concern themselves with the design and
implementation of the application, along with a series of tests to determine its general
performance. The thesis concludes with a discussion on both the current state of the
application and future work.
2 Technical background
As we recall, the simplest pitched sound is the sine tone, with its pitch being
determined solely by the frequency of the single sinusoid. When dealing with sine
tones, it is trivial to determine even several simultaneous pitches, since each peak in
the frequency spectrum corresponds to a separate note.
Typically, however, a pitched sound will have several periodic components (referred
to as partials), differing in frequency, amplitude, and phase. In the typical pitched
instrument, the frequencies of the partials align in a harmonic series. This means that
the frequencies are whole-number multiples of some common fundamental frequency,
and partials with this property are called harmonics. The term overtone is often used to
refer to any partial – harmonic or inharmonic – other than the fundamental. To
varying degrees, the presence of overtones makes pitch detection more complicated.
3
4 Technical background
tones with differing amplitudes in octaves) [1]. Another example of the psychological
(and neurological) aspect of pitch is that in the case of harmonic partials, we tend to
‘hear’ the fundamental even if it is not present; this is known as periodicity pitch, missing
fundamental, or subjective fundamental. It may seem like a somewhat artificial example, but
the effect is used in practice, for instance in the production of deep organ tones [2].
Musically, the second and fourth harmonics lie one and two octaves above the
fundamental, respectively, and the third harmonic lies a fifth above the second
harmonic. Together, these intervals produce a very clean sound; indeed, much of the
‘color’ of the sound lies in the configuration of the higher harmonics. Overtones are
typically not perceived as separate notes, but in some sounds they are. Even so, we
hardly think of them as notes being played, but rather consider them components of
the sound. In other words, they do not necessarily alter the perceived pitch.
100 100
Magnitude (%)
Magnitude (%)
50 50
0 0
0 1000 2000 3000 4000 0 1000 2000 3000 4000
Frequency (Hz) Frequency (Hz)
(a) (b)
Figure 1. Magnitude spectra of the note g with fundamental frequency approximately
196 Hz, played on an electric guitar with a clean tone (a) and a tenor crumhorn (b).
Compared to the guitar, the crumhorn is notably rich in overtones; the frequency
range of the plots has been limited for readability, but partials of the crumhorn
continue way up to about 15 kHz.
Apart from issues that arise from a musical context, such as tone complexity and
polyphony, there are several other factors which can complicate proper pitch
detection. The noise level of the signal is one such factor. Some pitch detection
methods are more sensitive to noise than others, and often a compromise must be
reached between noise sensitivity, accuracy, and computational cost. Naturally, if there
is a real-time requirement, keeping the computational cost down becomes more
important. Sounds or sound phenomena originating from the recording environment
(such as echoes or reverb) also complicate analysis.
We may distinguish two basic approaches to the pitch detection problem; the time-
domain approach, and the frequency-domain approach. In the following sections, a
few examples of each approach are discussed.
Auto-correlation is another, quite popular, way to tackle the pitch detection problem in
the time domain. The main idea is to compare a segment of the signal with a shifted
version of itself; the correlation should be greatest when the shift corresponds to the
fundamental period of the signal. A problem with this approach is that the accuracy
tends to decrease at higher frequencies, due to periods becoming shorter and
approximation errors becoming greater. This method also suffers somewhat from
false detections – typically it has problems with periodic signals where the period is
that of a missing fundamental [4] – and it may not, in its basic form, be well suited for
polyphonic music [5].
There are several methods available to transform from the time domain to the
frequency domain. The best known and most frequently used method is probably the
6 Technical background
Popularity aside, there are some issues with the Fourier approach. For example, better
localization in frequency means worse localization in time, and vice versa. Moreover,
since frequencies of musical notes are distributed logarithmically and the frequency
bins of the FFT are distributed linearly, resolutions at low frequencies tend to be too
low, and resolutions at high frequencies tend to be unnecessarily high.
Many of these issues are absent in wavelet transforms. Simply put, a wavelet is a function
which divides the initial function into different frequency components and allows for
examination of these components in an appropriate scale; this remedies the resolution
issues of the FFT. Also, contrary to the FFT, wavelets are localized in both time and
frequency, and generally do not have a problem handling discontinuities. While not
(yet) widely adopted in audio processing, wavelets are often used in image processing.
There are several other transforms which may be used in audio processing, for
example the constant Q transform and the discrete Hartley transform. Nevertheless,
the FFT is ubiquitous, and hereafter we take the word ‘transform’ to imply the FFT.
After transformation into the frequency domain, there are several ways to estimate
pitch. In very simple cases, for example monophonic music with non-complex
sounds, it might be sufficient with frequency peak detection directly following the
transform, as mentioned in section 2.1.1. Most often, however, a more sophisticated
method is called for.
To obtain the harmonic product spectrum (HPS) [7], we begin by downsampling the
spectrum a number of times, each time producing a more ‘compressed’ version.
Specifically, the nth downsampled spectrum is 1/(n + 1) the size of the original
spectrum. The point is to utilize that the partials belong to a harmonic series, so that
the first harmonic (i.e. the fundamental) in the original spectrum aligns with the
second harmonic in the first downsampled spectrum, which in turn aligns with the
third harmonic in the second downsampled spectrum, and so on. Thus, the number of
spectra considered equals the number of harmonics considered, and the HPS is finally
produced by multiplying the spectra together, with the idea of amplifying the
fundamental frequencies.
Pitch detection 7
While the HPS method is quite insensitive to noise and generally works well, there
may be problems with notes being detected in the wrong octave (usually one octave
too high). Some extra peak analysis may help [6], but this may be difficult in
polyphonic cases.
As we shall see in chapter 3, the implementation relies on the FFT for transformation
into the frequency domain. While the DFT and the FFT are standard textbook
material and thus shall not be covered in depth here, a brief review of some relevant
aspects may be appropriate.
n −1
X k = ∑ x m ωnkm , k = 0, ..., n − 1 ,
m =0
where
2πi
−
ωn = e n
is a primitive nth root of unity. The DFT is, by definition, a complex transform; it
takes complex-valued input and produces complex-valued output. For a real-valued
input, the second half of the transform will be a complex conjugate mirror of the first;
that is,
X k = X n∗−k .
Hence, in the case of real-valued input, we need only consider the first half of the
transform.
Magnitude
Frequency Frequency
(a) (b)
Figure 3. A signal with a frequency that is not an integer multiple of the spectral
resolution of the transform will produce spectral leakage as seen in (a); energy has
’spilled over’ into the other frequency bins. In (b), a Hamming window was applied
before transforming, with a noticeable reduction in spectral leakage effects as a result.
2πm
w (m ) = 0.54 − 0.46 cos
n −1
to the signal segment before taking the transform. Like many window functions, the
Hamming window is bell-shaped, but other shapes (such as triangular windows) are
occasionally used.
From the definition of the DFT, we can see that it has an asymptotical complexity of
O(n2). The FFT is a divide-and-conquer method which, by utilizing properties of the
complex roots of unity, improves the complexity to O(n lg n). While the divide-and-
conquer approach may hint of recursion and dependence upon the factorization of n,
there are several FFT algorithms of varying kinds. Most commonly seen are iterative
implementations of the Cooley-Tukey radix-2 algorithm, which requires n to be a
power of two.
Although all FFT algorithms run in O(n lg n) time, naturally we wish to minimize the
time taken in practice. Our audio signal will be strictly real, and if we have n real
samples it might seem like we have to ‘add’ an imaginary signal consisting solely of
zeros before transforming, since it is a complex transform. This is fortunately not the
case. By taking every even-indexed sample to be real-valued and every odd-indexed
sample to be imaginary-valued, we produce a complex input of size n/2, on which the
transform is applied. From the result, the transform of the initial real sequence is
obtained through a final unwrapping step (which, like the construction of the complex
input from the real input, runs in linear time). Hence, a real signal of length n requires
10 Technical background
2.2 MIDI
MIDI (Musical Instrument Digital Interface) was created in the early 1980’s in an
effort to standardize the way digital musical instruments communicated. Previously,
various manufacturers had developed their own digital interfaces, and some were
beginning to worry that the use (and hence sales) of synthesizers would be inhibited
by the lack of compatibility. The first MIDI instrument, the Sequential Circuits
Prophet-600, appeared in January 1983, soon to be followed by Roland’s JX-3P. At
this time, the MIDI specification was very simple, defining only the most basic
instructions. Since then, however, it has grown significantly.
The MIDI specification can be said to consist of three main parts; the message
specification, the transport specification, and the file specification. Of these three,
probably the most important part, and the part of our primary concern, is the message
specification, or protocol.
2.2.1 Messages
Naturally, the content of any present data bytes depend on the command with which
they are associated. A program change command, for instance, is followed by one data
byte, containing the number of the instrument sound (or patch) to be used. When two
data bytes are used, they usually contain one separate piece of information each; for
example, a note on command uses two data bytes, specifying note number and velocity,
respectively. Since the MSB is used to signify whether it is a status byte or a data byte,
this gives 128 possible note numbers (in comparison, a standard piano has 88 keys)
and 128 different velocities (including the zero velocity). Here, 128 different values are
quite sufficient, but in some cases a greater range is desired. An example is the pitch
bend command, where one data byte holds the least significant bits and the other the
MIDI 11
most significant; the 214 different values allow for very smooth pitch transitions.
Channel messages always have one or two data bytes, while system messages may
have zero data bytes. Thus, a MIDI message is at most three bytes in size.
The small message size is important for the timing. The MIDI protocol is a serial
communications protocol, with a specified bandwidth of a mere 31.25 kBaud
(approximately 3.8 kByte/s). There is no true simultaneity; a chord, for example, is in
practice a really fast arpeggio. With a maximum message size of three bytes, well over
a thousand messages can be sent per second even in the worst case, disregarding
practical limitations.
Since no actual sound data is stored, the space required is very small, especially
compared to an audio file. By accessing the file, any device or application capable of
MIDI playback can replay the performance. How it actually sounds depends on the
MIDI instrument utilized for playback, and this brings up the issue of uniform
playback. While some MIDI instruments have very high-quality sounds, and while the
program change command lets us tell the MIDI instrument to use a certain patch, it is
not specified what sound we will actually get – it may differ from instrument to
instrument. This means, for example, that a piece which plays back with an organ
sound on one MIDI instrument might play back with drum sounds on another.
To counter this problem, General MIDI (GM) was created. While not a part of MIDI
per se, GM defines specific features for MIDI instruments. For instance, with a GM
instrument, we know that MIDI channel 10 is reserved for percussion sounds, and we
also know that a particular note number played on this channel will always produce a
particular percussion instrument sound. For other channels, we know which program
number corresponds to which instrument (for example, the acoustic grand piano is
12 Technical background
always found at program number 1, the violin is always number 41, and so on). In
addition to organizing instrument layout, GM also makes specifications regarding
polyphony, velocity, and multitimbrality. Thus, adhering to the GM standard increases
the chances of correct playback on foreign systems.
Since first published in 1991, GM has been superseded by GM2 (in 1999). GM2 is
fully compatible with the original GM, while considerably extended. There also exists
a slimmed-down version (General MIDI ‘Lite’, GML) aimed at mobile applications,
and some instrument manufacturers have introduced their own extensions and
variants, for example the XG standard from Yamaha.
2.3 Audio-to-MIDI
Now, having acquainted ourselves somewhat with both pitch detection and MIDI, we
can reflect a bit upon the requirements and possibilities of audio-to-MIDI
functionality. As we shall see, we encounter limitations of both pitch detection and
MIDI.
Perhaps the most obvious issue is that it is not enough to have a well-functioning
pitch detection algorithm; in order to produce a correct translation, we must also be
able to tell when a note was played, and when it was released. There are a number of
ways to tackle this problem. For example, we may consider spectral peaks to indicate
note onsets whenever a certain threshold is exceeded. Determining that threshold may
be a bit tricky in practice; for instance, since the dynamics of an instrument may vary
so that notes in a certain register are naturally louder than notes in another register,
the threshold value may need to vary throughout the frequency range. It may also be
desired to consider changes in spectral magnitude (i.e. spectral flux) in addition to the
values themselves. Other approaches to note onset detection may work with the
amplitude of the signal, or changes in phase [12].
Of course, some cases are easier to handle than others. Sounds such as sine tones may
be easy to deal with from a pitch detection perspective, but the unchanging nature of
the notes can make proper detection of repeated notes difficult. With other sounds, it
may be very easy to determine when a note begins or ends, but there may be
inharmonic transients that complicate pitch detection.
Erratic note detections can arise from very slight fluctuations in frequency or
amplitude. These fluctuations may be temporary and very short, and in such cases the
resulting notes have very short durations (i.e. the note onset is almost immediately
Audio-to-MIDI 13
followed by the note offset). Thus, it may at times be possible to use note duration as
a ‘mistake criterion’ in order to clean up the audio-to-MIDI translation.
Apart from note onset and offset detection, obviously we must also handle the pitch
information itself. In the 12-tone equal temperament, the fundamental frequency fk of
the kth semitone above a note with fundamental frequency f 0 is given by
f k = f 0 ⋅ 2 k 12 , k = 1, 2, ... .
This makes each note have a fundamental frequency which is approximately 5.9%
higher than that of the preceding semitone. The most basic approach for audio-to-
MIDI translation is probably to disregard devices such as vibrato and glissando and
simply match frequencies to their closest note in the equal temperament. This is of
course ideal for music that is itself limited in that respect (such as piano music), but
less well suited if we wish the translation to mimic the original performance in detail.
To handle for example glissando correctly, the audio-to-MIDI method must ‘know’
when a new note should be played and when to simply apply the expression to an
already sounding note. This implies a certain degree of sophistication in note onset
and note offset detection. We must also keep in mind that the relevant MIDI
commands are channel commands; for example, pitch bend will affect all currently
sounding notes on the specified channel. Thus, if we want to be able to handle
polyphonic cases such as when one voice is doing a glissando while another is not,
each voice needs a separate channel. Obviously, this requires that the number of
voices does not exceed the number of channels.
The quick and dirty way to assign different voices to different channels would be to
simply correlate channel number with pitch order. For example, we might always let
the highest note be handled by channel 1, the next highest by channel 2, and so on.
There are, however, numerous problems with this approach. For one thing, if two
voices cross each other, their channel numbers will no longer correspond to their
pitch order. This is likely to necessitate manual corrections if for instance musical
notation is to be produced. Moreover, if the crossing of voices is the result of a
glissando, this approach will simply not work. Ideally, we would like to be able to
track each voice and thus make sure that each note gets assigned to the correct
channel, but this would most often require elaborate pattern matching and timbre
identification.
Several products aiming to bridge the gap between audio and MIDI exist, of varying
character. For example, they may be aimed at hobbyists or professionals, they may be
general-purpose or specialized for a particular instrument, and so on. We conclude
this chapter with some general remarks about hardware- and software-based solutions.
There are two main approaches to hardware solutions to the audio-to-MIDI problem;
integrated and non-integrated. An example of the former case are MIDI guitars with
on-board DSP’s, allowing the MIDI cable to be connected directly into the
instrument. However, musicians tend to be very picky about their instruments, and
most would be unhappy having to use an instrument they did not like in order to have
MIDI functionality. In such cases, the non-integrated approach may be more
appealing; since the processing is done externally, the instrument generally needs no
modification apart from possibly mounting a special pickup. For example, stringed
instruments may be fitted with a pickup that sends a separate signal for each string,
greatly simplifying multi-pitch detection. As a note, hybrid solutions exist as well,
where the pickup is integrated but the DSP is not.
Most audio-to-MIDI software are stand-alone applications, but there are also plug-ins,
intended for use within a host application. Plug-ins often have direct hardware
counterparts, and are typically fairly light-weight and dedicated to a particular real-time
task.
In software solutions, the GUI possibilities pave the way for numerous additional
features, such as extensive editing functionality and production of musical notation.
Unless audio-to-MIDI needs to be performed in a real-time situation – for example
having audio triggering MIDI events during a live performance – the direct result of
the translation is often an intermediary step requiring editing. Of course, the editing
itself does not really depend on whether the translation was performed by hardware or
software, but it can be convenient to be able to perform all tasks using a single tool or
platform.
3 Design and implementation
3.1 Overview
The application was developed in JDK6 on Windows XP, using an Intel quad-core
machine with 2 GB of RAM. No reasonably modern system should have any
problems running it; the only ‘real’ requirements are enough RAM to hold the audio
data (a non-issue these days) and a decent sound card.
In general terms, the main features of the application are the following:
There are various configuration options available, for example allowing the user to
control the sample rate and bit depth used during recording, and which window
function to use for pre-processing during pitch detection. Also, the user has a number
of options for controlling audio-to-MIDI behavior (such as MIDI resolution and
pitch detection thresholds).
The application design is based on the Model-View-Controller (MVC) pattern, where the
controller is notified of relevant user interaction with the view and reacts accordingly
through direct access to both the view and the model. While there are several variants
of this pattern, the main point is to separate user interface from business logic.
15
16 Design and implementation
Model View
Controller
Often in MVC, the model has no direct access to the view; instead, the view observes
the model, fetching data of interest when notified of a relevant change. Another
variant has the controller managing all the information flow, with no connection
whatsoever between the view and the model. Except for a particular real-time case,
this is the variant used in the application.
In implementation terms, simplifying slightly, the view corresponds to the GUI class,
the controller to the Controller class, and the model is split into the MidiCentral and
AudioCentral classes. Instantiation (and initial configuration) of these classes is the
duty of the AudioToMidiApp class.
GaussWindow HammingWindow
BlackmanWindow RectangularWindow
MenuSystem
<<interface>>
WindowFunction IterativeFFT PlottingPanel
ControlPanel
PitchDetector AudioToMidiPanel
Controller
Ideally, using the application should require as little interaction as possible. Following
the ‘make the common case fast’ guideline, all the basic tools needed for recording,
playback, and audio-to-MIDI translation are accessible directly from the control
panels on the main screen. Additional functionality is provided through menus.
While the record button is audio specific, the stop and play buttons control both
audio and MIDI. Also, buttons are disabled at times when their functionality is not
available. For example, as seen in Figure 6, the record button is disabled during
playback, as is the audio-to-MIDI button. Both are re-enabled when playback ends.
However, as mentioned in chapter 5, such ‘user proofing’ is not yet consistently
implemented.
18 Design and implementation
Several parts make up the graphical user interface. Apart from the main window, the
important elements are the two plots, the two lower panels from which for instance
playback and audio-to-MIDI is controlled, and the menu bar.
GUI 2 PlottingPanel
...
Figure 7. The GUI class provides a number of methods used for interaction
with GUI components.
The GUI class and its components are fully unaware of the rest of the application
save for the Controller, which is registered as a listener to various GUI components.
Changes to GUI appearance and functionality are handled through direct method
calls; the GUI class acts as an interface to other GUI elements, most importantly the
two plots. Although there are efficiency reasons to let the model feed plot data
directly to the view in this manner, an observer pattern may be a cleaner approach
regarding smaller updates, and is subject to future evaluation.
Basic Swing/AWT components are used throughout. The main application window is
provided by the GUI class, which extends javax.swing.JFrame. The GUI class also
handles instantiation of the other GUI components, in particular the PlottingPanel
objects, the ControlPanel, the AudioToMidiPanel, and the MenuSystem. The
latter is a subclass of javax.swing.JMenuBar, while the panels are subclasses of
javax.swing.JPanel.
Through the plotAudio() and plotSpectrum() methods in the GUI class, the plots
are continuously fed with data during playback. These methods are called by an inner
class of AudioCentral (see section 3.4), and adjust the supplied data for the plots.
This generally means scaling with regards to plot height and plot width, and in the
MIDI functionality 19
case of the audio signal data passed to plotAudio() this pre-processing also includes
root mean square (RMS) calculations.
Since it is not assumed that the data supplied to the plotting methods describe the
complete signal (or the spectrum taken over the complete signal), but rather a small
segment, the scaling procedure assumes that the given data has already been
normalized to values within [-1.0, 1.0].
The GUI also owns the file chooser dialogs used when opening and saving files.
However, instead of offering methods to interact with these, the GUI class provides
methods to obtain them as to facilitate direct interaction. This results in somewhat less
cluttered code.
Currently, the GUI is all hand-written (i.e. not constructed using a GUI builder tool).
Although a clear and intuitive GUI is important, it was somewhat down-prioritized at
this stage in favor of pitch detection and other key features. The aim was mostly to
provide a sufficiently good GUI within the scope of the thesis. Thus, there is room
for much polishing, both with regards to design details and implementation details
(see chapter 5).
The application supports opening of MIDI files and saving the MIDI data produced
by an audio-to-MIDI translation as a type 0 MIDI file. When MIDI data is present, it
may be played back, and the playback may be muted/unmuted at any time. The
playback tempo can be controlled through a spinner in the GUI.
Since the MIDI functionality is so basic, dividing it over several classes would rather
lead to fragmentation than to improved structure.
20 Design and implementation
MidiCentral
...
The MIDI functionality is provided by the single MidiCentral class. There is,
however, an inner (private) class for the metronome functionality, as described in the
following section.
The sequencer does not close itself when the end of the sequence is reached, thus
keeping hold of acquired system resources. However, at the end of playback, a
particular MetaEvent is dispatched, which we may use to trigger the closing of the
Sequencer. In our case, this MetaEvent is caught by the Controller, which is
registered as a MetaEventListener in the MidiCentral.
audio recording. What happens at each metronome click is detailed in the inner class
MetronomeTask, which implements the Runnable interface. This task is run by
means of a ScheduledExecutorService. Note that, depending on system audio
settings (e.g. “What U Hear” source selection), the metronome click may come to be
recorded. There is currently no ‘stand-alone’ metronome; it is only available during
recording, and hence started through the ‘record’ button in the GUI.
The Sequencer class has a setTrackMute() method, which is the way muting of
MIDI playback is currently implemented. While it has a certain appeal in its simplicity,
there are some issues with this approach. For one thing, it could be considered to be a
kind of ‘fake’ mute, since we in effect mute notes instead of sounds. Moreover,
according to the API documentation, it is actually not guaranteed that a Sequencer
supports this functionality. Muting MIDI playback by muting the synthesizer is
perhaps the proper way, and this leads us to some issues of MIDI volume control in
Java.
The easiest way to set up a MIDI playback system is to use Java’s own default
synthesizer. However, some versions of the JRE (e.g. the Windows version) do not
ship with a soundbank, thus requiring the user to download it separately. It can
therefore not be assumed that a soundbank is present. Java Sound has a fallback
mechanism so that if it can not obtain a soundbank for the synthesizer, it tries to
utilize a hardware MIDI port instead. However, this is generally not desired since it
results in various inconsistencies.
If no soundbank was found, attempting to change the MIDI volume through the
default synthesizer will obviously not work; we must obtain the Receiver from the
MidiSystem instead of from the Synthesizer if we want control. This was tried
during implementation, but the results were considered unsatisfactory. Not wishing to
require the user to download a soundbank or configure the sound system manually,
the volume control functionality was skipped in this version of the application.
The application supports recording and playback of audio in 8- or 16-bit PCM format
at various sample rates. Stereo is currently not supported. Any audio file within
specification may be opened and played back, and present audio data may be saved.
Playback may be muted/unmuted at any time. Also, plot data is continuously fed to
the GUI during playback.
22 Design and implementation
The audio functionality is a bit more complex than the MIDI functionality, being
directly involved in pitch detection and audio-to-MIDI translation in addition to
handling playback, recording, and opening/saving files.
AudioCentral
...
...
Figure 9. The AudioCentral class. Not shown are two inner classes used for
playback and recording, respectively.
An object implementing the Line interface may be viewed as an audio transport path
to or from the system. Mixers and ports are both lines, although when speaking of
lines we usually refer to lines going into or out from the mixer. To capture audio, we
acquire a TargetDataLine from which the signal is read. For playback, we may use
either a SourceDataLine or a Clip. While the former is continuously fed with audio
data during playback by writing to its buffer, the latter lets all the data be loaded from
the beginning. This results in lower playback latency, and also makes it possible to
jump between different positions in the audio (which may be desired for fast
forward/rewind functions). Also, looping of the audio data is directly supported by
the Clip class. Hence, unless the audio data requires too much memory to be loaded
at once, or is not known in its entirety at the start of playback, a Clip is generally to be
preferred over a SourceDataLine. Clip is the playback line of choice in the
implementation.
Supposedly, there have been issues with out-of-memory errors when attempting to
load clips greater than 5 MB. However, no such issues have been encountered during
development. As an example, 10 MB of audio was recorded, played back, and re-
loaded from file with no problems whatsoever.
An audio clip is played back by calling Clip’s start() method. When playback is
complete, a LineEvent is dispatched. In the implementation, this is noticed by the
Controller instance, which is registered as a LineListener with the Clip.
Again, as was mentioned in section 3.3.2, if the metronome is used during audio
recording, it may come to be recorded, depending on system settings.
The sample rate and bit depth used for audio recording may be specified via the
‘Settings’ menu in the GUI. Currently, four pre-determined sample rates are listed, all
assumed (by means of a rather unsophisticated test performed at application launch)
to be supported by the system. Note, however, that playback and audio-to-MIDI is
supported for any available sample rate.
In section 3.3.2, the omission of MIDI volume control in the current implementation
was discussed. Controlling audio volume does not present similar difficulties, but for
reasons of consistency an audio volume control was omitted as well.
24 Design and implementation
The application supports pitch detection of (definite-pitched) sounds from about the
note F# (at approximately 92.5 Hz) and upward, depending on sample rate. From the
audio-to-MIDI panel in the GUI, the thresholds used to filter sounding pitches can be
adjusted.
IterativeFFT
...
<<constructor>> IterativeFFT()
<<constructor>> IterativeFFT(in numSamples:int)
setNumSamples(in numSamples:int):void
getMagnitudes(in input:double[*]):double[*]
...
1
BlackmanWindow
PitchDetector
GaussWindow
...
<<constructor>> PitchDetector()
<<constructor>> PitchDetector(in sampleRate:float, in windowSize:int) HammingWindow
setWindowSize(in windowSize:int):void
setSampleRate(in sampleRate:float):void
setWindowFunction(in name:String):void RectangularWindow
setLowThreshold(in d:double):void
setHighThreshold(in d:double):void
<<interface>>
getMagnitudeSpectrum(in audioData:double[*]):double[*]
1 WindowFunction
prepareForTranslation():void
getPitches(in audioData:double[*]):double[*] apply(in data:double[*]):void
...
Generally, analysis logic resides within the PitchDetector class, while processing is
performed by the IterativeFFT and WindowFunction instances.
The design has basically everything even remotely related to signal processing go via
the PitchDetector. This includes producing the data used to plot the frequency
spectrum in the GUI, although no actual pitch detection is performed in that case.
Pitch detection functionality 25
The pitch detection utilizes an iterative radix-2 FFT algorithm, meaning only sample
windows having a size which is an even power of two are supported. With this
restriction in mind, the window size is set depending on the sample rate in an attempt
to strike a balance between frequency resolution and time resolution. For example, for
audio sampled at 44100 Hz, a window size of 8192 samples will be used. This
corresponds to roughly 0.186 seconds and gives a frequency resolution of
approximately 5.4 Hz, which is sufficient to correctly identify the note F# at about
92.5 Hz. On the other hand, with a sample rate of 8000 Hz, a 1024-sample window
will be used, which corresponds to 0.128 seconds and a frequency resolution of
7.8125 Hz. Here, we can only reliably detect notes down to d at about 146.8 Hz.
The transform is implemented in the class IterativeFFT. The ‘n for the price of n/2’
procedure mentioned in section 2.1.4 is employed, along with miscellaneous smaller
tweaks such as using bit-shift operators for multiplications and divisions by a power of
two. Also, n is required to have been specified before using the transform. This allows
pre-computation of constants specific to a transform of a given size; such values are
often referred to as twiddle factors.
It may be noted that IterativeFFT does not currently have a method that returns the
actual transform, i.e. a sequence of complex numbers; there is only the
getMagnitudes() method, which returns (the first half of) the magnitude spectrum.
Also mentioned in section 2.1.4 is the phenomenon of spectral leakage. Via the
‘Settings’ menu, the user may choose from several window functions. These are
implemented as ‘function objects’; they implement the WindowFunction interface
and provide a single apply() method.
The pitch detection algorithm utilizes the harmonic product spectrum, as described in
section 2.1.3. When the magnitude of some frequency in the HPS exceeds a certain
high threshold, that frequency is added to a list of currently sounding frequencies.
Similarly, when the magnitude falls below a certain low threshold, the frequency is
removed from the list.
Currently, these thresholds are calculated from the number of harmonics considered
when constructing the HPS, the size of the sample window, and the values specified
via the threshold spinners in the GUI. In other words, the thresholds do not take the
power of the signal into account. This causes the pitch detection to behave somewhat
differently depending on sample rate and overall volume.
26 Design and implementation
The PitchDetector keeping track of currently sounding pitches makes it perhaps even
more intertwined with the audio-to-MIDI functionality. Indeed, it was designed with
audio-to-MIDI in mind. There is no separate audio-to-MIDI object; all methods of
the translation mechanism are defined within the AudioCentral, utilizing the
functionality of the PitchDetector.
Within the scope of the thesis, the ambition regarding audio-to-MIDI functionality
has been to achieve good results at least in the single-voice case with not too complex
sounds. In this chapter, we shall consider this and other cases by examining pitch
detection functionality and timing accuracy. Regarding pitch detection, it will be
examined both with respect to tone complexity and with respect to the number of
simultaneous notes.
Two different tests, or test series, have been performed. One has a more ‘clinical’
approach, while the other is perhaps closer to real-world use cases.
In this section, the different tests were created to be somewhat comparable with each
other, despite being of different character. Recordings of three musical passages were
constructed by first creating MIDI versions of the passages via manual editing, and
then exporting them as 16-bit audio files with a sample rate of 44100 Hz. Two
recordings of each passage were done; one using a square wave sound, and one using
a piano sound. In the piano sound, a slight reverb was present. Within the application,
each recording was then opened, translated to MIDI, and saved. A resolution of 96
PPQ was used, and the tempo was set to match that of the recording. The data in the
resulting MIDI files was then imported into the MIDI editor and compared with the
original MIDI used to produce the audio files. In this editor (specifically, the key
editor in Steinberg’s Cubase SX3), the ‘y axis’ shows the notes and the ‘x axis’ their
durations; it can be thought of as a ‘musical’ version of a frequency/time plot.
Due to the number of configurations possible, the testing was restricted to achieving
‘good enough’ or typical translations. In some cases, this has meant accepting a few
non-detected or falsely detected notes. While better results may at times be possible to
achieve through more careful parameter selection, general improvements in the pitch
detection procedure are of more interest; the errors in the translations highlight the
weaknesses of the current mechanism.
27
28 Tests and analysis
4.2.1 Monophony
&4 #œ #œ nœ nœ bœ #œ b œ n œ
#œ # œ n œ b œ nœ #œ nœ
◊ 5
5
6
6
Figure 11. The monophonic test case. It spans almost four octaves and has a variety of
note values.
The single-voice line illustrated above was specifically created to be demanding. The
fundamental frequencies of the notes range from approximately 277 to 2093 Hz, and
played at a tempo of 100 BPM, the line contains notes as short as 1/10 of a second.
Viewed in the MIDI editor, the original MIDI appears as below.
Figure 12. The accelerating, ascending single-line test case as written in the
MIDI editor. The ’x axis’ shows bars with quarter-note subdivisions. MIDI note
C3 (having MIDI note number 60) corresponds to c´ at approximately 261.6 Hz.
This test case should allow us to gain a fairly good grasp of the application’s ability to
handle rather short notes, and also the behavior of the pitch detection mechanism at
different frequencies. The lower frequencies are particularly interesting with the piano
sound, since lower notes often have more pronounced overtones.
Structured tests 29
We proceed with examining the square-wave recording of this test case. The non-
complex character of square waves makes a good translation easy to obtain. In fact,
the translation is 100% correct with regards to pitch, and the timing differences
compared to the original are insignificant.
Clearly, a piano sound is more difficult to handle than a square wave sound, for
several reasons. For one thing, square waves are truly periodic, while a piano tone
changes notably with time. Also, as the hammer strikes the strings, the mechanism
introduces a certain ‘thump’ sound, which may be more or less prominent depending
on register.
It should also be pointed out that with square waves, all notes are created equal. With
a piano, all notes are different, since each note is produced from one or more
individual strings sounding together. Piano sample libraries vary in how accurately this
is reflected; some high-quality libraries do indeed sample each key individually, while
other libraries may have sampled some lesser number of keys and then utilized pitch-
shifting to fill the gaps. Similarly, the number of velocity levels at which each key is
sampled may differ.
When doing audio-to-MIDI translation of music with a piano sound, the threshold
values are of greater importance since the notes are not constant in volume while they
sound. Due to for example transients, resonance, and noise, using an ‘aggressive’
window function (such as a Gauss or Blackman window) may be required. As an
example, consider the following translation.
30 Tests and analysis
Figure 14. A not-quite-perfect translation. The first two bars contain incorrectly
detected pitches (marked as black), and in the third bar a note is missing.
In Figure 14, we see that the two lowest notes (c# and e at about 138.6 and 164.8 Hz,
respectively) are both detected along with their second and third harmonics, and there
is also a false detection of the note c below c# at the very beginning. Notes from f to
f#´ (approximately within the span 174.6 – 370.0 Hz) are paired with their respective
second harmonics. Thereafter, the single notes are correctly detected with the
exception of a missing h´´ near the end of the third bar, most likely due to reverb.
Rhythmic accuracy is decent, although the low threshold probably could have been
raised slightly to trigger earlier note off events.
œ œ œ
&c ≈ œ œ œ œ œ œ œ #œ œ œ œ œ œ œ œ œ œ œ Œ
ÿ
?c j œ #œ œ. œ œ œ œ œ œ œ œ œ #œ œ œ́ œ œ œ œ œ œ œ
œ J
ÿ
Figure 15. Our two-voice test case: the beginning of Bach’s two-
part invention no. 13, BWV 784.
Structured tests 31
Figure 16. The original edit of the two-voice test case as appearing in the
editor. For clarity, the ’x-axis’ is subdivided down to sixteenth-note
durations.
The square-wave recording of this case was translated to MIDI without any problems.
Figure 17. A translation of the two-voice test played with a square wave
sound.
32 Tests and analysis
Again we see that the nature of square waves makes them easy to handle by the
application. Some insignificant rhythmic inaccuracies aside, all notes have been
properly detected; the octaves did not pose a problem here. Possibly the low threshold
could have been raised to further lessen the occasional slight overlapping of notes.
Moving on, the following translation of the piano recording was obtained.
Figure 18. A translation of the two-voice passage played back with a piano
sound. Incorrect notes are marked as black.
Looking at Figure 18, a strong third harmonic of the first note (A at 110 Hz) may
explain some of the oddities in the beginning of the result. Throughout, we see
numerous notes being incorrectly paired with notes corresponding to their second
harmonic (i.e. the octave). The rhythmic accuracy of note onsets is generally good,
with slight deviations in either direction.
For a three-voice test case, we take a segment from a Bach fugue. Here, a frequency
range of roughly 175-831 Hz is covered. Played at 80 BPM, the shortest note duration
is 3/16 of a second.
. . . . . . . . .
b œ œ œ œ œ œœ œœ œœ œ œ œ œ nœ œ œ
&bb œ Œ œ œ ‰œ ‰ œ œ œ œ œ œ œ œ
œ
. . . .
œ n œ œ. œ. œ. œ œ œ. œ. .œ œ œ œ. œ. œ œ œ œ œ
? b ‰
bb
>
Figure 19. Bars 7-8 of the c-minor fugue in the first volume of
Bach’s Das Wohltemperirte Clavier, BWV 847.
Structured tests 33
As in the previous case, no accentuation or phrasing specifics are present in the MIDI
edit of the passage. In the editor, this passage appears as in Figure 20.
Again, proper detection of octave intervals between voices is of interest. The square
wave recording did not present much difficulty, as seen in Figure 21 below.
The highest note on the second beat is missing. As we can see, it was determined that
no new note was played, but instead the preceding note of the same pitch was held for
another sixteenth-note length. In some cases, such errors should be possible avoid by
tweaking the threshold values, but it is problematic with a square wave sound since
the character and volume remains constant throughout the duration of the note. Thus,
the application can only rely on the miniscule silence between the repeated notes to
tell them apart, something that would require a significant (and most likely, practically
impossible) timing resolution. However, other than the missing note, the translation is
correct, with only very slight rhythmic deviations compared to the original.
Translating the piano-sound version, we once again see the familiar octave errors. In
some cases, the octave errors result in missing notes, since the algorithm does not
34 Tests and analysis
issue new note on commands if it finds the note to be already sounding. It typically
looks worse than it sounds since octaves are harmonically consistent with the correct
notes, but errors are errors nevertheless.
The repeated note which was missing in the square-wave case was properly detected
here, owing to the difference in character between the beginning and end of a piano
tone. We also see variations in the note durations, although the rhythms are generally
correct.
Since the audio of the tests in the preceding section originated from MIDI
instruments, the signal was very clean. In this section, we shall perform two further
tests which may be closer to most practical circumstances. All audio was recorded
using the application, and the metronome functionality was used to provide a
rhythmic point of reference. Since there is no source MIDI to compare the results
with, as was the case in section 4.2, the accuracy of the translations will be evaluated
largely by ear, although the MIDI editor will be used for visualization.
The following line was written to incorporate common electric guitar playing
techniques, such as sweep picking, hammer-ons, and pull-offs. It was recorded using a
solid-body electric guitar with a clean tone, save for a very slight reverb effect. The
tone was dialed in to have a typical jazz guitar character, with rolled-off highs.
Further tests 35
≥ ≥
œ̆ ˘œ b >œ œ œ # œ≤ œ œ≤ ≥œ n œ≤ b œ œ≤ ≥œ ≤ œ ≤
≥
≥ ≤ ≥ ≤ ≥
#œ œ b œ n œ œ n >œ b œ≤
4 œŒ Œ ‰œ œ
œ
&4 œœ Œ
J 3
Figure 23. A jazz line. It was played with a loose swing feel at a tempo
of 160 BPM. By convention it is notated one octave above sounding
pitch. Appropriately, the recording was made late at night.
This test should present a number of difficulties. For example, the tonal characteristics
of each note will vary depending on which string it was played on, whether it was
picked or not, and if it was, whether it was an upstroke or a downstroke. Also, the
strings used were less than new; the fresher the strings, the cleaner the tone.
Figure 24. MIDI translation of the jazz lick. The ‘x axis’ is subdivided into eight-note
triplets. While there are some false detections and rhythmic deviations, it is essentially
correct.
We see that there are some octave errors (even featuring false detections two octaves
above the fundamental, corresponding to the fourth harmonic). There is also an
obvious misdetection at the very end. We also see that the durations of the
erroneously detected notes are small compared to that of the simultaneously sounding
correct notes. The rhythms are fairly accurate compared with the recording, and
perhaps worthy of special mention is that the initial sweep-picked arpeggio was
translated correctly. Judging from the figure, those notes are roughly 1/16 of a second
apart.
36 Tests and analysis
b 4 œœ œœ œœ œœ œœ Œ ## œœœ ˙
&b 4 œ Œ œœ Œ œ Œ œ
œ
Œ œ œ
Œ n ˙˙ Ó
œ œ œ ˙
Again, several factors will affect the possibility of getting a good MIDI translation, for
example how forcefully each string was picked (finger-picking this time) and where it
was picked; transients, particularly those of the lower notes, might cause some
problems with correct note onset detection.
As we may guess, the translation in Figure 26 does not sound fantastic, although the
chords are discernible. There are a number of notes present that should not be; some
due to octave errors, some probably due to transients or string noise. Also, there are
some missing notes – the highest note is absent in the third, fifth, and sixth chord of
the progression.
Test result summary 37
We have tested the application’s audio-to-MIDI functionality with music and sounds
of varying complexity and character, using recordings with differing noise levels.
Unsurprisingly, clean signals with fairly simple sounds do not appear to be much of a
problem. With the square-wave recordings, the number of voices did not seem to be
very significant with regards to the accuracy of the result; except for a single missing
note in the three-voice polyphony test, due to specific difficulties with repeated notes,
these recordings were all translated correctly.
The more complex sounds of the piano were more difficult for the application to deal
with, and we began to see octave errors and some missing notes. As mentioned in
section 2.1.3, octave errors are a common problem with the HPS method of pitch
detection. While the single-note line played on an electric guitar was handled decently,
although with some errors, there were some obvious problems with the acoustic guitar
chord progression. In addition to properties of the guitar and the way the chord
progression was performed, factors such as noise level and the placement and
frequency response of the microphone come into play.
5 Discussion and future work
5.1 Results
Within the scope of the thesis, the functionality of the application is to be considered
satisfactory. Below, we shall discuss this in terms of general quality and features, with
special emphasis on the audio-to-MIDI functionality.
Since the audio-to-MIDI functionality is the key feature, other features have not been
subject to rigorous testing. Due to time constraints, little ‘user-proofing’ was done
during the development of the application. Although there are exceptions, the
application does not currently do much to handle user mistakes and unintended usage
scenarios.
With the exception of volume controls, all originally planned features and some
additional ones (such as the metronome) were implemented. While at times somewhat
unpolished, each feature serves its purpose sufficiently well to allow the user to obtain
results without having to resort to workarounds. Conceived improvements and
additions to the feature set are discussed in more detail later in this chapter.
The ambition, within the scope of the thesis, was audio-to-MIDI functionality good
enough to properly handle at least monophonic music with non-complex sounds. This
has been achieved and surpassed; as long as the sounds are of simple character, the
39
40 Discussion and future work
More complex sounds, however, may pose some problems. Apart from the typical
octave errors of the HPS method, we may also see missing notes and even notes
which are flat out wrong – notes that are not harmonics of any of the actual notes in
the source, and at first glance appear to have come out of nowhere. Although these
translations are ‘essentially correct’ – that is, the original music is clearly discernible
and not obscured by the errors – a higher degree of accuracy is obviously desired.
There are several factors which can lead to errors. Transients can often produce false
positives, and may also interfere with note onset detection. Oscillations in the
amplitudes of the partials may be a key factor in the occurrence of octave errors (and
errors including other harmonics than the first), and both note onset and note offset
detection can be affected. There is also the problem of ‘non-uniformity’ often present
in instruments. For example, a note played on one particular string of a guitar has a
different character than the same note played on another string; even though they are
the same note, their spectra can differ quite notably. This may be of particular
significance in polyphonic cases.
In some cases, errors can be avoided by tweaking the threshold values and using an
appropriate window function. However, this is not really a solution to the problem,
which is more about the core pitch detection algorithm than superficial parameter
twiddling. Thus, even though the thesis ambition has been realized, further refinement
is needed before the application can be said to have ‘release quality’. Some further
thoughts on this are mentioned in section 5.2.1.
Existing features can, of course, be improved upon in several areas, and there are also
a number of new features which would be of benefit to the general usability of the
application. Some ideas and suggestions are mentioned below, in no particular order.
Being a central part of the application, pitch detection functionality can never be good
enough. Several possible improvements come to mind. A ‘minor’ improvement would
be a better formula used for threshold calculation. However, the underlying pitch
detection algorithm itself should be improved – currently, only a harmonic product
spectrum is used, and while this does a decent job it has a number of shortcomings.
Feature improvements and additions 41
Pitch detection algorithms can be, and often are, especially well suited to a particular
type of music or sound; naturally, the requirements of the algorithm differ depending
on whether we intend to use it for sine tones or a saxophone, a single melody or a
chord progression, and so on. The general functionality of the application could be
enhanced by letting the user choose from several pitch detection methods.
Some window functions – for example the Gauss and Blackman windows – have
parameters which can be configured to obtain a certain ‘flavor’ of the window. In the
application, these parameters are treated as constants, but further user control of the
details of the window functions may be a nice touch. It may even be conceivable with
user-defined window functions, but this is a rather substantial change.
Some editing functionality would be convenient. For example, the user may wish to
adjust or remove incorrect MIDI notes, or trim off silence from the beginning and
end of the audio data. User-controlled audio filtering may also be desired.
While the application supports audio of various sample rates and bit depths, it is still
restricted to PCM-encoded mono files. In addition to PCM, Java Sound has native
support for A-law and µ-law encoding, but not for other formats such as mp3 or Ogg
Vorbis. Having the application support several formats would potentially alleviate the
user from having to convert files in another application before working with them.
As for stereo support, it is mostly a question on how to implement the pitch detection
functionality. For example, pitch detection could be performed on each channel
separately, or the audio could be converted to mono before pitch detection. Which is
more appropriate might depend on the audio in question, so it is possible that the user
should be able to choose method manually.
Occasionally, it might be desirable to have an audio file and its corresponding MIDI
file connected in some way, together with any relevant settings (such as for example
volume settings and so on). This would for example allow ‘simultaneous’ loading of
audio and its MIDI translation. Such a feature could be quickly implemented by letting
a text file represent each project, linking an audio file with a MIDI file and storing
various settings.
42 Discussion and future work
As noted in section 3.3.2, proper control of MIDI playback volume is not completely
straightforward to implement keeping user convenience in mind, and therefore this
feature (together with audio volume control for consistency reasons, as mentioned in
section 3.4.2) was left out of this version. It should, however, be present in a future
version. The easiest way to handle it would be to require the user to have a soundbank
installed, but whether this is the most appealing solution is another matter.
Currently, the only ‘tape deck’ commands available are play, stop, and record. Being
able to pause playback, or fast forward/rewind to a specific position, may in some
cases be desirable. Such a feature would benefit from a ‘song position’ GUI element,
perhaps even a slider with which the user can interact directly.
5.2.7 GUI
As mentioned in section 3.2.2, the GUI was hand-written with the aim of providing
sufficient functionality. This means that several things could be improved, not least
from an aesthetic point of view.
From a functional perspective, the most important issue is probably the plotting
functionality. The purpose of the plots in the current version of the application is
mostly to provide some visual feedback. They may be improved in several ways, most
importantly by adding properly graded and labeled axes. Another idea is to allow the
user to interact with the plots using the mouse, for instance setting the frequency
ranges and thresholds for pitch detection.
An instrument tuner requires higher frequency resolution than the pitch detection
needed for ‘conventional’ audio-to-MIDI translation. While the latter may only need
to be able to assign a given pitch to the closest note in the equally tempered 12-tone
scale, an instrument tuner must be able to tell clearly how far from ideal pitch the
signal is. There is also a real-time requirement. Obviously, a properly tuned instrument
is desired for correct audio-to-MIDI translation, and the addition of tuning
functionality would broaden the usability of the application.
Concluding remarks 43
Pitch detection is an interesting subject with many and diverse applications, for
example speech analysis and music information retrieval services such as ‘query by
humming’. Since MIDI data has a ‘musical character’, is easy to edit, and is widely
used in music applications, audio-to-MIDI is a rather natural approach to the problem
of automatic music transcription.
Java was chosen as the programming language for several reasons. For one thing, it
has a rich and well-documented API. Second, it will (hopefully) mean that the
application can run on different platforms (although this may require some extra work
due to platform specifics – ‘write once, run everywhere’ is by no means a given). The
choice of Java was also motivated by the author’s personal curiosity regarding the
sound programming possibilities of the language.
While there are, perhaps, some quirks in Java’s sound API, there does not seem to be
any particular problems with implementing this kind of application in Java. While
other languages such as C++ may perform better, Java’s performance seems decent
enough at least for the time being, and keeps the option of platform independence
open. Thus, there are at this stage no plans to switch to another language, and further
development will build upon the current code base.
Appendices
Appendix A: References
[1] Shepard, Roger N., ‘Circularity in Judgements of Relative Pitch’. In Journal of the
Acoustical Society of America, vol. 36, issue 12, december 1964, pp. 2346-2353.
[2] Berg, Richard E., and Stork, David G. The Physics of Sound, 2nd ed. Prentice-Hall,
Englewood Cliffs, New Jersey, 1995, p. 156.
[3] Tadokoro, Y., Matsumoto, W., and Yamaguchi, M. ‘Pitch detection of musical
sounds using adaptive comb filters controlled by time delay’. In Proceedings of the
International Conference on Multimedia and Expo (ICME), August 2002, Lausanne,
Switzerland, Vol. 1, pp. 109-112.
[4] Sound and Vision Engineering Department, University of Gdansk, 2000, ‘Pitch
Detection Methods’, most recently viewed on april 2, 2009,
http://sound.eti.pg.gda.pl/student/eim/synteza/leszczyna/index_ang.htm.
[5] Barry, John M., Polyphonic Music Transcription Using Independent Component Analysis,
Master’s Thesis, Churchill College, University of Cambridge, April 2003, p. 5.
[6] de la Cuadra, P., Master, A., and Sapp, C. ’Efficient Pitch Detection Techniques
for Interactive Music’. In Proceedings of the International Computer Music Conference
(ICMC), 2001, Havana, Cuba, pp. 403-406.
[8] Bogert, B. P., Healy, M. J. R., and Tukey, J. W. ’The Quefrency Alanysis of
Time Series for Echoes: Cepstrum, Pseudo Autocovariance, Cross-Cepstrum
and Saphe Cracking’. In Proceedings of the Symposium on Time Series Analysis,
Chapter 15, 1963, New York, pp. 209-243.
45
46 Appendices
[9] Savard, A. ‘An Overview of Pitch Detection Algorithms’, lecture slides, Schulic
School of Music, McGill University, Canada, February 2006, most recently
viewed on april 2, 2009,
http://www.music.mcgill.ca/~savard/Presentation_Pitch_Tracking.ppt
[10] Master, Aaron S., Speech Spectrum Modelling from Multiple Sources, Master’s Thesis,
Churchill College, University of Cambridge, August 2000, p. 32.
[11] Engineering Productivity Tools Ltd., ‘FFT of Pure Real Sequences’, 1999, most
recently viewed on april 2, 2009,
http://www.engineeringproductivitytools.com/stuff/T0001/PT10.HTM.