KEMBAR78
Audio Compression Notes (Data Compression) | PDF | Data Compression | Sampling (Signal Processing)
0% found this document useful (0 votes)
378 views35 pages

Audio Compression Notes (Data Compression)

These are notes on audio compression .Pls view these notes they are really useful while preparing for exams.

Uploaded by

infinityankit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
378 views35 pages

Audio Compression Notes (Data Compression)

These are notes on audio compression .Pls view these notes they are really useful while preparing for exams.

Uploaded by

infinityankit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Chapter2

AUDIOCOMPRESSION
DigitalAudio,Lossysoundcompression,MlawandAlawCompanding,DPCMandADPCM
audiocompression,MPEGaudiostandard,frequencydomaincoding,formatofcompressed
data.

1.Introduction:
Two important features of audio compression are (1) it can be lossy and (2) it requires fast
decoding.Textcompressionmustbelossless,butimagesandaudiocanlosemuchdatawithouta
noticeable degradation of quality. Thus, there are both lossless and lossy audio compression
algorithms. Often, audio is stored in compressed form and has to be decompressed in realtime
whentheuserwantstolistentoit.Thisiswhymostaudiocompressionmethodsareasymmetric.
The encoder can be slow, but the decoder has to be fast. This is also why audio compression
methods are not dictionary based. A dictionarybased compression method may have many
advantages,butfastdecodingisnotoneofthem.
Wecandefinesoundas:
(a)An intuitive definition: Sound is the sensation detected by our ears and interpreted by our
braininacertainway.
(b) A scientific definition: Sound is a physical disturbance in a medium. It propagates in the
mediumasapressurewavebythemovementofatomsormolecules.
Likeanyotherwave,soundhasthreeimportantattributes,itsspeed,amplitude,andperiod.
Thespeedofsounddependsmostlyonthemediumitpassesthrough,andonthetemperature.The
humanearissensitivetoawiderangeofsoundfrequencies,normallyfromabout20Hztoabout
22,000Hz,dependingonapersonsageandhealth.Thisistherangeofaudiblefrequencies.Some
animals, most notably dogs and bats, can hear higher frequencies (ultrasound). Loudness is
commonly measured in units of dB SPL (sound pressure level) instead of sound power. The
definitionis,
10

10

20

2.DigitalAudio:
Sound can be digitized and broken up into numbers. Digitizing sound is done by measuring the
voltage at many points in time, translating each measurement into a number, and writing the
numbers on a file. This process is called sampling. The sound wave is sampled, and the samples
becomethedigitizedsound.Thedeviceusedforsamplingiscalledananalogtodigitalconverter
(ADC).
Since the audio samples are numbers, they are easy to edit. However, the main use of an
audiofileistoplayitback.Thisisdonebyconvertingthenumericsamplesbackintovoltagesthat
are continuously fed into a speaker. The device that does that is called a digitaltoanalog
converter (DAC). Intuitively, it is clear that a high sampling rate would result in better sound
reproduction,butalsoinmanymoresamplesandthereforebiggerfiles.Thus,themainproblemin
audiosamplingishowoftentosampleagivensound.

Figure1:SamplingofaSoundWave

Figure 1a shows the effect of low sampling rate. The sound wave in the figure is sampled
fourtimes,andallfoursampleshappentobeidentical.Whenthesesamplesareusedtoplayback
the sound, the result is silence. Figure 1b shows seven samples, and they seem to follow the
original wave fairly closely. Unfortunately, when they are used to reproduce the sound, they
produce the curve shown in dashed. There simply are not enough samples to reconstruct the
originalsoundwave.Thesolutiontothesamplingproblemistosamplesoundatalittleoverthe
Nyquistfrequency,whichistwicethemaximumfrequencycontainedinthesound.

The sampling rate plays a different role in determining the quality of digital sound reproduction. One classic
law in digital signal processing was published by Harry Nyquist. He determined that to accurately reproduce a
signal of frequency f, the sampling rate has to be greater than 2*f. This is commonly called the Nyquist Rate. It
is used in many practical situations. The range of human hearing, for instance, is between 16 Hz and 22,000 Hz.
When sound is digitized at high quality (such as music recorded on a CD), it is sampled at the rate of 44,100 Hz.
Anything lower than that results in distortions.

Thus,ifasoundcontainsfrequenciesofupto2 kHz,itshouldbesampledatalittlemore
than4kHz.Suchasamplingrateguaranteestruereproductionofthesound.Thisisillustratedin
Figure1c,whichshows10equallyspacedsamplestakenoverfourperiods.Noticethatthesamples
donothavetobetakenfromthemaximaorminimaofthewave;theycancomefromanypoint.
The range of human hearing is typically from 1620 Hz to 20,00022,000 Hz, depending on the
personandonage.Whensoundisdigitizedathighfidelity,itshouldthereforebesampledatalittle
overtheNyquistrateof222000=44000Hz.Thisiswhyhighqualitydigitalsoundisbasedona
44,100Hz sampling rate. Anything lower than this rate results in distortions, while higher
samplingratesdonotproduceanyimprovementinthereconstruction(playback)ofthesound.We
can consider the sampling rate of 44,100 Hz a lowpass filter, since it effectively removes all the
frequenciesabove22,000Hz.
The telephone system, originally designed for conversations, not for digital
communications, samples sound at only 8 kHz. Thus, any frequency higher than 4000 Hz gets
distortedwhensentoverthephone,whichiswhyitishardtodistinguish,onthephone,between
the sounds of f and s. The second problem in sound sampling is the sample size. Each sample
becomesanumber,buthowlargeshouldthisnumberbe?Inpractice,samplesarenormallyeither
8 or 16 bits. Assuming that the highest voltage in a sound wave is 1 volt, an 8bit sample can
distinguishvoltagesaslowas1/2560.004volt,or4millivolts(mv).Aquietsound,generatinga

wavelowerthan4mv,wouldbesampledaszeroandplayedbackassilence.Incontrast,witha16
bitsampleitispossibletodistinguishsoundsaslowas1/65,53615microvolt(v).Wecanthink
ofthesamplesizeasaquantizationoftheoriginalaudiodata.
Audio sampling is also called pulse code modulation (PCM). The term pulse modulation
refers to techniques for converting a continuous wave to a stream of binary numbers (audio
samples). Possible pulse modulation methods include pulse amplitude modulation (PAM), pulse
positionmodulation(PPM),pulsewidthmodulation(PWM),andpulsenumbermodulation(PNM)
isagoodsourceofinformationonthesemethods.Inpractice, however,PCMhasprovedthemost
effective form of converting sound waves to numbers. When stereo sound is digitized, the PCM
encoder multiplexes the left and right sound samples. Thus, stereo sound sampled at 22,000 Hz
with16bitsamplesgenerates44,00016bitsamplespersecond,foratotalof704,000bits/sec,or
88,000bytes/sec.

2.1DigitalAudioandLaplaceDistribution:

Alargeaudiofilewithalong,complexpieceofmusictendstohaveallthepossiblevaluesofaudio
samples.Considerthesimplecaseof8bitaudiosamples,whichhavevaluesintheinterval[0,255].
A large audio file, with millions of audio samples, will tend to have many audio samples
concentrated around the center of this interval (around 128), fewer large samples (close to the
maximum255),andfewsmallsamples(althoughtheremaybemanyaudiosamplesof0,because
manytypesofsoundtendtohaveperiodsofsilence).Thedistributionofthesamplesmayhavea
maximumatitscenterandanotherspikeat0.Thus,theaudiosamplesthemselvesdonotnormally
haveasimpledistribution.
However,whenweexaminethedifferencesofadjacentsamples,weobserveacompletely
differentbehavior.Consecutiveaudiosamplestendtobecorrelated,whichiswhythedifferencesof
consecutivesamplestendtobesmallnumbers.Experimentswithmanytypesofsoundindicatethat
the distribution of audio differences resembles the Laplace distribution. The differences of
consecutive correlatedvaluestendtohave anarrow,peakeddistribution,resemblingtheLaplace
distribution. This is true for the differences of audio samples as well as for the differences of
consecutive pixels of an image. A compression algorithm may take advantage of this fact and
encode the differences with variablesize codes that have a Laplace distribution. A more
sophisticated version may compute differences between actual values (audio samples or pixels)
and their predicted values, and then encode the (Laplace distributed) differences. Two such
methodsareimageMLPandFLAC.

2.2.TheHumanAuditorySystem

The frequency range of the human ear is from about 20 Hz to about 20,000 Hz, but the ears
sensitivity to sound is not uniform. It depends on the frequency. It should also be noted that the
rangeofthehumanvoiceismuchmorelimited.Itisonlyfrom about500Hztoabout2kHz.The
existenceofthehearingthresholdsuggestsanapproachtolossyaudiocompression.Justdeleteany
audio samples that are below the threshold. Since the threshold depends on the frequency, the
encoder needs to know the frequency spectrum of the sound being compressed at any time. If a
signalforfrequencyfissmallerthanthehearingthresholdatf,it(thesignal)shouldbedeleted.In
additiontothis,twomorepropertiesofthehumanhearingsystemareusedinaudiocompression.
Theyarefrequencymaskingandtemporalmasking.

2.2.1SpectralMaskingorFrequencyMasking:
Frequency masking(alsoknownas auditorymaskingorSpectralmasking)occurswhen asound
thatwecannormallyhear(becauseitisloudenough)ismaskedbyanothersoundwithanearby
frequency. The thick arrow in Figure 2 represents a strong sound source at 8 kHz. This source
raisesthenormalthresholdinitsvicinity(thedashedcurve),withtheresultthatthenearbysound
represented by the arrow at x, a sound that would normally be audible because it is above the
threshold,isnowmasked,andisinaudible.Agoodlossyaudiocompressionmethodshouldidentify
this case and delete the signals corresponding to sound x, because it cannot be heard anyway.
Thisisonewaytolossilycompresssound.

Figure2:Spectralorfrequencymasking
The frequency masking (the width of the dashed curve of Figure 2) depends on the
frequency.Itvariesfromabout100Hzforthelowestaudiblefrequenciestomorethan4kHzfor

thehighest.Therangeofaudiblefrequenciescanthereforebepartitionedintoanumberofcritical
bands that indicate the declining sensitivity of the ear (rather, its declining resolving power) for
higherfrequencies.Wecanthinkofthecriticalbandsasameasuresimilartofrequency.However,
incontrasttofrequency,whichisabsoluteandhasnothingto dowithhumanhearing,thecritical
bands are determined according to the sound perception of the ear. Thus, they constitute a
perceptuallyuniformmeasureoffrequency.Table1lists27approximatecriticalbands.

Table1:TwentySevenApproximateCriticalBands.

This also points the way to designing a practical lossy compression algorithm. The audio
signal should first be transformed into its frequency domain, and the resulting values (the
frequencyspectrum)shouldbedividedintosubbandsthatresemblethecriticalbandsasmuchas
possible. Once this is done, the signals in each subband should be quantized such that the
quantization noise (the difference between the original sound sample and its quantized value)
shouldbeinaudible.

2.2.2TemporalMasking
TemporalmaskingmayoccurwhenastrongsoundAoffrequencyfisprecededorfollowedintime
byaweakersoundBatanearby(orthesame)frequency.Ifthetimeintervalbetweenthesoundsis
short, sound B may not be audible. Figure 3 illustrates an example of temporal masking. The
thresholdoftemporalmaskingduetoaloudsoundattime0goesdown,firstsharply,thenslowly.
Aweakersoundof30dBwillnotbeaudibleifitoccurs10msbeforeoraftertheloudsound,but
willbeaudibleifthetimeintervalbetweenthesoundsis20ms.


Figure3:ThresholdandMaskingofSound.

If the masked sound occurs prior to the masking tone, this is called premasking or
backward masking, and if the sound being masked occurs after the masking tone this effect is
called postmasking or forward masking. The forward masking remains in effect for a much
longertimeintervalthanthebackwardmasking.

3.LossySoundCompression

Itispossibletogetbettersoundcompressionbydevelopinglossymethodsthattakeadvantageof
our perception of sound, and discard data to which the human ear is not sensitive. We briefly
describetwoapproaches,silencecompressionandcompanding.
Theprincipleofsilencecompressionistotreatsmallsamplesasiftheyweresilence(i.e.,as
samples of 0). This generates run lengths of zero, so silence compression is actually a variant of
RLE,suitableforsoundcompression.Thismethodusesthefactthatsomepeoplehavelesssensitive
hearingthanothers,andwilltoleratethelossofsoundthatissoquiettheymaynothearitanyway.
Audiofilescontaininglongperiodsoflowvolumesoundwillrespondtosilencecompressionbetter
than other files with highvolume sound. This method requires a usercontrolled parameter that
specifiesthelargestsamplethatshouldbesuppressed.
Companding(shortforcompressing/expanding)usesthefactthattheearrequiresmore
precise samples at low amplitudes (soft sounds), but is more forgiving at higher amplitudes. A
typicalADCusedinsoundcardsforpersonalcomputersconvertsvoltagestonumberslinearly.Ifan
amplitudeaisconvertedtothenumbern,thenamplitude2awillbeconvertedtothenumber2n.A
compression method using companding examines every sample in the sound file, and employs a

nonlinearformulatoreducethenumberofbitsdevotedtoit.Moresophisticatedmethods,suchas
lawandAlaw,arecommonlyused.

4.LawandALawCompanding

The Law and ALaw companding standards employ logarithmbased functions to encode audio
samples for ISDN (integrated services digital network) digital telephony services, by means of
nonlinearquantization.TheISDNhardwaresamplesthevoicesignalfromthetelephone8KHz,and
generates14bitsamples(13forAlaw).ThemethodoflawcompandingisusedinNorthAmerica
andJapan,andAlawisusedelsewhere.
Experiments indicate that the low amplitudes of speech signals contain more information
thanthehighamplitudes.Thisiswhynonlinearquantizationmakessense.Imagineanaudiosignal
sentonatelephonelineanddigitizedto14bitsamples.Theloudertheconversation,thehigherthe
amplitude, andthe biggerthevalueofthesample.Sincehigh amplitudes arelessimportant,they
canbecoarselyquantized.Ifthelargestsample,whichis2 141=16,383,isquantizedto255(the
largest8bitnumber),thenthecompressionfactoris14/8=1.75.Whendecoded,acodeof255will
become very different from the original 16,383. We say that because of the coarse quantization,
largesamplesendupwithhighquantizationnoise.Smallersamplesshouldbefinelyquantized,so
theyendupwithlowquantizationnoise.Thelawencoderinputs14bitsamplesandoutputs
8bit codewords. The Alaw inputs 13bit samples and also outputs 8bit codewords. The
telephone signals are sampled at 8 kHz (8,000 times per second), so the law encoder receives
8,00014=112,000bits/sec.Atacompressionfactorof1.75,theencoderoutputs64,000bits/sec.
4.1LawEncoder:
Thelawencoderreceivesa14bitsignedinputsamplex.Thus,theinputisintherange[8192,
+8191]. The sample is normalized to the interval [1, +1], and the encoder uses the logarithmic
expression

| |

Where
1,
0,
1,

0
0
0

(andisapositiveinteger),tocomputeandoutputan8bitcodeinthesameinterval[1,+1].The
output is then scaled to the range [256, +255]. Figure 4 shows this output as a function of the

input for the three values 25, 255, and 2555. It is clear that large values of cause coarser
quantizationforlargeramplitudes.Suchvaluesallocatemorebitstothesmaller,moreimportant,
amplitudes. The G.711 standard recommends the use of = 255. The diagram shows only the
nonnegativevaluesoftheinput(i.e.,from0to8191).Thenegativesideofthediagramhasthesame
shapebutwithnegativeinputsandoutputs.

Figure4:TheLawforValuesof25,255,and2555.

The following simple examples illustrate the nonlinear nature of the law. The two
(normalized)inputsamples0.15and0.16aretransformedbylawtooutputs0.6618and0.6732.
Thedifferencebetweentheoutputsis0.0114.Ontheotherhand,thetwoinputsamples0.95and
0.96 (bigger inputs but with the same difference) are transformed to 0.9908 and 0.9927. The
difference between these two outputs is 0.0019; much smaller. Bigger samples are decoded with

morenoise,andsmallersamplesaredecodedwithlessnoise.However,thesignaltonoiseratiois
constantbecauseboththelawandtheSNRuselogarithmicexpressions.

S2

S1

S0

Q3

Q2

Q1

Q0

Figure5:G.711LawCodeword.

Logarithmsareslowtocompute,sothelawencoderperformsmuchsimplercalculations
that produce an approximation. The output specified by the G.711 standard is an 8bit codeword
whoseformatisshowninFigure5.BitPinFigure5isthesignbitoftheoutput(sameasthesignbit
ofthe14bitsignedinputsample).BitsS2,S1,andS0arethesegmentcode,andbitsQ3throughQ0
arethequantizationcode.Theencoderdeterminesthesegmentcodeby(1)addingabiasof33to
theabsolutevalueoftheinputsample,(2)determiningthebitpositionofthemostsignificant1bit
among bits 5 through 12 of the input, and (3) subtracting 5 from that position. The 4bit
quantizationcodeissettothefourbitsfollowingthebitpositiondeterminedinstep2.Theencoder
ignores the remaining bits of the input sample, and it inverts (1s complements) the codeword
beforeitisoutput.
ExampleofLawCodeword:
(a)Encoding:Weusetheinputsample656asanexample.Thesampleis

Q3

Q2

Q1

Q0

12

11

10

Figure6:EncodingInputSample656.

negative, so bit P becomes 1. Adding 33 to the absolute value of the input yields 689 =
0010101100012(Figure6).Themostsignificant1bitinpositions5through12isfoundatposition
9.Thesegmentcodeisthus95=4.Thequantizationcodeisthefourbits0101atpositions85,
and the remaining five bits 10001 are ignored. The 8bit codeword (which is later inverted)
becomes

S2

S1

S0 Q3 Q2 Q1 Q0

(b) Decoding: The law decoder inputs an 8bit codeword and inverts it. It then decodes it as
follows:
1. Multiplythequantizationcodeby2andadd33(thebias)totheresult.
2. Multiplytheresultby2raisedtothepowerofthesegmentcode.
3. Decrementtheresultbythebias.
4. UsebitPtodeterminethesignoftheresult.
Applyingthesestepstoourexampleproduces
1. Thequantizationcodeis01012=5,so52+33=43.
2. Thesegmentcodeis1002=4,so4324=688.
3. Decrementbythebias68833=655.
4. BitPis1,sothefinalresultis 655.Thus,the quantizationerror(thenoise)is1;very
small.
Figure 7 illustrates the nature of the law midtread quantization. Zero is one of the valid
output values, and the quantization steps are centered at the input value of 0. The steps are
organizedineightsegmentsof16stepseach.Thestepswithineachsegmenthavethesamewidth,

Figure7:LawMidtreadQuantization.

but they double in width from one segment to the next. If we denote the segment number by i
(wherei=0,1...7)andthewidthofasegmentbyk(wherek=1,2...16),thenthemiddleofthe
treadofeachstepinFigure7(i.e.,thepointslabeledxj)isgivenby,
16

wheretheconstantsT(i)andD(i)aretheinitialvalueandthestepsizeforsegmenti,respectively.
Theyaregivenby,

T(i)

35

103

239

511

1055

2143

4319

D(i)

16

32

64

128

256

4.2TheAlawencoder:
TheAlawencoderusesthesimilarexpression
| |

| |

| |

| |

TheG.711standardrecommendstheuseofA=87.6.

Figure8:ALawMidriserQuantization.

TheoperationoftheAlawencoderissimilar,exceptthatthe quantization(Figure8)isof
themidriservariety.ThebreakpointsxjaregivenbyEquation,

16

buttheinitialvalueT(i)andthestepsizeD(i)forsegmentiaredifferentfromthoseusedbythe
lawencoderandaregivenby,

i
0
1
2
3
4
5
6
7

T(i)

32

64

128

256

512

1024

2048

D(i)

16

32

64

128

TheAlawencodergeneratesan8bitcodewordwiththesameformatasthelawencoder.
ItsetsthePbittothesignoftheinputsample.Itthendeterminesthesegmentcodeinthefollowing
steps:
1.Determinethebitpositionofthemostsignificant1bitamongthesevenmostsignificantbitsof
theinput.
2.Ifsucha1bitisfound,thesegmentcodebecomesthatpositionminus4.Otherwise,thesegment
codebecomeszero.
The4bitquantizationcodeissettothefourbitsfollowingthebitpositiondeterminedin
step1,ortohalftheinputvalueifthesegmentcodeiszero.Theencoderignorestheremainingbits
of the input sample, and it inverts bit P and the evennumbered bits of the codeword before it is
output.
TheAlawdecoderdecodesan8bitcodewordintoa13bitaudiosampleasfollows:
1.ItinvertsbitPandtheevennumberedbitsofthecodeword.
2.Ifthesegmentcodeisnonzero,thedecodermultipliesthequantizationcodeby2andincrements
thisbythebias(33).Theresultisthenmultipliedby2andraisedtothepowerofthe(segmentcode
minus1).Ifthesegmentcodeis0,thedecoderoutputstwicethequantizationcode,plus1.
3.BitPisthenusedtodeterminethesignoftheoutput.

5.ADPCMAudioCompression:

Adjacent audio samples tend to be similar in much the same way that neighboring pixels in an
image tend to have similar colors. The simplest way to exploit this redundancy is to subtract
adjacentsamplesandcodethedifferences,whichtendtobesmallintegers.Anyaudiocompression
method based on this principle is called DPCM (differential pulse code modulation). Such
methods,however,areinefficient,becausetheydonotadaptthemselvestothevaryingmagnitudes
of the audio stream. Better results are achieved by an adaptive version, and any such version is
calledADPCM.
ADPCM: Short for Adaptive Differential Pulse Code Modulation, a form of pulse code modulation (PCM) that
produces a digital signal with a lower bit rate than standard PCM. ADPCM produces a lower bit rate by
recording only the difference between samples and adjusting the coding scale dynamically to accommodate
large and small differences.

ADPCM employs linear prediction. It uses the previous sample (or several previous
samples)topredictthecurrentsample.Itthencomputesthedifferencebetweenthecurrentsample

anditsprediction,andquantizesthedifference.ForeachinputsampleX[n],theoutputC[n]ofthe
encoderissimplyacertainnumberofquantizationlevels.The decodermultipliesthisnumberby
the quantization step (and may add half the quantization step, for better precision) to obtain the
reconstructedaudiosample.Themethodisefficientbecausethequantizationstepisupdatedallthe
time,bybothencoderanddecoder,inresponsetothevaryingmagnitudesoftheinputsamples.Itis
alsopossibletomodifyadaptivelythepredictionalgorithm.VariousADPCMmethodsdifferinthe
waytheypredictthecurrentaudiosampleandinthewaytheyadapttotheinput(bychangingthe
quantizationstepsizeand/orthepredictionmethod).
Inadditiontothequantizedvalues,anADPCMencodercanprovidethedecoderwith side
information.Thisinformationincreasesthesizeofthecompressedstream,butthisdegradationis
acceptable to the users, because it makes the compressed audio data more useful. Typical
applications of side information are (1) help the decoder recover from errors and (2) signal an
entrypointintothecompressedstream.Anoriginalaudiostreammayberecordedincompressed
formonamediumsuchasaCDROM.Iftheuser(listener)wantstolistentosong5,thedecodercan
usethesideinformationtoquicklyfindthestartofthatsong.

Figure9:(a)ADPCMEncoderand(b)Decoder.


Figure9showsthegeneralorganizationoftheADPCMencoderanddecoder.Theadaptive
quantizer receives the difference D[n] between the current input sample X[n] and the prediction
Xp[n1].ThequantizercomputesandoutputsthequantizedcodeC[n] ofX[n].Thesamecodeis
senttotheadaptivedequantizer(thesamedequantizerusedby thedecoder),whichproducesthe
nextdequantizeddifferencevalueDq[n].ThisvalueisaddedtothepreviouspredictoroutputXp[n
1],andthesumXp[n]issenttothepredictortobeusedinthenextstep.
Better prediction would be obtained by feeding the actual input X[n] to the predictor.
However,thedecoderwouldntbeabletomimicthat,sinceitdoesnothaveX[n].Weseethatthe
basicADPCMencoderissimple,andthedecoderisevensimpler.ItinputsacodeC[n],dequantizes
ittoadifferenceDq[n],whichisaddedtotheprecedingpredictoroutputXp[n1]toformthenext
outputXp[n].Thenextoutputisalsofedintothepredictor,tobeusedinthenextstep.

6.SpeechCompression:
Certain audio codecs are designed specifically to compress speech signals. Such signals are audio
andaresampledlikeanyotheraudiodata,butbecauseofthenatureofhumanspeech,theyhave
propertiesthatcanbeexploitedforefficientcompression.
6.1PropertiesofSpeech
Weproducesoundbyforcingairfromthelungs throughthe vocalcordsintothe vocaltract.The
vocalcordscanopenandclose,andtheopeningbetweenthemiscalledtheglottis.Themovements
of the glottis and vocal tract give rise to different types of sound. The three main types are as
follows:
1. Voiced sounds. These are the sounds we make when we talk. The vocal cords vibrate which
opensandclosestheglottis,therebysendingpulsesofairatvaryingpressurestothetract,whereit
isshapedintosoundwaves.Thefrequenciesofthehumanvoice,ontheotherhand,aremuchmore
restrictedandaregenerallyintherangeof500Hztoabout2kHz.Thisisequivalenttotimeperiods
of2msto20ms.Thus,voicedsoundshavelongtermperiodicity.
2. Unvoiced sounds. These are sounds that are emitted and can be heard, but are not parts of
speech.Suchasoundistheresultofholdingtheglottisopenandforcingairthroughaconstriction
inthevocaltract.Whenanunvoicedsoundissampled,thesamplesshowlittlecorrelationandare
randomorclosetorandom.
3. Plosive sounds. These result when the glottis closes, the lungs apply air pressure on it, and it
suddenlyopens,lettingtheairescapesuddenly.Theresultisapoppingsound.

6.2Speechcodecs
Therearethreemaintypesofspeechcodecs.
1. Waveform speech codecs: It produce good to excellent speech after compressing and
decompressingit,butgeneratebitratesof1064kbps.
2. Sourcecodecs(alsocalledvocoders):Vocodersgenerallyproducepoortofairspeechbut
cancompressittoverylowbitrates(downto2kbps).
3. Hybrid codecs: These codecs are combinations of the former two types and produce
speechthatvariesfromfairtogood,withbitratesbetween2and16kbps.

Figure10illustratesthespeechqualityversusbitrateofthesethreetypes.

Figure10:SpeechQualityversusBitrateforSpeechCodecs.

6.3WaveformCodecs
Waveformcodecdoesnotattempttopredicthowtheoriginalsoundwasgenerated.Itonlytriesto
produce,afterdecompression,audiosamplesthatareasclosetotheoriginalonesaspossible.Thus,
such codecs are not designed specifically for speech coding and can perform equally well on all
kindsofaudiodata.AsFigure10illustrates,whensuchacodecisforcedtocompresssoundtoless
than16kbps,thequalityofthereconstructedsounddropssignificantly.
The simplest waveform encoder is pulse code modulation (PCM). This encoder simply
quantizeseachaudiosample.Speechistypicallysampledatonly8kHz.Ifeachsampleisquantized
to 12 bits, the resulting bitrate is 8k 12 = 96 kbps and the reproduced speech sounds almost

natural. Better results are obtained with a logarithmic quantizer, such as the law and Alaw
compandingmethods.Theyquantizeaudiosamplestovaryingnumbersofbitsandmaycompress
speech to 8 bits per sample on average, thereby resulting in a bitrate of 64 kbps, with very good
qualityofthereconstructedspeech.
A differential PCM speech encoder uses the fact that the audio samples of voiced speech
are correlated. This type of encoder computes the difference between the current sample and its
predecessor and quantizes the difference. An adaptive version (ADPCM) may compress speech at
goodqualitydowntoabitrateof32kbps.
Waveform coders may also operate in the frequency domain. The subband coding
algorithm (SBC) transforms the audio samples to the frequency domain, partitions the resulting
coefficientsintoseveralcriticalbands(orfrequencysubbands),andcodeseachsubbandseparately
withADPCMorasimilarquantizationmethod.TheSBCdecoderdecodesthefrequencycoefficients,
recombinesthem,andperformstheinversetransformationto(lossily)reconstructaudiosamples.
TheadvantageofSBCisthattheearissensitivetocertainfrequenciesandlesssensitivetoothers
Subbands of frequencies to which the ear is less sensitive can therefore be coarsely quantized
without loss of sound quality. This type of coder typically produces good reconstructed speech
qualityatbitratesof1632kbps.Theyare,however,morecomplextoimplementthanPCMcodecs
andmayalsobeslower.
The adaptive transform coding (ATC) speech compression algorithm transforms audio
samplestothefrequencydomainwiththediscretecosinetransform(DCT).Theaudiofileisdivided
into blocks of audio samples and the DCT is applied to each block, resulting in a number of
frequency coefficients. Each coefficient is quantized according to the frequency to which it
corresponds.Goodqualityreconstructedspeechcanbeachievedatbitratesaslowas16kbps.
6.4SourceCodecs
Ingeneral,asourceencoderusesamathematicalmodelofthesourceofdata.Themodeldepends
oncertainparameters,andtheencoderusestheinputdatatocomputethoseparameters.Oncethe
parameters are obtained, they are written (after being suitably encoded) on the compressed
stream.Thedecoderinputstheparametersandemploysthemathematicalmodeltoreconstructthe
originaldata.Iftheoriginaldataisaudio,thesourcecoderiscalledvocoder(fromvocalcoder).
6.4.1LinearPredictiveCoder(LPC):
Figure 11 shows a simplified model of speech production. Part (a) illustrates the process in a
person, whereas part (b) shows the corresponding LPC mathematical model. In this model, the
outputisthesequenceofspeechsampless(n)comingoutoftheLPCfilter(whichcorrespondsto

the vocal tract and lips).The input u(n) to the model (and to the filter) is either a train of pulses
(whenthesoundisvoicedspeech)orwhitenoise(whenthesoundisunvoicedspeech).The

Figure11:SpeechProduction:(a)Real.(b)LPCModel

quantitiesu(n)arealsotermedinnovation.Themodelillustrateshowsampless(n)ofspeechcanbe
generated by mixing innovations (a train of pulses and white noise). Thus, it represents
mathematically the relation between speech samples and innovations. The task of the speech
encoder is to input samples s(n) of actual speech, use the filter as a mathematical function to
determineanequivalentsequenceofinnovationsu(n),andoutputtheinnovationsincompressed
form. The correspondence between the models parameters and the parts of real speech is as
follows:
1.ParameterV(voiced)correspondstothevibrationsofthevocalcords.UVexpressestheunvoiced
sounds.
2.Tistheperiodofthevocalcordsvibrations.
3.G(gain)correspondstotheloudnessortheairvolumesentfromthelungseachsecond.
4.Theinnovationsu(n)correspondtotheairpassingthroughthevocaltract.
5.Thesymbols and denoteamplificationandcombination,respectively.


ThemainequationoftheLPCmodeldescribestheoutputoftheLPCfilteras,

wherezistheinputtothefilter[thevalueofoneoftheu(n)].Anequivalentequationdescribesthe
relation between the innovations u (n) on the one hand and the 10 coefficients ai and the speech
audiosampless(n)ontheotherhand.Therelationis,

Thisrelationimpliesthateachnumberu(n)inputtotheLPCfilteristhesumofthecurrentaudio
samples(n)andaweightedsumofthe10precedingsamples.TheLPCmodelcanbewrittenasthe
13tuple
,

,,

, , /

where V/UV is a single bit specifying the source (voiced or unvoiced) of the input samples. The
modelassumesthatAstaysstableforabout20ms,thengetsupdatedbytheaudiosamplesofthe
next20ms.Atasamplingrateof8kHz,thereare160audiosampless(n)every20ms.Themodel
computes the 13 quantities in A from these 160 samples, writes A (as 13 numbers) on the
compressedstream,thenrepeatsforthenext20ms.Theresultingcompressionfactoristherefore
13numbersforeachsetof160audiosamples.
Its important to distinguish the operation of the encoder from the diagram of the LPCs
mathematicalmodeldepictedinFigure11b.Thefigureshowshowasequenceofinnovationsu(n)
generates speech samples s(n). The encoder, however, starts with the speech samples. It inputs a
20ms sequence of speech samples s(n), computes an equivalent sequence of innovations,
compresses them to 13 numbers, and outputs the numbers after further encoding them. This
repeatsevery20ms.
LPCencoding(oranalysis)startswith160soundsamplesandcomputesthe10LPCparametersai
byminimizingtheenergyoftheinnovationu(n).Theenergyisthefunction
,

,,

and its minimum is computed by differentiating it 10 times, with respect to each of its 10 . The
autocorrelationfunctionofthesampless(n)isgivenby,

GKB

Whichisusedtoobtain10LPCparametersai.Theremainingthreeparameters,V/UV,G,andT,are
determinedfromthe160audiosamples.Ifthosesamplesexhibitperiodicity,thenTbecomesthat
periodandthe1bitparameterV/UVissettoV.Ifthe160samplesdonotfeatureanywelldefined
period,thenTremainsundefinedandV/UVissettoUV.ThevalueofGisdeterminedbythelargest
sample.
LPC decoding (or synthesis) starts with a set of 13 LPC parameters and computes 160 audio
samplesastheoutputoftheLPCfilterby,

Thesesamplesareplayedat8,000samplespersecondandresultin20msof(voicedorunvoiced)
reconstructedspeech.
AdvantagesofLPC:
1. LPCprovidesagoodmodelofthespeechsignal.
2. The way in which LPC is applied to the analysis of speech signals leads to a reasonable
sourcevocaltractseparation.
3. LPCisananalyticallytractablemodel.Themodelismathematicallypreciseandsimpleand
straightforwardtoimplementineithersoftwareorhardware.
6.5HybridCodecs
This type of speech codec combines features from both waveform and source codecs. The most
popular hybrid codecs are AnalysisbySynthesis (AbS) timedomain algorithms. Like the LPC
vocoder,thesecodecsmodelthevocaltractbyalinearpredictionfilter,butuseanexcitationsignal
instead of the simple, twostate voiceunvoice model to supply the u(n) (innovation) input to the
filter.Thus,anAbSencoderstartswithasetofspeechsamples(aframe),encodesthemsimilarto
LPC,decodesthem,andsubtractsthedecodedsamplesfromtheoriginalones.Thedifferencesare
sent through an error minimization process that outputs improved encoded samples. These
samples are againdecoded,subtractedfromthe originalsamples,andnewdifferencescomputed.
Thisisrepeateduntilthedifferencessatisfyaterminationcondition.Theencoderthenproceedsto
thenextsetofspeechsamples(nextframe).
6.5.1CodeExcitedLinearPrediction(CELP):
Oneofthemostimportantfactorsingeneratingnaturalsoundingspeechistheexcitationsignal.As

thehumanearisespeciallysensitivetopitcherrors,agreat dealofefforthasbeendevotedtothe
developmentofaccuratepitchdetectionalgorithms.
In CELP instead of having a codebook of pulse patterns, we allow a variety of excitation
signals.Foreachsegmenttheencoderfindstheexcitationvectorthatgeneratessynthesizedspeech
thatbestmatchesthespeechsegmentbeingencoded.Thisapproachiscloserinastrictsensetoa
waveform coding technique such as DPCM than to the analysis/synthesis schemes. The main
components of the CELP coder include the LPC analysis, the excitation codebook, and the
perceptualweightingfilter.BesidesCELP,theMPLPCalgorithmhadanotherdescendantthathas
becomeastandard.Insteadofusingexcitationvectorsinwhichthenonzerovaluesareseparated
byanarbitrarynumberofzerovalues,theyforcedthenonzerovaluestooccuratregularlyspaced
intervals.Furthermore,MPLPCallowedthenonzerovaluestotakeonanumberofdifferentvalues.
Thisschemeiscalledasregularpulseexcitation(RPE)coding.AvariationofRPE,calledregular
pulse excitation with longterm prediction (RPELTP), was adopted as a standard for digital
cellular telephony by the Group Speciale Mobile (GSM) subcommittee of the European
TelecommunicationsStandardsInstituteattherateof13kbps.
ThevocaltractfilterusedbytheCELPcoderisgivenby

wherePisthepitchperiodandthetermynPisthecontributionduetothepitchperiodicity.
1. The input speech is sampled at 8000 samples per second and divided into 30millisecond
framescontaining240samples.
2. Eachframeisdividedintofoursubframesoflength7.5milliseconds.
3. The coefficients

for the 10thorder shortterm filter are obtained using the

autocorrelationmethod.
4. ThepitchperiodPiscalculatedonceeverysubframe.Inordertoreducethecomputational
load,thepitchvalueisassumedtoliebetween20and147everyoddsubframe.
5. In every even subframe, the pitch value is assumed to lie within 32 samples of the pitch
valueinthepreviousframe.
6. The algorithm uses two codebooks, a stochastic codebook and an adaptive codebook. An
excitationsequenceisgeneratedforeachsubframebyaddingonescaledelementfromthe
stochasticcodebookandonescaledelementfromtheadaptivecodebook.

7. Thestochasticcodebookcontains512entries.TheseentriesaregeneratedusingaGaussian
random number generator, the output of which is quantized to 1, 0, or 1. The codebook
entriesareadjustedsothateachentrydiffersfromtheprecedingentryinonlytwoplaces.
8. The adaptive codebook consists of the excitation vectors from the previous frame. Each
time a new excitation vector is obtained, it is added to the codebook. In this manner, the
codebookadaptstolocalstatistics.
9. The coder has been shown to provide excellent reproductions in both quiet and noisy
environmentsatratesof4.8kbpsandabove.
10. Thequalityofthereproductionofthiscoderat4.8kbpshasbeenshowntobeequivalentto
adeltamodulatoroperatingat32kbps.Thepriceforthisqualityismuchhighercomplexity
andamuchlongercodingdelay.
CCITTG.728CELPSpeechcodingStandard:
By their nature, the speech coding schemes have some coding delay built into them. By coding
delay,we meanthe timebetweenwhenaspeechsampleisencodedtowhenitisdecodedifthe
encoderanddecoderwereconnectedbacktoback(i.e.,therewerenotransmissiondelays).Inthe
schemeswehavestudied,asegmentofspeechisfirststoredinabuffer.Wedonotstartextracting
thevariousparametersuntilacompletesegmentofspeechisavailabletous.Oncethesegmentis
completely available, it is processed. If the processing is real time, this means another segments
worth of delay. Finally, once the parameters have been obtained, coded, and transmitted, the
receiverhastowaituntilatleastasignificantpartoftheinformationisavailablebeforeitcanstart
decoding the first sample. Therefore, if a segment contains 20 milliseconds worth of data, the
codingdelaywouldbeapproximatelysomewherebetween40to60milliseconds.
Forsuchapplications,CCITTapprovedrecommendationG.728,aCELPcoderwithacoder
delayof2millisecondsoperatingat16kbps.Astheinputspeechissampledat8000samplesper
second,thisratecorrespondstoanaveragerateof2bitsper sample.TheG.728recommendation
usesasegmentsizeoffivesamples.Withfivesamplesandarateof2bitspersample,weonlyhave
10bitsavailabletous.Usingonly10bits,itwouldbeimpossibletoencodetheparametersofthe
vocal tract filter as well as the excitation vector. Therefore, the algorithm obtains the vocal tract
filter parameters in a backward adaptive manner; that is, the vocal tract filter coefficients to be
usedtosynthesizethecurrentsegmentareobtainedbyanalyzingthepreviousdecodedsegments.
TheG.728algorithmusesa50thordervocaltractfilter.Theorderofthefilterislargeenoughto
modelthepitchofmostfemalespeakers.Notbeingabletousepitchinformationformalespeakers
doesnotcausemuchcorruptedbychannelerrors.Therefore,thevocaltractfilterisupdatedevery

fourth frame, which is once every20 samples or 2.5 milliseconds. The autocorrelation method is
usedtoobtainthevocaltractparameters.

FIGURE12:EncoderanddecoderfortheCCITTG.72816kbpsCELPspeechcodec

Ten bits would be able to index 1024 excitation sequences. However, to examine 1024 excitation
sequences every 0.625 milliseconds is a rather large computational load. In order to reduce this
load,theG.728algorithmusesaproductcodebookwhereeachexcitationsequenceisrepresented

by a normalized sequence and a gain term. The final excitation sequence is a product of the
normalizedexcitationsequenceandthegain.Ofthe10bits,3bitsareusedtoencodethegainusing
apredictiveencodingscheme,whiletheremaining7bitsformtheindextoacodebookcontaining
127sequences.
BlockdiagramsoftheencoderanddecoderfortheCCITTG.728coderareshowninFigure12.The
lowdelay CCITT G.728 CELP coder operating at 16 kbps provides reconstructed speech quality
superiortothe32kbpsCCITTG.726ADPCMalgorithm.Variouseffortsareunderwaytoreducethe
bitrateforthisalgorithmwithoutcompromisingtoomuchonqualityanddelay.
6.5.3MixedExcitationLinearPrediction(MELP):
Themixedexcitationlinearprediction(MELP)coderwasselectedtobethenewfederalstandard
forspeechcodingat2.4kbpsbywhichusesthesameLPCfiltertomodelthevocaltract.However,it
usesamuchmorecomplexapproachtothegenerationoftheexcitationsignal.Ablockdiagramof
thedecoderfortheMELPsystemisshowninFigure13.Asevidentfromthefigure,theexcitation
signalforthesynthesisfilterisnolongersimply noiseora periodicpulsebutamultibandmixed
excitation.Themixedexcitationcontainsbothafilteredsignalfromanoisegeneratoraswellasa
contributionthatdependsdirectlyontheinputsignal.
Thefirststepinconstructingtheexcitationsignalispitchextraction.TheMELPalgorithm
obtains the pitch period using a multistep approach. In the first step an integer pitch value P1 is
obtainedby
1.firstfilteringtheinputusingalowpassfilterwithacutoffof1kHz
2.computingthenormalizedautocorrelationforlagsbetween40and160
Thenormalizedautocorrelationr()isdefinedas
,
,

ThefirstestimateofthepitchP1isobtainedasthevalueofthatmaximizesthenormalized
autocorrelation function. This stage uses two values of P1, one from the current frame and one
fromthepreviousframe,ascandidates.Thenormalizedautocorrelationvaluesareobtainedforlags
fromfivesampleslesstofivesamplesmorethanthecandidateP1values.
Thelagsthatprovidethemaximumnormalizedautocorrelationvalueforeachcandidateareused
forfractionalpitchrefinement.


Figure13:BlockdiagramofMELPdecoder.

Thefinalrefinementsofthepitchvalueareobtainedusingthelinearpredictionresiduals.
Theresidualsequenceisgeneratedbyfilteringtheinputspeechsignalwiththefilterobtainedusing
the LPC analysis. For the purposes of pitch refinement the residual signal is filtered using a low
pass filter with a cutoff of 1 kHz. The normalized autocorrelation function is computed for this
filteredresidualsignalforlagsfromfivesampleslesstofivesamplesmorethanthecandidateP2
value,andacandidatevalueofP3isobtained.
Theinputisalsosubjectedtoamultibandvoicinganalysisusingfivefilterswithpassbands
0500,5001000,10002000,20003000,and30004000Hz.Thegoaloftheanalysisistoobtain
thevoicingstrengthsVbpiforeachbandusedintheshapingfilters.IfthevalueofVbp1 issmall,this
indicatesalackoflowfrequencystructure,whichinturnindicatesanunvoicedortransitioninput.
Thus,ifVbp1<0.5,thepulsecomponentoftheexcitationsignalisselectedtobeaperiodic,andthis
decisioniscommunicatedtothedecoderbysettingtheaperiodicflagto1.WhenVbp1>06,the
valuesoftheothervoicingstrengthsarequantizedto1iftheirvalueisgreaterthan0.6,andto0
otherwise. In this way signal energy in the different bands is turned on or off depending on the
voicingstrength.
Inordertogeneratethepulseinput,thealgorithmmeasuresthemagnitudeofthediscrete
Fouriertransformcoefficientscorrespondingtothefirst10harmonicsofthepitch.Themagnitudes

oftheharmonicsarequantizedusingavectorquantizerwithacodebooksizeof256.Thecodebook
is searched using a weighted Euclidean distance that emphasizes lower frequencies over higher
frequencies.
At the decoder, using the magnitudes of the harmonics and information about the
periodicity of the pulse train, the algorithm generates one excitation signal. Another signal is
generated using a random number generator. Both are shaped by the multiband shaping filter
before being combined. This mixture signal is then processed through an adaptive spectral
enhancementfilter,whichisbasedontheLPCcoefficients,toformthefinalexcitationsignal.Note
that in order to preserve continuity from frame to frame, the parameters used for generating the
excitationsignalareadjustedbasedontheircorrespondingvaluesinneighboringframes.

6.6MPEGAudioCoding
TheformalnameofMPEG1is theinternationalstandardformovingpicturevideocompression,IS
11172. It consists of five parts, of which part 3 [ISO/IEC 93] is the definition of the audio
compressionalgorithm.ThedocumentdescribingMPEG1hasnormativeandinformativesections.
A normative section is part of the standard specification. It is intended for implementers, it is
written in a precise language, and it should be strictly followed in implementing the standard on
actualcomputerplatforms.Aninformativesection,ontheotherhand,illustratesconceptsthatare
discussed elsewhere, explains the reasons that led to certain choices and decisions, and contains
backgroundmaterial.Anexampleofanormativesectionisthetablesofvariousparametersandof
theHuffmancodesusedinMPEGaudio.Anexampleofaninformativesectionisthealgorithmused
by MPEG audio to implement a psychoacoustic model. MPEG does not require any particular
algorithm,andanMPEGencodercanuseanymethodtoimplementthemodel.Thisinformative
sectionsimplydescribesvariousalternatives.
The MPEG1 and MPEG2 (or in short, MPEG1/2) audio standard specifies three
compression methods called layers and designated I, II, and III. All three layers are part of the
MPEG1 standard. A movie compressed by MPEG1 uses only one layer, and the layer number is
specifiedinthecompressedstream.Anyofthelayerscanbeusedtocompressanaudiofilewithout
anyvideo.Aninterestingaspectofthedesignofthestandardisthatthelayersformahierarchyin
thesensethatalayerIIIdecodercanalsodecodeaudiofilescompressedbylayersIorII.
The result of having three layers was an increasing popularity of layer III. The encoder is
extremelycomplex,butitproducesexcellentcompression,andthis,combinedwiththefactthatthe
decoderismuchsimpler,hasproducedinthelate1990sanexplosionofwhatispopularlyknown

asmp3soundfiles.ItiseasytolegallyandfreelyobtainalayerIIIdecoderandmuchmusicthatis
alreadyencodedinlayerIII.Sofar,thishasbeenabigsuccessoftheaudiopartoftheMPEGproject.
The principle of MPEG audio compression is quantization. The values being quantized,
however,arenottheaudiosamplesbutnumbers(calledsignals)takenfromthefrequencydomain
of the sound. The fact that the compression ratio (or equivalently, the bitrate) is known to the
encodermeansthattheencoderknowsatanytimehowmanybitsitcanallocatetothequantized
signals. Thus, the (adaptive) bit allocation algorithm is an important part of the encoder. This
algorithmusestheknownbitrateandthefrequencyspectrumofthemostrecentaudiosamplesto
determinethesizeofthequantizedsignalssuchthatthequantizationnoise(thedifferencebetween
anoriginalsignalandaquantizedone)willbeinaudible.

Figure14:MPEGAudio:(a)Encoderand(b)Decoder

The psychoacoustic models use the frequency of the sound that is being compressed, but
the input stream consists of audio samples, not sound frequencies. The frequencies have to be
computed from the samples. This is why the first step in MPEG audio encoding is a discrete
Fouriertransform,whereasetof512consecutiveaudiosamplesistransformedtothefrequency
domain. Since the number of frequencies can be huge, they are grouped into 32 equalwidth
frequencysubbands(layerIIIusesdifferentnumbersbutthesameprinciple).Foreachsubband,a
number is obtained that indicates the intensity of the sound at that subbands frequency range.
These numbers (called signals) are then quantized. The coarseness of the quantization in each

subband is determined by the masking threshold in the subband and by the number of bits still
available to the encoder. The masking threshold is computed for each subband using a
psychoacousticmodel.
MPEGusestwopsychoacousticmodelstoimplementfrequencymaskingandtemporal
masking.Eachmodeldescribeshowloudsoundmasksothersoundsthathappentobeclosetoitin
frequencyorintime.Themodelpartitionsthefrequencyrangeinto24criticalbandsandspecifies
how masking effects apply within each band. The masking effects depend, of course, on the
frequencies and amplitudes of the tones. When the sound is decompressed and played, the user
(listener) may select any playback amplitude, which is why the psychoacoustic model has to be
designed for the worst case. The masking effects also depend on the nature of the source of the
sound being compressed. The source may be tonelike or noiselike. The two psychoacoustic
modelsemployedbyMPEGarebasedonexperimentalworkdonebyresearchersovermanyyears.
Thedecodermustbefast,sinceitmayhavetodecodetheentiremovie(videoandaudio)at
realtime,soitmustbesimple.Asaresultitdoesnotuseanypsychoacousticmodelorbitallocation
algorithm. The compressed stream must therefore contain all the information that the decoder
needs for dequantizing the signals. This information (the size of the quantized signals) must be
written by the encoder on the compressed stream, and it constitutes overhead that should be
subtractedfromthenumberofremainingavailablebits.
Figure 14 is a block diagram of the main components of the MPEG audio encoder and
decoder.Theancillarydataisuserdefinableandwouldnormallyconsistofinformationrelatedto
specificapplications.Thisdataisoptional.

6.7FrequencyDomainCoding
The first step in encoding the audio samples is to transform them from the time domain to the
frequencydomain.Thisisdonebyabankofpolyphasefiltersthattransformthesamplesinto32
equalwidth frequency subbands. The filters were designed to provide fast operation combined
withgoodtimeandfrequencyresolutions.Asaresult,theirdesigninvolvedthreecompromises.
1. The first compromise is the equal widths of the 32 frequency bands. This simplifies the
filtersbutisincontrasttothebehaviorofthehumanauditorysystem,whosesensitivityis
frequencydependent. When several critical bands are covered by a subband X, the bit
allocation algorithm selects the critical band with the least noise masking and uses that
criticalbandtocomputethenumberofbitsallocatedtothequantizedsignalsinsubbandX.
2. Thesecondcompromiseinvolvestheinversefilterbank,theoneusedbythedecoder.The
original timetofrequency transformation involves loss of information (even before any

quantization).Theinversefilterbankthereforereceivesdatathatisslightlybad,andusesit
to perform the inverse frequencytotime transformation, resulting in more distortions.
Therefore,thedesignofthetwofilterbanks(fordirectandinversetransformations)hadto
usecompromisestominimizethislossofinformation.
3. The third compromise has to do with the individual filters. Adjacent filters should ideally
pass different frequency ranges. In practice, they have considerable frequency overlap.
Soundofasingle,pure,frequencycanthereforepenetratethroughtwofiltersandproduce
signals(thatarelaterquantized)intwoofthe32subbandsinsteadofinjustonesubband.
The polyphase filter bank uses (in addition to other intermediate data structures) a buffer X
withroomfor512inputsamples.ThebufferisaFIFOqueueandalwayscontainsthemostrecent
512samplesinput.Figure15showsthefivemainstepsofthepolyphasefilteringalgorithm.

6.8MPEGLayerICoding

Figure15:PolyphaseFilterBank.

The Layer I coding scheme provides a 4:1 compression. In Layer I coding the time frequency
mappingisaccomplishedusingabankof32subbandfilters.Theoutputofthesubbandfiltersis
criticallysampled.Thatis,theoutputofeachfilterisdownsampledby32.Thesamplesaredivided
intogroupsof12sampleseach.Twelvesamplesfromeachofthe32subbandfilters,oratotalof
384 samples, make up one frame of the Layer I coder. Once the frequency components are
obtained the algorithm examines each group of 12 samples to determine a scalefactor. The
scalefactorisusedtomakesurethatthecoefficientsmakeuseoftheentirerangeofthequantizer.
Thesubbandoutputisdividedbythescalefactorbeforebeinglinearlyquantized.Thereareatotal
of63scalefactorsspecifiedintheMPEGstandard.Specificationofeachscalefactorrequires6bits.


Figure16:FramestructureforLayer1.

To determine the number of bits to be used for quantization, the coder makes use of the
psychoacousticmodel.TheinputstothemodelincludeFastFourierTransform(FFT)oftheaudio
data as well as the signal itself. The model calculates the masking thresholds in each subband,
which in turn determine the amount of quantization noise that can be tolerated and hence the
quantization step size. As the quantizers all cover the same range, selection of the quantization
stepsizeisthesameasselectionofthenumberofbitstobeusedforquantizingtheoutputofeach
subband. In Layer I the encoder has a choice of 14 different quantizers for each band (plus the
option of assigning 0 bits). The quantizers are all midtread quantizers ranging from 3 levels to
65,535levels.Eachsubbandgetsassignedavariablenumberofbits.However,thetotalnumberof
bitsavailabletorepresentallthesubbandsamplesisfixed.Therefore,thebitallocationcanbean
iterativeprocess.Theobjectiveistokeepthenoisetomaskratiomoreorlessconstantacrossthe
subbands.
Theoutputofthequantizationandbitallocationstepsarecombinedintoaframeasshown
inFigure16.BecauseMPEGaudioisastreamingformat,eachframecarriesaheader,ratherthan
havingasingleheaderfortheentireaudiosequence.
1. Theheaderismadeupof32bits.
2. Thefirst12bitscompriseasyncpatternconsistingofall1s.
3. Thisisfollowedbya1bitversionID,
4. A2bitlayerindicator,
5. A 1bit CRC protection. The CRC protection bit is set to 0 if there is no CRC
protectionandissettoa1ifthereisCRCprotection.
6. If the layer and protection information is known, all 16 bits can be used for
providingframesynchronization.

7. Thenext4bitsmakeupthebitrateindex,whichspecifiesthebitrateinkbits/sec.
Thereare14specifiedbitratestochosefrom.
8. This is followed by 2 bits that indicate the sampling frequency. The sampling
frequencies for MPEG1 and MPEG2 are different (one of the few differences
between the audio coding standards for MPEG1 and MPEG2) and are shown in
Table2.
9. Thesebitsarefollowedbyasinglepaddingbit.Ifthebitis1,theframeneedsan
additional bit to adjust the bit rate to the sampling frequency. The next two bits
indicate the mode. The possible modes are stereo, joint stereo, dual channel,
and single channel. The stereo mode consists of two channels that are encoded
separatelybutintendedtobeplayedtogether.Thejointstereomodeconsistsoftwo
channelsthatareencodedtogether.

Table2:AllowablesamplingfrequenciesinMPEG1andMPEG2.

Theleftandrightchannelsarecombinedtoformamidandasidesignalasfollows:

Thedualchannelmodeconsistsoftwochannelsthatareencodedseparatelyandarenot
intended to be played together, such as a translation channel. These are followed by two mode
extension bits that are used in the joint stereo mode. The next bit is a copyright bit (1 if the
materialiscopyrighted,0ifitisnot).Thenextbitissetto1fororiginalmediaand0forcopy.
The final two bits indicate the type of deemphasis to be used. If the CRC bit is set, the header is
followedbya16bitCRC.Thisisfollowedbythebitallocationsusedbyeachsubbandandisinturn
followed by the set of 6bit scalefactors. The scalefactor data is followed by the quantized 384
samples.

16.9LayerIICoding
The Layer II coder provides a higher compression rate by making some relatively minor
modifications to the Layer I coding scheme. These modifications include how the samples are
grouped together, the representation of the scalefactors, and the quantization strategy.
WheretheLayerIcoderputs12samplesfromeachsubbandintoaframe,theLayerIIcodergroups
threesetsof12samplesfromeachsubbandintoaframe.Thetotalnumberofsamplesperframe
increasesfrom384samplesto1152samples.Thisreducestheamountofoverheadpersample.In
LayerIcodingaseparatescalefactorisselectedforeachblockof12samples.InLayerIIcodingthe
encodertriestoshareascalefactoramongtwoorallthreegroupsofsamplesfromeachsubband
filter.Theonlytimeseparatescalefactorsareusedforeachgroupof12samplesiswhennotdoing
so would result in a significant increase in distortion. The particular choice used in a frame is
signaledthroughthescalefactorselectioninformationfieldinthebitstream.
The major difference between the Layer I and Layer II coding schemes is in the
quantizationstep.IntheLayerIcodingschemetheoutputofeachsubbandisquantizedusingone
of 14 possibilities; the same 14 possibilities for each of the subbands. In Layer II coding the
quantizers used for each of the subbands can be selected from a different set of quantizers
depending on the sampling rate and the bit rates. For some sampling rate and bit rate
combinations,manyofthehighersubbandsareassigned0bits.Thatis,theinformationfromthose
subbandsissimplydiscarded.Wherethequantizerselectedhas3,5,or9levels,theLayerIIcoding
schemeusesonemoreenhancement.Noticethatinthecaseof3 levelswehavetouse2bitsper
sample,whichwouldhaveallowedustorepresent4levels.Thesituationisevenworseinthecase
of5levels,whereweareforcedtouse3bits,wastingthreecodewords,andinthecaseof9levels
wherewehavetouse4bits,thuswasting7levels.
Toavoidthissituation,theLayerIIcodergroups3samplesintoagranule.Ifeachsample
cantakeon3levels,agranulecantakeon27levels.Thiscanbeaccommodatedusing5bits.Ifeach
sample had been encoded separately we would have needed 6 bits. Similarly, if each sample can
takeon9values,agranulecantakeon729values.Wecanrepresent729valuesusing10bits.If
eachsampleinthegranulehadbeenencodedseparately,wewouldhaveneeded12bits.Usingall
thesesavings,thecompressionratioinLayerIIcodingcanbeincreasefrom4:1to8:1or6:1.
TheframestructurefortheLayerIIcodercanbeseeninFigure17.Theonlyrealdifference
between this frame structure and the frame structure of the Layer I coder is the scalefactor
selectioninformationfield.

Figure17:FramestructureforLayer2.

16.10LayerIIICodingmp3
Layer III coding, which has become widely popular under the name mp3, is considerably more
complexthantheLayerIandLayerIIcodingschemes.OneoftheproblemswiththeLayerIand
codingschemeswasthatwiththe32banddecomposition,thebandwidthofthesubbandsatlower
frequenciesissignificantlylargerthanthecriticalbands.Thismakesitdifficulttomakeanaccurate
judgmentofthemasktosignalratio.Ifwegetahighamplitudetonewithinasubbandandifthe
subbandwasnarrowenough,wecouldassumethatitmaskedothertonesintheband.However,if
thebandwidthofthesubbandissignificantlyhigherthanthecriticalbandwidthatthatfrequency,it
becomesmoredifficulttodeterminewhetherothertonesinthesubbandwillbemasked.
Tosatisfythebackwardcompatibilityrequirement,thespectraldecompositionintheLayer
IIIalgorithmisperformedintwostages.Firstthe32bandsubbanddecompositionusedinLayerI
and Layer II is employed. The output of each subband is transformed using a modified discrete
cosinetransform(MDCT)witha50%overlap.TheLayerIIIalgorithmspecifiestwosizesforthe
MDCT,6or18.Thismeansthattheoutputofeachsubbandcanbedecomposedinto18frequency
coefficientsor6frequencycoefficients.
ThereasonforhavingtwosizesfortheMDCTisthatwhenwetransform asequenceinto
thefrequencydomain,welosetimeresolutionevenaswegainfrequencyresolution.Thelarger
the block size the more we lose in terms of time resolution. The problem with this is that any
quantizationnoiseintroducedintothefrequencycoefficientswillgetspreadovertheentireblock
size of the transform. Backward temporal masking occurs for only a short duration prior to the
maskingsound(approximately20msec).Therefore,quantizationnoisewillappearasapreecho.

Forthe longwindowsweendupwith18frequenciespersubband,resultinginatotalof
576 frequencies. For the short windows we get 6 coefficients per subband for a total of 192
frequencies.Thestandardallowsforamixedblockmodeinwhichthetwolowestsubbandsuse
longwindowswhiletheremainingsubbandsuseshortwindows.Noticethatwhilethenumberof
frequenciesmaychangedependingonwhetherweareusinglongorshortwindows,thenumberof
samples in a frame stays at 1152. That is 36 samples, or 3 groups of 12, from each of the 32
subbandfilters.
ThecodingandquantizationoftheoutputoftheMDCTisconductedinaniterativefashion
usingtwonestedloops.Thereisanouterloopcalledthedistortioncontrolloopwhosepurposeis
to ensure that the introduced quantization noise lies below the audibility threshold. The
scalefactorsareusedtocontrolthelevelofquantizationnoise.InLayerIIIscalefactorsareassigned
togroupsorbandsofcoefficientsinwhichthebandsareapproximatelythesizeofcriticalbands.
Thereare21scalefactorbandsforlongblocksand12scalefactorbandsforshortblocks.

Figure19:FramesinLayerIII
Theinnerloopiscalledthe ratecontrolloop.Thegoalofthisloopistomakesurethata
targetbitrateisnotexceeded.ThisisdonebyiteratingbetweendifferentquantizersandHuffman
codes. The quantizers used in mp3 are companded nonuniform quantizers. The scaled MDCT
coefficients are first quantized and organized into regions. Coefficients at the higher end of the
frequencyscalearelikelytobequantizedtozero.Theseconsecutivezerooutputsaretreatedasa
single region and the runlength is Huffman encoded. Below this region of zero coefficients, the
encoder identifies the set of coefficients that are quantized to 0 or 1. These coefficients are
grouped into groups of four. This set of quadruplets is the second region of coefficients. Each
quadrupletisencodedusingasingleHuffmancodeword.

The remaining coefficients are divided into two or three subregions. Each subregion is
assignedaHuffmancodebasedon itsstatisticalcharacteristics.Iftheresultofusingthisvariable
lengthcodingexceedsthebitbudget,thequantizerisadjustedtoincreasethequantizationstepsize.
Theprocessisrepeateduntilthetargetrateissatisfied.Thepsychoacousticmodelisused
tocheckwhetherthequantizationnoiseinanybandexceedsthealloweddistortion.Ifitdoes,the
scalefactor is adjusted to reduce the quantization noise. Once all scalefactors have been adjusted,
control returns to the rate control loop. The iterations terminate either when the distortion and
rateconditionsaresatisfiedorthescalefactorscannotbeadjustedanyfurther.
TherewillbeframesinwhichthenumberofbitsusedbytheHuffmancoderislessthanthe
amountallocated.Thesebitsaresavedinaconceptualbitreservoir.Inpracticewhatthismeansis
thatthestartofablockofdatadoesnotnecessarilycoincidewiththeheaderoftheframe.Consider
the three frames shown in Figure 19. In this example, the main data for the first frame (which
includes scalefactor information and the Huffman coded data) does not occupy the entire frame.
Therefore,themaindataforthesecondframestartsbeforethesecondframeactuallybegins.The
sameistruefortheremainingdata.Themaindatacanbegininthepreviousframe.However,the
main data for a particular frame cannot spill over into the following frame. All this complexity
allowsfora veryefficientencodingofaudioinputs.Thetypicalmp3audiofilehas acompression
ratioofabout10:1.Inspiteofthishighlevelofcompression,mostpeoplecannottellthedifference
betweentheoriginalandthecompressedrepresentation.

You might also like