KEMBAR78
Speech Recognition of Isolated Words Usi | PDF | Speech Recognition | Deep Learning
0% found this document useful (0 votes)
24 views10 pages

Speech Recognition of Isolated Words Usi

This paper presents a new speech database for the Sylheti language and develops automatic speech recognition (ASR) systems to recognize isolated Sylheti words using neural network classifiers. It highlights the challenges of developing ASR systems for under-resourced languages and evaluates the performance of the proposed systems. The study emphasizes the significance of neural networks in enhancing speech recognition technology for languages with limited resources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

Speech Recognition of Isolated Words Usi

This paper presents a new speech database for the Sylheti language and develops automatic speech recognition (ASR) systems to recognize isolated Sylheti words using neural network classifiers. It highlights the challenges of developing ASR systems for under-resourced languages and evaluates the performance of the proposed systems. The study emphasizes the significance of neural networks in enhancing speech recognition technology for languages with limited resources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

International Journal of Recent Technology and Engineering (IJRTE)

ISSN: 2277-3878, Volume-8 Issue-3, September 2019

Speech Recognition of Isolated Words using a


New Speech Database in Sylheti
Gautam Chakraborty, Navajit Saikia

 Speech recognition deals with isolated words, connected


Abstract: With the advancements in the field of artificial words or continuous speech depending on the requirements of
intelligence, speech recognition based applications are becoming applications using the speech databases varying from small
more and more popular in the recent years. Researchers working vocabulary to large vocabulary [1],[2],[3],[17],[30],[38]. An
in many areas including linguistics, engineering, psychology, etc.
ASR system may also be speaker dependent or speaker
have been trying to address various aspects relating to speech
recognition in different natural languages around the globe. independent[2],[56]. Technologies like document preparation
Although many interactive speech applications in or retrieval, command and control, automated customer
"well-resourced" major languages are being developed, uses of service, etc. use speaker independent speech recognizers.
these applications are still limited due to language barrier. Hence, Speaker dependent systems are used in the applications like
researchers have also been concentrating to design speech interactive voice response system, computer game control,
recognition system in various under-resourced languages. Sylheti etc. Factors that affect the performance of an ASR system
is one of such under-resourced languages primarily spoken in the include type of speech, dependency of speaker, vocabulary
Sylhet division of Bangladesh and also spoken in the southern
size, age variation, etc.[1]. A generic ASR system ideally
part of Assam, India. This paper has two contributions: i) it
presents a new speech database of isolated words for the Sylheti consists of three stages [5]:
language, and ii) it presents speech recognition systems for the i) Signal Pre-processing involves the extraction of voiced
Sylheti language to recognize isolated Sylheti words by applying parts from a speech signal through a series of signal
two variants of neural network classifiers. The performances of analysis. It derives the voiced parts in digital form by
these recognition systems are evaluated with the proposed exercising an input speech signal through
database and the observations are presented. analog-to-digital conversion, pre-emphasis filtering
Keywords: Automatic Speech Recognition, Mel Frequency followed by windowing.
Cepstral Coefficient, Sylheti, Under-resourced Language, ii) Feature Extraction computes features from each voiced
Feed-forward neural network, Recurrent Neural Network. part in the pre-processed signal. Some popular features
used in ASR systems are Linear Predictive Coding (LPC)
I. INTRODUCTION coefficients [5],[6], Mel Frequency Cepstral Coefficients
Speech is a primary mode of communication among humans. (MFCC) [4],[6],[7],[8],[10], short-time energy [6],
Each uttered word in a language contains linguistic contents i-vector [11], etc. The mel-frequency scale in MFCC
(vowel and consonant speech segments) specific to the coefficients being proportional to the logarithm of the
linear frequency below 1 kHz, it closely reflects the
language. With the advances in machine or artificial
human perception; and hence, MFCC features are mostly
intelligence, it has become more pertinent to use voice for
used in ASR systems.
man-machine interaction. Even with a small vocabulary of
iii) Classification is the process of mapping the feature vector
isolated words, speech recognition is used in mobile of an input word into 1 out of N word classes of the
telephony, interactive television, support systems for considered vocabulary during testing. Some popularly
differently abled people, robotics, etc. Automatic Speech used classifiers in ASR are Artificial Neural Network
Recognition (ASR) involves conversion of a given speech (ANN) [5],[10],[12],[13], Hidden Markov model (HMM)
signal into a machine readable format and then transforming it [14],[15], Dynamic Time Warping (DTW) [16],[17],
into desired outputs which can be used in applications for Deep Neural Network [9],[47],[51], etc. The application
practical purposes [1]. As a pattern recognition problem, a of ANN in designing ASR system is still being used by
speech recognition system compares a given test pattern with researchers [5],[6],[19],[20],[21],[22],[23],[36],[40],[42]
the training pattern of each speech classes for classification. despite the developments in the field of deep neural
ASR systems, starting from isolated digit recognition to network (DNN) in recent times. ANN has been a popular
continuous speech recognition, have evolved significantly in choice because of its following characteristics [24]:
various languages.  Non linearity: Neural network has the ability to learn and
model the non-linear and complex relationships between
the inputs and outputs to perform tasks like pattern
recognition and classification,
Revised Manuscript Received on September 15, 2019
multi-dimensional data processing, etc.
Gautam Chakraborty, Department of Electronics &
Telecommunication Engineering, Assam Engineering College, Guwahati,
India. Email: gauchak2012@gmail.com.
Navajit Saikia, Department of Electronics & Telecommunication
Engineering, Assam Engineering College, Guwahati, India. Email:
navajit.ete@aec.ac.in.

Published By:
Retrieval Number: C5874098319/2019©BEIESP Blue Eyes Intelligence Engineering
DOI:10.35940/ijrte.C5874.098319 6259 & Sciences Publication
Speech Recognition of Isolated Words using a New Speech Database in Sylheti

 Robustness: Because of the inherent parallel structure,


neural network can continue to work even if any element
of the network fails.
 Adaptability: Neural network has the self learning
capability, and hence does not require reprogramming in a
dynamic environment.
Many speech interactive applications are already being yin f (y )
in
used which facilitates major languages or "well-resourced
languages" like English, Chinese, French, German, Russian,
Hindi, etc. However, language barrier seen in human
interaction demands for ASR systems in "under-resourced
languages" [25],[29]. An under-resourced language is one
which has some shortfalls such as lack of writing system, lack Figure 1. Model of an artificial neuron
of linguistic study, limited or unavailability of electronic The weight associated with a communication link measures
speech resources, etc.[25],[26]. A comprehensive list of the quantity of knowledge of the local network formed by the
7,111 languages in the world is available in [27] which two neurons. The output y of neuron Y depends on an
includes both the well- and under- resourced languages. activation function f  . according to:
Researchers in recent years have reported speech recognition
solutions for some of the under-resourced languages y  f  yin  (2)
[28],[29],[58],[59]. The activation function used in neural network is generally
The language Sylheti is an under-resourced language with non-linear which can be chosen from sigmoid function,
limited linguistic study [31],[32],[33],[41], few printed and hyperbolic tangent function, etc.[5]. This output y may be
electronic literatures [34], limited linguistic expertise, etc.
connected to one or more neurons in next layer. During
There is also no record of any speech and language
training, weights associated with the communication link
technology on it. Sylheti belongs to the Indo-Aryan language
between two neurons are adjusted to resolve the differences
family [32] and more than ten million people speak Sylheti
between the actual and predicted outcomes for subsequent
globally. It is spoken largely in the Sylhet
forward passes in the network. It is to be noted that in ANN
Division of Bangladesh and also spoken in the northern part
classifier, the number of nodes in input and output layers
of Tripura and the Barak Valley region of Assam, India.
match the number of input features and output classes
Sylheti is written in its unique Sylheti Nagari script and the
respectively.
alphabets are presented in [34]. Sylheti has a total of 32
Based on the interconnections of neurons, ANN may be
alphabets comprising of 5 vowels and 27 consonants. A
phonetic study in the Sylheti language has been carried out broadly categorised into two types: feed-forward and
only recently in [32] which introduces the phonemes present feedback or recurrent [24] neural network which are shown in
in Sylheti first time. Some characteristics such as distinctive Figure 2 (with one input, one hidden and one output layers).
way of pronunciation, de-aspiration and deaffrication, etc. of While a feed-forward neural network (FFNN) propagates data
this language are also worth mentioning here [32]. from input to output through the hidden nodes, a recurrent
From the perspective of this work, a brief introduction to neural network (RNN) uses feedback by using an internal
ANN is presented here. ANN as a machine learning model state memory as shown in Figure 2(b). Due to the use of
replicates the human brain which uses a large collection of feedback, RNN has the ability to deal with time-varying
nonlinear information processing elements (called artificial dynamic inputs. To be noted that the stability of an ANN
neurons or nodes or units) and it is organized in three layers: depends on the number of neurons in the hidden layer(s).
one input layer, one or more hidden layers and one output
layer [24]. There may be one or many hidden layers
depending on the network architecture (single layer or
multi-layer). The net input to a neuron is equal to the weighted
sum of the outputs of the neurons feeding it. For example,
consider that the neuron Y receives inputs from three neurons
X1, X2 and X3 through communication links having weights
w1 , w2 and w3 respectively as shown in Figure 1. If x1 ,
x2 and x3 are the respective outputs of X1, X2 and X3, the
net input to Y is:
yin  w1 x1  w2 x2  w3 x3 (1)

(a) Architecture of FFNN

Published By:
Retrieval Number: C5874098319/2019©BEIESP Blue Eyes Intelligence Engineering
DOI:10.35940/ijrte.C5874.098319 6260 & Sciences Publication
International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-8 Issue-3, September 2019

of digits as well as words. To recognize isolated Turkish


digits, three ASR systems are proposed in [4] by applying
different types of neural networks. Taking MFCC features of
each digit, these systems present recognition accuracies in the
range from 98.12% to 100%. In [5], authors present three
ASR systems to recognise isolated digits in Assamese by
using LPC features. These systems use FFNN, RNN and
Cooperative Heterogenous ANN (CHANN). An ASR
framework for recognizing isolated Bangla words is reported
in [7] which employs MFCC features. The authors use a
semantic time delay neural network in the work and report a
recognition accuracy of 82.66%. J.T. Geiger et al. [39] have
(b) Architecture of RNN applied RNN in a hybrid NN-HMM system architecture
Figure 2. Types of ANN considering the medium-vocabulary 2nd CHiMEM speech
The remaining part of the paper is organized as follows: corpus in experiments. P. Sarma and A. Garg propose an ASR
Section 2 presents a review of literatures on isolated word system in [40] to recognize Hindi words with a neural network
recognition in different languages. A new speech database for classifier. MFCC and PLP features used here and an average
the Sylheti language is introduced in Section 3. Section 4 recognition accuracy of 79% is reported. In [13], Marathi
presents the proposed ASR system for isolated word in isolated words are considered to build an ASR system using
Sylheti. Experimental results are presented in Section 5 neural network classifier which also presents an overall
followed by discussions. Section 6 concludes this work. classification rate. Another ASR model for Malayalam
isolated words is presented in [42] which uses ANN classifier
II. LITERATURE REVIEW and a combined feature set comprising of MFCC, energy and
Speech recognition has been a subject of interest for zero crossing. The authors have reported a recognition
researchers for decades leading to the developments of accuracy of 96.4%. M. Oprea and D. Schiopu [12] propose
speech interfaces for desktop and handheld devices. The first an ASR system to recognize Romanian isolated words by
ASR system designed by researchers at Bell Laboratories in using neural network classifier. S. Furui presents a
1952 [35] could recognize ten digits (0-9) spoken by one speaker-independent ASR system which is used to recognise
speaker. In subsequent times, ASR systems for isolated names of Japanese cities [44]. With a vocabulary of 100 city
words, connected words and continuous speeches in various names, the system has a recognition rate of 97.6%. M. K.
languages are reported by employing different acoustic Luka et al. [10] use neural network classifier to design an
features and classification models ASR system for Hausa language by using MFCC features. In
[6],[18],[36],[37],[39],[44]. Although people used hidden [59], the authors introduce a neural network based ASR
Markov model in ASR during 1980s and 1990s, technology system for Gujarati isolated words by using two acoustic
transition has been witnessed towards ANN significantly due features MFCC and real cepstral coefficients [59] and report
to certain advantages as discussed in Section 1 and also the comparison results in terms of average classification rates.
towards DNN in recent times. As the present work confines to In recent years, researchers are employing DNN in speech
the recognition of isolated words in the language Sylheti, the recognition. Authors in [45] use a DNN classifier with MFCC
literature review in the following concentrates on available features for English digit recognition. When tested with the
ASR systems for isolated words in major and under-resourced database of English digits constructed by Texus Instrument,
languages. an average recognition accuracy of 86.06% is reported in this
A neural network based approach with MFCC, LPC and work. By considering German speech data, [43] presents an
short-time energy features to recognize isolated English digits ASR system by employing convolutional neural network
is presented by B. P. Das and R. Parekh [6]. The authors which derives a WER of 58.36% and a letter error rate of
report an overall recognition accuracy of 85% with their own 22.78%. Authors in [46 ] use LSTM RNN to construct an
database. In [8], N. Seman et al. present a model to recognize ASR system which is tested on two speech databases- the
isolated Malay words in live video and audio recorded Augmented Multi-party Interaction (AMI) corpus and Corpus
speeches. The authors have used MFCC features and the of Spontaneous Japanese Interaction (CJI). Another ASR
multi-layer neural network to derive an average classification system proposed in [47] considers three African languages
rate of 84.73%. A recognition system for Chinese digits is Swahili, Amharic and Dinka which employs DNN classifier.
proposed in [49] by using MFCC features and neural network From the above discussion, it is observed that researchers
classifier where an average recognition rate of 97.4% is are using ANN and its variants in recent times also to design
reported. I. Kipyatkova and A. Karpov introduce an RNN ASR systems in some major languages [6],[48],[49],[50] due
based ASR system in Russian language with word error rate to their attractive characteristics as discussed in Section 1. It is
(WER) of 14% [50]. The speech recognition models to be also noted that ANN is being popularly used for digit
proposed in [48] for Arabic digits and words use two variants and isolated word recognition in under-resourced languages
of neural network- multi-layer perceptron and Long [4],[5],[7],[8],[10]. These
Short-Term Memory(LSTM). For both the models, the ANN based ASR systems are
authors have reported recognition rates of around 95% in case reported to deliver good

Published By:
Retrieval Number: C5874098319/2019©BEIESP Blue Eyes Intelligence Engineering
DOI:10.35940/ijrte.C5874.098319 6261 & Sciences Publication
Speech Recognition of Isolated Words using a New Speech Database in Sylheti

recognition rates. As discussed in Section 1, Sylheti is an in SCB [52] and they are /b/, /t/, /ɡ/, /m/, /n/, / /, /s/, /h/, /r/, /l/,
unexplored and under-resourced language from both / /, /t/, /d/. The other 4 consonant phonemes /z/, /x/, /ɖ/ and
linguistic as well as technological points of view. Speech /Φ/ are specific in Sylheti language. The 17 consonant
phonemes /p/, /ph/, /bh/, /th/, /d/, /dh/, / t /, / dh/, /c/, /ch/, /k/,
recognition for Sylheti has not been considered yet except an h

ASR system reported for isolated digits pronounced in Sylheti /kh/, /ɡh/, /w/, /j/, /Ɉ/, / Ɉh/ available in SCB are not present in
from 0 to 9 [36] as a part of our initiative. Further to state here Sylheti.
that there is no speech database available in the Sylheti Table 1 : Sylheti Phonemes
language in electronic form for applications in speech or Vowel phonemes Consonant phonemes
speaker recognition. Considering the above observations, we
concentrate here on two aspects as follows: /i/, /e/, /a/, /u/, /ɔ/ /b/, /t/, /ɡ/, /m/, /n/, / /, /s/, /h/,
 To construct a new speech database of small vocabulary /r/, /l/, / /, /t/, /d/, /z/, /x/, /ɖ/, /Φ/
for isolated Sylheti words. In doing so, the possible future When Sylheti is compared with English, it is observed that the
use of the database in ASR environment can also to be English language has a total of 12 vowel phonemes [52]. All
considered. The Sylheti words (except the digits) are to be the 5 vowel phonemes in Sylheti are also present in English.
chosen based on phonetic studies made in [31],[32] such Therefore, the remaining 7 vowel phonemes in English (/ə/,
that the words are phonetically rich. /æ/, /I/, /ɒ/, /ɜ/, /ʌ/, /ʊ/) are specific to the language. On the
 To design ASR systems for isolated Sylheti words by other hand, 12 consonant phonemes /b/, /t/, /ɡ/, /m/, /n/, / /, /s/,
using FFNN and RNN types of neural networks. /h/, /r/, /l/, / /, /z/ in Sylheti [ 2] are also present in English
The following section presents the proposed Sylheti speech [52]. ence, Sylheti has 5 language specific consonant
database for isolated words. phonemes /t/, /d/, /x/, /ɖ/ and /Φ/, when compared with
English. The English language has 12 specific consonant
III. CONSTRUCTION OF NEW SYLHETI SPEECH phonemes which are /p/, /d/, /ɵ/, /k/, /f/, /v/, /Ʒ/, /t /, /dƷ/, /w/,
DATABASE /j/, /ð/. Therefore, there is enough scope to study the Sylheti
language from the linguistic point of view and [32],[41] may
A speech database (or corpus) is a collection of utterances for
be consulted for detail in this regard. This also entails that
a particular language and it is an important resource for
Sylheti can be considered to be studied from the technological
building a speech recognizer. The samples in such database
viewpoints. This phonetic study will facilitate a database in
are used for training and testing of an ASR system. In transcribed form which may be used in studying the
constructing a speech database, phonetic/linguistic level phoneme-based speech recognition and speaker recognition
discussion in the language is found to be relevant. As an problems in Sylheti. The construction of a Sylheti speech
under-resourced language, a study in Sylheti is also carried database of small vocabulary considering of isolated words is
out here based on its phonetic/linguistic characteristics and presented in the following.
accordingly a brief comparison on phonemic structure of In constructing this new Sylheti speech database of isolated
Sylheti language with two major languages English and words, 30 most frequently-used mono syllabic words are
Standard Colloquial Bangla (SCB) is presented below. considered which are phonetically rich. Out of these words,
Thereafter, the work on constructing a new database in 10 are the utterances of the digits 0-9 in Sylheti and 20 are
Sylheti is discussed. Although the ASR systems which are to other Sylheti words among which few are taken from [32].
be proposed here do not involve phoneme recognition, it is Table 2 lists the isolated words in Sylheti for the proposed
aimed in the following to construct a standard speech database and it shows the meaning in English of each word
database in Sylheti considering the phonetically rich words. and the phonemes present (in bold letters).
Each speech utterance is represented by some finite set of Table 2: Isolated Sylheti words in the proposed database.
symbols known as phonemes which describe the basic units of The phonemes present in the words are shown in bold
speech sound [52]. Phonemic status of a sound is not same letters.
across languages. Moreover, the number of phonemes in one Sylheti word Meaning Sylheti word Meaning
language varies from another language. The phoneme [suinjɔ] Zero [ex] One
[dui] Two [tin] Three
inventory of Sylheti presented in [32] shows that Sylheti has
[sair] Four [Фas] Five
some specific phonemes which are not present in SCB or in [sɔy] Six [sat] Seven
major language like English. This phonetic study also [aʈ] Eight [nɔy] Nine
presents a significant reduction and reconstruction compared [ an] Donate [ an] Paddy
to that of SCB. Also, Sylheti language has the nature of [pua] Boy [puri] Girl
[ ud] Milk [bari] Home
obstruent weakening which employs de-aspiration, [pul] Flower [bari] Heavy
spirantization and deaffrication. Altogether Sylheti has 22 [bala] Good [bala] Bracelet
phonemes as shown in Table 1, out of which 5 are vowel and [jamai] Husband [bɔu] Wife
17 are consonant [32]. On the other hand, SCB has 37 [ba ] Boiled Rice [ba ] Arthritis
[ma a] Head [mux] Face
phonemes, aggregated from 7 vowel and 30 consonant
[ɡa] Body [ɡa] Wound
phonemes [52]. Five vowel phonemes /i/, /e/, /a/, /u/ and /ǝ/ in [ɡai] Stroke [ɡai] Cow
Sylheti are common to Bangla. The two other vowel The construction of speech database primarily involves
phonemes of Bangla, /o/ and /æ/, are merged with the vowel speech acquisition and labeling [53]. The acquisition of
phonemes /u/ and /e/ respectively in Sylheti due to speech utterances may be from
restructuring in articulation [32]. Again, out of the 17 read out speech or from
consonant phonemes in Sylheti, 13 phonemes are also present spontaneous speech [53],[54].

Published By:
Retrieval Number: C5874098319/2019©BEIESP Blue Eyes Intelligence Engineering
DOI:10.35940/ijrte.C5874.098319 6262 & Sciences Publication
International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-8 Issue-3, September 2019

Both of these approaches have their intrinsic advantages and analyzing the performance of the proposed systems.
limitations [53]. As the first attempt to construct a Sylheti
speech database, the case of read out speech is chosen here. IV. PROPOSED SPEECH RECOGNITION SYSTEM
Therefore, speakers are asked to read out the Sylheti words FOR SYLHETI LANGUAGE
shown in Table 2 and the utterances are captured. In recording
It is observed in Section 2 that the many ASR systems
the speech utterances, the following hardware and software
are used: [4],[6],[7],[8],[42],[49] developed in recent times for
 Microphone: iBall Rocky unidirectional microphone "well-resourced" as well as "under-resourced" languages use
(frequency range from 20Hz to 20KHz) MFCC features and neural network classifiers due to their
 Laptop: Intel Core i3 processor and 2 GB RAM, distinct characteristics as presented in Section 1. In view of
manufactured by ASUS the above, an architecture for ASR system which employs
 Operating system: Windows 7 MFCC features and ANN classifier as shown in Figure 3 is
 Voice recording software: PRAAT (version proposed in this work. We consider to use two different types
praat5367_win64) of ANN classifiers to derive two ASR systems for recognizing
Further the following parameters are set up during recording: isolated words in Sylheti.
 Sampling rate: 16 KHz
 Channel mode: Mono
 Environment: Noise-free closed room environment
 Encoding format: WAV with 16 bit PCM
 Distance of microphone from speaker's mouth: 10-12 cm
This speech database consists of data recorded from 10 native
speakers including 8 male speakers and 2 female speakers
who are willing to participate to contribute during the
construction of this database. These speakers do not have any
history of speech disorders. As there is no specific rule about
the male-female proportion in construction of speech
database, literatures [51],[56],[58] have considered various Figure 3. Architecture for ASR system employing
proportions like 60%-40%, 70%-30%, 65%-35%, etc. The MFCC features and ANN classifier
speakers in this work are chosen from Sylheti speaking areas The functions of the each block in Figure 3 are described in
in the Karimganj district of the state of Assam and the the following.
Kailasahar and Kumarghat districts of the state of Tripura,
India where they have been living since their childhood. The A. Signal Pre-processing
ages of the participating speakers are in the range from 25 to The signal pre-processing involves analog-to-digital (A/D)
70 years. Six speakers are in the age group 25 to 45 years and conversion, end point detection, pre-emphasis filtering and
have a graduate degree. The other 4 speakers are in the age windowing. In A/D conversion, the input speech signal is
group 46 to 70 years and are undergraduates. By choosing the sampled at 8 KHz and quantized with 16 bits/sample to derive
speakers in different age group, the variations in speech a digital signal. The voiced part of the speech signal is
characteristics with age are taken care of [55]. Apart from extracted from the digital signal by locating the beginning and
Sylheti, speakers can also speak English and Hindi. All the end points in the utterance (end point detection). One popular
speakers are asked to utter (read out) each of the 30 Sylheti method to extract the voiced part is to compute the
zero-crossing rate. Here, the rate at which the speech signal
words in Table 2 for 10 times. The samples are recorded and
crosses the zero amplitude line by transition from a positive to
stored according to:
negative value or vice versa is measured. Voiced part exhibits
speakernumber_age_gender_utteredword_utterancenumber
a low zero-crossing rate. Another method to extract voiced
.wav. Thereby, a total of 300 utterances are recorded for each part is short-time signal energy. After extraction of the voiced
speaker. This exercise derives a speech database containing part, a pre-emphasis filter is used to emphasize the
3000 speech samples of isolated Sylheti words. Duration of high-frequency components in the voiced part. It helps either
recording for this new Sylheti speech database is to separate the signal from noise or to restore distorted signal.
approximately 5 hours. In the labeling process [53], the Here, a first-order high-pass finite impulse response (FIR)
recorded utterances are verified by carefully listening the filter is applied to spectrally flatten the signal. Authors
target words and presence of any irregular noise or quiet consider the following FIR filter for pre-emphasis [5]:
segments in the recorded samples are examined. For each x p [n]  xv [n]  0.95 xv [n  1] (3)
recorded utterance, the voiced parts are manually extracted by
selecting their beginning and end points and the unwanted where, xv [ n] is the input signal (voiced part) to the
silent parts are removed. This is done by using PRAAT pre-emphasis filter and x p [ n] is the output.
software. The labeling exercise is rechecked by another
Due to time varying nature, speech signal is divided into
verifier to confirm that only the voiced parts are retained from
short segments (of duration ranging from 5 to 100 ms) called
the recorded utterances in the final database.
frames [5]. Frames are assumed
The following section presents two ASR systems for
to be stationary and speech
recognizing isolated Sylheti words which are taken from the analysis is carried out on the
above-presented Sylheti speech database and also for frames.

Published By:
Retrieval Number: C5874098319/2019©BEIESP Blue Eyes Intelligence Engineering
DOI:10.35940/ijrte.C5874.098319 6263 & Sciences Publication
Speech Recognition of Isolated Words using a New Speech Database in Sylheti

In ASR systems, generally overlapping frames are considered f linear


with a frame duration in the range from 20ms to 40ms and f mel  2595 log10 [1  ] (6)
with an overlapping of 5ms to 15ms [40]. In this proposed 700
system, a frame duration of 32ms with an overlap of 10ms is where f mel is the mel frequency corresponding to the linear
considered. frequency f linear . Finally, log is taken from the output f mel
The objective of the windowing stage is to minimize the
spectral discontinuities at the boundaries of the frame. The and discrete cosine transform (DCT) is applied to it to obtain
windowing operation can be expressed with the following the magnitudes of the resulting spectrum [4]. The
equation [5]: methodology described above to extract MFCC features from
Fw  n   w  n  F  n  ; 0  n  N 1
the frames of a speech signal was proposed by Davis
(4) and Mermelstein in 1980.
where, N is the number of frames in a speech sample, As the first 12 to 13 MFCC coefficients contain maximum
F  n  is a frame, w  n  is the window function and Fw  n  information present in a speech frame [40], we here consider
the first 13 MFCC coefficients of a frame as features to
is the windowed version of F  n  . Researchers usually apply represent the frame. Let
Hamming window in speech analysis [4],[5],[36]. The
coefficients of the Hamming window are computed according
c n ,i i  1, 2, 
,13 represent the first 13 MFCC
to: coefficients corresponding to the mel frequencies
 2  n  ; 0  n  N 1
w[n]  0.54  0.46 cos  (5) f i
mel i  1, 2, 
,13 for the n th
frame of an
 N  1  i
utterance. For a mel frequency f mel , the mean of all the
This work also uses the Hamming window.
MFCC coefficients derived from the frames of an utterance is
B. MFCC feature extraction computed according to:
Feature is a set of representative values extracted from an N
input speech sample that uniquely characterize the sample. mi  ( cn ,i ) / N , i  1, 2, ,13 (7)
Here, windowed version of each frame in a speech signal is n 1
considered independently to compute a feature set for the
frame. The feature sets of all the frames are then concatenated
The set of mean values m i i  1, 2, ,13 acts as the
to derive the features for the input speech signal. In the features for the utterance.
following, the computation of the feature set for the frame C. Neural network based classification
F  n  from its windowed version Fw  n  is presented. As mentioned in Section 2, the present work proposes to
As discussed in Section 1, ASR systems for different employ two variants of ANN as classifiers in the ASR
languages use MFCC coefficients as features due to their high systems. The role of the ANN classifier is to classify an input
resemblance with human hearing system [4],[5],[6],[8]. The speech by measuring its similarity with a reference pattern
mel frequency scale is approximately linear up to about derived through training phase. The proposed ASR systems
1000Hz in the frequency and well approximates the for isolated Sylheti words use FFNN and RNN separately for
sensitivity of the human ear. Therefore, the proposed ASR classification. Each of the neural networks is designed with
the following parameters:
systems for Sylheti language also use a set of MFCC
Input, output and hidden layers: Both FFNN and RNN
coefficients as the features for a frame. The block diagram for
networks are structured with one input layer, one hidden layer
computing the MFCC coefficients at frame level is presented
and one output layer. The number of neurons in input layer is
m ,13 is
in Figure 4.
taken as 13 as the feature set i i  1, 2,
used to represent an utterance. The output layer contains 30
neurons corresponding to the 30 different words to represent
30 different classes. The selection of an appropriate number
of neurons in the hidden layer is challenging in the design of
neural network. Using too few neurons in hidden layer results
in underfitting whereas a large number of hidden neurons may
cause overfitting [24],[57]. The number of hidden neurons
may be chosen according to the three rule-of-thumb
approaches [57]:
Figure 4. Computation of MFCC coefficients  It is between the input and output layer sizes.
The first block in MFCC computation finds the discrete
 It is smaller than double of the input layer size.
Fourier transform (DFT) coefficients from the windowed
 It is the sum of the output layer size and 2/3 of the input
version of an input speech frame deriving the amplitude
layer size.
spectrum. The DFT coefficients are usually obtained by
However, these rules may not result an optimum hidden
employing the fast Fourier transform (FFT). The mel filter
layer. Therefore, trial and error approach with backward or
bank converts the frequency scale to the mel-scale, which is
forward selections is generally
performed according to:
adopted to find the optimum
network architecture [57].

Published By:
Retrieval Number: C5874098319/2019©BEIESP Blue Eyes Intelligence Engineering
DOI:10.35940/ijrte.C5874.098319 6264 & Sciences Publication
International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-8 Issue-3, September 2019

In the present work, the number of neurons in hidden layer is decide the optimum number of neurons in hidden layer for the
decided empirically as discussed above. The observed proposed ASR systems, we conducted training and testing of
performances for both the FFNN and RNN networks suggest the networks by varying numbers of neurons in the hidden
46 neurons in the hidden layer. A detail description of these layer according to the three rules-of-thumb mentioned in
performances is presented in the next section. Section 4(C). Better performances are observed when the size
Activation function: The non-linear activation (transfer) of hidden layer is set at 38 as per the third rule (i.e., the
functions logsigmoid and tansigmoid are used respectively in number of neurons is equal to the sum of the output layer size
this study for the output and hidden layers. The basic reasons and 2/3 of the input layer size) out of the three rules. However,
of using sigmoid function are its smoothness, continuity and to achieve superior performances, the trial and error approach
positive derivation. The logsigmoid function in the output is adopted in backward and forward directions by taking
layer produces the network outputs in the interval [0,1] i.e. hidden layer neurons in the range 36 to 50. Figure 5 presents
output of one class is closer to be 1 once the word is detected plots of the observed performances of the systems using the
and 0 otherwise. Again in tansigmoid function, it’s output is FFNN and RNN networks. It can be observed in the plots that
zero centered in between -1 to 1 and hence optimization the maximum performance of 84.5% is obtained for the
is easier. proposed FFNN based ASR system when the hidden layer
Training algorithm: In both the ASR systems, the scaled contains 46 neurons. Similarly, for RNN based system, the
conjugate gradient back-propagation method is used to train hidden layer with 46 neurons derives the best performance of
the networks due to its better learning speed [4],[5]. Many 86.6%. We, therefore, consider 46 neurons in the hidden
other authors have also used this training algorithm due to the layers of the proposed systems.
above said advantage [6],[12]. As a supervised algorithm, this
back-propagation method optimizes the weights of the
neurons by using a loss/cost function [5] and produces faster
convergence than other methods.
The following section presents the experimental setup and
%RR

observations of the proposed ASR systems.

V. EXPERIMENTAL RESULT AND ANALYSIS


This work performs two sets of experimentations relating to
the above-said two ASR systems for isolated Sylheti words.
The first set deals with the FFNN based ASR system and that
of second set deals with the RNN based ASR system. The
following parameters are considered during Nodes in hidden layer
experimentations:
1. Features: The set of 13 MFCC-based features Figure 5. Observed Performance plots with different
m i i  1, 2, ,13 for each utterance as number of neurons in the hidden layer
A neural network model stops its training when any one of
presented in Section 4(B). two conditions are met: a) the maximum number of epochs is
2. Classifiers: FFNN and RNN types as presented in reached, or b) performance is converged to the goal. In the
Section 4(C). presented work, the first condition is satisfied. A convergence
3. Activation functions: tansigmoid for hidden layer plot is often generated in the training phase to show the
and logsigmoid for output layer. closeness of the network outputs to the target values. It
4. Training and testing datasets: The database for presents the MSE values between the corresponding network
Sylheti language presented in Section 3 has a total of
outputs and targeted values [5],[36]. Figure 6 presents
3000 utterances of 30 words, where each word is
convergence plots for both the proposed ASR systems in
uttered 10 times by each speaker. Out of the 3000
terms of MSE values. It is observed that the convergence of
utterances, 1500 utterances comprising of 50
utterances of each word are considered for training the RNN based ASR system is better than that of the FFNN
the networks. The other 1500 utterances are used for based system. This is due to the inherent nature of feedback
testing. looping in RNN, which tries to adjust the errors of outputs of
5. Convergence: Targeted mean-squared error (MSE) the neurons during training.
of 0.001 during training. To further examine the robustness and performances of the
6. Performance measure: The performances of ASR proposed systems in terms of variations in training and testing
systems are studied in terms of Percentage samples, different combinations from the available 3000 word
recognition rate (%RR), which is computed utterances of the proposed database are considered for
according to: training and testing. The total 3000 utterances are divided
Number of correct word recognitions into four non-overlapping groups G1, G2, G3, and G4.
%RR  100% (8)
Total number of word utterances used in testing
As discussed in the previous section, the
performances of the proposed ASR systems change when the
number of neurons in the hidden layer is varied. In order to

Published By:
Retrieval Number: C5874098319/2019©BEIESP Blue Eyes Intelligence Engineering
DOI:10.35940/ijrte.C5874.098319 6265 & Sciences Publication
Speech Recognition of Isolated Words using a New Speech Database in Sylheti

VI. CONCLUSION
Speech Recognition using neural network has been an
area of research interest for long, and many ASR systems have
been proposed for different languages around the globe. This
paper has considered the "under-resourced" Sylheti language.
As no speech database for Sylheti in electronic form is
available, a new speech database of isolated Sylheti words has
been proposed which can be used by researchers working in
the domains of speech processing in Sylheti. This paper has
also presented two ASR systems for the Sylheti language to
recognize isolated Sylheti words by applying two variants of
neural network classifiers, FFNN and RNN. It has been
observed that the overall performance of ASR system using
Figure 6. Convergence plots for the proposed ASR the RNN network (recognition rate:86.38%) is better than that
systems of the FFNN based ASR system (84.55%) which is due to the
feedback of RNN. One of our future works will concentrate
In each group, 750 utterances (25 utterances of each of the 30 on updating this constructed Sylheti database to include
words) are considered. Out of these four groups, two groups connected words and also to design ASR system for
are considered for training and that of other two groups are recognizing connected words in Sylheti. Another future work
used for testing. Thereby, a total of 4C2  6 different training will be to employ DNN in ASR system for Sylheti. Also, the
and testing datasets are used. The corresponding observed problem of speaker identification will be taken up for the
recognition rates for both the proposed systems are presented Sylheti language.
in Table 3.
REFERENCES
Table 3. Performances of both the ASR systems 1. C. Kurian, "A Survey on Speech Recognition in Indian Languages",
International Journal of Computer Science and Information
ASR Technologies, vol. 5, no. 5, 2014, pp. 6169-6175.
Training Testing Average
system %RR 2. R. Matarneh, S. Maksymova, V. V. Lyashenko and N. V. Belova,
Dataset Dataset %RR
using "Speech recognition systems: A comparative Review", IOSR Journal of
Computer Engineering, vol. 19, no. 5, 2017, pp. 71-79.
G1,G2 G3,G4 83.9 3. S. K.Gaikwad, B.W.Gawali and P. Yannawar, "A Review on Speech
G1,G3 G2,G4 84.5 Recognition Technique", International Journal of Computer
G1,G4 G2,G3 85 84.55 Applications, vol. 10, no.3, Nov. 2010, pp. 16-24.
FFNN 4. G. Dede and M. H. Sazli, "Speech recognition with artificial neural
G2,G3 G1,G4 85.8
networks", Elsevier journal of Digital Signal Processing, vol.20, no. 3,
G2,G4 G1,G3 84.6 May, 2010, pp.763-768.
G3,G4 G1,G2 83.5 5. M. Sarma, K. Dutta, and K. K. Sarma, "Assamese Numeral Corpus for
G1,G2 G3,G4 85 speech recognition using Cooperative ANN architecture", International
Journal of Computer, Electrical, Automation, Control and Information
G1,G3 G2,G4 88.3 Engineering vol.3, no.4,2009.
G1,G4 G2,G3 87 86.38 6. B. P. Das and R. Parekh, "Recognition of isolated Words using Features
RNN based on LPC, MFCC, ZCR and STE, with Neural Network Classifiers",
G2,G3 G1,G4 86.6
International Journal of Modern Engineering Research (IJMER), vol. 2,
G2,G4 G1,G3 86 no. 3, May-June 2012, pp. 854-858.
G3,G4 G1,G2 85.4 7. Y.A. Khan, S. M. Mostaq Hossain and M. M. Hoque, "Isolated Bangla
It may be observed from the above experimentations that both word recognition and Speaker detection by Semantic Modular Time
the proposed systems perform more or less consistently when Delay Neural Network (MTDNN)", 18th International conference on
Computer and Information Technology, Dhaka, Bangladesh , 21-23
different training and testing Sylheti datasets are used. This Dec. 2015.
implies good robustness of both the systems to variations in 8. N. Seman, Z. A. Bakar and N. A. Bakar, "Measuring the performance of
datasets. However, the RNN based ASR system derives better Isolated Spoken Malay Speech Recognition using Multi-layer Neural
recognition accuracy (average %RR of 86.38) than that of the Network", International Conference on Science and social
Research(CSSR 2010), Kualalumpur, Malaysia, December, 2010.
ASR system using FFNN (average %RR of 84.55). The better 9. A. Mohammed, G. E. Dahl, and G. Hinton, "Acoustic Modeling using
performance with the RNN classifier may be due to its Deep Belief Networks", IEEE transactions on Audio, Speech and
inherent feedback characteristics as discussed in Section 1. Language Processing, vol.20, no.1, January 2012, pp. 14-22.
Due to speech variability in age variation (which affect the 10. M. K. Luka, I. A. Frank and G. Onwodi, "Neural Network Based Hausa
Language Speech Recognition", International Journal of Advanced
performance of any ASR system) in the constructed Sylheti Research in Artificial Intelligence, vol. 1, no. 2, 2012, pp. 39-44..
speech database, it is also noticeable the minor deterioration 11. A. Kanagasundaram, "Speaker Verification using I-vector Features", a
of recognition results in the presented ASR systems. Thus, PhD thesis of Queensland University of Technology, 2014.
from the generated results it can be concluded that the 12. M. OPREA AND D. SCHIOPU, "AN ARTIFICIAL NEURAL
NETWORK-BASED ISOLATED WORD SPEECH RECOGNITION SYSTEM FOR
observed performances of the ASR systems for Sylheti
THE ROMANIAN LANGUAGE", 16TH INTERNATIONAL CONFERENCE ON
presented above are comparable to the performances of SYSTEM THEORY, CONTROL AND COMPUTING (ICSTCC), 12-14 OCT.,
similar systems available for other languages SINAIA, ROMANIA, 2012.
[6],[7],[8],[40],[50] and hence are considered to be
satisfactory.

Published By:
Retrieval Number: C5874098319/2019©BEIESP Blue Eyes Intelligence Engineering
DOI:10.35940/ijrte.C5874.098319 6266 & Sciences Publication
International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-8 Issue-3, September 2019

13. K. R. Ghule and R. R. Deshmukh, "Automatic Speech Recognition of conference on telecommunication, power analysis and computing
Marathi isolated words using Neural Network", International Journal of techniques(ICTPACT-2017),Chennai, India, 6-8 April,2017.
Computer Science and Information Technologies, vol.6(5), 2015, pp. 37. H. Sakoe and S. Chiba, "Dynamic programming algorithm
4296-4298. optimization for Spoken word recognition", IEEE Trans. Acoustic,
14. M. K. Sarma, A. Gajurel, A. Pokhrel and B. Joshi, "HMM based isolated Speech Signal Processing, vol. 26 , no.1, Feb 1978 , pp. 43-49.
word Nepali speech recognition", Proceedings of International 38. X. Lei, A. W. Senior, A. Gruenstein and J. Sorensen, "Accurate and
conference on Machine learning and Cybernetics, Ningbo, China,2017. Compact Large vocabulary speech recognition on mobile devices",
15. S. S. Bharali and S. K. Kalita, "A comparative study of different INTERSPEECH 2013, Lyon, France, 25-29 August 2013, pp. 662-665.
features for isolated spoken word recognition using HMM with 39. J.T.Geiger, Z.Zhang, F.Weninger, B. Schuller and G. Rigoli, "Robust
reference to Assamese language", International Journal of Speech speech recognition using long short term memory recurrent neural
Technology, Springer ,vol. 18, no. 4, 2015, pp. 673–684. networks for hybrid acoustic modeling", Conference of the International
16. S. Xihao and Y. Miyanaga, "Dynamic time warping for speech Speech Communication Association, 14-18 September 2014, Singapore
recognition with training part to reduce the computation", International INTERSPEECH 2014
Symposium on Signals, Circuits and Systems ISSCS2013, 11-12 July, 40. P. Sharma and A. Garg, "Feature Extraction and Recognition of Hindi
2013. Spoken Words using Neural Networks", International Journal of
17. B.W.Gawali, S. Gaikwad, P. Yannawar and S.C. Mehrotra, "Marathi Computer Applications, vol. 142, no.7, May 2016., pp. 12-17.
isolated word recognition system using MFCC and DTW features", 41. A. Goswami, "Simplification of CC sequence of Loan words in Sylheti
ACEEE International Journal of Information Technology, vol. 01, no. Bangla", Language in India, vol 13, no. 6 June,2013.
01, Mar 2011, pp. 21-24. 42. M. MoneyKumar, E. Sherly, and W. M. Varghese, "Isolated Word
18. C. Madhu, A. George and L. Mary, "Automatic language identification Recognition system for Malayalam Using Machine Learning", Proc. of
for seven Indian languages using higher level features", IEEE the 12th Intl. Conference on Natural Language Processing, Trivandrum,
International Conference on Signal Processing, Informatics, India. December 2015, pp. 158–165.
Communication and Energy Systems (SPICES), Kollam, India, 2017, 43. J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, and S.
pp. 1-6. Stober, "Transfer Learning for Speech Recognition on a Budget",
19. P. SWIETOJANSKI, "LEARNING REPRESENTATIONS FOR SPEECH Proceeding of the 2nd workshop on Representation learning for NLP,
RECOGNITION USING ARTIFICIAL NEURAL NETWORK", A DOCTORAL Vancouver, Canada, August 3, 2017, pp. 168-177.
THESIS, 2016. 44. S. Furui, "Speaker-Independent Isolated Word Recognition Using
20. M. Borsky, "Robust recognition of strongly distorted speech", a doctoral Dynamic Features of Speech Spectrum", IEEE transactions on
thesis, 2016. Acoustic,Speech and Signal processing, vol. 34 , no.1, 1986, pp. 52-59.
21. S. G. Surampudi and Ritu Pal, "Speech Signal processing using Neural 45. D. Dhanashri and S.B. Dhonde, "Isolated word speech recognition
Networks", IEEE International Advance Computing Conference (IACC system using Deep Neural Networks", Proceedings of the International
2015), Bangalore, India, 12-13 June 2015. Conference on Data Engineering and Communication Technology, vol.
22. A. Zaatri, N. Azzizi and F. L. Rahmani, "Voice Recognition 1, 2017, pp. 9-17.
Technology using Neural Networks", Journal of New Technology and 46. T. Hori, C. Hori, S. Watanabe and J. R. Hershey, "Minimum word error
Materials, vol. 5, no. 1, 2015, pp. 26-30. Training of Long short-term memory Recurrent Neural Network
23. O.I. Abiodun, A Jantan, A.E.Omolara, K.V. Dada, N.A. Mohamed and Language models for Speech recognition", 41st IEEE International
H. Arshad, "State-of-the-art in artificial neural network applications: A conference on Acoustic, Speech and Signal processing, Shanghai,
survey", Heliyon, an Elsevier Journal, vol. 4, no. 11, 2018. China, vol. 2016-May, 2016, pp. 5990-5994.
24. L. Fausett, "Fundamentals of Neural Networks: Architecture, 47. A. Das, P. Jyothi, and M. H. Johnson, "Automatic Speech Recognition
Algorithms and Applications", Prentice-Hall, Inc., New Jersey 1994. using Probabilistic transcriptions in Swahili, Amharic and Dinka",
25. L. Besacier, E. Barnard, A. Karpov and T. Schultz, "Automatic Speech Proceedings of the Annual Conference of the International Speech
Recognition for Under-Resourced Languages: A Survey", Speech Communication Association, INTERSPEECH 2016, San Francisco,
Communication, vol. 56, January, 2014, pp. 85-100. USA, 8-12 September, 2016, pp. 3524-3528.
26. V. Berment, "Methods to computerise "little equipped" languages and 48. N. Zerari, S. Abdelhamid, H. Bouzgou, and C. Raymond,
group of languages", PhD. Thesis, J. Fourier University-Grenoble I, May "Bidirectional deep architecture for Arabic speech recognition", Open
2004. Comput. Sci., DE GRUYTER, 2019, pp. 92–102.
27. "Ethnologue Languages of the World" 49. C. Xu, X. Wang and S. Wang, "Research on Chinese Digit Speech
https://www.ethnologue.com/statistics/status. Recognition Based on Multi-weighted Neural Network", IEEE
28. H.B.Sailor, M.V.S. Krishna, D. Chhabra, A. T. Patil, M.R. Kamble and Pacific-Asia Workshop on Computational Intelligence and Industrial
H.A. Patil, "DA-IICT/IIITV system for low resource speech recognition Application, 2008, pp. 400-403.
challenge 2018", Interspeech 2018, 2-6 September 2018, Hyderabad, 50. I. Kipyatkova and A. Karpov, "Recurrent Neural Network- based
pp. 3187-3191. Language modeling for an Automatic Russian Speech Recognition
29. M.A. Hasegawa-Johnson, P. Jyothi, D. McCloy, M. Mirbagheri, G. M. System", Proceeding of the Artificial Intelligence and Natural Language
di Liberto, Amit Das, B. Ekin, C. Liu, V. Manohar, H. Tang, E. C. Lalor, and Information Extraction, Social Media and Web Search FRUCT
N. F. Chen, P. Hager, T. Kekona, R. Sloan and A. K. C. Lee, "ASR for Conference (AINL-ISMW FRUCT), 9-14 Nov. 2015, St. Petersburg,
Under-Resourced Languages From Probabilistic Transcription", Russia.
IEEE/ACM transactions on Audio, Speech, and Language processing, 51. D. T. Toledano, M. P. Fernandez-Gallego and A. Lozano-Diez,
vol. 25, no. 1, January 2017. "Multi-resolution speech analysis for automatic speech recognition
30. K. Kumar, R.K. Aggarwal and A. Jain, "A Hindi speech recognition using deep neural networks: Experiments on TIMIT", PLoS ONE, vol.
system for connected words using HTK", International Journal of 13, no. 10, October 10, 2018.
Computational Systems Engineering, vol. 1, no. 1, 2012, pp. 25-32. 52. B. Barman, "A contrastive analysis of English and Bangla phonemics",
31. A. Gope and S. Mahanta, "Lexical Tones in Sylheti", 4th International The Dhaka University Journal of Linguistics, vol. 2, no.4, August 2009.
Symposium on Tonal Aspects of 53. W. Hong and P. Jin'gui, "An undergraduate Mandarin Speech
Languages,Nijmegen,Netherlands,May 13-16,2014 Database for Speaker Recognition Research, Oriental COCOSDA
32. A.Gope and S. Mahanta, "An Acoustic Analysis of Sylheti Phonemes", International Conference on Speech Database and Assessments",
Proceedings of the 18th International Congress of Phonetic Sciences. Urumqi, China, 10-12 August, 2009.
Glassgow, UK, 2015. 54. C. Kurian, "A Review on Speech Corpus Development for Automatic
33. A.Gope and S. Mahanta, "Perception of Lexical Tones in Sylheti", Speech Recognition in Indian Languages", International Journal of
Tonal Aspects of Languages 2016, 24-27 May 2016, Newyork. Advanced Networking and Applications, vol.6, no.6, 2015, pp.
34. D. M. Kane, "Puthi-Pora:'Melodic Reading' and its use in the 2556-2558.
Islamisation of Bengal", Doctoral Dissertation, University of London, 55. B. Das, S. Mandal, P. Mitra and A. Basu, "Effect of aging on speech
2008. features and phoneme recognition: a study on Bengali voicing vowels",
35. K.H.Davis, R. Biddulph and S. Balashek, "Automatic Recognition of International Journal of Speech Technology, Springer, vol. 16, no. 1,
Spoken Digits", Journal of the Acoustic Soc. of America, vol. 24, no. 6, March 2013, pp. 19-31.
1952, pp. 627-642.
36. G.Chakraborty, M.Sharma, N. Saikia and K. K.Sarma, "Recurrent
Neural Network Based Approach To Recognise Isolated Digits In
Sylheti Language Using MFCC Features", Proceedings of International

Published By:
Retrieval Number: C5874098319/2019©BEIESP Blue Eyes Intelligence Engineering
DOI:10.35940/ijrte.C5874.098319 6267 & Sciences Publication
Speech Recognition of Isolated Words using a New Speech Database in Sylheti

56. M. Dua, R. K. Aggarwal and M. Biswas, "Performance evaluation of


Hindi speech recognition system using optimized filterbanks",
Engineering Science and Technology, an International Journal, Vol. 21,
no. 3, June 2018, pp. 389-398.
57. F. S. Panchal and M. Panchal, "Review on methods of selecting number
of Hidden nodes in Artificial Neural Network", International Journal of
Computer Science and Mobile computing, vol. 3, no. 11, Nov.2014, pp.
455 – 464.
58. B. Deka, J. Chakraborty, A. Dey, S. Nath, P. Sarmah, S. R. Nirmala, and
S. Vijaya, "Speech Corpora of Under Resourced Languages of
North-East India", Oriental COCOSDA 2018, 7-8 May 2018, Miyazaki,
Japan.
59. Desai V. A and Dr. V. K. Thakar, "Neural Network based Gujarati
Speech Recognition for Dataset Collected by in-ear microphone", 6th
International Conference On Advances In Computing &
Communications, ICAAC 2016, 6-8 September,2016, Cochin, India,
pp. 668-675.

AUTHORS PROFILE
Gautam Chakraborty, a research scholar in the
department of Electronics & Telecommunication
Engineering, Assam Engineering College, Guwahati,
India, under Gauhati University, is currently working as
an Assistant Professor at NERIM, Guwahati, Assam
since 2010. His research interests include speech
processing, cloud computing, etc. He has authored many research papers in
national and international conference proceedings.

Dr. Navajit Saikia, currently Associate Professor in the


Department of Electronics and Telecommunication
Engineering, Assam Engineering College, Guwahati,
India, has over 23. years of professional experience. His
research interests include image processing, speech
processing, reversible logic, information security, etc. He
has co-authored several research papers in journals and
conference proceedings. He is reviewer of many international journals and
also has served as reviewer/ TPC member in many national/ international
conferences.

Published By:
Retrieval Number: C5874098319/2019©BEIESP Blue Eyes Intelligence Engineering
DOI:10.35940/ijrte.C5874.098319 6268 & Sciences Publication

You might also like