Artificial Intelligence for Speech Recognition
by
Arya Singh (1847125)
Under the guidance of
Dr. Arul Kumar N
Computer Network Assignment_1A
1
Abstract
Speech recognition or speech to text includes capturing and digitizing the sound waves,
transformation of basic linguistic units or phonemes, constructing words from phonemes and
contextually analyzing the words to ensure the correct spelling of words that sounds the same.
Approach: Studying the possibility of designing a software system using one of the techniques
of artificial intelligence applications neuron networks where this system is able to distinguish
the sound signals and neural networks of irregular users. Fixed weights are trained on those
forms first and then the system gives the output match for each of these formats and high speed.
The proposed neural network study is based on solutions of speech recognition tasks, detecting
signals using angular modulation and detection of modulated techniques. The intelligence of
machines by which it works efficiently shall known as artificial intelligence. Speech
recognition is the way of understanding voice through the computer and by any required task.
It is commonly used in military, commercial and also for business purpose. The speech
recognition processing is performed by software known as speech recognition engine, based
on audio signals it enables communication among human and the computers. It is the science
and engineer of making intelligent machines, exclusively with computer programs and the
process of converting speech signals in to words.
2
TABLE OF CONTENTS
SL NO. CONTENT PAGE NO.
1. Introduction & Motivation 4
2. Definition 4
3. History or Technical Background 5
4. Speech Recognition 6
5. Salient Features 6
6. Diagramatic Representation of Speech Recognition 7
7. Factor Affecting the Performance of Speech 8
Recognition System
8. Applications 10
9. Advantages 11
10. Disadvantage 11
11. Conclusion 12
12. References 12
3
1. Introduction & Motivation
The term artificial intelligence was coined in the year 1956 by McCarthy, described the
mental qualities to machines and artificial intelligence. Thus, literally defined as
“Making intelligent machines especially computer programs”, artificial intelligence is
the intelligence of machines and branch of computer science that can create it.
Intelligence is the computational parts to achieve the goals in the fastest ever blooming
world. Degrees of peoples, animals and also machines are being the part of this. It is
further known as the study of mental faculties through the computational part.
Intelligent agent can make a maximize actions and its helps to success.
Artificial intelligence involves two basic ideas: -
First, it involves studying the thought processes of human beings. Second, it deals with
representing those processes via machines (like computers, robots, etc.).
AI is behavior of a machine, which, if performed by a human being, would be called
intelligent. It makes machines smarter and more useful, and is less expensive than
natural intelligence. Natural language processing (NLP) refers to artificial intelligence
methods of communicating with a computer in a natural language like English. The
main objective of a NLP program is to understand input and initiate action.
2. Definition:
It is the science and engineering of making intelligent machines, especially intelligent
computer programs.
AI means Artificial Intelligence. Intelligence” however cannot be defined but AI can
be described as branch of computer science dealing with the simulation of machine
exhibiting intelligent behavior. The theory and development of computer systems able
to perform tasks normally requiring human intelligence, such as visual perception,
speech recognition, decision-making, and translation between languages.
Branches of AI:
Figure 1: Branches of AI
4
3. History or Technical Background
Figure 2: History of Speech Recognition
Technology companies are recognizing interests in speech recognition technologies and are
working toward making voice recognition a standard for most products. One goal of these
companies may be to make voice assistants speak and reply with greater accuracy around
context and content.
Research shows that the use of virtual assistants with speech recognition capabilities is forecast
to keep increasing in the next year, from 60.5 million people in the United States in 2017 to
62.4 million in 2018. By 2019, 66.6 million Americans are projected to be using speech or
voice recognition technology.
To build a robust speech recognition experience, the artificial intelligence behind it has to
become better at handling challenges such as accents and background noise. Today,
developments in natural language processing and neural network technology have improved
the speech and voice technology, so much so that today it is reportedly on par with humans. In
2017. For example, the word error rate for Microsoft’s voice technology has been recorded
at 5.1 percent by the company, while Google reports that it has reduced its rate to 4.9 percent.
Research firm Research and Markets reported that the speech recognition market will be worth
$18 billion by 2023. As the voice recognition technology gets bigger and better, the research
estimates that it could be incorporated into everything from phones to refrigerators to cars. A
glimpse of that was seen at the annual CES 2017 show in Las Vegas where new devices with
voice were either launched or announced.
In an effort to show insights on how the leaders in voice recognition compare, we have created
a list highlighting each, as well as its features.
5
While all applications have very similar features and integration opportunities, we have
clustered them based on what our research points to as the primary focus areas of each. The
two focus areas we will note in this piece are:
Smart Speaker and Smart Home: Highlighting Amazon, Google and Microsoft
Mobile Device Applications: Highlighting Apple’s Siri and Facebook’s speech recognition
integrations.
4. Speech Recognition:
The user communicates with the application through the appropriate input device i.e. a
microphone. The Recognizer converts the analog signal into digital signal for the speech
processing. A stream of text is generated after the processing. This source-language text
becomes input to the Translation Engine, which converts it to the target language text.
5. Salient Features
I. Input Modes
Through Speech Engine
Through soft copy
II. Interactive Graphical User Interface
III. Format Retention
IV. Fast and standard translation
V. Interactive Pre-processing tool
Spell checker
Phrase marker
Proper noun, date and other package specific identifier Input Format
Input Format: txt, .doc .rtf
User friendly selection of multiple output
Online thesaurus for selection of contextually appropriate synonym
Online word addition, grammar creation and updating facility
Personal account creation and inbox management
Figure 3: Method for Speech Recognition
6
6. DIGRAMATIC REPRESENTATION OF SPEECH RECOGNITION
Figure 4: Speech recognition process
Acoustic model represents the acoustic sounds of a language and can be trained to recognize
the char of a particular user’s speech patter and acoustic environment. Lexical model gives
a list of large no. of words in a language along with how to pronounce each word.
Language model gives the way in which different words of a language are combined. In
order to recognized a word the recognizer chooses it is guess from a finite vocabulary as the
word is uniquely identified by it is spelling different models are used for this purpose.
Trigram model: In a trigram model the concept is the probability of next word depends
upon the history of previous words that have been spoken i.e.
Probability a next word w depends upon his (previous) Where I = 1 + 0n but as number of
previously spoken words increase the complexity of model increase in order to have a
practical model trigram model i.e. where n = 3 is used it means that most recent only two
words at the history are used to obtain the condition probability of next word. The term
perplexity is used to determine the performance perplexity is defined as size of set of the
words from which the next words is choose to we use the history at previously spoken words
for 5 diagram model the perplexity in difference domain is:
Domain Perplexity
Domain Radiology 26
Emergency medicine 60
Journalism 105
General English 247
When two language model are given one need to compare them one method used to model
in recognize and select the one which provide minimum recognizer error rate or we can
determine the best language method by talking long probability per word basis on a new text
which is not being used for building the language model which will give you perplexity.
Again show diagram major components the digital speech signal is first transformed. To a
set measurement or future at fixed rates typically at 10- 20 msec & used to search the most
likely words condition depending upon the contains imposed by lexical, language &
acoustic model. Throughout this process training data is used to determine the valued of
modelled parameters.
Class model: Instant at using separate words set at words i.e. Class is used There may be
overlapping i.e. one word may belong to many classes the classes are made deepening upon
morphological analysis it words & segmentation information at the word.
Source Channel Model: this model is pioneered by IBM continues speech recognition
group it use statistical model at joint distribution p (w, o) w- Sequence of spoken words &
0 is the corresponding sequence of observed acoustic information[7][1] .It determine
estimation W identity of spoken words from observed acoustic in order to minimize the error
rat the recognize choose that sequence with max posterior distribution.
7
7. FACTOR AFFECTING THE PERFORMANCE OF SPEECH
RECOGNIZATION SYSTEM
There are many external factors that affect the performance of speech recognization system
like noise environment condition, placement of IP Phone etc.
Parameter Range
Speaking mode Isolated to continuous speech
Speaking style Read speech or spontaneous speech
Transducer IPPhone to telephone
Enrollment Speaker dependent & speaker independent
Vocabulary small < 20 large > 20000 words
Perplexity small < 10 large > 100
Language Model finite state to context dependent
Vocabulary is the most dominant feature which affects the performance of the speech recognition
system as the error rate is recognizer is no less that the % of spoken words that are not in recognizer
vocabulary therefore building a language model vocabulary is the most important factor for
determining vocabulary corpus (collection) of text along with directories is used. When the
recognition is restricted for a particular application then more personalized vocabulary is useful
rather than general vocabulary.
Table shows the static coverage of unseen text depending upon:
V Size Static coverage
20,000 94.1%
64000 98.7%
100000 99.3%
200000 99.4%
The next imp parameter is the acoustic representation of the phonemes i.e. the smallest sound units
from which word are composed of this phonemes are highly dependent on the context in which they
are used in for example the acoustic representation of phoneme /t/ in ‘true’ & ‘two’.Another factor
that affects the performance of speech recognition system is quality & placement of microphone,
speakers, emotional & physical condition, speech rate, voice quality, size & shape of vocal tract
system. So it is very difficult to specify how speech sound like, moreover human speech rarely
follows strict & formal grammar rules & word cannot be said exactly in the same way twice
therefore speech recognition is never going to find it’s match but the quality of recognition is
depends upon how good it is refining it’s search i.e. eliminated poor matches & selecting most likely
matches[7]. The accuracy of the recognition depends upon good language & acoustic model as well
as on algorithms for both search & sound processing better is the models & algorithms fewer errors
are made & result are quickly found.When general language model is used it will have
comprehensive language domain i.e. consist of general day to day spoken English, but if recognition
is to be used for the particular application then instead of using general language model.It is
beneficial to use the model in which only restricted words that are required for the application are
used. It has several benefits it increases accuracy, few errors are made quicker search and each search
result is meaningful as due to restricted Vocal the recognizer will listen only that speech which is
required for that application.
8
Figure 5: Process the I/P audio stream
Using speech recognition for application I/P (using windows vista). The speech recognition system
can be said as consisting at a front end and at back end. The front end process the I/P audio stream,
isolated sound segments i.e. probably speech into series of numeric values that are characterized by
vocal sounds at the signal. The back end is a search engine which searches through 3 databases 1)
Acoustic 2) lexicon 3) language model (Explain diagram) Windows Vista speech technology has
built-in dictation capability. It has edit controls for deleting, inserting. You can correct disorganized
words by redictating like for New your spell N like Nest e-Element etc, choosing alternatives.
Microsoft speech server 2004R2 is also used whereas MSS 2004 only supported English for
automated speech recognition and text to speech generation. R2 adds French and Spanish
recognition and generation. There are tremendous forces driving the development technology in
many countries touch tone penetration are low and voice is the only option for controlling automated
services. Another application is home voice. It uses latest voice recognition to give you control at
your home. It uses sound blaster compatible sound card and your computer to add voice
recognisation capabilities too many popular home automation controller you have to only select
command phrases. You want to use and associate them with the action you want to perform. The
action may be infrared command, macro or even relay closure or X10 command. X10 is a
communication language that allows computable products to talk to each other using the existing
electrical wiring in the home. Installation is simple and requires a transmitter plug in at one location
and sets its control signal like on, off, dim etc to receiver into another location t home There are
many methods to build language grammar simplest method is semantic grammar.
9
8. Applications
One of the main benefits of speech recognition system is that it lets user do other works
simultaneously. The user can concentrate on observation and manual operations, and still
control the machinery by voice input commands. Another major application of speech
processing is in military operations. Voice control of weapons is an example. With reliable
speech recognition equipment, pilots can give commands and information to the computers by
simply speaking into their microphones - they don’t have to use their hands for this purpose.
Another good example is a radiologist scanning hundreds of X-rays, ultra sonograms, CT
scans and simultaneously dictating conclusions to a speech recognition system connected to
word processors. The radiologist can focus his attention on the images rather than writing the
text.
Voice recognition could also be used on computers for making airline and hotel reservations.
A user requires simply stating his needs, to make reservation, cancel a reservation, or making
enquiries about schedule.
Figure 6: Voice Recognition
Figure 7: Voice Processing
10
Figure 8: Types of neural networks and application.
a) Present study of artificial neural networks for speech recognition task. Neural network
size influence on the effectiveness of detection of phonemes in words. The research
methods of speech signal parameterization. Learn about how to use linear prediction
analysis, a temporary way of learning of the neural network for recognition of
phonemes. The proposed way of teaching as input requires only the transcription of
words from the training set and do not require any manual segmentation of words
b) Development and research of the methods for diagnosing and detecting modulated
signals
c) Software implementation and pilot testing on real signals of neural network methods
for processing.
9. Advantages
Menial computer tasks
Assists paralysed people
Comfortable human-machine interaction
Saves time for user
Simple handling of software
More efficient use of labour resources - hiring robots for some jobs
Faster information processing and decision-making
Reduction of errors
Objectivity - no influence of personal connections on the decision-making process
Increased personalization of the information received, e. g. The education system,
advertising
Exploration capacity (e. g. going outside the globe)
10. Disadvantages
Lack of control over decisions taken by the robots
Lack of knowledge about the foundation on which the decision was made
New forms of control and power
Threats to the significance of the human kind
Risks related to system hacking
Translating inappropriate patterns by machines into their own behaviour (e. g.
aggression) - robots learn from every interaction
Robots as effective tools for killing
11
11. Conclusion
Speech recognition helps physically challenged peoples as an aid of assisting its support for
them. Physically challenged peoples can do their works without pushing the any buttons and
without the help of any peoples. It will not much time consuming, user friendly and do the task
in an effective way. This ASR technology also uses in the military weapon and research field.
Till now officers dealing with criminals tend to use this technology. They are used this to catch
and trap the criminals.
12. References
[1] Childer, D.G. (2004) The Matlab Speech Processing and Synthesis Toolbox. Photocopy
Edition, Tsinghua University Press, Beijing, 45-51.
[2] Chien, J.T. (2005) Predictive Hidden Markov Model Selection for Speech Recognition.
IEEE Transaction on Speech and Audio Processing, 13.
[3] Luger, G. and Stubblefield, W. (2004) Artificial Intelligence: Structures and Strategies for
Complex Problem Solving. 5th Edition, The Benjamin/Cummings Publishing Company, Inc.
http://www.cs.unm.edu/~luger/ai-final/tocfull.htm
[4] Choudhary, A. and Kshirsagar, R. (2012) Process Speech Recognition System Using
Artificial Intelligence Technique. International Journal of Soft Computing and Engineering
(IJSCE), 2.
[5] Ovchinnikov, P.E. (2005) Multilayer Perceptron Training without Word Segmentation for
Phoneme Recognition. Optical Memory & Neural Networks (Information Optics), 14, 245-
248.
[6] Guo, X.Y., Liang, X. and Li, X. (2007) A Stock Pattern Recognition Algorithm Based on
Neural Networks. Third International Conference on Natural Computation, 2.
[7] Dai, W.J. and Wang, P. (2007) Application of Pattern Recognition and Artificial Neural
Network to Load Forecasting in Electric Power System. Third International Conference on
Natural Computation, 1.
[8] Shahrin, A.N., Omar, N., Jumari, K.F. and Khalid, M. (2007) Face Detecting Using
Artificial Neural Networks Approach. First Asia International Conference on Modelling &
Simulation.
[9] Lin, H., Hou, W.S., Zhen, X.L. and Peng, C.L. (2006) Recognition of ECG Patterns Using
Artificial Neural Network. Sixth International Conference on Intelligent Systems Design and
Applications, 2.
[10] Al Smadi, T.A. (2013) Design and Implementation of Double Base Integer Encoder of
Term Metrical to Direct Binary. Journal of Signal and Information Processing, 4, 370.
[11] Takialddin Al Smadi Int. An Improved Real-Time Speech Signal in Case of Isolated
Word Recognition. Journal of Engineering Research and Applications, 3, 1748-1754.
[12] McCarthy, J. (1979) Ascribing mental qualities to machines. In: Philosophical
perspectives in artificial intelligence, ed. M. Ringle. Atlantic Highlands, N.J.: Humanities
Press.
[13] Haugeland, J. (ED). (1985), Artificial intelligence: The very Idea, Massachusetts Institute
of Technology, Massachusetts: MIT Press.
[14] Kurzweil, R, (1990). The age of Intelligent Machines, Massachusetts Institute of
Technology, Massachusetts: MIT Press.
[15] Charniak and Mc Dermoth, (1985). Introduction to Artificial Intelligence, USA: Addison
– Wesley.
[16] Nilson, N.J. (1998). Artificial Intelligence – A new synthesis. Morgan Koufmann.
12