AUTOMATED DIGITAL HAND GESTURE AND
SPEECH RECOGNITION BASED PRESENTATIONS
Vijaya Prakash R1[0000-0003-2177-5350], Maduri Ram Charan Teja2, Suhaas Sanga3,
Renukuntla Dhanush4, Kothapally Prem5, Gurrapu Aditya Krishna6
1,2,3,4,5,6
School of Computer Science & Artificial Intelligence, SR University, Warangal, India
r.vijayaprakash@sru.edu.in, 2ramcharantejamaduri@gmail.com,
1
3
suhaassanga1604@gmail.com, 4dhanushrenukuntla7@gmail.com, 5K.pre-
mgupta25@gmail.com, 6Adityakrishnagurrapu06@gmail.com
Abstract. This paper introduces an advanced system that seamlessly combines
computer vision and speech recognition technologies for hands-free slideshow
presentation management. By leveraging OpenCV, the system enables real-time
hand tracking and gesture recognition, allowing users to navigate through slides
with simple gestures, such as swiping left or right. Additionally, voice com-
mands facilitate tasks like jumping to specific slides, deleting annotations, and
other presentation controls, enhancing both accessibility and convenience. The
system’s real-time annotation feature lets users draw directly on slides using
hand gestures, adding a highly interactive element to presentations. The user-
friendly graphical interface, built with Tkinter, simplifies folder and file selec-
tion, ensuring ease of use. Moreover, the shuttle library further streamlines file
management, providing smooth operations during the presentation. This inno-
vative approach offers a dynamic, hands-free, and interactive solution for man-
aging presentations, catering to professionals, educators, and individuals seek-
ing enhanced presentation capabilities in diverse settings.
Keywords: Recognition, Machine Learning, Gesture Detection, Gesture Classi-
fication
1. Introduction
This research paper describes a novel system combining computer vision, speech
recognition, and graphical user interface (GUI) technologies for interactive slideshow
presentations. During presentations, the hands-free experience is enhanced by the sys-
tem, which makes use of OpenCV and the CVzone library to allow users to ex- plore
slides and interact with annotations with hand gestures. Using the CVzone li- brary to
streamline the development of gesture detection and hand tracking, the appli - cation
makes use of OpenCV, a powerful library for computer vision applications. To iden-
tify and follow hand landmarks, utilize CVzone's HandDetector class. With the use of
preset finger position patterns, some gestures are translated to slide navigation instruc-
tions.Swiping to the left or right, for example, moves you to the previous or next
slide, accordingly. The analysis of the relative locations of hand landmarks that have
been discovered is used to identify these movements. The speech_recognition library
is used to implement voice recognition, and it makes use of the Google Web voice
API to convert speech to text accurately. Through the use of a microphone, the
2
recognizer records audio input, interprets it to determine commands, and then carries
out the appropriate activities. By enabling users to manipulate the slideshow vocally,
commands like "next slide" and "go to slide 3" improve usability and accessibility. In
addition, a user-friendly interface for choosing presentation folders is provided via the
integration of Tkinter, a typical Python framework for developing GUI applications.
Users may concentrate on presenting their presentations by using this integration,
which simplifies organising and accessing presentation contents.
2. Literature Survey
In their study, authors Devivara Prasad et. al. [6], explores the significance of gesture
recognition in Human-Computer Interaction (HCI), emphasizing its practical applica-
tions for individuals with hearing impairments and stroke patients. They used image
feature extraction tools and AI-based classifiers for 2D and 3D gesture recognition.
Their proposed system harnesses machine learning, and real-time image processing
with Media Pipe, and OpenCV to enable efficient and intuitive presentation control
using hand gestures, addressing the challenges of accuracy and robustness. The re-
search focuses on enhancing the user experience, particularly in scenarios where tra-
ditional input devices are impractical, highlighting the potential of gesture recognition
in HCI.[13][15]
Reethika et. al. [7], presents a study on Human-Computer Interaction (HCI) with a
focus on hand gesture recognition as a natural interaction technique. It explores the
significance of real-time hand gesture recognition, particularly in scenarios where tra-
ditional input devices are impractical. The methodology involves vision-based tech-
niques that utilize cameras to capture and process hand motions, offering the potential
to replace conventional input methods. The paper discusses the advantages and chal-
lenges of this approach, such as the computational intensity of image processing and
privacy concerns regarding camera usage. Additionally, it highlights the benefits of
gesture recognition for applications ranging from controlling computer mouse actions
to creating a virtual HCI device [16].
Hajeera Khanum [8], outlines a methodology that harnesses OpenCV and Google's
MediaPipe framework [167[18] to create a presentation control system that interprets
hand gestures. Using a webcam, the system captures and translates hand movements
into actions such as slide control, drawing on slides, and erasing content, eliminating
the need for traditional input devices. While the paper does not explicitly enumerate
the challenges encountered during system development, common obstacles in this
field may include achieving precise gesture recognition, adapting to varying lighting
conditions, and ensuring the system's reliability in real-world usage scenarios. This
work contributes to the advancement of human-computer interaction, offering a mod-
ern and intuitive approach to controlling presentations through hand gestures.[19] Sa-
lonee et. al. [9] introduces a system that utilizes artificial intelligence-based hand ges-
ture detection, employing OpenCV and MediaPipe. The system allows users to con-
trol presentation slides via intuitive hand gestures, eliminating the reliance on conven-
tional input devices like keyboards or mice. The gestures correspond to various
3
actions, including initiating presentations, pausing videos, transitioning between
slides, and adjusting volume. This innovative approach enhances the natural interac-
tion between presenters and computers during presentations, demonstrating its poten-
tial in educational and corporate settings. Notably, the paper does not explicitly detail
the challenges encountered during the system's development, but it makes a valuable
contribution to the realm of human-computer interaction by rendering digital presen-
tations more interactive and user-friendly. [20]
Bobo Zeng et. al [10], present a real-time interactive presentation system that uti-
lizes hand gestures for control. The system integrates a thermal camera for robust hu-
man body segmentation, overcoming issues with complex backgrounds and varying
illumination from projectors. They propose a fast and robust hand localization algo-
rithm and a dual-step calibration method for mapping interaction regions between the
thermal camera and projected content using a web camera. The system has high
recognition rates for hand gestures, enhancing the presentation experience. However,
the challenges they encountered during development, such as the need for precise cal-
ibration and handling hand localization, are not explicitly mentioned in the paper. [21]
In Meera Paulson et. al. [11] introduces a gesture recognition system for enhancing
presentations and enabling remote control of electronic devices through hand ges-
tures. It incorporates ATMEGA 328, Python, Arduino, Gesture Recognition, Zigbee,
and wireless transmission [22]. The paper emphasizes the significance of gesture
recognition in human-computer interaction, its applicability in various domains, and
its flexibility to cater to diverse user needs. The system offers features such as presen-
tation control, home automation, background change, and sign language interpreta-
tion. The authors demonstrated a cost-effective prototype with easy installation and
extensive wireless signal transmission capabilities. The paper discusses the results,
applications, methodology, and challenges, highlighting its potential to improve hu-
man-machine interaction across different fields.
Rina Damdoo et. al. [12], present a vision-based adaptive hand gesture recognition
system employing Convolutional Neural Networks (CNN) for machine learning clas-
sification. The study addresses the challenges of recognizing dynamic hand gestures
in real-time and focuses on the impact of lighting conditions. The authors highlight
that the performance of the system significantly depends on lighting conditions, with
better results achieved under good lighting. They acknowledge that developing a ro-
bust system for real-time dynamic hand gesture recognition, particularly under vary-
ing lighting conditions, is a complex task. The paper offers insights into the potential
for further improvement and the use of filtering methods to mitigate the effects of
poor lighting, contributing to the field of dynamic hand gesture recognition.
RutikaBhor et. al. [13] presents a real-time hand gesture recognition system for ef-
ficient human-computer interaction. It allows remote control of PowerPoint presenta-
tions through simple gestures, using Histograms of Oriented Gradients and K- Nearest
Neighbor classification with around 80% accuracy. The technology extends beyond
PowerPoint to potentially control various real-time applications. The paper addresses
challenges in creating a reliable gesture recognition system and optimizing
4
lighting conditions. It hints at broader applications, such as media control, without in-
termediary devices, making it relevant to the human-computer interaction field. Ref-
erences cover related topics like gesture recognition in diverse domains.
Outcome of Literature Review- The literature review highlights the significant ad-
vancements in gesture recognition and its application in human-computer interaction,
particularly for controlling presentations. Various methodologies, including AI-based
classifiers, Convolutional Neural Networks (CNN), and vision- based techniques,
have been explored to achieve accurate and intuitive gesture recognition. Despite the
progress, common challenges such as computational intensity, privacy concerns,
lighting conditions, and the need for precise calibration persist. This project aims to
address these challenges by integrating OpenCV, MediaPipe, and the Google Web
Speech API, offering a robust and user-friendly solution for interactive slideshow pre-
sentations. The insights from the literature underscore the potential of gesture recog-
nition to enhance user experience and accessibility in HCI applications.
3. Methodology
The project's primary objective is to make the presentation easy for the presenter to
deliver in a comfortable by controlling the complete presentation through hand ges-
tures.
Fig.1 Cyclic Process
The whole concept of this project is demonstrated in the Fig.1 It gives a complete
step by step process from uploading of files to till terminated of the presentation.
5
3.1. Data Collection
In this project the input data is given by the user in the form of ppt slides in images
format where the user will convert the ppt slides into images and those images will be
stored in a folder. The folder with images is the data for this project, specified in the
Fig. 1.
3.2. Data Preprocessing
To rename and organize a set of PNG images, the initial step involves assigning se-
quential numbers to them in the desired order. This can be achieved through scripting
or batch operations using programming or command-line tools. Once renamed, the
images will have consecutive identifiers, making it easier to organize and retrieve
them in a logical order.
After successfully renaming the PNG images with sequence numbers, the next step
is to sort them based on these assigned numerical values. Sorting ensures that the
images are used in the correct order, following the numerical sequence. This process
is crucial when creating presentations (PPT) or when a specific order is required for
image usage, as it ensures that the images are in the desired sequence for easy access
and presentation purposes. Overall, these procedures simplify the task of organizing
and working with PNG images in a structured and orderly manner. After uploading the
files folder, the data preprocessing starts renaming the images and sorting immediately and
storing them back in the folder takes place as show in Fig.2.
Fig.2 Preprocessing of Data
Hand Detection: The method recognizes and localizes a hand's position within a
video frame. The hand detection is the key objective in this research, and we em-
ployed the Kanade-Lucas-Tomasi (KLT) algorithm to identify and locate all known
objects in a scene [14]. The algorithm starts by identifying feature points in the first
frame of a video or image sequence. These features could include corners, edges, or
any other distinguishing points in the image. The Harris corner detector [15] is com-
monly used for feature detection. It detects corners by analyzing intensity changes in
6
various directions. Once the features are identified in the first frame, the algorithm at-
tempts to track them in subsequent frames. It is assumed that the features move in
small steps between frames.
A small window is considered around each initial frame feature point. The algo-
rithm searches the next frame for the best window match. Feature point optical flow is
estimated using the Lucas-Kanade method [10]. The motion is assumed to be constant
in a local neighborhood around the feature point. The optical flow equation is solved
for each window pixel around the feature point. Motion parameters (w) and spatial in-
tensity gradients (Ix and Iy) are related by this equation. The KLT algorithm analyzes
spatial gradient matrix as specified in equation 1, eigenvalues to determine feature
point tracking reliability. Spatial gradients of intensity in the window around the fea-
ture point deter- mine the matrix. A
feature point is re- liable for tracking
if its matrix eigen- values are above a
threshold. Fig. 3 describes the
tracking of the hand with the help
of matrix eigen values.
Fig.3.Matrix Calculation
Finger Tracking: After detecting the hand, the algorithm records the location of indi-
vidual fingers. It may entail estimating hand landmarks to pinpoint crucial spots on
the fingers, particularly the fingertips.
Finger State Classification: The algorithm defines each finger's state as "up" (1) or
"down" (0) based on its location and movement. To establish these classifications, it
most likely evaluates the angles and placements of the fingers compared to a refer-
ence hand form.
Finger State Combination: The algorithm creates a combination of finger states for
the entire hand. For instance, if all fingers are labeled "up," it may indicate "5". If all
the fingers are marked "down," it may indicate "0."
Fig.4. Hand Track Mechanism
7
Speech Command Detection:
Speech recognition is the process of converting spoken words into text. In the context
of the project outlined above, speech recognition enables users to interact with the
presentation software using voice commands.
Initialization: You start by initializing a speech recognition module or library in your
programming environment. This typically involves importing the necessary libraries
or modules.
Audio Input: The speech recognition system captures audio input, usually from a
microphone connected to the device. This audio input is then processed to extract
relevant features that represent speech signals.
Feature Extraction: Signal processing techniques are applied to the audio input to
extract relevant features that characterize speech. These features might include fre-
quency content, time-domain characteristics, or other acoustic properties.
Model Training (Optional): In some cases, the speech recognition system might
use pre-trained models that have been trained on large datasets of speech samples.
Alter- natively, you might need to train your own models using machine learning
tech- niques, depending on the complexity of the speech recognition task.
Pattern Matching: Once the features are extracted, the speech recognition system
compares them against patterns or templates stored in its database or model. These
patterns represent known speech sounds or words.
Recognition: Based on the comparison between the extracted features and the stored
patterns, the system identifies the closest match or matches. This process involves
determining which words or phrases best match the input audio signal.
Output: The recognized words or phrases are then output as text or some other form
of representation, depending on the requirements of the application. This output can
be further processed or used to trigger actions or responses within the system.
The KLT algorithm was chosen for hand detection and tracking due to its efficiency
in identifying and tracking feature points in real-time. Its robustness in varying light-
ing conditions and its ability to handle complex movements make it suitable for this
application.
For speech recognition, the Google Web Speech API was selected because of its high
accuracy and reliability in converting spoken words into text. Its ability to handle
various accents and dialects further enhances its applicability in diverse user settings.
Software Tools Used OpenCV, MediaPipe , Tkinter ,Google Web Speech
API ,Python
By integrating these tools and methodologies, the project aims to provide a ro-
bust, user-friendly system for interactive slideshow presentations, enhancing the
overall user experience.
8
4. Results
The hand tracking mechanism, finger state classification, and combination allow each
finger to be identified and assigned to a specific task. Figure 4 depicts this classifica-
tion for the purpose of presentation. The first gesture is used to move the slide to the
previous slide, the second gesture is used for the next slide, the third one is used for
the pointer to point the object on the slide, the fourth gesture is used to delete the ob-
ject drawn with the help of the fifth gesture, and the final gesture is used to exit the
presentation.
Fig. 5. Hand Gestures to control the Presentation.
Fig. 6. Speech Mode Enable Gesture
By using the fig.5 Gesture we can switch to speech command mode where we can
perform operations on presentation like moving to previous slide, next slide,open the
required slide, deleting the writing on slide and termination of presentation through
speech.
There were several experiments that we carried out in order to assess the effectiveness
of the system. The first experiment was designed to determine how accurate the de -
9
tection and classification of hand gestures turned out to be. We discovered that the
system was able to accurately detect and categorize hand gestures in most situations.
Figure 5 Shows the hand tracking and gesture accuracy with the help of KLT algo-
rithm of the system, an accuracy rate of approximately 95%. We conducted a second
experiment in which we examined the system's capability of controlling a presentation
with hand gestures. It was discovered by us that the system was able to control the
slides in a smooth manner and carry out a variety of actions, such as moving forward
or going back to the slide that came before it.
Fig.7. Accuracy Graph of Hand Detection.
Fig.8. Accuracy Graph of Speech Recognition.
In the current model, we simply set the gesture array using the built-in Hand
Tracking Module and Speech Recognition, saving time on training, and collecting
hand gestures. Converting PowerPoint to images and uploading them will take very
little time. The accuracy of the built-in model ranges from 95 to 97%. The previous
model required more time for hand tracking and speech recognition because there was
no built-in model for detecting hand gestures, and the accuracy was less than 95%.
10
For this project HD camera and ANC microphone is mandatory, the range of nor-
mal inbuilt cameras in existing laptops is 5 meters. To get a long range of gesture
recognition we need to use external long-range cameras. Once the termination gesture
is used the files will be deleted. If the user wants to use the files again then they
should upload the files again.
Comparison with Previous Models
The current model, which utilizes built-in modules for hand tracking and speech
recognition, shows improved performance over previous models. The built-in model
achieves higher accuracy (95-97%) compared to the earlier model, which had an ac-
curacy rate of less than 95%. Additionally, the current model reduces the time needed
for hand tracking and speech recognition due to the use of pre-trained models.
5. Conclusion
This research presents an improvement in the delivery of presentations, this research
provides a sophisticated interactive presentation control system that skilfully com-
bines speech recognition, gesture recognition, and computer vision technologies. By
providing control using natural hand movements and vocal instructions, the system
addresses the shortcomings of conventional presentation tools and provides an ad-
vanced, hands-free interface for organizing presentations. With the Kanade-Lucas-
Tomasi (KLT) algorithm obtaining around 95% accuracy in hand gesture identifica-
tion and voice recognition models reaching an accuracy range of 95-97%, the major
findings show the system's excellent technological integration. The system's robust-
ness and practical application were demonstrated by empirical assessments, which
verified that it can reliably identify and classify hand gestures and recognize verbal
instructions under a range of settings.
However, problems like preserving accuracy and dependability across a variety of set-
tings—affected by variables like illumination, background noise, and speech accent
variations—call for constant improvement of identification algorithms and strong er-
ror-handling systems. According to the research, incorporating cutting-edge technol-
ogy into interactive presenting tools has the potential to significantly improve presen-
ters' flexibility and control. Future studies should concentrate on improving the accu-
racy of gesture and speech detection, maximizing system responsiveness, and improv-
ing user interface design. Realizing the system's full potential and getting over current
obstacles need ongoing improvement. In summary, this study makes a significant
contribution to interactive presentation systems by demonstrating the effective inte-
gration of gesture and speech recognition technologies, with the potential for a more
seamless and intuitive experience, ultimately empowering presenters to deliver dy-
namic and engaging presentations.
11
References
1. D.O. Lawrence, and M.J. Ashleigh, Impact Of Human-Computer Interaction (HCI) on
Users in Higher Educational System: Southampton University As A Case Study, Vol.6, No
3, pp. 1-12, September (2019)
2. Sebastian Raschka, Joshua Patterson, and Corey Nolet, Machine Learning in Python: Main
Developments and Technology Trends in Data Science, Machine Learning, and Artificial
Intelligence, (2020)
3. Xuesong Zhai, Xiaoyan Chu, Ching Sing Chai, Morris Siu Yung Jong, Andreja Istenic,
Michael Spector, Jia-Bao Liu, Jing Yuan, Yan Li, A Review of Articial Intelligence (AI)
in Education from 2010 to 2020, (2021)
4. D. Jadhav, Prof. L.M.R.J. Lobo, Hand Gesture Recognition System to Control Slide Show
Navigation IJAIEM, Vol. 3, No. 4 (2014)
5. Ren, Zhou, et al. Robust part-based hand gesture recognition using Kinect sensor, IEEE
Transactions on multimedia 15.5, pp.1110-1120, (2013), Page 8-11.
6. Devivara Prasad G, Mr. Srinivasulu M. "Hand Gesture Presentation by Using Machine
Learning." September 2022, IJIRT, Volume 9, Issue 4.
7. G. Reethika, P.Anuhya, M. Bhargavi. "Slide Presentation by Hand Gesture Recognition
Using Machine Learning", IRJET, Volume 10, Issue: 01, Jan 2023
8. Hajeera Khanum, Dr. Pramod H B. "Smart Presentation Control by Hand Gestures Using
Computer Vision and Google’s Mediapipe", IRJET, Volume: 09 Issue: 07, July 2022.
9. Salonee Powar, Shweta Kadam, Sonali Malage, Priyanka Shingane. "Automated Digital
Presentation Control using Hand Gesture Technique.", ITM Web of Conferences 44,
03031 (2022).
10. G. L. P, A. P, G. Vinayan, G. G, P. M and A. S. H, "Lucas Kanade based Optical Flow for
Vehicle Motion Tracking and Velocity Estimation," 2023 International Conference on
Control, Communication and Computing (ICCC), Thiruvananthapuram, India, 2023, pp. 1-
6, doi: 10.1109/ICCC57789.2023.10165227.
11. Meera Paulson, Nathasha P R, Silpa Davis, Soumya Varma, "Smart Presentation Using
Gesture Recognition", 2017, Volume 2, Issue 3.
12. Rina Damdoo, Kanak Kalyani, Jignyasa Sanghavi, "Adaptive Hand Gesture Recognition
System Using Machine Learning Approach", Biosc Biotech Res Comm. Special Issue Vol
13 No 14 (2020) pp.106-110.
13. Bhor Rutika, Chaskar Shweta, Date Shraddha, Auti M. A, “PowerPoint Presentation Con-
trol Using Hand Gestures Recognition”, International Journal of Research Publication and
Reviews, Vol 4, no 5, pp 5865-5869, May 2023
14. D. Mikhaylov, A. Samoylov, P. Minin and A. Egorov, "Face Detection and Tracking from
Image and Statistics Gathering," 2014 Tenth International Conference on Signal-Image
Technology and Internet-Based Systems, Marrakech, Morocco, 2014, pp. 37-42, doi:
10.1109/SITIS.2014.85.
15. Zhao, J.; Su, L.; Wang, X.; Li, J.; Yang, F.; Jiang, N.; Hu, Q. DTFS-eHarris: A High Ac -
curacy Asynchronous Corner Detector for Event Cameras in Complex Scenes. Appl. Sci.
2023, 13, 5761. https://doi.org/10.3390/app13095761
16. M. F. Wahid, R. Tafreshi, M. Al-Sowaidi and R. Langari, "An efficient approach to recog -
nize hand gestures using machine-learning algorithms," 2018 IEEE 4th Middle East Con-
ference on Biomedical Engineering (MECBME), Tunis, Tunisia, 2018, pp. 171-176, doi:
10.1109/MECBME.2018.8402428.
17. Ajay Talele, Aseem Patil, Bhushan Barse, “Detection of Real-Time Objects Using Tensor -
Flow and OpenCV”, Asian Journal of Convergence in Technology, Vol 5, (2019)
12
18. Ahmed Kadem Hamed AlSaedi, Abbas H. Hassin Al Asadi, “A New Hand Gestures
Recognition System”, Indonesian Journal of Electrical Engineering and Computer Science,
Vol 18, (2020)
19. Sebastian Raschka, Joshua Patterson, and Corey Nolet, “Machine Learning in Python:
Main Developments and Technology”, Trends in Data Science, Machine Learning, and
Artificial Intelligence, (2020)
20. I. Dhall, S. Vashisth, G. Aggarwal, Automated Hand Gesture Recognition using a Deep
Convolutional Neural Network, 10th International Conference on Cloud Computing, Data
Science & Engineering, (2020)
21. Meera Paulson, Natasha, Shilpa Davis on the Smart presentation using gesture recognition
and OpenCV, Asian Journal of Convergence in Technology, Vol 5, (2019)