KEMBAR78
A Google Glass Based Real-Time Scene Analysis For | PDF | Visual Impairment | Computer Vision
0% found this document useful (0 votes)
137 views20 pages

A Google Glass Based Real-Time Scene Analysis For

This document summarizes a research article that proposes a system using Google Glass to assist visually impaired people in real-time scene recognition. The system uses Google Glass's camera to capture images of the surroundings, which are then analyzed using Microsoft's Vision API. The API's output is converted to speech which is heard by the visually impaired user wearing Google Glass. The researchers created a new dataset of 5000 images to improve scene description performance in Indian scenarios. Their system achieved a mean Average Precision score of 84% on this dataset, providing accurate scene recognition results to users in under 1 second.

Uploaded by

Punya Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views20 pages

A Google Glass Based Real-Time Scene Analysis For

This document summarizes a research article that proposes a system using Google Glass to assist visually impaired people in real-time scene recognition. The system uses Google Glass's camera to capture images of the surroundings, which are then analyzed using Microsoft's Vision API. The API's output is converted to speech which is heard by the visually impaired user wearing Google Glass. The researchers created a new dataset of 5000 images to improve scene description performance in Indian scenarios. Their system achieved a mean Average Precision score of 84% on this dataset, providing accurate scene recognition results to users in under 1 second.

Uploaded by

Punya Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

A Google Glass Based Real-Time Scene


Analysis for the Visually Impaired
HAFEEZ ALI A1 , SANJEEV U RAO2 , SWAROOP RANGANATH3 , ASHWIN T S4 (MEMBER,
IEEE), AND RAM MOHANA REDDY GUDDETI5 (SENIOR MEMBER, IEEE)
1-5
National Institute of Technology Karnataka, Surathkal, Mangalore 575025
E-mail Ids: hafeez.ali5@gmail.com1 , sanjeevurao@gmail.com 2 , swaroopr97 @gmail.com 3 , ashwindixit9 @gmail.com 4 ,
profgrmreddy@nitk.edu.in5
Corresponding author: Ashwin T S (e-mail: ashwindixit9@gmail.com).

ABSTRACT Blind and Visually Impaired People (BVIP) are likely to experience difficulties with tasks that
involve scene recognition. Wearable technology has played a significant role in researching and evaluating
systems developed for and with the BVIP community. This paper presents a system based on Google Glass
designed to assist BVIP with scene recognition tasks, thereby using it as a visual assistant. The camera
embedded in the smart glasses is used to capture the image of the surroundings, which is analyzed using
the Custom Vision Application Programming Interface (Vision API) from Azure Cognitive Services by
Microsoft. The output of the Vision API is converted to speech, which is heard by the BVIP user wearing
the Google Glass. A dataset of 5000 newly annotated images is created to improve the performance of the
scene description task in Indian scenarios. The Vision API is trained and tested on this dataset, increasing
the mean Average Precision (mAP) score from 63% to 84%, with an IoU > 0.5. The overall response time of
the proposed application was measured to be less than 1 second, thereby providing accurate results in real-
time. A Likert scale analysis was performed with the help of the BVIP teachers and students at the "Roman
Catherine Lobo School for the Visually Impaired" at Mangalore, Karnataka, India. From their response,
it can be concluded that the application helps the BVIP better recognize their surrounding environment in
real-time, proving the device effective as a potential assistant for the BVIP.

INDEX TERMS Google Glass, Human Computer Interaction, Azure Cognitive Services, Microsoft Vision
API, Ubiquitous computing, Visual assistant

I. INTRODUCTION to be four times higher than in high-income regions [3]. The


CCORDING to the World Health Organization, it is loss of sight causes much suffering to the affected individuals
A estimated that there are at least 2.2 billion people
globally who have vision impairment or blindness1 . Out of
and their families. Despite many efforts, population growth
and aging are expected to increase the risk of more people
these, around 45 million are blind and in need of vocational acquiring vision impairment.
and social support. This population faces many difficulties in A visually impaired person deals with orientation and
perceiving and understanding their surroundings since more navigation issues daily. These issues can be alleviated with
than 80% of the information entering the brain is visual [1]. the help of particular types of equipment that can provide ad-
Studies have shown that vision accounts for two-thirds of the ditional support to the individuals. With the improvements in
activity in the brain when a person’s eyes are open [2]. The computer vision and human-computer interaction techniques,
loss of sight represents a public health, social and economic it is possible to assist Blind and Visually Impaired People
issue in developing countries, where 9 out of 10 of the (BVIP) with scene recognition tasks. With the motivation of
world’s blind live. It is estimated that more than 60% of the helping the BVIP community, this paper presents an appli-
world’s blind reside in India, sub-Saharan Africa, and China. cation implemented on Google Glass2 that acts as a visual
In terms of regional differences, the prevalence of vision assistant to the BVIP.
impairment in low and middle-income regions is estimated Many efforts by several researchers have been made to
design systems that aid the BVIP. Bradley et al. [4] ex-
1 https://www.who.int/news-room/fact-sheets/detail/blindness-and-visual-
impairment 2 https://www.google.com/glass/start/

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

perimented, testing whether a group of sighted individuals inside an indoor environment. When a BVIP user holding the
and visually impaired individuals experience a difference in smartphone moves through the building, the user will receive
physical and mental demands when given directions to spe- auditory information about the nearest point of interest. A
cific landmarks. Battaglia et al. [5] developed an integrated, very similar system was developed by Bie et al. [22] for
modular, and expandable open-source package called Blind an outdoor setting. Finally, Guerreiro et al. [23] developed
Assistant to show that it is possible to produce effective a smartphone based virtual-navigation application that helps
and affordable aids for the BVIP. Meza-de-Luna et al. [6] the BVIP gain route knowledge and familiarize themselves
designed a social-aware assistant using a pair of smart glasses with their surroundings before visiting a particular location.
and a haptic belt to enhance the face-to-face conversations Lupu et al. [24] presented an experimental framework to
of the BVIP by providing them with vibrational cues from assess the brain cortex activation and affective reactions of
the belt. Chang et al. [7] [8] [9] proposed a wearable the BVIP to stimuli provided by a sensory substitution device
smart-glasses-based drug pill recognition system using deep- used for navigation in real-world scenarios. The test was
learning for the BVIP to enable them to improve their med- done in 5 different types of experimental scenarios. It was
ication use safety. The system consists of a pair of wearable focused on evaluating working memory load, visual cortex
smart glasses, an artificial intelligence (AI)-based drug pill activation, and emotional experience when visually impaired
recognition box, and a mobile phone app. The smart glasses people perceive audio, haptic, and multimodal stimuli. Chang
are used to capture images of the drugs to be consumed, and et al. [25] proposed a wearable assistive system comprising a
the AI-based drug recognition box is used to identify the pair of smart glasses, a waist-mounted intelligent device, and
drugs in the image. The mobile app is used to track drug an intelligent cane to help BVIP consumers safely use zebra
consumption and also to provide timely reminders to the user. crossings. They used artificial intelligence (AI) based edge
Zientara et al. [10] proposed a shopping assistant system for computing techniques to help the BVIP users to utilize the
the BVIP called the ‘Third Eye’ that aids in navigation and zebra crossings.
identification of various products inside a shop. Similarly, Other researchers have focused on the design of assistive
Pintado et al. [11] designed a wearable object detection systems which help in scene description and analysis. Ye
device in eyewear that helps to recognize items from the et al. [26] analyzed how different devices can help the
produce section of a grocery store. BVIP in their daily lives and concluded that smartphones
In addition to shopping assistants, researchers have also play a significant role in their daily activities. Pēgeot et al.
developed Electronic Travel Aids (ETA) and obstacle de- [27] proposed a scene text tracking system used for finding
tection systems to assist navigation. Quinones et al. [12] and tracking text regions in video frames captured by a
performed a needs-finding study to assist in navigation of wearable camera. Gonzāles-Delgado et al. [28] proposed a
familiar and unfamiliar routes taken daily among the BVIP. smart gloves system that helps in meeting some of the daily
They concluded that a device that can act as an assistant is needs of the BVIP, such as face recognition, automatic mail
needed for better navigation. El-taher et al. [13] have done reading, automatic detection of objects, among other func-
a comprehensive review of research directly in, or relevant tions. Memo et al. [29] developed a head-mounted gesture
to, outdoor assistive navigation for the BVIP. They also recognition system. Their system uses a depth camera and
provided an overview of commercial and non-commercial an SVM classifier to identify the different gestures during a
navigation applications targeted at assisting the BVIP. Lee human conversation. Barney et al. [30] developed a sensory
et al. [14] implemented a guidance system that uses map- glass system that detects obstacles and informs the user of
matching algorithms and ultrasonic sensors to guide users 3D sound waves. The glasses were fitted with five ultrasonic
to their chosen destination. Tapu et al. [15] implemented sensors placed on the left, upper-left, front, right, and upper-
an autonomous navigation system for the BVIP based on right parts. Shishir et al. [31] designed an Android app
computer vision algorithms. Similarly, Vyavahare et al. [16] that can capture images and analyze them for image and
used a combination of ultrasonic sensors and computer vision text recognition. B. Jiang et al. [32] designed a wearable
techniques to build a wearable assistant that can perform assistance system based on binocular sensors for the BVIP.
obstacle detection and image description. Laubhan et al. [17] The binocular vision sensors were used to capture images in
and Trent et al. [18] designed a wearable Electronic Travel a fixed frequency, and the informative images were chosen
Aid for the blind, which uses an array of ultrasonic sensors based on stereo image quality assessment (SIQA). Then the
to survey the scene. Bai et al. [19] proposed a depth image informative images were sent to the cloud for further com-
and multi-sensor-based algorithm to solve the problem of putations. Bogdan et al. [33] proposed a system composed
transparent and small obstacle avoidance. Their system uses of a pair of smart glasses with an integrated microphone
three primary audible cues to guide completely blind users and camera, a smartphone connected with the smart glasses
to move safely and efficiently. Nguyen et al. [20] developed through a host application, and a server that serves the
a way-finding system on a mobile robot helping the BVIP purpose of a computational unit. Their system was capable
user in an indoor setting. Avila et al. [21] developed a smart- of detecting obstacles in the nearest surrounding, providing
phone application that helps in localization within an indoor an estimation of the size of an object, face recognition, auto-
setting. In this system, 20 Bluetooth beacons were placed matic text recognition, and question answering of a particular
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

input image. Pei et al. [34] proposed a visual image aid for Android OS.
vocalizing the information of objects near the user. Since its release, researchers have used the device to
Some researchers have designed their smart glasses to design systems to solve many real-life problems. Jiang et
develop applications that assist visually impaired people. al. [39] proposed a Google Glass application that is used
Chang et al. [35] and Chen et al. [36] proposed an assis- for food nutrition information retrieval and visualization. On
tive system comprising wearable smart-glasses, an intelligent similar grounds, Li et al. [40] developed a Google Glass
walking stick, a mobile device app, and a cloud-based in- application that can be used to assess the uniqueness and aes-
formation management platform used to achieve the goals thetics of a food dish by analyzing its image for visual appeal,
of aerial obstacle avoidance and fall detection goals for the color combinations, and appearance. A few of the researchers
BVIP. The intelligent walking stick provides feedback to the have used the device in the medical field to treat children with
user with the help of vibrations to warn the user of obstacles. Autism Spectrum Disorder (ASD). For instance, Washington
Furthermore, when the user experiences a fall event, an et al. [41] [42] developed a Google Glass-based system for
urgent notification is immediately sent to family members automatic facial expression recognition, delivering real-time
or caregivers. In the realm of wearable intelligent glasses, social cues to children with ASD, thus improving their social
Chang et al. [37] and Chen et al. [38] have also proposed a behavior.
drowsiness-fatigue-detection system to increase road safety. Lv et al. [43] developed a touch-less interactive augmented
The system consists of wearable smart-glasses, an in-vehicle reality game using Google Glass. Wang et al. [44] presented
infotainment telematics platform, an onboard diagnostics-II- a navigation strategy for NAO humanoid robots via hand
based automotive diagnostic bridge, a rear light alert mech- gestures based on global and local live videos displayed
anism in an active vehicle, and a cloud-based management on Google Glass. Similarly, Wen et al. [45] developed a
platform. The system is used to detect drowsiness and fatigue Google Glass-based system to achieve hands-free remote
in a driver in real-time. When detected, the active vehicle real control of humanoid robots. Xu et al. [46] used the device
light alert mechanism will automatically be flickered to alert to facilitate intelligent substation inspection by using virtual
following vehicles, and warning messages will be played to video and real-time data demonstration. Widmer et al. [47]
alert the driver. developed a medical information search system on Google
Although many systems have been proposed and devel- Glass by connecting it to a content-based medical image
oped to assist the visually impaired, their practical usabil- retrieval system. The device takes a photo and sends it along
ity is very limited due to the application’s wearability and with keywords associated with the image to a medical image
portability. In this era of high-end consumer electronics, retrieval system to retrieve similar cases, thus helping the user
where multiple sensors are embedded in light, highly portable make an informed decision.
smart glasses such as the Google Glass, it is possible to Devices such as Microsoft Kinect and Google Glass have
design an application that addresses the usability concerns also been used to help visually impaired people. For instance,
faced by previous applications while also providing real-time Lausegger et al. [48] developed a Google Glass application
responses to complex problems such as scene recognition and to help people with color vision deficiency or color blindness.
object detection. Therefore, in this paper, a Google Glass Anam et al. [49] developed a dyadic conversation aid using
based real-time visual assistant is proposed for the BVIP. Google Glass for the visually impaired. Hwang et al. [50]
The rest of the paper is organized as follows. Section II implemented an augmented vision system on Glass, which
describes related work done by other researchers on Google overlays edge information over the wearer’s real-world view,
Glass to solve real-world social problems. In Section III, the to provide contrast-improved central vision to the user. They
proposed application is presented, along with explaining the used a combination of positive and negative laplacian filters
different design choices. Here, the merits of the proposed for edge enhancement. Neto et al. [51] proposed a wearable
application are explained in detail. The various steps involved face recognition system to aid the visually impaired in real-
in using the application are also provided. In Section IV, the time. Their system uses a Kinect sensor to acquire an RGB-D
results of the proposed work and the feedback obtained by image and run an efficient face recognition algorithm. Simi-
the BVIP users are presented. Finally, the conclusion is given larly, Takizawa et al. [52] proposed Kinect cane - an assistive
in Section V. system for the visually impaired based on the concept of
object recognition.
II. RELATED WORK Kim et al. [53] performed a systematic review of the ap-
Google Glass is a brand of smart glasses with a prism plications of smart glasses in various applied sciences, such
projector for display, a bone conduction transducer, a mi- as healthcare, social science, education, service, industry, and
crophone, accelerometer, gyroscope, magnetometer, ambient computer science. Their study shows a remarkable increase
light sensor, proximity sensor, a touchpad, and a camera. It in the number of published papers on the application of
can connect to other devices using a Bluetooth connection, a smart glasses since the release of Google Glass. Further,
micro USB, or a Wi-Fi connection. Application development they claimed that the research has been steadily increasing
for the device can be done using the Android development as of 2021. With this, it can be concluded that Google Glass
platform and toolkit available for mobile devices running has been extensively used for designing applications to solve
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

problems in various fields. Inspired by its potential, this paper


presents a Google Glass-based application to solve some of
the problems faced by the BVIP community by developing
a scene descriptor using the Custom Vision API provided
by Azure Cognitive Services3 . The merits and features of
Google Glass that led to its use in the proposed application
and the system design are further explained in the next
section.

III. SYSTEM DESIGN


Google Glass is relatively lightweight at 36g, which is quite
comfortable to wear and use for extended periods. It comes
with a head strap to firmly secure the device while it is in
use. The device has a prism projector for a display to allow
the user to view a visual output. It is a Single LCoS (Liquid
Crystal on Silicon) display with a resolution of 640x360. In
addition, a camera is mounted on top of the right frame of
FIGURE 1. Google Glass
the Glass. This 5MP photo, 720p video camera allows the
user to capture images and store them in its local storage. A
second key hardware feature is the integrated bone conduc-
tion speaker, which allows transmitting sound directly into enhancement application on Google Glass, Hwang et al. [50]
the user’s ear canal without interference from outside noise. concluded that the device provides a valuable platform for
It is beneficial for a device meant to be used both indoors and implementing various applications that can aid patients with
outdoors. It also has a microphone to capture audio and voice various low vision conditions. It is explained that since the
input from the user, which is one of the main user-device device is reasonably priced, cosmetically appealing, highly
interaction mechanisms. A secondary way to interact is the flexible, and designed in a socially desirable format, it has a
touchpad present on the side of the device. vast potential for further innovation. El-taher et al. [13], in
The Glass comes with a lightweight dual-core Cortex their systematic review of navigation systems for the visually
A9 (2 x 1 GHz) processor by Texas Instruments, a built- impaired, highlighted the importance of portability, wearabil-
in PowerVR SGX540 GPU, a 2GB RAM, and an internal ity, latency, feedback interface, and user-friendliness of the
storage capacity of 16GB. The device can perform moderate application. Google Glass excels in all these critical design
computations using these processing and storage capabilities. considerations. The device is lightweight, weighing only 36g,
It also comes with a 570mAH battery capacity. In terms making it highly portable. Its cosmetic appeal, flexibility, and
of connectivity, the Glass has a micro USB port that can socially desirable format make it a highly wearable device.
connect to a suitable development environment for building Feedback-interface can be defined as the means used by the
and deploying applications on the device. In addition, it can application to convey information to the BVIP. Google Glass
connect to Bluetooth and is Wi-Fi 802.11g compatible. The provides an excellent feedback interface due to the presence
device also has an accelerometer, gyroscope, magnetometer, of a bone conduction speaker that renders audio signals to the
ambient light sensor, and proximity sensor. A sample image user without obstructing any external sound, making it a safe
of Google Glass is shown in Fig. 1. choice for use in both indoor and outdoor environments. The
In their review of the applications of smart glasses in device also has a microphone and any application developed
applied sciences, Kim et al. [53] found that the most pop- on the device can be controlled entirely using audio-based
ular commercial smart glass is Google Glass, followed by commands giving the user excellent flexibility and comfort.
Microsoft’s HoloLens. Their review shows that the android- The audio-based interface also helps keep the user experience
based Google Glass is used in various domains of applied as unrestricted as possible while using the device. Hence,
sciences, accounting for more than half of all the applica- due to its superior usability and features mentioned above,
tions reviewed as part of their research. Furthermore, it is Google Glass was used in designing the visual assistant.
highlighted that since the device has an Android OS, it is Finally, one of the critical aspects of a visual assistant
effortless for developers to design and build applications device is low latency and the ability to run in real-time. In
on it. Moreover, Google Glass weighs only 36g, which is order to achieve this, the Custom Vision API from Azure
much lighter than other smart glasses in the market. For Cognitive Services was used to run state-of-the-art deep
instance, Microsoft HoloLens 2 weighs 566g, Epson Moverio learning models for scene description and object detection. It
BT-350 weighs 151g, and Vuzix Blade M100 weighs 372g provides superior response time with excellent precision and
[53]. Similarly, in their paper on implementing an edge accuracy. In order to further improve its precision on Indian
scenarios, a newly annotated image dataset consisting of
3 https://azure.microsoft.com/en-in/services/cognitive-services/ 5000 images was created, and the Vision API was trained on
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

this dataset. Finally, the Vision API’s precision and accuracy of a cloud-based API, complex vision algorithms can
were compared against other state-of-the-art models run on a be run on the image with almost real-time responses
cloud-based intelligent server. Based on the performance of since the algorithm runs on powerful machines on
the Vision API and the superior usability of Google Glass, the cloud. The use of cloud-based APIs prevents the
the proposed application was designed using them. need to carry a bulky computer for processing, thereby
boosting the system’s portability. The API can catego-
A. PROPOSAL rize the image into 86 categories and can be trained
Some of the significant issues that restrict the usability of on custom datasets. It can further assign tags to the
most wearable assistance systems were identified during the image and generate captions describing the contents in
literature survey. Firstly, the size and weight of the sensors human-readable sentences.
used in the system directly impact the long-term wearability, A comparison of the proposed approach with existing as-
portability, and hence, the usability of the system without sistive systems for the BVIP is shown in Table 1. The usabil-
causing health hazards to the user. El-taher et al. [13] empha- ity and functionality provided by the various applications are
sized the importance of portability or weight and wearability also shown. There are no applications that use Google Glass
of the device used for assisting the visually impaired person for scene description tasks in real-time on Indian scenarios.
in their review of urban navigation systems for the visually Further, the proposed application provides better portability
impaired. Secondly, one of the most critical factors that must and wearability in scene description tasks while providing a
be considered while designing a system for the disabled is an real-time response and a completely hands-free interaction
intuitive human-computer interaction interface. The system interface. The key contributions of the proposed work are,
must be designed such that it is easy to use with minimal
• The development of an augmented reality application
user training. Finally, the response time from the source of
for real-time scene description using Google Glass as
computation must be close to real-time. Achieving real-time
an edge device and Azure Vision API for the BVIP.
performance on a smart glass is very challenging since the al-
• The creation of an annotated image dataset consisting
gorithm’s complexity directly impacts the device’s response
of objects used by the BVIP in Indian scenarios and
time unless the algorithm runs on a powerful machine, which
environments. The annotations correspond to the 86
is heavy and bulky and hence not portable. On the other
class labels supported by the Vision API.
hand, reducing the complexity of the algorithm leads to less
• Optimizing the performance of the Vision API by using
accurate results. Therefore, it is essential to consider using
the newly created annotated image dataset and using the
cloud computing platforms with a fast response time for such
custom vision4 option provided by the Vision API.
systems. The following design choices are used to address the
problems mentioned above, thereby improving the usability Fig. 2 gives an overview of the proposed approach and the
of the proposed visual assistant system. various components involved in it. The BVIP user wearing
Google Glass captures the image of his/her surrounding by
1) Google Glass is selected as the core of the visual
using the camera present on the device with the help of the
assistant system. The camera present in the device
voice command - "OK, Glass; Describe Scene." The captured
captures images of the surroundings, which are sent
image is compressed and sent via a Wi-Fi connection to the
to a mobile app for further processing. Most of the
smartphone device of the user. Upon receiving the image,
previous applications that serve the purpose of visual
the smartphone app decompresses the image and invokes
assistants have bulky sensors and cameras attached,
the Vision API to generate captions and identify the various
which are difficult to wear and are not portable. Given
objects in the image. The smartphone app then processes
the superior portability, wearability, and flexibility of
the API’s response to extract the captions and the objects
the device, the use of Google Glass will significantly
identified. This text response is sent back to Google Glass via
improve the usability of such systems.
the same wifi connection. Finally, Android’s text to speech
2) The application is designed to have a very intuitive
API is used to convert the text response into sound using
interaction interface. Users can interact with it using
the bone conduction transducer present in the device. In the
a voice command that triggers the camera to capture
following subsections, the proposed application development
an image and send it to the mobile app and the Vision
methodology and the user-system interaction design is de-
API for further processing. The result from the Vision
scribed in detail.
API is sent back to the Google Glass device, which
is then converted to sound using the bone conduction
B. PROPOSED APPLICATION DEVELOPMENT
transducer. The completely hands-free, voice-activated
METHODOLOGY
approach leads to superior user-system interaction and
helps keep the user as unrestricted as possible while According to the official documentation by Google5 , the
using the application. three major design patterns for developing software on
3) The Custom Vision API provided by Azure Cognitive 4 https://azure.microsoft.com/en-us/services/cognitive-services/custom-
Services is used for performing the necessary computa- vision-service/overview
tion on the image captured by the device. With the help 5 https://developers.google.com/glass/design/patterns

VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

TABLE 1. Comparison of the proposed application with existing assistive applications for the BVIP

Literature Source of Sensors Used Functionality Provided Usability (portability, wearability and output inter-
Compute face)
Mauro et al. Smartphone Bluetooth Beacons Auditory information is communicated Highly portable and wearable since the system
2015 [21] about the nearest point of interest when the comprises only a smartphone. Auditory informa-
user is close to a Bluetooth beacon placed tion is communicated using earplugs worn by the
in different points of interest. Helps in nav- user
igation and providing spatial awareness in
indoor settings
Barney et al. Arduino Ultrasound Sensors 3D sound is generated to give the user a Moderately portable and wearable as the system
2017 [30] sense of the distance of the objects around comprises an ultrasound sensor on a smart-glass,
him/her. Useful for navigation in indoor an Arduino for computing the distance of the
settings surrounding objects, and a smartphone. Earphones
are used to render the generated sound.
Jiang et al. Cloud 2 sets of CCD cam- Object detection using convoluted neural Moderate level of portability and wearability as
2019 [32] eras and a semiconduc- networks running on a cloud-based plat- the system requires to be calibrated for effective
tor laser form binocular image acquisition. Moving the setup
around might require re-calibration.
Bai et al. 2017 CPU and Mi- Eyeglasses, depth Obstacle avoidance in an indoor environ- Low portability and wearability since the user
[19] croprogrammed camera, ultrasonic ment with the help of depth and ultrasonic must carry the CPU and MCU everywhere. The
Control Unit rangefinder and AR sensors user is provided with auditory cues to avoid obsta-
(MCU) glasses cles.
Neto et al. Laptop computer Microsoft Kinect, gy-
Face detection and recognition using an Low portability and wearability due to a laptop
2016 [51] roscope, compass sen-
efficient face recognition algorithm based computer and a Microsoft Kinect, both of which
sor, IR depth sensor,
on HOG, PCA, and K-NN. 3D audio is are heavy and bulky. Good response interface with
stereo headphones generated on face recognition as the user- the help of 3D sound in the direction of the person
response. Microsoft Kinect is used to cap- identified in the image.
ture the RGB-D image of the person.
Pintado et al. Raspberry Pi Raspberry Pi Camera Shopping Assistant- Object recognition Moderate portability and wearability since the
2019 [11] Module V2 and price extraction using convolutional user must carry a Raspberry Pi used as the com-
neural networks (CNN) running on a puting source for running the CNN. It has very
Raspberry Pi high latency, which can significantly reduce the
practical usage of the application
Pēgeot et al. Laptop computer Head mounted color Scene text detection and tracking using Low portability and wearability since the user
2012 [27] camera Optical Character Recognition must carry a laptop computer for running the OCR
algorithm. The user also needs to wear a head-
mounted color camera for capturing images. The
identified text is output as an audio signal using a
text-to-speech library.
Takizawa et al. Laptop computer Microsoft Kinect and a To recognize a pre-trained set of fixed 3D Low long-term portability and wearability since
2019 [52] tactile feedback device objects in the surrounding. The system the user must carry a Microsoft Kinect and a
on a cane also provides the user with instructions to laptop computer for processing. Vibratory cues are
find the 3D object provided to the user to help with finding the 3D
object.
Chang et al. Intelligent waist Camera, time-of-flight Zebra crossing safety for the visually im- Moderate long-term portability and wearability
2021 [25] mounted device laser-ranging module, paired since the user must carry a waist-mounted device
6-axis motion sensor, everywhere. Audio feedback is provided to the
GPS module, LPWAN user with the help of Bluetooth earphones.
module
Chang et al. IR sensors, 6 axis IR sensors, vibration Aerial object detection using IR sensor Highly portable and wearable as the compute is
2020 [35] and gyroscope and motor, LPWAN mod- data by calculating distance using the tri-
performed by sensors on the smart glasses and
Chen et al. accelerometer ule, 6 axis gyroscope angulation method, fall detection using the
intelligent cane. High reliability due to the pres-
2019 [36] in smart glasses and accelerometer six-axis gyroscope and accelerometer in ence of a notification mechanism in case of fall
and intelligent smart glasses, and intelligent cane and no-
detection. Vibratory cues signal the presence of
cane tification system in case of fall detection
aerial obstacles in front of the user.
Chang et al. AI-based Camera on smart Drug pill recognition for the visually im-
Moderately portable as the user must carry the
2019 and 2020 intelligent drug glasses, drug pill paired intelligent drug pill recognition box. High wear-
[7] [8] [9] pill recognition recognition box with ability as the images are captured with the help of
box with a pre- wifi-capabilities smart glasses. Audio signals are generated to pro-
trained deep vide reminders to the user and inform the correct
learning model or incorrect identification of drugs.
Proposed Azure Vision Google Glass having Image captioning and object detection in Highly portable and wearable as the only de-
Method API used a camera, a micro- real time vices that the user must carry is a smartphone
to generate phone, a bone conduc- and Google Glass. Audio output is produced with
captions and tion transducer, Wi-Fi the help of a bone conduction transducer which
identify objects capability and more prevents the obstruction of external sound. Com-
on the image pletely hands-free application with voice com-
mand capabilities

6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

FIGURE 2. System Overview

Google Glass, also called Glassware, are Ongoing Task, is received in a JSON format by the Cognitive Services
Periodic Notifications, and Immersions. Ongoing tasks are API interface built on the smartphone app. JSON stands
long-running applications that remain active even when users for JavaScript Object Notation. It is an open standard data
switch focus to a different application within the device. A interchange format that uses human-readable text to store and
stopwatch app is an excellent example of an ongoing task. transmit data objects consisting of attribute-value pairs and
Users can switch to a different application while running arrays. The smartphone app processes the JSON response to
the stopwatch app without stopping the timer. The Periodic extract the captions and the objects identified in the image.
Notifications design pattern is used to develop applications The processed response is then sent back to the client-side ap-
where the user is notified of any new information to be plication on Google Glass over the same socket connection.
displayed. Examples of applications that use the Periodic Finally, on receiving the text response from the smartphone,
Notification design pattern include a news app, an SMS the app on the Google Glass device uses a text to speech
reader, or an email reader. The Immersion design pattern is API provided by Android to convert the text to audio signals,
used whenever the application requires complete control of which is rendered as sound by using the bone conduction
the user experience. These applications stop when the user speaker present on the device. The BVIP user hears this
switches focus to a different app. Any gaming application is sound output.
an excellent example of an Immersion Pattern. The proposed The version of Glass used in developing the proposed
visual assistant requires complete control of the user expe- system is the Glass Explorer Edition. It comes with a cus-
rience, and hence the Immersion Pattern is chosen to design tom Glass OS and Software Development Kit developed by
the application. Google. Glass OS or Google XE is a version of Google’s
The system design diagram is shown in Fig. 3. The system Android operating system designed for Google Glass. The
can be divided into three major sections: the app on the operating system version on the Explorer Edition device was
Google Glass device, the smartphone, and the Vision API. upgraded from XE 12 to XE 23 since Android Studio, the in-
The BVIP user interacts directly with the app on Google tegrated development environment (IDE) used for developing
Glass. On receiving a user voice command, the camera image the app, supports XE 23, and the SDK documentation avail-
handler built into the app uses the camera present on the able online is also for XE18+. The OS version was upgraded
smart glasses to capture the image of the user’s surroundings. by flashing the device, which was done by programming the
This image is compressed and then sent to the smartphone bootloader of the Glass.
using a socket connection over the internet. The image is Kivy6 , an open-source python library, was used for de-
compressed to reduce the size of the data to be sent over veloping the socket server application on the smartphone.
the internet, thereby reducing the application’s response time. It is a cross-platform library for the rapid development of
Socket programming is a way of connecting two nodes. One applications that make use of innovative user interfaces. It
node (server) listens on a particular port at an IP, while the can run on Windows, OS X, Android, iOS, and Raspberry
other node (client) reaches out to the server node on the same Pi. Hence, the server-side of the application can be started on
port to form a connection. In this system, the application on any smartphone, laptop computer, or Raspberry Pi. However,
the Google Glass is the client, and the application on the to increase portability and ease of use, smartphones were
smartphone forms the server side of the socket connection. chosen for the proposed system. The Azure Vision API used
to identify the various objects and generate captions of the
Upon receiving the image from Google Glass, the server- captured image provides excellent results in real-time. It can
side application on the smartphone decompresses the image. be used to categorize objects into 86 different categories.
The captions of the decompressed image are then gener-
ated by using the Vision API. The response from the API 6 https://kivy.org/#home

VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

The performance of the API was evaluated against Flickr8k


[54] and Microsoft COCO [55] datasets. Different standard
evaluation metrics, namely, BLEU, METEOR, ROUGE-L,
and CIDEr, were used to evaluate the API, and the evaluation
results are shown in Table 2. The response from the API is
returned in JSON format. We process the JSON and return
the description of the image and the various objects in the
image in text format back to the device. Fig. 4 shows a
sample image from the Microsoft COCO [55] dataset, and the
caption generated by the API is "A bedroom with a bookshelf
and a mirror."
From Table 2, it is observed that the performance metric
scores can be significantly improved. The Azure Vision API
is trained and tested on images obtained from non-Asian
countries. Hence, the API can be fine-tuned and customized
to Indian scenarios by using Azure Custom Vision API. This
feature enables training and testing the Vision API on local
image datasets, thereby making the API more robust to local
settings. Here, several images were added to the training
dataset for better performance. A new image dataset was
compiled centered around the daily routine of the BVIP. The
images were annotated with class labels already supported
by the Vision API. There are 86 different class labels, and
a minimum of three images for each category was collected.
The BVIP subjects were surveyed to understand their routine,
and it was found that they extensively used the following cat-
egories: keys, remote, medicine, mobile phone, prescription
glasses, and umbrella. A minimum of 50 images was col-
lected for each of the six categories and was used for training
the API. The standard annotation procedure is followed, and
it is discussed in the Results and Analysis section.

FIGURE 4. Sample Image

FIGURE 3. System Design C. USER-SYSTEM INTERACTION


The system is designed in such a fashion to make the user-
system interaction entirely audio-based, thus providing the
best user experience to a BVIP. Google Glass has a sensor
that can detect whenever a user wears it, and the device is
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

TABLE 2. Evaluation metrics of the Azure Vision API against Flickr8K and Microsoft COCO datasets

Dataset BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr


Flickr8K 0.383812 0.228540 0.145145 0.088268 0.168121 0.333856 0.594539
MS COCO 0.396322 0.256324 0.173089 0.117195 0.182853 0.357928 0.804114

configured such that it automatically turns on whenever the image of the surroundings in front of the user. The
user wears the device. The Home screen shown in Fig. 5 captured image is shown in Fig. 8.
is displayed as soon as the user wears the device. Once the
device is worn, the following steps are to be followed.

FIGURE 7. Main Activity Screen

FIGURE 5. Home Screen

• Step 1: Say, "OK, Glass." The device recognizes the


command and takes the BVIP user from the home screen
to the menu screen, containing the list of vocal invoca-
tions for various applications installed on the device. It
also sends a beep sound so that the user can be assured
that the command is recognized and executed. The menu
or voice invocation screen is shown in Fig. 6.

FIGURE 8. Image Captured using the Camera on Glass

After capturing the image, the device sends it to the socket


server, running on the user’s smartphone. The description of
the image and the various objects present in it are recognized
using the Vision API. The generated response is sent back to
the Google Glass in text format and is converted to speech by
using the bone conduction transducer present in the device.
The captions and objects detected are also displayed on the
FIGURE 6. Describe Scene: Invocation Screen device, as shown in Fig. 9 and Fig. 10, respectively.
Figs. 5, 6, 7, 8, 9, and 10 are used to explain the flow of
• Step 2: Say, "Describe Scene". One of the voice com- the application to the readers. The user can use the device
mands present on the invocation menu is for starting without seeing the screen. To summarize, the user has to use
the virtual assistant. The voice command is "Describe the voice command "OK, Glass" followed by "Describe
Scene." Upon execution of the command, the main Scene" to launch the application. Thus, a wearable assistant
activity screen of the application is displayed on the for the visually impaired was developed by using only voice
device. This screen is shown in Fig. 7. Once the ap- commands to interact with the application.
plication starts and the main activity screen is visible, A detailed user-system interaction diagram is shown in
the camera intent is activated. The camera captures the Fig. 11. It displays the various steps in order of occurrence
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

and is Wi-Fi compatible. The smartphone application was


built on a OnePlus 7 phone running Android 10 OS, with a
Snapdragon 855 processor, 6 GB RAM, and 128 GM ROM.
Application development on both devices was done using
Android Studio, an Interactive Development Environment for
building Android applications.

TABLE 3. Training setup for the Custom Vision API

Attribute Details Attribute Details


Validating 100 Images Optimize Adaptive Momentum
and testing
batch size
FIGURE 9. Caption Response Epochs 4000 Loss Categorical Entropy
Learning 0.1 Weight Pre Trained weights
rate initializa- of Imagenet v3 model
tion trained over ImageNet,
MS COCO, and Flikr.
The model is fine-
tuned by retraining the
bottleneck layer
Classifier Softmax Testing set 10% of training dataset
at classifier
Bottleneck
layer
Number of 86 Validating 10% of training dataset
classes set

The accuracy and precision of Azure Custom Vision API


were measured against other state-of-the-art vision models on
FIGURE 10. Objects Detected Response a Dell G7 laptop computer with an Intel Core i7-9750H CPU
running at 2.60GHz and a 16GB RAM. It runs on a Windows
10 OS with an NVIDIA GeForce RTX 2060 GPU and 8GB
while using the app. Firstly, the socket server application video memory. For measuring the proposed system’s latency
is started on the smartphone. This server application waits and for testing the application on the BVIP, a 4G network
for a client connection from the Google Glass device. The was used, which was provided by a local internet service
server-client connection is firmly established on receiving a provider. Table 3 shows the attributes used for training the
connection request from the smart glasses. This connection Custom Vision API.
remains intact for all interactions between the Google Glass
and the smartphone. Next, the BVIP user interacts with B. ANNOTATION
the application with the voice command described earlier: Five different annotators are used in this study, and at least
"Ok Glass, Describe Scene." As shown in the interaction two different annotators are used to annotate each image
diagram, the voice command triggers a series of steps on the by identifying the different objects present in it. Each an-
smart glasses and the smartphone, starting with capturing the notation includes the bounding box and the class label of
image in front of the user and sending it to the smartphone the identified object. The annotation process followed is the
app. Here, after processing the received image, captions are standard procedure for annotation described in [56]. The
generated, and the objects in the image are identified by using Custom Vision API has an interface that loads the annotated
the Vision API. The output of the API is sent back to the images and provides tools to place the bounding box and the
application on Google Glass via the smartphone app, and class label. Once the annotation is done for each image, all
this output is eventually heard by the BVIP user wearing the the corresponding bounding box coordinates and the class
device. labels present in that image are stored in the API’s internal
database. The annotators were informed about all 86 class la-
IV. RESULTS AND ANALYSIS bels supported by the API. They were also provided with the
A. EXPERIMENTAL SETUP definition (the smallest rectangle with vertical and horizontal
The experimental results and tests were done on a Google sides surrounding an object) and documentation of bounding
Glass Explorer Edition, which comes with a dual-core Cortex boxes. All the annotators were previously familiar with the
A9 (2 x 1 GHz) processor by Texas Instruments, a built- concept of object localization and classification, and hence no
in PowerVR SGX540 GPU, a 2GB RAM, and an internal further training was provided. Though the class labels were
storage capacity of 16GB. It has a 570mAH battery capacity clear in most cases, the tight bounding box in the case of
10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

FIGURE 11. User-system interaction diagram

multiple overlapping objects was an issue, and annotators’


reliability is also measured. Since the number of annotators X X
is more than two, quadratic-weighted Cohen’s κ (κw ), and uij = qh vih + ph vhj
h h
the leave-one-labeler-out agreement was used as shown in
Equation 1, where, pij are the observed probabilities, eij =
Accuracy Comparison: Cohen’s κ is computed to compare
pi qj are the expected probabilities and wij are the weights
the accuracy of the Vision API against human annotations
(with wji = wij ). The annotators reliably agree when dis-
[56], [57]. The results are shown in Table 4, where it can
criminating against the recognized class label with Cohen’s
be observed that the average κ value varies from 0.91 for
κ = 0.94.
the API’s classification of 80 class labels to 0.94 for the
P
wij pij six other class labels (keys, remote, medicine, mobile phone,
i,j
κw = 1 − P (1) prescription glasses, and umbrella) of the considered Vision
i,j wij eij API. The classification accuracy reduces if there are multiple
overlapping objects in the images. Also, we observe that
Here, the standard error (se) is calculated using Equation 2.
the human annotation results vary in line with the API
1 classifications, which vary between 0.91 and 0.96 for the
sew = Vision API class labels. From Table 4, it is observed that the
1 − pe(w)
sP API classification of class labels performs equally well when
i,j pij [vij − uij (1 − κw )]2 − [κw − pe(w) (1 − κw )]2 compared to inter-human annotations.
n
(2) TABLE 4. Cohen’s κ for Train-Test Results of Annotated Images

where, Cohen’s κ (and std. dev.)


wij Environment API Human Annotation
vij = 1 −
wmax 80 Class labels of Vision API 0.91 (0.96) 0.91 (0.94)
6 Other class labels of Vision API 0.94 (0.97) 0.96 (0.98)
XX
pe(w) = vij pi qj
i j

VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

C. DATA AUGMENTATION TABLE 6. COCO object detection results comparison using different
frameworks and network architectures vs Azure Custom Vision API. mAP is
The collected data for training contains 86 different class reported with COCO primary challenge metric (AP at IoU=0.50:0.05:0.95)
labels with a minimum of 3 images for each category, and
for a few class labels such as keys, remote, medicine, mobile Vision Framework Model Used mAP Billion Million Pa-
phone, prescription glasses, and umbrella, more than 50 im- Mult-Adds rameters
Azure Custom Vi- - 26.33% 116 37.43
ages per class are collected and annotated. Though the total sion API
number of images considered is more than 540, the variants deeplab-VGG 21.10% 34.9 33.1
of images considered are fewer as the images are taken from SSD 300 Inception V2 22.00% 3.8 13.7
MobileNet 19.30% 1.2 6.8
cameras with three different angles, i.e., front view, side view, VGG 22.90% 64.3 138.5
and top view. Only these three angles were considered since Faster-RCNN 300 Inception V2 15.40% 118.2 13.3
all other variants can be generated using data augmentation. MobileNet 16.40% 25.2 6.1
VGG 25.70% 149.6 138.5
So, data augmentation is used to increase the training data Faster-RCNN 600 Inception V2 21.90% 129.6 13.3
tenfold, thereby increasing its robustness [58], [59]. The data MobileNet 19.80% 30.5 6.1
augmentation techniques used are given below. Furthermore,
the augmentation values used for the data augmentation are TABLE 7. Comparison of Accuracy Results of different models vs Azure
given in Table 5. After data augmentation, the total number Custom Vision API

of annotated images is 5000.


Model Accuracy Billion Multi- Million Param-
• channel_shift_range: Random channel shifts of the im- Adds eters
age. Azure Custom Vi- 73.1% 116 37.43
sion API
• zca_whitening: Applies ZCA whitening to the image.
MobileNet-224 70.60% 569 4.2
• rotation_range: Random rotation of image with a degree GoogleNet 69.80% 1550 6.8
range. VGG 16 71.50% 15300 138
Squeezenet 57.50% 1700 1.25
• width_shift_range: Random horizontal shifts of the im-
AlexNet 57.20% 720 60
age with a fraction of total width.
• height_shift_range: Random vertical shifts of the image
with a fraction of total height. (mAP) value was computed on the MS COCO dataset using
• shear_range: Shear intensity of the image where the
COCO primary challenge metric7 . It is calculated by taking
shear angle is in the counter-clockwise direction as the average of multiple IoU (Intersection over Union as
radian. shown in Equation 3) values ranging from 0.5 to 0.95 with
• zoom_range: Random zoom of the image where the
a step of 0.05. The number of operations and parameters
lower value is 1-room_range and upper value is involved in the calculation is also given in Billion Mult-
1+zoom_range. Adds and Million Parameters, respectively. The results show
• fill_mode: Any of constant, nearest, reflect or wrap.
the API performing better than other state-of-the-art vision
Points outside the boundaries of the input are filled models such as SSD 300, Faster-RCNN 300, and Faster-
according to the selected mode. RCNN 600. Similarly, the performance of the API on the
• horizontal_flip: Randomly flip the inputs horizontally.
ImageNet dataset [60] is given in Table 7. From the results,
Table 5 shows the details of different data augmenta- it can be concluded that the API has better performance than
tions performed on the dataset. most state-of-the-art models for computer vision.
TABLE 5. Types of data augmentation used Area of Overlap
IoU = (3)
Type of Augmentation Augmentation Value Area of U nion
channel_shift_range 20
zca_whitening TRUE The mAP of the Vision API is calculated on the newly
rotation_range 40 computed dataset of 5000 annotated images before and after
width_shift_range 0.2 training the Custom Vision API on the new dataset. It is
height_shift_range 0.2
observed that the mAP value increases from 63% to 84% with
shear_range 0.2
zoom_range 0.2 IoU > 0.5 after training the Custom Vision API.
horizontal_flip TRUE The application’s latency was measured for two resolu-
fill_mode Nearest tions of the captured image - 224*224 and 512*512 pixels
and the results are shown in Table 8. The time measured can
be classified into three different categories:
D. PERFORMANCE EVALUATION 1) Smartphone Time: the time taken on the smartphone
After training with the newly annotated image dataset, the app
Custom Vision API’s performance was compared against 2) Glass Time: the time taken on the Google Glass device
other deep learning computer vision frameworks, and the
results are presented in Table 6. The mean Average Precision 7 https://cocodataset.org/detection-eval

12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

TABLE 8. Latency Results. 1) User training period is minimal: As already de-


scribed in the user-system interaction section, there
Frame Resolution 224*224 512*512
Delay in milliseconds
are two voice commands to use the application. The
Smartphone Time 300 500 first voice command is "OK Glass," followed by
Glass Time 200 300 the second command, "Describe Scene." The voice
Edge(API) Time 100 150 commands were found very intuitive by the users. The
most significant advantage of the proposed system is
that the user does not require any visual cues to use the
3) Edge Time: the time the Vision API takes to generate application.
the captions for the captured image, identify the objects 2) No extra effort is required to use this device daily:
present in it, and return the results over the Wi-Fi back The device is fairly simple to use. The navigation
to the smartphone. through the device is entirely audio-based. Each of the
For both the resolutions, the application has a response time two voice commands is followed by a beep sound, and
of less than 1 second. All the latency values were measured the result is the audio-based description of the scene.
on a 4G network. On receiving the description, the user can recognize
Comparison with other computer vision and image recog- a different screen by starting over. Finally, the voice
nition APIs: There are several APIs such as Watson, Clarifai, recognition software provided by Google was found to
Imagga, and Parallel dots. Though few APIs have better be very effective.
mAP values than Azure Vision API for the standard datasets 3) The application helps the user to understand the
[61], the customizable option provided by the Azure Vision scene: Since the application generates captions of
API, of using images belonging to various other categories whatever scene the person is looking at, it was hypoth-
which are not present in the standard datasets makes the esized that the application would help the user better
Azure Vision API a better choice. Apart from that, this API understand their surroundings.
also has computer vision features such as blob detection & 4) A null and alternate hypothesis was also formulated:
analysis, building tools, image processing, multiple image • Null Hypothesis: A visually impaired person
type support, reporting/analytics integration, smart camera would not prefer to use the application.
integration and also supports the integration with Microsoft • Alternate Hypothesis: A visually impaired per-
Azure Cloud network and various other virtual and aug- son would prefer to use this application every day.
mented reality tools like Microsoft Kinect and so on. All of
Questionnaires were formulated to evaluate the above hy-
these make this API a better choice than the rest.
potheses. The questions are as follows:
Hypothesis 1: User training period is minimal.
E. LIKERT SCALE ANALYSIS
1) Were you able to effectively use this application your-
With the help of students (50) and teachers (5) at the Roman
self after three or fewer trials/walkthroughs?
and Catherine Lobo School for the Visually Impaired
2) While trying this application, did you feel confused at
at Mangalore, Karnataka, India, the application was tested,
any point?
and its usefulness to the BVIP community was determined.
3) After a prolonged period of not using the application,
The students who took part in the study belonged to the age
would you use it with the same efficiency you are using
group of 12 to 15 years and were in their high school years.
now? (Would you be able to remember how to use the
They were under the supervision of their teachers during the
device ?)
study, who were 30 to 50 years old. After demonstrating
Hypothesis 2: No extra effort is required to use this device
how to use the device, the students were asked to use it in
daily.
their school environment to identify and recognize different
areas within the school boundary, such as their classroom, 1) Do you consider wearing this device irritat-
dorm, and playground. Objects like chairs, windows, doors, ing/troublesome?
beds, and stairs were some of the different indoor objects 2) How many times (out of five) is your voice recognized
identified using the device. Some of the students who used by the device?
the device in the playground within the school premises 3) Would you prefer to use this application instead of
were able to identify outdoor scenes, which included trees, having a guide? If not: Would you prefer to use this
swings, pet cats, and dogs. Using the description given by application when your guide is not available?
the device, the students could accurately identify their current Hypothesis 3: The application helps the user to understand
location within the school. After performing the study, to the scene.
determine the application’s usefulness, a set of hypotheses 1) Do you think the device identified all the objects in the
was formulated, along with corresponding questionnaires for scene?
each of these hypotheses. The feedback and answers to the 2) Do the identified objects help in better understanding
questions were recorded and presented in the form of charts. the scene?
The following hypotheses were formulated for the study: 3) Are the objects correctly identified?
VOLUME 4, 2016 13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

Hypothesis 4:
Null Hypothesis: A visually impaired person would not
prefer to use the application.
Alternate Hypothesis: A visually impaired person would
prefer to use this application every day.
1) What do you use to walk in and around your neighbor-
hood? Cane, Guide, Other.
2) Would you prefer a guide or would you rather walk
alone?
3) Do you have access to the internet in your area?
4) How would you rate the response time of the applica-
tion?
5) How likely are you to use this application every day?
6) How comfortable are you with the audio-based inter- FIGURE 13. Hypothesis 2 Question 2
face?
7) Are you able to hear the output from the device?
8) How well do you think the description given by the
application matches the actual scene (As described by
the guide)?
9) Would you prefer voice-based or touch-based naviga-
tion?
A Likert Scale Analysis on the usability of the applica-
tion was performed using user feedback and responses to
the above questions. The following pie charts depict the
responses to some of the questions received from the users.

FIGURE 14. Hypothesis 3 Question 2

understand the scene. Thus, the object detection model


complements the image captioning model. This question is
concerning Hypothesis 3: The application helps the user to
understand the scene.
Statistical Analysis: To further understand the four dif-
ferent hypotheses, fifty-five subjects were compared with
independent sample t-test and χ2 -squared tests on various
FIGURE 12. Hypothesis 1 Question 2 parameters such as age, gender, the severity of visual im-
pairment, and intellectual level. The differences between
Inference: Fig. 12 gives the percentage of people who groups on parametric data such as chronological age and age
felt confused while trying the device. As can be seen, the of onset of blindness were evaluated with an independent
majority of the users did not feel confused. This question is sample t-test, and the non-parametric data such as gender
concerning Hypothesis 1: User training period is minimal. and severity of blindness were evaluated with the χ2 -squared
The less confused the user in his(her) first attempt at using test. School records included age, gender, the severity of
the application, the smaller the training period. visual impairment, and intellectual level. The severity of
Inference: Since the application is entirely audio-based, visual impairment was categorized as ‘total blindness,’ ‘near
the users’ voices must be correctly recognized. From Fig. blindness,’ ‘profound vision loss,’ and ‘severe vision loss.’
13, it can seen that the users’ voice commands are being Similarly, the intellectual level was classified into ‘normal,’
recognized most of the time correctly. This question is con- ‘borderline,’ or ‘mental retardation.’ Age included ‘chrono-
cerning Hypothesis 2: No extra effort is required to use this logical age’ and ‘age of onset of visual impairment.’ There
device daily. The fewer times the user has to repeat the voice was no statistically significant difference between the groups
command, the lesser the effort put into using the application. in terms of age (18.86 ± 3.05), age of onset (8.81 ± 20.85
Inference: From Fig. 14, it can be concluded that the months), and gender (χ2 = 0.02, d.f. = 1, P = 0.95) w.r.t
objects identified from the image helped the user better the four hypotheses: training period, the effort required to
14 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

use the device, application’s impact in scene understanding personalized such that the voice output has a regional accent.
and the inclination to use this application every day. Also, The users explained that this would help make the appli-
there was no significant difference in severity of blindness cation feel more personalized to the user, given that Indian
for the four hypothesis categories (χ2 = 10.24, d.f. = 2, accents are now available on various electronic gadgets. One
P = 0.15). However, we found a significant difference for BVIP user commented on the audio output being affected
the intellectual level (χ2 = 36.11, d.f. = 3, P = 0.001) as in boisterous environments, such as noisy traffic junctions
the students with borderline and mental retardation found or construction sites. However, this problem was mitigated
it difficult to understand and use the application. For this by switching to Bluetooth earphones instead of the bone
complete statistical analysis, the significance level was set conduction transducer, in which case the audio output was
with P < 0.05. not affected by external sounds. The BVIP users explained
that the application on Google Glass provides more comfort
F. SIGNIFICANCE and usability when compared with smartphone apps for the
The proposed approach was compared with existing state- visually impaired.
of-the-art solutions, and the details are shown in Table 1.
Previously, devices like smartphones, Bluetooth beacons, and V. CONCLUSION
Raspberry Pi were used to develop diverse solutions. The The use of Google Glass to assist the BVIP community is
proposed approach attempts to tackle the problem by using demonstrated by developing an application that acts as a
Google Glass. Keeping BVIP users in mind, the application visual assistant. The system is designed to be highly portable,
was designed to be wholly audio-based, and the user does not easy to wear, and works in real-time. The experimental
require any visual cues to use the device. Another significant results of the Azure Vision API show a mean Average Pre-
improvement is that the user does not need to carry any bulky cision value (mAP) of 29.33% on the MS COCO dataset and
hardware while using the proposed system. The hardware an accuracy of 73.1% on the ImageNet dataset. A dataset
used here is Google Glass, which is very similar to any of 5000 newly annotated images is created to improve the
regular reading glasses in size and shape, and a smartphone performance of scene description in Indian scenarios. The
which makes the application highly portable and easy to use. Custom Vision API is trained and tested on the newly created
Hence, the proposed system is highly wearable, portable, and dataset, and it is observed that it increases the overall mAP
provides accurate results in real-time. from 63% to 84% with IoU > 0.5 for the created dataset.
A Likert Scale Analysis on the usability of the applica- The overall response time of the proposed application was
tion was performed. Positive feedback and response were measured and is less than 1 second, thereby providing accu-
received from the users, as shown in the charts in Figures rate results in real-time. The proposed application describes
12, 13, and 14. It can be concluded from the response that the scene and identifies the various objects present in front
the application can be used effortlessly on a daily basis to of the user. It was tested on the BVIP, and their response
understand the BVIP user’s surroundings. It can be further and feedback were recorded, and a Likert scale analysis was
concluded that the BVIP require minimal to no training to performed. From the analysis, it can be concluded that the
use the device, and they prefer to use the application as a proposed system has an excellent potential to be used as an
visual assistant. assistant for the BVIP.
The computer vision API from Azure Cognitive Services
G. LIMITATIONS can add more functionalities to the proposed application.
While testing, certain limitations of the application were The capabilities of other APIs can be explored to add more
identified. Firstly, the proposed system is highly dependent functionalities such as text extraction and reading using Read
on a strong internet connection and works if and only if there API and face detection and recognition using Face Service8 .
is an internet connection available in the area. The latency The application can be enhanced by adding more features,
of the application was found to vary significantly due to such as lane detection, fall detection, pit detection, obstacle
fluctuations in the network speed. Secondly, the device is avoidance, and shopping assistant, thereby creating a one-
relatively expensive in developing countries and is not easily stop assistant for the BVIP. Google Glass has embedded
affordable. Finally, the battery on the Google Glass was able sensors that can achieve these functionalities with little to
to run only for 4 hours per charge while using the application no need for external sensors. Further, there exists a possi-
continuously. However, the short runtime problem can be bility of moving the application entirely to Google Glass
overcome by adding an external power pack. by removing the dependency on the smartphone. Currently,
A few other improvements were identified while collecting the smartphone device is used to process the captured image
feedback from the BVIP students and teachers. For instance, before making the API calls to the Custom Vision API, which
a few users commented on the ability of the device to un- can be avoided by using the Android SDK for Vision API9
derstand regional accents, stating that the voice command directly on Google Glass.
was not recognized in certain instances. The statistics are
shown in Fig. 13. Another feedback received on very similar 8 https://azure.microsoft.com/en-us/services/cognitive-services/face/

grounds was that the audio output from the device can be 9 https://github.com/microsoft/Cognitive-Vision-Android

VOLUME 4, 2016 15

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

DECLARATION
The experimental procedure and the entire setup, including
Google Glass given to the participants, were approved by
the Institutional Ethics Committee (IEC) of NITK Surathkal,
Mangalore, India. The participants were also informed that
they had the right to quit the experiment at any time. The
collected data, i.e., video recordings, audio, and the written
feedback of the subjects, was taken only after they gave
written consent for the use of their collected data for the
research experiment.
.

APPENDIX A RESULTS OF LIKERT SCALE ANALYSIS


The following pie charts are obtained by performing FIGURE 16. Hypothesis 1 Question 3
the Likert Scale Analysis on the usability of the ap-
plication. In order to perform the analysis, we for-
mulated four types of hypotheses and generated corre-
sponding questionnaires to evaluate these hypotheses. We
asked these questions to the BVIP students and teach-
ers at the Roman and Catherine Lobo School for
the Visually Impaired, Mangalore, Karnataka, India.

A. HYPOTHESIS 1: USER TRAINING PERIOD IS


MINIMAL

FIGURE 17. Hypothesis 2 Question 1

FIGURE 15. Hypothesis 1 Question 1

In order to evaluate the first hypothesis, a description of


the proposed application was given to the users, along with
an explanation of how to use the device. After trying the
application, the BVIP users were asked the questions shown FIGURE 18. Hypothesis 2 Question 3
in Figs. 15, 12 and 16. As can be seen from the responses
in the graphs, most of the users found the application easy
to use and were able to use the application effectively after a of the users were comfortable using the device regularly,
single walk-through. but a few of them found the device irritating as overusing
the application sometimes led to the device’s heating. The
B. HYPOTHESIS 2: NO EXTRA EFFORT IS REQUIRED voice recognition system provided by Google Glass was
TO USE THIS DEVICE ON A DAILY BASIS effective except for a few cases where the users had to repeat
The second set of questions were asked to determine if the the commands a few times for the device to recognize the
users found the device to be usable on a daily basis. For this, command.
the questions shown in Figs. 17, 13 and 18 were asked. Most
16 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

C. HYPOTHESIS 3: THE APPLICATION HELPS THE


USER TO UNDERSTAND THE SCENE

FIGURE 21. Hypothesis 4 Question 1

FIGURE 19. Hypothesis 3 Question 1

FIGURE 22. Hypothesis 4 Question 2

FIGURE 20. Hypothesis 3 Question 3

The third set of questions were focused on the actual use


case of the application: Scene Description. The questions
asked are as shown in Figs. 19, 14 and 20. As can be seen
from the responses displayed in the charts, most of the BVIP
users agreed that the objects identified helped them better
understand the scene.

D. HYPOTHESIS 4
Null Hypothesis: A visually impaired person would not
prefer to use the application.
Alternate Hypothesis: A visually impaired person would FIGURE 23. Hypothesis 4 Question 3
prefer to use this application every day.
The final set of questions were asked to determine if
the visually impaired person would prefer to use the ap-
plication. Various questions were asked to evaluate this
hypothesis as can be seen from Figs. 21 to 29. The ques-
tions were asked to determine the current lifestyle of the
visually impaired individuals and if the use of the applica-
tion would help them in better scene analysis. From their
responses, it can be concluded that the majority of the users
found the application effective, portable, and easy to use.
VOLUME 4, 2016 17

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

FIGURE 24. Hypothesis 4 Question 4 FIGURE 27. Hypothesis 4 Question 7

FIGURE 25. Hypothesis 4 Question 5 FIGURE 28. Hypothesis 4 Question 8

FIGURE 29. Hypothesis 4 Question 9


FIGURE 26. Hypothesis 4 Question 6

REFERENCES
[1] Eric R. Jensen. Brain-based learning: The new paradigm of teaching. 2008.
[2] R. S. FIXOT. American journal of ophthalmology. 1957.
[3] Van C. Lansingh. Vision 2020: The right to sight in 7 years? 2013.
[4] Nicholas Bradley and Mark Dunlop. An experimental investigation
into wayfinding directions for visually impaired people. Personal and
Ubiquitous Computing, 9:395–403, 11 2005.
[5] F. Battaglia and G. Iannizzotto. An open architecture to develop a
handheld device for helping visually impaired people. IEEE Transactions
on Consumer Electronics, 58(3):1086–1093, 2012.
[6] María Meza-de Luna, Juan Terven, Bogdan Raducanu, and Joaquín Salas.

18 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

A social-aware assistant to support individuals with visual impairments [28] Luis González, Luis Serpa, Kevin Calle, A. Guzhnay-Lucero, Vladimir
during social interaction: A systematic requirements analysis. Interna- Robles-Bykbaev, and M. Mena-Salcedo. A low-cost wearable support
tional Journal of Human-Computer Studies, 122, 08 2018. system for visually disabled people. pages 1–5, 11 2016.
[7] Wan-Jung Chang, Liang-Bi Chen, Chia-Hao Hsu, Jheng-Hao Chen, Tzu- [29] Alvise Memo and Pietro Zanuttigh. Head-mounted gesture controlled
Chin Yang, and Cheng-Pei Lin. Medglasses: A wearable smart-glasses- interface for human-computer interaction. Multimedia Tools and Appli-
based drug pill recognition system using deep learning for visually im- cations, 77, 12 2016.
paired chronic patients. IEEE Access, 8:17013–17024, 2020. [30] Michael Barney, Gilmar Brito Jonathan Kilner, Aida Araújo, and Meuse
[8] Wan-Jung Chang, Yue-Xun Yu, Jhen-Hao Chen, Zhi-Yao Zhang, Sung- Nogueira. Sensory glasses for the visually impaired. pages 1–2. ACM, 04
Jie Ko, Tsung-Han Yang, Chia-Hao Hsu, Liang-Bi Chen, and Ming-Che 2017.
Chen. A deep learning based wearable medicines recognition system [31] Md Shishir, Shahariar Fahim, Fairuz Habib, and Tanjila Farah. Eye
for visually impaired people. In 2019 IEEE International Conference assistant : Using mobile application to help the visually impaired. pages
on Artificial Intelligence Circuits and Systems (AICAS), pages 207–208, 1–4, 05 2019.
2019.
[32] B. Jiang, J. Yang, Z. Lv, and H. Song. Wearable vision assistance system
[9] Wan-Jung Chang, Liang-Bi Chen, Chia-Hao Hsu, Cheng-Pei Lin, and Tzu-
based on binocular sensors for visually impaired users. IEEE Internet of
Chin Yang. A deep learning-based intelligent medicine recognition system
Things Journal, 6(2):1375–1383, April 2019.
for chronic patients. IEEE Access, 7:44441–44458, 2019.
[33] Oleksandr Bogdan, Oleg Yurchenko, Oleksandr Bailo, François Rameau,
[10] P. A. Zientara, S. Lee, G. H. Smith, R. Brenner, L. Itti, M. B. Rosson, J. M.
Donggeun Yoo, and Inso Kweon. Intelligent Assistant for People with
Carroll, K. M. Irick, and V. Narayanan. Third eye: A shopping assistant
Low Vision Abilities, pages 448–462. 02 2018.
for the visually impaired. Computer, 50(2):16–24, 2017.
[11] D. Pintado, V. Sanchez, E. Adarve, M. Mata, Z. Gogebakan, B. Cabuk, [34] Soo-Chang Pei and Yu-Ying Wang. Census-based vision for auditory
C. Chiu, J. Zhan, L. Gewali, and P. Oh. Deep learning based shopping depth images and speech navigation of visually impaired users. IEEE
assistant for the visually impaired. In 2019 IEEE International Conference Transactions on Consumer Electronics - IEEE TRANS CONSUM ELEC-
on Consumer Electronics (ICCE), pages 1–6, 2019. TRON, 57:1883–1890, 11 2011.
[12] Pablo-Alejandro Quinones, Tammy Greene, Rayoung Yang, and Mark [35] Wan-Jung Chang, Liang-Bi Chen, Ming-Che Chen, Jian-Ping Su, Cheng-
Newman. Supporting visually impaired navigation: A needs-finding study. You Sie, and Ching-Hsiang Yang. Design and implementation of an
pages 1645–1650, 05 2011. intelligent assistive system for visually impaired people for aerial obstacle
[13] Fatma El-zahraa El-taher, Ayman Taha, Jane Courtney, and Susan Mc- avoidance and fall detection. IEEE Sensors Journal, 20(17):10199–10210,
keever. A systematic review of urban navigation systems for visually 2020.
impaired people. Sensors, 21(9), 2021. [36] Liang-Bi Chen, Jian-Ping Su, Ming-Che Chen, Wan-Jung Chang, Ching-
[14] J.-H Lee, Dongho Kim, and B.-S Shin. A wearable guidance system with Hsiang Yang, and Cheng-You Sie. An implementation of an intelligent
interactive user interface for persons with visual impairment. Multimedia assistance system for visually impaired/blind people. In 2019 IEEE
Tools and Applications, 75, 11 2014. International Conference on Consumer Electronics (ICCE), pages 1–2,
[15] Ruxandra Tapu, Bogdan Mocanu, and Titus Zaharia. A computer vision- 2019.
based perception system for visually impaired. Multimedia Tools and [37] Wan-Jung Chang, Liang-Bi Chen, and Yu-Zung Chiou. Design and im-
Applications, 76, 05 2016. plementation of a drowsiness-fatigue-detection system based on wearable
[16] P. Vyavahare and S. Habeeb. Assistant for visually impaired using smart glasses to increase road safety. IEEE Transactions on Consumer
computer vision. In 2018 1st International Conference on Advanced Electronics, 64(4):461–469, 2018.
Research in Engineering Sciences (ARES), pages 1–7, June 2018. [38] Liang-Bi Chen, Wan-Jung Chang, Jian-Ping Su, Ji-Yi Ciou, Yi-Jhan Ciou,
[17] Kevin Laubhan, Michael Trent, Blain Root, Ahmed Abdelgawad, and Cheng-Chin Kuo, and Katherine Shu-Min Li. A wearable-glasses-based
Kumar Yelamarthi. A wearable portable electronic travel aid for blind. drowsiness-fatigue-detection system for improving road safety. In 2016
pages 1999–2003, 03 2016. IEEE 5th Global Conference on Consumer Electronics, pages 1–2, 2016.
[18] Michael Trent, Ahmed Abdelgawad, and Kumar Yelamarthi. A smart [39] Haotian Jiang, James Starkman, Menghan Liu, and Ming-Chun Huang.
wearable navigation system for visually impaired. pages 333–341, 07 Food nutrition visualization on google glass: Design tradeoff and field
2017. evaluation. IEEE Consumer Electronics Magazine, 7:21–31, 05 2018.
[19] J. Bai, S. Lian, Z. Liu, K. Wang, and D. Liu. Smart guiding glasses for [40] Ying Li and Anshul Sheopuri. Applying image analysis to assess food
visually impaired people in indoor environment. IEEE Transactions on aesthetics and uniqueness. In 2015 IEEE International Conference on
Consumer Electronics, 63(3):258–266, 2017. Image Processing (ICIP), pages 311–314. IEEE, 2015.
[20] Quoc-Hung Nguyen, Vu Hai, Thanh-Hai Tran, and Quang-Hoan Nguyen. [41] Peter Washington, Catalin Voss, Nick Haber, Serena Tanaka, Jena Daniels,
Developing a way-finding system on mobile robot assisting visually im- Carl Feinstein, Terry Winograd, and Dennis Wall. A wearable social
paired people in an indoor environment. Multimedia Tools and Applica- interaction aid for children with autism. In Proceedings of the 2016 CHI
tions, 76, 01 2016. Conference Extended Abstracts on Human Factors in Computing Systems,
[21] Thomas Kubitza Mauro Avila. Assistive wearable technology for visually pages 2348–2354. ACM, 2016.
impaired. pages 940–943. ACM, 04 2014.
[42] Peter Washington, Dennis Wall, Catalin Voss, Aaron Kline, Nick Haber,
[22] Joey van der Bie, Britte Visser, Jordy Matsari, Mijnisha Singh, Timon Van
Jena Daniels, Azar Fazel, Titas De, Carl Feinstein, and Terry Winograd.
Hasselt, Jan Koopman, and Ben J A Kröse. Guiding the visually impaired
Superpowerglass: A wearable aid for the at-home therapy of children with
through the environment with beacons. pages 385–388. ACM, 09 2016.
autism. Proceedings of the ACM on Interactive, Mobile, Wearable and
[23] João Guerreiro, Daisuke Sato, Dragan Ahmetovic, Eshed Ohn-Bar, Kris
Ubiquitous Technologies, 1:1–22, 09 2017.
Kitani, and Chieko Asakawa. Virtual navigation for blind people: Trans-
[43] Zhihan Lv, Alaa Halawani, Shengzhong Fen, Shafiq Réhman, and Haibo
ferring route knowledge to the real-world. International Journal of Human-
Li. Reprint: Touch-less interactive augmented reality game on vision based
Computer Studies, page 102369, 10 2019.
wearable device. Personal and Ubiquitous Computing, 04 2015.
[24] Robert-Gabriel Lupu, Oana Mitrut, , Andrei Stan, Florina Ungureanu, Kyr-
iaki Kalimeri, and Alin Moldoveanu. Cognitive and affective assessment [44] Zibo Wang, Xi Wen, Song yu, Xiaoqian Mao, Wei Li, and Genshe Chen.
of navigation and mobility tasks for the visually impaired via electroen- Navigation of a humanoid robot via head gestures based on global and
cephalography and behavioral signals. Sensors, 20(20), 2020. local live videos on google glass. pages 1–6, 05 2017.
[25] Wan-Jung Chang, Liang-Bi Chen, Cheng-You Sie, and Ching-Hsiang [45] Xi Wen, Yu Song, Wei Li, Genshe Chen, and Bin Xian. Rotation vector
Yang. An artificial intelligence edge computing-based assistive system for sensor-based remote control of a humanoid robot through a google glass.
visually impaired pedestrian safety at zebra crossings. IEEE Transactions In 2016 IEEE 14th International Workshop on Advanced Motion Control
on Consumer Electronics, 67(1):3–11, 2021. (AMC), pages 203–207. IEEE, 2016.
[26] Hanlu Ye, Meethu Malu, Uran Oh, and Leah Findlater. Current and future [46] C.F. Xu, Y.F. Gong, W. Su, J. Cao, and F.B. Tao. Virtual video and real-
mobile and wearable device use by people with visual impairments. pages time data demonstration for smart substation inspection based on google
3123–3132. ACM, 08 2015. glasses. pages 5 .–5 ., 01 2015.
[27] Faustin Pégeot and Hideaki Goto. Scene text detection and tracking for a [47] Antoine Widmer, Roger Schaer, Dimitrios Markonis, and Henning Müller.
camera-equipped wearable reading assistant for the blind. volume 7729, Facilitating medical information search using google glass connected to a
pages 454–463, 11 2012. content-based medical image retrieval system. volume 2014, 08 2014.

VOLUME 4, 2016 19

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3135024, IEEE Access

Hafeez et al.: A Google Glass Based Real-Time Scene Analysis for the Visually Impaired

[48] Georg Lausegger, Michael Spitzer, and Martin Ebner. Omnicolor – a SANJEEV U RAO received his B.Tech degree in
smart glasses app to support colorblind people. International Journal of information technology from National Institute of
Interactive Mobile Technologies (iJIM), 11:161–177, 07 2017. Technology Karnataka, Surathkal, India in 2019.
[49] Asm Anam, Shahinur Alam, and M. Yeasin. Expression: A dyadic He is currently working as a software developer
conversation aid using google glass for people with visual impairments. in CitiBank, Pune. His research interests include
UbiComp 2014 - Adjunct Proceedings of the 2014 ACM International deep learning, big data, and the internet of things.
Joint Conference on Pervasive and Ubiquitous Computing, pages 211–214,
01 2014.
[50] Alex D. Hwang and Eli Peli. An augmented-reality edge enhancement ap-
plication for google glass. Optometry and Vision Science, 91:1021–1030,
2014.
[51] L. B. Neto, F. Grijalva, V. R. M. L. Maike, L. C. Martini, D. Florencio,
M. C. C. Baranauskas, A. Rocha, and S. Goldenstein. A kinect-based
wearable face recognition system to aid visually impaired users. IEEE
Transactions on Human-Machine Systems, 47(1):52–64, Feb 2017.
[52] Hotaka Takizawa, Shotaro Yamaguchi, Mayumi Aoyagi, Nobuo Ezaki, and
Shinji Mizuno. Kinect cane: An assistive system for the visually impaired SWAROOP RANGANATH received his B.Tech
based on three-dimensional object recognition. volume 19, pages 740– degree in information technology from National
745, 12 2012. Institute of Technology Karnataka, Surathkal, In-
[53] Dawon Kim and Yosoon Choi. Applications of smart glasses in applied dia in 2019. He is currently working at Wipro,
sciences: A systematic review. Applied Sciences, 11(11), 2021. Bangalore as a Data Engineer. His research inter-
[54] Micah Hodosh, Peter Young, and Julia Hockenmaier. Flickr8k dataset. ests include Deep Learning, Reinforcement Learn-
[55] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross ing, AI applications in IoT systems, advanced
Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zit- analytics in business insights, and Explainable AI.
nick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
[56] Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and
Javier R. Movellan. The faces of engagement: Automatic recognition
of student engagementfrom facial expressions. IEEE Transactions on
Affective Computing, 5(1):86–98, 2014.
[57] TS Ashwin and Ram Mohana Reddy Guddeti. Affective database for e-
learning and classroom environments using indian students’ faces, hand
gestures and body postures. Future Generation Computer Systems,
108:334–348, 2020. ASHWIN T S received his B.E. degree from
[58] TS Ashwin and Ram Mohana Reddy Guddeti. Automatic detection of Visveswaraya Technological University, Belgaum,
students’ affective states in classroom environment using hybrid con- India, in 2011, and his M.Tech. degree from Ma-
volutional neural networks. Education and Information Technologies, nipal University, Manipal, India, in 2013. He re-
25(2):1387–1415, 2020. ceived his Ph.D. degree from National Institute
[59] TS Ashwin and Ram Mohana Reddy Guddeti. Unobtrusive behavioral of Technology Karnataka Surathkal, Mangalore,
analysis of students in classroom environment using non-verbal cues. India. He is currently working as a postdoctoral
IEEE Access, 7:150693–150709, 2019. fellow at IIT Bombay, India. He has more than
[60] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- 35 reputed and peer-reviewed international confer-
Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE ences and journal publications including 5 book
Conference on Computer Vision and Pattern Recognition, pages 248–255,
chapters. His research interests include Affective Computing, Human-
2009.
Computer Interaction, Educational Data Mining, Learning Analytics, and
[61] Sinan Chen, Sachio Saiki, and Masahide Nakamura. Toward flexible and
efficient home context sensing: Capability evaluation and verification of Computer Vision applications. He is a Member of AAAC, ComSoc, IEEE,
image-based cognitive apis. Sensors, 20(5):1442, 2020. and ACM.

RAM MOHANA REDDY GUDDETI received


his B.Tech from S.V. University, Tirupati, Andhra
Pradesh, India in 1987; M.Tech from Indian Insti-
tute of Technology Kharagpur, India in 1993 and
Ph.D. from The University of Edinburgh, U.K in
HAFEEZ ALI A received his B.Tech degree in
2005. Currently, he is the Senior Professor, De-
information technology from National Institute of
partment of Information Technology, National In-
Technology Karnataka, Surathkal, India in 2019.
stitute of Technology Karnataka Surathkal, Man-
He is currently working as an Analyst at Gold-
galore, India. His research interests include Affec-
man Sachs, where he is working on developing
tive Computing, Big Data, and Cognitive Analyt-
software applications for financial services using
ics, Bio-Inspired Cloud and Green Computing, Internet of Things and Smart
state-of-the-art technology. His research interests
Sensor Networks, Social Multimedia, and Social Network Analysis. He is
include distributed computing, data analytics, pat-
a Senior Member of both IEEE and ACM; Life Fellow of IETE (India);
tern recognition, computer vision, computer net-
Life Member of ISTE (India) and Life Member of Computer Society of
works, and software engineering.
India. He has more than 200 research publications in reputed / peer-reviewed
International Journals, Conference Proceedings, and Book Chapters.

20 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like