Sift Fuzzy05
Sift Fuzzy05
Abstract – In this paper, we discuss the use of the matching continued, and a new set of three-dimensional
Scale Invariant Feature Transform to match areas matched points would be created. With the use of
between stereo images. The three dimensional location of wheel encoders, the orientation of the robot could be
matched points are then computed. Each matched area is determined with relation to the robot at previous time
further matched to a database of known objects. After
steps. The locations of three dimensional matched
projecting the three dimensional locations onto a
horizontal plane, the spatial relationship between pairs of points could then be further resolved with a least
objects are then described linguistically using a system of squared fit to the location of previously defined points.
Fuzzy rules. We are exploring the technique to facilitate Lowe’s SLAM system is very effective, but the
human-like communication with a robot. maps quickly become cluttered with large amounts of
generic points. With no grouping by class distinction,
Index Terms – Stereo Vision, SIFT, Spatial Language, there appears no easy mode of communication between
Fuzzy Logic, Human-Robot Interaction the robot and the user about objects in the scene. This
paper focuses on the relative three-dimensional
locations of objects in a scene as a common language
I. INTRODUCTION
between the robot and the user.
Research into three-dimensional representation of The system described in this paper is able to
the world as a fusion of multiple sensors has been going understand and linguistically describe the real three-
on for decades. The research has been done using dimensional world in a presented scene. This includes
different modalities including sonar, lasers, and optical using descriptors such as “to the left”, “to the right”, “in
sensors. With increased computational power as well front”, “behind” and surround. These descriptions are
as new matching algorithms, the speed and complexity built using a fuzzy rule base that takes the values of
of scene description has increased by a large amount. object points found using stereo matching projected
This paper describes the use of stereo vision and a onto a horizontal plane. This allows a natural
linguistic description of a viewed scene. The communication between a robot and a user.
motivation for this work is to simplify human robot The paper is organized as follows. We first review
interaction. As many early algorithms used simplified the SIFT algorithm and then discuss how it is used for
problem domains or took a very simplified view of the stereo vision. The method for generating spatial
world around the robot, the robot had great limitations linguistic descriptions is then described, followed by
in its abilities. One example of a simplified world view several examples illustrating the technique. We
is to use sonars and dead reckoning of wheel encoders conclude with a summary and discussion of future
to build maps of a scene. This has led to great work.
successes in robotic navigation but comes up short
when interacting with the real three-dimensional world.
II. SIFT DESCRIPTION
At best, the robot is only able to understand and
describe the world in a two dimensional plane. The stereo vision matching relies upon the Scale
In [1], Lowe used his previous work in image Invariant Feature Transform (SIFT) algorithm proposed
matching, [2], to match areas between three cameras by Lowe [3]. SIFT finds unique locations (keypoints)
mounted on a mobile robot. The three-dimensional in scale space of a training image that can also be easily
locations of these matched areas relative to the robot found in future test images. Scale space is built by
could then be determined. At each time step, image incrementally blurring an original image by a Gaussian
*
This work is supported by NSF under grant EIA-0325641.
705
match is found, the keypoint is labeled by the name of False alarms that reside far outside a proper cluster
the node in the database. If a keypoint is not matched, on the horizontal plane are first removed. The median
it is held as part of an unknown object. These unknown X and Z values for each cluster of keypoints is found.
keypoints might reside close to known objects in the Any keypoints lying outside a fixed radius of this
scene and could possibly be added to the database of location are thrown away. In future work, a clustering
known keypoints using clustering. algorithm will be used to find more robust clusters of
After matching, there is a list of keypoints in a keypoints.
three-dimensional space with labels. Most of these The remaining keypoints are run through the qhull
keypoints will correspond to objects located in the program, [4], to find a convex hull. The set of points
scene. The scene can be represented with a three representing the convex hull of the object is then saved
dimensional modeling tool. For example, Fig. 5 shows to an external file.
an OpenGL scene with keypoints of known objects
plotted in white. As well, projections onto the
horizontal plane are shown in cyan and the projections
onto the vertical plane are shown in red.
(a)
Fig. 4 Matching of the left and right image from a single time step.
The more vertical the matching line, the further the point is from the
(b)
cameras. Only four false alarms occurred in the matching of these Fig. 5 (a) A real world scene taken using two parallel cameras. The
two images. cameras are off to the lower left of this image. (b) The matched
keypoints in white and their projections onto the horizontal (cyan)
IV. LINGUISTIC DESCRIPTION and vertical (red) planes.
For this preliminary work into the linguistic
description of three dimensional scenes, the
descriptions are simple. When asked the relation of two The inputs to the linguistic description engine are
objects, the program responds that the object is to the the convex hulls of two objects A and B. The input can
left of, to the right of, in front or behind the other be thought of as “What is the relation of A to B?” The
objects. These descriptions all have a degree of truth Histogram of Forces, [5], is used to process the convex
used to further resolve the location of an object in the hulls of the two objects. The result is a histogram of
scene. These linguistic hedges include “slightly”, 360 degrees divided into 180 bins. The domain
“somewhat” or “extends to”. corresponds to all angular directions around an origin.
The inputs to the system are the X and Z locations Each bin represents the strength of the relation between
of the keypoints, which are the projections onto the the two objects at the angle. The process of finding the
horizontal plane, as shown in Fig 5 (b). On this value of a single bin is shown in Fig. 6. The length of
horizontal plane, the objects will end up as clusters of the intersections between the lines at the given angle
keypoints. Using the class labels, the desired keypoints and the objects are first computed. The force is
can quickly be found and input to the linguistic computed using both a constant force and a
description system. gravitational force which takes into account the
706
distance between the two objects. The sum of the between the objects and the cameras are in the range of
forces for all lines at a given rotation is then added to 50 to 200 cm; similar to the range used for training.
the proper bin of the histogram. One example of a Of the 100 unique image pairs, 76 resulted in
histogram is shown in Fig. 7. having suitable convex hulls for linguistic description.
The other 24 did not have enough matched keypoints to
build convex hulls for all three objects. Four of these
pairs are shown in Fig. 8-11. Each shows the original
images, the 3D representation of the scene, the
horizontal plane projections, the convex hulls and the
description of the scene. The figures show that the
descriptions accurately describe the scene. More
examples of linguistic descriptions of spatial relations
can be found in [7] and [8].
Fig. 8 shows a scene with the hat between, but
slightly behind the mug and the can. The three-
dimensional figure shows where the keypoints reside in
three space as well as the horizontal and vertical
Fig. 6 Convex hulls of two objects in a scene. The lines correspond projections. The projections onto the horizontal plane
to one angle and the intersection of those lines with the objects
represents the relation between the two objects at that angle. In the and especially the convex hulls show that each object is
real implementation there are many more lines for each angle to get separate from the others.
better resolution.
(a) (b)
Fig. 7 An output example from the Histogram of Forces. This
histogram describes the relation of the two objects in figure 6.
707
discern the mug from the can in the three dimensional
representation of the keypoints. The projections onto
the horizontal plane add little into differentiating one
from the other. But, matching the keypoints in the
scene to the database of known keypoints and then
finding the individual convex hulls results in a very
descriptive figure. The linguistic description says that
(a) (b) the mug is inside the can. This makes sense as the
points project to roughly the same area on the
horizontal plane. Future work will use the projection
onto the vertical plane, which will differentiate the two
objects by saying the mug is above the can.
(c) (d)
(a) (b)
(e)
1. The hat is to the left of the can but extends to the rear
relative to the can.
2. The mug is mostly to the right of the hat but somewhat
to the rear.
3. The can is mostly to the right of the mug but somewhat (c) (d)
to the front.
(f)
Fig. 9 A scene showing the hat rotated with its bill up. (a) Left
camera. (b) Right camera. (c) A three-dimensional representation of
matched points. (d) The keypoint projections onto the horizontal
plane. (e) The convex hulls of the objects. (f) Three of the six
possible linguistic descriptions between objects.
708
projected onto a horizontal plane where their convex out of keypoints. Then, using the change in location of
hulls can be found and their spatial relations matched keypoints between the scene keypoints and the
linguistically described with great accuracy. database keypoints, an overall affine transformation can
This paper has touched on only a small number of the be approximated. The entire three-dimensional object
possibilities of linguistic scene descriptions by robots. of keypoints can then be projected into the scene.
It would seem possible to store and describe entire Because the cameras can only see one side of an object
environments for better human-robot interaction. The at a time, knowing how the entire object resides in three
next obvious step would be to take the descriptions into space would greatly increase the accuracy of linguistic
the third dimension. This can be accomplished by spatial descriptions.
projecting keypoints onto the vertical plane and The use of working memory can also improve
augmenting the grammar output of the linguistic performance of a real world robot. Working memory
description code. can store only a small number of “chunks” of
information about a current task. In this case, the
chunks could be the objects in a scene. Within the
chunk are the object location and possibly a desired
goal for the object.
It might also be possible to map the three-
dimensional locations of keypoints to unknown objects
and cluster them on the fly. These keypoints could then
be used to build a three-dimensional model of a new
(a) (b) object. A robot would then be able to move through an
environment, notice an unknown object and create a
new database entry for it. A robot with such adaptive
capabilities will allow for more natural interaction with
humans using very human-like spatial descriptions.
REFERENCES
[1] S. Se, D. G. Lowe and J. Little, “Mobile robot localization and
mapping with uncertainty using scale-invariant visual
(c) (d) landmarks,” International Journal of Robotics Research, 21, 8
(2002), pp. 735-758.
[2] D. G. Lowe, “Object Recognition from local scale invariant
features, ” Proceedings of the Seventh International Conference
on Computer Vision (ICCV’ 99) pp. 1150-1157, Kerkyra, Greece,
Sep., 1999
[3] D. G. Lowe, “Distinctive Image Features from Scale-Invariant
Keypoints,” International Journal of Computer Vision, vol. 60
no. 2 pp 91-110, 2004
[4] C. B. Barber, D. P. Dobkin, and H. T. Huhdanpaa, "The
Quickhull algorithm for convex hulls," ACM Trans. on
Mathematical Software, 22(4):469-483, Dec 1996,
(e) http://www.qhull.org
[5] P. Matsakis and L. Wendling, “A New Way to Represent the
1. The mug is surrounded by the can.
Relative Position between Areal Objects,” IEEE Trans. On
2. The hat is mostly in front of the mug but somewhat to the
Pattern Analysis and Machine Intelligence, vol. 21 no. 7 pp 634-
right.
643, 1999.
3. The can is mostly behind the hat but somewhat to the left.
[6] P. Matsakis, J. Keller, L. Wendling, J. Marjamaa, and O.
Sjahputera, “Linguistic Description of Relative Positions in
(f) Images,” IEEE Trans. on Systems, Man and Cybernetics, Part B,
Fig. 11 A scene showing a mug above a can as well as a hat. The two vol. 31, no. 4, pp. 573-588, 2001.
and three-dimensional plots of the keypoints do not show the [7] M. Skubic, P. Matsakis, G. Chronis and J. M. Keller, "Generating
separation between objects, but the convex hull image shows the Multi-Level Linguistic Spatial Descriptions from Range Sensor
object’s locations. (a) Left camera. (b) Right camera. (c) A three- Readings Using the Histogram of Forces,” Autonomous Robots,
dimensional representation of matched points. (d) The keypoint vol. 14, no. 1, Jan., 2003, pp. 51-69.
projections onto the horizontal plane. (e) The convex hulls of the [8] M. Skubic, et al., "Spatial Language for Human-Robot Dialogs,"
objects. (f) Three of the six possible linguistic descriptions between IEEE Transactions on SMC Part C, Special Issue on Human-
objects. Robot Interaction, vol. 34, no. 2, May, 2004, pp. 154-167.
709