KEMBAR78
Sift Fuzzy05 | PDF | Applied Mathematics
0% found this document useful (0 votes)
12 views6 pages

Sift Fuzzy05

Uploaded by

fibec66073
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

Sift Fuzzy05

Uploaded by

fibec66073
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2005 IEEE International Workshop on Robots and Human Interactive Communication

Linguistic Spatial Relations of Three Dimensional


Scenes Using SIFT Keypoints*
Robert H. Luke, Samuel N. Blisard, James M. Keller,
Marjorie Skubic
Electrical and Computer Engineering
University of Missouri
Columbia, MO, USA, 65201
{rhl3db,snbfg8}@mizzou.edu,{kellerj,skubicm}@missouri.edu

Abstract – In this paper, we discuss the use of the matching continued, and a new set of three-dimensional
Scale Invariant Feature Transform to match areas matched points would be created. With the use of
between stereo images. The three dimensional location of wheel encoders, the orientation of the robot could be
matched points are then computed. Each matched area is determined with relation to the robot at previous time
further matched to a database of known objects. After
steps. The locations of three dimensional matched
projecting the three dimensional locations onto a
horizontal plane, the spatial relationship between pairs of points could then be further resolved with a least
objects are then described linguistically using a system of squared fit to the location of previously defined points.
Fuzzy rules. We are exploring the technique to facilitate Lowe’s SLAM system is very effective, but the
human-like communication with a robot. maps quickly become cluttered with large amounts of
generic points. With no grouping by class distinction,
Index Terms – Stereo Vision, SIFT, Spatial Language, there appears no easy mode of communication between
Fuzzy Logic, Human-Robot Interaction the robot and the user about objects in the scene. This
paper focuses on the relative three-dimensional
locations of objects in a scene as a common language
I. INTRODUCTION
between the robot and the user.
Research into three-dimensional representation of The system described in this paper is able to
the world as a fusion of multiple sensors has been going understand and linguistically describe the real three-
on for decades. The research has been done using dimensional world in a presented scene. This includes
different modalities including sonar, lasers, and optical using descriptors such as “to the left”, “to the right”, “in
sensors. With increased computational power as well front”, “behind” and surround. These descriptions are
as new matching algorithms, the speed and complexity built using a fuzzy rule base that takes the values of
of scene description has increased by a large amount. object points found using stereo matching projected
This paper describes the use of stereo vision and a onto a horizontal plane. This allows a natural
linguistic description of a viewed scene. The communication between a robot and a user.
motivation for this work is to simplify human robot The paper is organized as follows. We first review
interaction. As many early algorithms used simplified the SIFT algorithm and then discuss how it is used for
problem domains or took a very simplified view of the stereo vision. The method for generating spatial
world around the robot, the robot had great limitations linguistic descriptions is then described, followed by
in its abilities. One example of a simplified world view several examples illustrating the technique. We
is to use sonars and dead reckoning of wheel encoders conclude with a summary and discussion of future
to build maps of a scene. This has led to great work.
successes in robotic navigation but comes up short
when interacting with the real three-dimensional world.
II. SIFT DESCRIPTION
At best, the robot is only able to understand and
describe the world in a two dimensional plane. The stereo vision matching relies upon the Scale
In [1], Lowe used his previous work in image Invariant Feature Transform (SIFT) algorithm proposed
matching, [2], to match areas between three cameras by Lowe [3]. SIFT finds unique locations (keypoints)
mounted on a mobile robot. The three-dimensional in scale space of a training image that can also be easily
locations of these matched areas relative to the robot found in future test images. Scale space is built by
could then be determined. At each time step, image incrementally blurring an original image by a Gaussian

*
This work is supported by NSF under grant EIA-0325641.

0-7803-9275-2/05/$20.00/©2005 IEEE 704


kernel with an increasing sigma. Each image in scale
space is subtracted from its immediate neighbor which
results in a series of images that roughly simulates the
derivative of scale space. Each pixel of each image of
the derivative of scale space is checked against its 8
immediate neighbors in its scale as well as the 9
neighbors in the scale above and below. If the pixel is
greater than or less than all 26 neighbors, the location is
an extremum and is therefore a keypoint location. Fig.
1(a) shows the schematic of extrema detection while
Fig. 1(b) displays the keypoints found on an image of a
coffee mug.
When a keypoint location is determined, it must
then be described. The description is based on a
biological model of complex neurons in the primary
visual cortex using the gradients of pixels around the
keypoints at the closest scale. Fig. 2 shows the process
of building a single keypoint description. After finding
a keypoint at an appropriate scale, a histogram is built
using the gradients around the keypoint. The length of
each gradient is added to the closest histogram
orientation bin. The bin with the largest value is
determined as the major direction of the keypoint. The
area around the keypoint is then rotated by the negative Fig. 2 The description of one keypoint. After a keypoint is found in
of the angle determined from the maximum bin index. the original image, the area around the keypoint at the scale it was
After being oriented to the zero-angle, the keypoint can detected is further examined. First the gradients are determined and a
major orientation is found. The sub image and its related gradient are
be described. Sixteen histograms are distributed around rotated by this major angle. The gradients are then added to the
the keypoint location in the rotated area. The gradient appropriate histogram bins, which gives the final keypoint
magnitudes are added to the bin of their nearest description.
histograms. This is depicted in lower left of Fig. 2.
Each of the 16 histograms has 8 bins which results in a
128 dimensional feature vector. After normalization,
the vector is stored in a database for future matching.
Matching is performed by determining the two
closest keypoints in the database to the current keypoint
using Euclidean distance. If the ratio of the smallest
distance divided by the second smallest distance is less
than a given threshold, (currently .6), then it is
concluded that the two keypoints match. Fig. 3 shows
several matched points between a training image and a
test image.

Fig. 3 Recognition using SIFT. Keypoint feature vectors from the


test image are matched to the database generated on the training
image using a simple Euclidean distance test.

III. STEREO VISION


A single image contains several hundred keypoints
and comparing the databases of two stereo images
(a) (b) results in a few hundred matched keypoints. Two
matched images are shown in Fig. 4. The distance
Fig. 1 Basic SIFT components. (a) Extrema detection in Scale Space; between the two cameras is known as well as the
(b) Keypoints found on a hat training image. resolution and viewing angle of the cameras. Using
these known values, the three-dimensional location of
the matched keypoints can be determined with
reasonable accuracy. These matched keypoints are then
further matched to a database of known objects. If a

705
match is found, the keypoint is labeled by the name of False alarms that reside far outside a proper cluster
the node in the database. If a keypoint is not matched, on the horizontal plane are first removed. The median
it is held as part of an unknown object. These unknown X and Z values for each cluster of keypoints is found.
keypoints might reside close to known objects in the Any keypoints lying outside a fixed radius of this
scene and could possibly be added to the database of location are thrown away. In future work, a clustering
known keypoints using clustering. algorithm will be used to find more robust clusters of
After matching, there is a list of keypoints in a keypoints.
three-dimensional space with labels. Most of these The remaining keypoints are run through the qhull
keypoints will correspond to objects located in the program, [4], to find a convex hull. The set of points
scene. The scene can be represented with a three representing the convex hull of the object is then saved
dimensional modeling tool. For example, Fig. 5 shows to an external file.
an OpenGL scene with keypoints of known objects
plotted in white. As well, projections onto the
horizontal plane are shown in cyan and the projections
onto the vertical plane are shown in red.

(a)

Fig. 4 Matching of the left and right image from a single time step.
The more vertical the matching line, the further the point is from the
(b)
cameras. Only four false alarms occurred in the matching of these Fig. 5 (a) A real world scene taken using two parallel cameras. The
two images. cameras are off to the lower left of this image. (b) The matched
keypoints in white and their projections onto the horizontal (cyan)
IV. LINGUISTIC DESCRIPTION and vertical (red) planes.
For this preliminary work into the linguistic
description of three dimensional scenes, the
descriptions are simple. When asked the relation of two The inputs to the linguistic description engine are
objects, the program responds that the object is to the the convex hulls of two objects A and B. The input can
left of, to the right of, in front or behind the other be thought of as “What is the relation of A to B?” The
objects. These descriptions all have a degree of truth Histogram of Forces, [5], is used to process the convex
used to further resolve the location of an object in the hulls of the two objects. The result is a histogram of
scene. These linguistic hedges include “slightly”, 360 degrees divided into 180 bins. The domain
“somewhat” or “extends to”. corresponds to all angular directions around an origin.
The inputs to the system are the X and Z locations Each bin represents the strength of the relation between
of the keypoints, which are the projections onto the the two objects at the angle. The process of finding the
horizontal plane, as shown in Fig 5 (b). On this value of a single bin is shown in Fig. 6. The length of
horizontal plane, the objects will end up as clusters of the intersections between the lines at the given angle
keypoints. Using the class labels, the desired keypoints and the objects are first computed. The force is
can quickly be found and input to the linguistic computed using both a constant force and a
description system. gravitational force which takes into account the

706
distance between the two objects. The sum of the between the objects and the cameras are in the range of
forces for all lines at a given rotation is then added to 50 to 200 cm; similar to the range used for training.
the proper bin of the histogram. One example of a Of the 100 unique image pairs, 76 resulted in
histogram is shown in Fig. 7. having suitable convex hulls for linguistic description.
The other 24 did not have enough matched keypoints to
build convex hulls for all three objects. Four of these
pairs are shown in Fig. 8-11. Each shows the original
images, the 3D representation of the scene, the
horizontal plane projections, the convex hulls and the
description of the scene. The figures show that the
descriptions accurately describe the scene. More
examples of linguistic descriptions of spatial relations
can be found in [7] and [8].
Fig. 8 shows a scene with the hat between, but
slightly behind the mug and the can. The three-
dimensional figure shows where the keypoints reside in
three space as well as the horizontal and vertical
Fig. 6 Convex hulls of two objects in a scene. The lines correspond projections. The projections onto the horizontal plane
to one angle and the intersection of those lines with the objects
represents the relation between the two objects at that angle. In the and especially the convex hulls show that each object is
real implementation there are many more lines for each angle to get separate from the others.
better resolution.

(a) (b)
Fig. 7 An output example from the Histogram of Forces. This
histogram describes the relation of the two objects in figure 6.

This histogram is then used as the input to the


fuzzy rule base. Rules such as

If the histogram values around 0 are high, then


there is an object to the right.
(c) (d)
If the histogram values around 90 are high,
then there is an object behind.

The fuzzy outputs of these rules are then summed and


defuzzified to get one final direction as a mixture of the
truth of these rules. There is also a secondary fuzzy
rule base that handles the special case of one object
surrounding another. The description will have a
primary direction as well as a possible secondary (e)
direction. The primary direction is main direction; 1. The hat is mostly to the left of the mug but somewhat
“left”, “right”, “in front”, behind; with the greatest behind.
2. The mug is mostly to the right of the can but
forces. The secondary direction is the direction with somewhat to the front.
the second greatest force. A more thorough description 3. The can is front-left of the hat.
can be found in [6].
(f)
V. RESULTS Fig. 8 A scene showing a hat between and slightly behind a mug and
a can. (a) Left camera. (b) Right camera. (c) A three-dimensional
One hundred pairs of images were taken of a scene representation of matched points. (d) The keypoint projections onto
containing a hat, a coffee mug and a can. These objects the horizontal plane. (e) The convex hulls of the objects. (f) Three of
were moved in several different orientations to test the the six possible linguistic descriptions between objects.
accuracy of the description of the scene. The distance

707
discern the mug from the can in the three dimensional
representation of the keypoints. The projections onto
the horizontal plane add little into differentiating one
from the other. But, matching the keypoints in the
scene to the database of known keypoints and then
finding the individual convex hulls results in a very
descriptive figure. The linguistic description says that
(a) (b) the mug is inside the can. This makes sense as the
points project to roughly the same area on the
horizontal plane. Future work will use the projection
onto the vertical plane, which will differentiate the two
objects by saying the mug is above the can.

(c) (d)

(a) (b)

(e)
1. The hat is to the left of the can but extends to the rear
relative to the can.
2. The mug is mostly to the right of the hat but somewhat
to the rear.
3. The can is mostly to the right of the mug but somewhat (c) (d)
to the front.
(f)
Fig. 9 A scene showing the hat rotated with its bill up. (a) Left
camera. (b) Right camera. (c) A three-dimensional representation of
matched points. (d) The keypoint projections onto the horizontal
plane. (e) The convex hulls of the objects. (f) Three of the six
possible linguistic descriptions between objects.

In Fig. 9, the hat is turned on its side to show that


SIFT can detect objects with many different (e)
1. The hat is mostly to the right of the mug but somewhat to the
orientations as it has reasonable invariance to affine front.
transformations. The three-dimensional figure is much 2. The mug is to the left of the can
more dense due to the many keypoints that relate to the 3. The can is mostly to the right of the hat but somewhat behind.
trash can. They are more easily noticed in the (f)
projection onto the horizontal plane. It is interesting to Fig. 10 A scene showing a hat partially obscuring the view of a can.
note the odd shape for the convex hull of the hat. This With only a small portion of the left image showing the can, a large
is due to keypoints being found on the bill as well as on number of keypoints are found. The separation of objects can be
top of the hat; points which are several centimeters easily seen in all graphs. (a) Left camera. (b) Right camera. (c) A
three-dimensional representation of matched points. (d) The keypoint
away from each other. The algorithm has no problem projections onto the horizontal plane. (e) The convex hulls of the
handling this case. objects. (f) Three of the six possible linguistic descriptions between
Fig. 10 is of interest due to the partial overlap of objects.
the hat and the can, leaving only a small area of the can
VI. SUMMARY AND FUTURE WORK
in the left image. Because SIFT uses such small local
areas for matching, many keypoints are found in a This paper has shown that a three dimensional
relatively small area of an image. Again, all three scene can be described linguistically by a robot. With
objects are described accurately with respect to one a database of known keypoints with class labels, the
another. Scale Invariant Feature Transform can be used to match
Another unique case is shown in Fig. 11. In this areas of stereo images and determine their location in
scene, the mug is placed atop the can. It is difficult to three-dimensional space. The locations can then be

708
projected onto a horizontal plane where their convex out of keypoints. Then, using the change in location of
hulls can be found and their spatial relations matched keypoints between the scene keypoints and the
linguistically described with great accuracy. database keypoints, an overall affine transformation can
This paper has touched on only a small number of the be approximated. The entire three-dimensional object
possibilities of linguistic scene descriptions by robots. of keypoints can then be projected into the scene.
It would seem possible to store and describe entire Because the cameras can only see one side of an object
environments for better human-robot interaction. The at a time, knowing how the entire object resides in three
next obvious step would be to take the descriptions into space would greatly increase the accuracy of linguistic
the third dimension. This can be accomplished by spatial descriptions.
projecting keypoints onto the vertical plane and The use of working memory can also improve
augmenting the grammar output of the linguistic performance of a real world robot. Working memory
description code. can store only a small number of “chunks” of
information about a current task. In this case, the
chunks could be the objects in a scene. Within the
chunk are the object location and possibly a desired
goal for the object.
It might also be possible to map the three-
dimensional locations of keypoints to unknown objects
and cluster them on the fly. These keypoints could then
be used to build a three-dimensional model of a new
(a) (b) object. A robot would then be able to move through an
environment, notice an unknown object and create a
new database entry for it. A robot with such adaptive
capabilities will allow for more natural interaction with
humans using very human-like spatial descriptions.
REFERENCES
[1] S. Se, D. G. Lowe and J. Little, “Mobile robot localization and
mapping with uncertainty using scale-invariant visual
(c) (d) landmarks,” International Journal of Robotics Research, 21, 8
(2002), pp. 735-758.
[2] D. G. Lowe, “Object Recognition from local scale invariant
features, ” Proceedings of the Seventh International Conference
on Computer Vision (ICCV’ 99) pp. 1150-1157, Kerkyra, Greece,
Sep., 1999
[3] D. G. Lowe, “Distinctive Image Features from Scale-Invariant
Keypoints,” International Journal of Computer Vision, vol. 60
no. 2 pp 91-110, 2004
[4] C. B. Barber, D. P. Dobkin, and H. T. Huhdanpaa, "The
Quickhull algorithm for convex hulls," ACM Trans. on
Mathematical Software, 22(4):469-483, Dec 1996,
(e) http://www.qhull.org
[5] P. Matsakis and L. Wendling, “A New Way to Represent the
1. The mug is surrounded by the can.
Relative Position between Areal Objects,” IEEE Trans. On
2. The hat is mostly in front of the mug but somewhat to the
Pattern Analysis and Machine Intelligence, vol. 21 no. 7 pp 634-
right.
643, 1999.
3. The can is mostly behind the hat but somewhat to the left.
[6] P. Matsakis, J. Keller, L. Wendling, J. Marjamaa, and O.
Sjahputera, “Linguistic Description of Relative Positions in
(f) Images,” IEEE Trans. on Systems, Man and Cybernetics, Part B,
Fig. 11 A scene showing a mug above a can as well as a hat. The two vol. 31, no. 4, pp. 573-588, 2001.
and three-dimensional plots of the keypoints do not show the [7] M. Skubic, P. Matsakis, G. Chronis and J. M. Keller, "Generating
separation between objects, but the convex hull image shows the Multi-Level Linguistic Spatial Descriptions from Range Sensor
object’s locations. (a) Left camera. (b) Right camera. (c) A three- Readings Using the Histogram of Forces,” Autonomous Robots,
dimensional representation of matched points. (d) The keypoint vol. 14, no. 1, Jan., 2003, pp. 51-69.
projections onto the horizontal plane. (e) The convex hulls of the [8] M. Skubic, et al., "Spatial Language for Human-Robot Dialogs,"
objects. (f) Three of the six possible linguistic descriptions between IEEE Transactions on SMC Part C, Special Issue on Human-
objects. Robot Interaction, vol. 34, no. 2, May, 2004, pp. 154-167.

The way in which the SIFT algorithm is used can


also be manipulated. First, the database of keypoints
for this paper was only for matching purposes. It is
possible to create three-dimensional model databases

709

You might also like