KEMBAR78
Object Recognition with Deformable Models | PDF
Object Recognition with
     Deformable Models
            Pedro F. Felzenszwalb
        Department of Computer Science
            University of Chicago



Joint work with: Dan Huttenlocher, Joshua Schwartz,
         David McAllester, Deva Ramanan.
Example Problems
  Detecting rigid objects              PASCAL challenge




                            Medical image
Detecting non-rigid objects   analysis
                                            Segmenting cells
Deformable Models
•   Significant challenge:
    - Handling variation in appearance within object classes
    - Non-rigid objects, generic categories, etc.
•   Deformable models approach:
    - Consider each object as a deformed version of a template
    - Compact representation
    - Leads to interesting modeling and algorithmic problems
Overview
•   Part I: Pictorial Structures
    - Deformable part models
    - Highly efficient matching algorithms
•   Part II: Deformable Shapes
    - Triangulated polygons
    - Hierarchical models
•   Part III: The PASCAL Challenge
    - Recognizing 20 object categories in realistic scenes
    - Discriminatively trained, multiscale, deformable part models
Part I: Pictorial Structures

•   Introduced by Fischler and Elschlager in 1973

•   Part-based models:
    - Each part represents local visual properties
    - “Springs” capture spatial relationships
                             Matching model to image involves
                             joint optimization of part locations
                                       “stretch and fit”
Local Evidence + Global Decision

•   Parts have a match quality at each image location

•   Local evidence is noisy
    - Parts are detected in the context of the whole model
             part




          test image                 match quality
Matching Problem

•   Model is represented by a graph G = (V, E)
    - V = {v ,...,v } are the parts
                 1         n

    - (v ,v ) ∈ E indicates a connection between parts
         i   j

•   mi(li) is a cost for placing part i at location li

•   dij(li,lj) is a deformation cost

•   Optimal configuration for the object is L = (l1,...,ln) minimizing
                     n
     E(L) =          ∑ m (l ) + ∑ d (l ,l )
                               i i                 ij i j
                     i=1             (vi,vj) ∈ E
Matching Problem
                           n
                E(L) =    ∑ m (l ) + ∑ d (l ,l )
                                 i i                 ij i j
                          i=1          (vi,vj) ∈ E


•   Assume n parts, k possible locations for each part
    - There are k n   configurations L

•   If graph is a tree we can use dynamic programming
    - O(nk ) algorithm
            2


•   If dij(li,lj) = g(li-lj) we can use min-convolutions
    - O(nk) algorithm
    - As fast as matching each part separately!
Dynamic Programming on Trees
                     n                                                 v2
          E(L) =    ∑ m (l ) + ∑ d (l ,l )
                              i i                 ij i j
                    i=1             (vi,vj) ∈ E                   v1



•   For each l1 find best l2:

    - Best (l ) = min [m (l ) + d
           2 1
                         l2
                                2 2               12(l1,l2)   ]
•   “Delete” v2 and solve problem with smaller model

•   Keep removing leafs until there is a single part left
Min-Convolution Speedup
                                                           v2

      Best2(l1) = min [m2(l2) + d12(l1,l2)]           v1
                     l2




•   Brute force: O(k2) --- k is number of locations

•   Suppose d12(l1,l2) = g(l1-l2):

    - Best (l ) = min [m (l ) + g(l -l )]
           2 1
                     l2
                            2 2        1 2


•   Min-convolution: O(k) if g is convex
Finding Motorbikes

Model with 6 parts:
      2 wheels
    2 headlights
front & back of seat
Human Pose Estimation
Human Tracking




Ramanan, Forsyth, Zisserman, Tracking People by Learning their Appearance
IEEE Pattern Analysis and Machine Intelligence (PAMI). Jan 2007
Part II: Deformable Shapes
•   Shape is a fundamental cue for recognizing objects

•   Many objects have no well defined parts
    - We can capture their outlines using deformable models
Triangulated Polygons




•   Polygonal templates

•   Delauney triangulation gives natural decomposition of an object

•   Consider deforming each triangle “independently”


                                    Rabbit ear can be bent by
                                    changing shape of a single
                                            triangle
Structure of Triangulated Polygons


                     There are 2 graphs associated with a
                            triangulated polygon



If the polygon is simple (no holes):

  Dual graph is a tree
  Graphical structure of triangulation is a 2-tree
Deformable Matching
        Consider piecewise affine maps from model
        to image (taking triangles to triangles)

        Find globally optimal deformation using
Model   dynamic programming over 2-tree




            Matching to MRI data
Hierarchical Shape Model
•   Shape-tree of curve from a to b:
    -   Select midpoint c, store relative location c | a,b.
    -   Left child is a shape-tree of sub-curve from a to c.
    -   Right child is a shape-tree of sub-curve from c to b.
                            h
            f           c       d     i
                e   g                                     c | a,b
                                          b
        a

                                              e | a,c                d | c,b




                                    f | a,e     g | e,c             h | c,d    i | d,b
Deformations

•   Independently perturb relative locations stored in a shape-tree
    -   Local and global properties are preserved
    -   Reconstructed curve is perceptually similar to original
Matching
                     h
     f           c           d     i
         e   g                                         c | a,b

a
                                       b   w                                           p

                                           e | a,c                d | c,b
                                                                                               r


                         v       f | a,e     g | e,c             h | c,d    i | d,b
                                                                                           q


                                             u
    model                                                                             curve

Match(v, [p,q]) = w1
Match(u, [q,r]) = w2
Match(w, [p,r]) = w1 + w2 + dif((e|a,c), (q|p,r))

         similar to parsing with the CKY algorithm
Recognizing Leafs




Nearest neighbor classification
                                  15 species
   Shape-tree           96.28
                                  75 examples per species
 Inner distance         94.13
                                  (25 training, 50 test)
 Shape context          88.12
Part III: PASCAL Challenge
•   ~10,000 images, with ~25,000 target objects
    - Objects from 20 categories (person, car, bicycle, cow, table...)
    - Objects are annotated with labeled bounding boxes
Model Overview




detection     root filter   part filters deformation
                                         models

Model has a root filter plus deformable parts
Histogram of Gradient (HOG) Features




•   Image is partitioned into 8x8 pixel blocks

•   In each block we compute a histogram of gradient orientations
    - Invariant to changes in lighting, small deformations, etc.
•   We compute features at different resolutions (pyramid)
Filters

•   Filters are rectangular templates defining weights for features

•   Score is dot product of filter and subwindow of HOG pyramid


                                                          H
                                          W
                                      Score of H at this location is H ⋅ W




                        HOG pyramid
Object Hypothesis




                                              Score is sum of filter
                                             scores plus deformation
                                                      scores

  Image pyramid        HOG feature pyramid




Multiscale model captures features at two-resolutions
Training
•   Training data consists of images with labeled bounding boxes

•   Need to learn the model structure, filters and deformation costs




                                    Training
Connection With Linear Classifiers
 •   Score of model is sum of filter scores plus deformation scores
     - Bounding box in training data specifies that score should be
       high for some placement in a range


                   w is a model
                   x is a detection window
                   z are filter placements




concatenation of filters and       concatenation of features
deformation parameters            and part displacements
Latent SVMs


Linear in w if z is fixed




            Regularization   Hinge loss
Learned Models
                            Bicycle
                     Sofa


          Car
Bottle
Example Results
More Results
Overall Results

•   9 systems competed in the 2007 challenge

•   Out of 20 classes we get:
    - First place in 10 classes
    - Second place in 6 classes
•   Some statistics:
    - It takes ~2 seconds to evaluate a model in one image
    - It takes ~3 hours to train a model
    - MUCH faster than most systems
Component Analysis

                               PASCAL2006 Person
             1
            0.9                       Root (0.18)
                                      Root+Latent (0.24)
            0.8                       Parts+Latent (0.29)
            0.7                       Root+Parts+Latent (0.34)
            0.6
precision




            0.5
            0.4
            0.3
            0.2
            0.1
             0
                  0   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9        1
                                     recall
Summary

•   Deformable models provide an elegant framework for object
    detection and recognition

    - Efficient algorithms for matching models to images
    - Applications: pose estimation, medical image analysis,
      object recognition, etc.

•   We can learn models from partially labeled data

    - Generalized standard ideas from machine learning
    - Leads to state-of-the-art results in PASCAL challenge
•   Future work: hierarchical models, grammars, 3D objects

Object Recognition with Deformable Models

  • 1.
    Object Recognition with Deformable Models Pedro F. Felzenszwalb Department of Computer Science University of Chicago Joint work with: Dan Huttenlocher, Joshua Schwartz, David McAllester, Deva Ramanan.
  • 2.
    Example Problems Detecting rigid objects PASCAL challenge Medical image Detecting non-rigid objects analysis Segmenting cells
  • 3.
    Deformable Models • Significant challenge: - Handling variation in appearance within object classes - Non-rigid objects, generic categories, etc. • Deformable models approach: - Consider each object as a deformed version of a template - Compact representation - Leads to interesting modeling and algorithmic problems
  • 4.
    Overview • Part I: Pictorial Structures - Deformable part models - Highly efficient matching algorithms • Part II: Deformable Shapes - Triangulated polygons - Hierarchical models • Part III: The PASCAL Challenge - Recognizing 20 object categories in realistic scenes - Discriminatively trained, multiscale, deformable part models
  • 5.
    Part I: PictorialStructures • Introduced by Fischler and Elschlager in 1973 • Part-based models: - Each part represents local visual properties - “Springs” capture spatial relationships Matching model to image involves joint optimization of part locations “stretch and fit”
  • 6.
    Local Evidence +Global Decision • Parts have a match quality at each image location • Local evidence is noisy - Parts are detected in the context of the whole model part test image match quality
  • 7.
    Matching Problem • Model is represented by a graph G = (V, E) - V = {v ,...,v } are the parts 1 n - (v ,v ) ∈ E indicates a connection between parts i j • mi(li) is a cost for placing part i at location li • dij(li,lj) is a deformation cost • Optimal configuration for the object is L = (l1,...,ln) minimizing n E(L) = ∑ m (l ) + ∑ d (l ,l ) i i ij i j i=1 (vi,vj) ∈ E
  • 8.
    Matching Problem n E(L) = ∑ m (l ) + ∑ d (l ,l ) i i ij i j i=1 (vi,vj) ∈ E • Assume n parts, k possible locations for each part - There are k n configurations L • If graph is a tree we can use dynamic programming - O(nk ) algorithm 2 • If dij(li,lj) = g(li-lj) we can use min-convolutions - O(nk) algorithm - As fast as matching each part separately!
  • 9.
    Dynamic Programming onTrees n v2 E(L) = ∑ m (l ) + ∑ d (l ,l ) i i ij i j i=1 (vi,vj) ∈ E v1 • For each l1 find best l2: - Best (l ) = min [m (l ) + d 2 1 l2 2 2 12(l1,l2) ] • “Delete” v2 and solve problem with smaller model • Keep removing leafs until there is a single part left
  • 10.
    Min-Convolution Speedup v2 Best2(l1) = min [m2(l2) + d12(l1,l2)] v1 l2 • Brute force: O(k2) --- k is number of locations • Suppose d12(l1,l2) = g(l1-l2): - Best (l ) = min [m (l ) + g(l -l )] 2 1 l2 2 2 1 2 • Min-convolution: O(k) if g is convex
  • 11.
    Finding Motorbikes Model with6 parts: 2 wheels 2 headlights front & back of seat
  • 12.
  • 13.
    Human Tracking Ramanan, Forsyth,Zisserman, Tracking People by Learning their Appearance IEEE Pattern Analysis and Machine Intelligence (PAMI). Jan 2007
  • 14.
    Part II: DeformableShapes • Shape is a fundamental cue for recognizing objects • Many objects have no well defined parts - We can capture their outlines using deformable models
  • 15.
    Triangulated Polygons • Polygonal templates • Delauney triangulation gives natural decomposition of an object • Consider deforming each triangle “independently” Rabbit ear can be bent by changing shape of a single triangle
  • 16.
    Structure of TriangulatedPolygons There are 2 graphs associated with a triangulated polygon If the polygon is simple (no holes): Dual graph is a tree Graphical structure of triangulation is a 2-tree
  • 17.
    Deformable Matching Consider piecewise affine maps from model to image (taking triangles to triangles) Find globally optimal deformation using Model dynamic programming over 2-tree Matching to MRI data
  • 18.
    Hierarchical Shape Model • Shape-tree of curve from a to b: - Select midpoint c, store relative location c | a,b. - Left child is a shape-tree of sub-curve from a to c. - Right child is a shape-tree of sub-curve from c to b. h f c d i e g c | a,b b a e | a,c d | c,b f | a,e g | e,c h | c,d i | d,b
  • 19.
    Deformations • Independently perturb relative locations stored in a shape-tree - Local and global properties are preserved - Reconstructed curve is perceptually similar to original
  • 20.
    Matching h f c d i e g c | a,b a b w p e | a,c d | c,b r v f | a,e g | e,c h | c,d i | d,b q u model curve Match(v, [p,q]) = w1 Match(u, [q,r]) = w2 Match(w, [p,r]) = w1 + w2 + dif((e|a,c), (q|p,r)) similar to parsing with the CKY algorithm
  • 21.
    Recognizing Leafs Nearest neighborclassification 15 species Shape-tree 96.28 75 examples per species Inner distance 94.13 (25 training, 50 test) Shape context 88.12
  • 22.
    Part III: PASCALChallenge • ~10,000 images, with ~25,000 target objects - Objects from 20 categories (person, car, bicycle, cow, table...) - Objects are annotated with labeled bounding boxes
  • 24.
    Model Overview detection root filter part filters deformation models Model has a root filter plus deformable parts
  • 25.
    Histogram of Gradient(HOG) Features • Image is partitioned into 8x8 pixel blocks • In each block we compute a histogram of gradient orientations - Invariant to changes in lighting, small deformations, etc. • We compute features at different resolutions (pyramid)
  • 26.
    Filters • Filters are rectangular templates defining weights for features • Score is dot product of filter and subwindow of HOG pyramid H W Score of H at this location is H ⋅ W HOG pyramid
  • 27.
    Object Hypothesis Score is sum of filter scores plus deformation scores Image pyramid HOG feature pyramid Multiscale model captures features at two-resolutions
  • 28.
    Training • Training data consists of images with labeled bounding boxes • Need to learn the model structure, filters and deformation costs Training
  • 29.
    Connection With LinearClassifiers • Score of model is sum of filter scores plus deformation scores - Bounding box in training data specifies that score should be high for some placement in a range w is a model x is a detection window z are filter placements concatenation of filters and concatenation of features deformation parameters and part displacements
  • 30.
    Latent SVMs Linear inw if z is fixed Regularization Hinge loss
  • 31.
    Learned Models Bicycle Sofa Car Bottle
  • 32.
  • 33.
  • 34.
    Overall Results • 9 systems competed in the 2007 challenge • Out of 20 classes we get: - First place in 10 classes - Second place in 6 classes • Some statistics: - It takes ~2 seconds to evaluate a model in one image - It takes ~3 hours to train a model - MUCH faster than most systems
  • 35.
    Component Analysis PASCAL2006 Person 1 0.9 Root (0.18) Root+Latent (0.24) 0.8 Parts+Latent (0.29) 0.7 Root+Parts+Latent (0.34) 0.6 precision 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recall
  • 36.
    Summary • Deformable models provide an elegant framework for object detection and recognition - Efficient algorithms for matching models to images - Applications: pose estimation, medical image analysis, object recognition, etc. • We can learn models from partially labeled data - Generalized standard ideas from machine learning - Leads to state-of-the-art results in PASCAL challenge • Future work: hierarchical models, grammars, 3D objects