A Benchmark For Comparison of Cell Tracking Algorithms
A Benchmark For Comparison of Cell Tracking Algorithms
ABSTRACT                                                                                        Received on August 29, 2013; revised on January 15, 2014; accepted
Motivation: Automatic tracking of cells in multidimensional time-lapse                          on January 31, 2014
fluorescence microscopy is an important task in many biomedical ap-
plications. A novel framework for objective evaluation of cell tracking
algorithms has been established under the auspices of the IEEE                                  1     INTRODUCTION
International Symposium on Biomedical Imaging 2013 Cell Tracking
Challenge. In this article, we present the logistics, datasets, methods
                                                                                                Cell migration is an essential process in normal tissue development,
and results of the challenge and lay down the principles for future uses
                                                                                                tissue repair and disease (Friedl and Gilmour, 2009). The dynamics
of this benchmark.                                                                              of cell movement (e.g. speed, directionality) and migration type
Results: The main contributions of the challenge include the creation                           (i.e. the morphological changes that the cell undergoes during the
of a comprehensive video dataset repository and the definition of ob-                           movement) are closely related to the biomechanical properties
jective measures for comparison and ranking of the algorithms. With                             of the surrounding environment (Friedl and Alexander, 2011).
this benchmark, six algorithms covering a variety of segmentation and                           Therefore, accurate quantification of both is the key to understand-
tracking paradigms have been compared and ranked based on their                                 ing the complex mechanobiology of cell migration.
performance on both synthetic and real datasets. Given the diversity                               Traditionally, cell migration experiments have been performed
of the datasets, we do not declare a single winner of the challenge.                            in two dimensions (2D) using phase or differential interference
Instead, we present and discuss the results for each individual dataset                         contrast microscopy. Nowadays, it is increasingly acknowledged
separately.                                                                                     that proper evaluation of the cellular movement, as well as related
Availability and implementation: The challenge Web site (http://                                forces, requires looking at the cells in their three-dimensional (3D)
www.codesolorzano.com/celltrackingchallenge) provides access to                                 tissue environment (Legant et al., 2010). This can be done by
the training and competition datasets, along with the ground truth of                           taking advantage of the versatility of fluorescence labeling and
the training videos. It also provides access to Windows and Linux                               the optical sectioning capability of multidimensional fluores-
executable files of the evaluation software and most of the algorithms                          cence in vivo microscopy (Fernandez-Gonzalez et al., 2006).
that competed in the challenge.                                                                 Fluorescence microscopy has several advantages (e.g. multidimen-
Contact: codesolorzano@unav.es                                                                  sionality, specificity). However, tracking fluorescent cells poses
Supplementary information: Supplementary data are available at                                  specific challenges compared with more traditional phase contrast
Bioinformatics online.                                                                          enhancing techniques: non-homogenous staining, low signal-to-
                                                                                                noise ratio, uneven background illumination, photobleaching,
                                                                                                phototoxicity, etc. Moreover, an important challenge, specific to
*To whom correspondence should be addressed.                                                    the use of green fluorescent protein (GFP) transfection-based
staining, is the cell-to-cell intensity variability caused by differen-     The limitations of these studies, such as being monomodality,
tial transfection efficiency. Therefore, tracking of fluorescent cells    using 2D or static images, one or two cell types only, or compar-
requires specialized tools.                                               ing with none or few competing algorithms, highlight the need
   Several methods have been described for the segmentation of            for common standards to evaluate new and existing algorithms.
cells in static 3D fluorescence microscopy images (Indhumathi             Bearing this in mind, we organized the first Cell Tracking
et al., 2011; Lin et al., 2005; Long et al., 2007; Ortiz-de-              Challenge (http://www.codesolorzano.com/celltrackingchallenge)
Solorzano et al., 1999). These methods have been extended to              hosted by the 2013 IEEE International Symposium on
account for the temporal variable in multidimensional time-               Biomedical Imaging (ISBI 2013, http://www.biomedicalima-
lapse microscopy, combining accurate segmentation of the cells            ging.org/2013/). In this article, we present the methods used
with proper tracking of their movements and lineage events (e.g.          in the challenge, briefly describe the competing algorithms
apoptosis, mitosis, cell merging and overlapping). They can be            and report on the results of the comparison, which was based
classified into two broad categories: tracking by detection and           on common accuracy measures and datasets covering a wide
tracking by model evolution (Meijering et al., 2009; Rohr et al.,         variety of scenarios of live cell imaging in fluorescence
2010; Zimmer et al., 2006). In the former paradigm, cells are first       microscopy.
detected in all the frames of the video independently using gradi-
ent features (Al-Kofahi et al., 2006), intensity (Li et al., 2010) or
wavelet decomposition (Padfield et al., 2011). Subsequently, an           2     METHODS
optimization strategy, such as multiple-hypothesis tracking
(Chenouard et al., 2013), integer programming (Li et al., 2010),          2.1    Logistics
dynamic programming (Magnusson and Jaldén, 2012) or coupled              The challenge was organized by members of three research institutions:
minimum-cost flow tracking (Padfield et al., 2011), is used to de-        Center for Biomedical Image Analysis, Masaryk University, Brno, Czech
termine the most likely cell correspondence between frames. In the        Republic (CBIA-CZ); Center for Applied Medical Research, University
                                                                          of Navarra, Pamplona, Spain (CIMA-ES); and Erasmus University
latter paradigm, cells are segmented and tracked simultaneously,
                                                                          Medical Center, Rotterdam, The Netherlands (ERASMUS-NL). The
using the final result of each frame as the initial condition for the
                                                                          challenge, announced via various media including targeted emails,
analysis of the following frame. This is mostly done by evolving          mailing lists and the ISBI 2013 Web site, was opened for registration
the contours of the cells, represented either parametrically              through the challenge Web site. Four weeks after opening the challenge
(Dufour et al., 2011; Zimmer et al., 2002) or implicitly (Dufour          for registration, all registered participants were given individual access to
et al., 2005; Dzyubachyk et al., 2010; Li et al., 2008; Maška et al.,    the challenge FTP server, where they could download the training data-
2013), using a velocity term defined by the content of the ‘‘target’’     sets, along with the ground truth, and self-evaluation software. The regis-
frame (e.g. gradient features or intra- and inter-region heterogen-       tered participants worked on the training datasets during the 4 weeks
eity) and by the internal properties of the evolved contours (e.g.        before the competition datasets were released. The participants were
mean curvature, shape or topology). The main benefit of the first         then given six additional weeks to submit their results and the algorithms
paradigm is the mutual independence of detection and association          used to produce them. After the deadline, the consistency and the com-
                                                                          pliance of the submissions were verified by the organizers before the
steps, which allows straightforward tracking of new cells entering
                                                                          presentation of the preliminary results at ISBI 2013. After the ISBI
the field of view as well as forward-backward spatiotemporal data         2013 workshop, the organizing committee confirmed the accuracy of
association (Bise et al., 2011). On the contrary, the tracking by         the submitted results and compiled the final rankings presented in this
model evolution approaches is popular for easy accommodation              article.
of morphological and behavioral clues into the model to inher-
ently deal with the topologically flexible behavior of live cells.
Bridging both paradigms together to take advantage of their bene-
                                                                          2.2    Datasets
fits, Li et al. (2008) proposed a complex cell tracking system that       Forty-eight time-lapse sequences used in the challenge were evenly dis-
combines a fast level set framework with a local spatiotemporal           tributed between the training and competition phases. Each group of
                                                                          24 videos consisted of 12 real microscopy time-lapse sequences and 12
data association step.
                                                                          computer-simulated videos, 6 2D and 6 3D, with various cell densities
   The tracking methods described until this date have been               and noise levels. The acquisition setup for each dataset is listed in Table 1,
tested in one or few private datasets using different metrics and         and representative regions of each dataset are displayed in Figure 1.
have seldom been compared against other algorithms. A note-               Representative sample videos can also be found as Supplementary
worthy attempt toward a formalization of the evaluation of cell           Videos S1–S8. The complete raw data are available at the challenge
tracking algorithms was described by Kan et al. (2011). They              Web site. The datasets were named using the following convention: a
compared a novel cell tracking strategy to a publicly available           four-letter prefix (LNDR) identifies the labeling (L) method -cytoplasmic
probabilistic tracker using a customized tracking measurement             (C) or nuclear (N); the dimensionality (ND) -2D or 3D; and the reso-
and mostly publicly available data. Similarly, Rapoport et al.            lution (R) -low (L) or high (H). The suffix, separated by a hyphen from
(2011) partly addressed this issue by providing a method for              the prefix, describes the cell line.
the validation of the accuracy of cell tracking results and a data-
                                                                          2.2.1 Real videos     The real video repository consists of six datasets.
set composed of two manually annotated brightfield microscopy
                                                                             C2DL-MSC (Fig. 1A and Supplementary Video S1). GFP transfected
videos. Finally, two recent studies (Dima et al., 2011; Held et al.,      rat mesenchymal stems cells on a flat polyacrylamide substrate, acquired
2011) presented two rigorous comparisons of algorithms de-                using a Perkin Elmer UltraVIEW ERS spinning disk confocal microscope
veloped for the segmentation of fluorescently labeled cells from          (courtesy of Dr F. Prósper, CIMA-ES). The difficulty of the dataset is
static 2D images, using their own image repositories and adapted          high because of the low signal-to-noise ratio and the presence of filament-
accuracy measures.                                                        like protruding areas caused by cell stretching, which sometimes appear
1610
                                                                                                                     Benchmarking of cell tracking algorithms
Name Objective lens/Numerical aperture Frame size (grid points) Voxel size (mm) Time step (min) Number of frames Difficulty
C2DL-MSC             20   Plan-apochromat/0.75             992  832 (1200  782)            0.397  0.397         20 (30)           48                    High
C3DH-H157            63   Plan-apochromat/1.2 water        992  832  35 (80)               0.126  0.126  0.5   1 (2)             60                    Low
C3DL-MDA231          20   Plan/0.7                         512  512  30                    1.242  1.242  6.0   80                12                    Very High
N2DH-GOWT1           63   Plan-apochromat/1.4 oil          1024  1024                       0.240  0.240         5                 92                    Medium
N2DL-HeLa            10   Plan/0.4                         1100  700                        0.644  0.644         30                92                    High
N3DH-CHO             63   Plan-apochromat/1.4 oil          512  443  5                     0.202  0.202  1.0   9.5               92                    Medium
N2DH-SIM             40   Plan-apochromat/1.3 oil          505–755  535–775                 0.125  0.125         28.8 (57.6)       56–100                Medium
N3DH-SIM             40   Plan-apochromat/1.3 oil          520–755  520–730  49 (60)       0.125  0.125  0.2   28.8 (57.6)       56–100                Medium
Note: The numbers in parentheses indicate particular values for the second half of a given dataset.
Fig. 1. Representative regions from the video dataset repository. (A) C2DL-MSC; (B) C3DH-H157 (selected z-slice); (C) C3DL-MDA231 (selected
z-slice); (D) N2DH-GOWT1; (E) N2DL-HeLa; (F) N3DH-CHO (selected z-slice); (G) N2DH-SIM (also representative of a selected z-slice of
N3DH-SIM)
as discontinuous extensions of the cells. Further complicating the ana-                   the axial direction, and large time step). Moreover, there are a high
lysis of the scenes, these protrusions often come in contact with other                   number of colliding elongated cells as well as cells entering and leaving
cells.                                                                                    the field of view.
   C3DH-H157 (Fig. 1B and Supplementary Video S2). GFP transfected                            N2DH-GOWT1 (Fig. 1D and Supplementary Video S4). GFP trans-
H157 human squamous lung carcinoma cells embedded in a 3D matrigel                        fected GOWT1 mouse embryonic stem cells on a flat substrate, acquired
matrix, acquired using a Perkin Elmer UltraVIEW ERS spinning                              using a Leica TCS SP5 laser scanning confocal microscope (courtesy of
disk confocal microscope (courtesy of Dr A. Rouzaut, CIMA-ES).                            Dr E. Bártová, Academy of Sciences of the Czech Republic, Brno, Czech
The difficulty of the dataset is low because of low cell density and high                 Republic). The difficulty of the dataset is considered medium because of
resolution. However, the presence of cell blebbing and cells entering and                 heterogeneous staining, prominent nucleoli, mitoses, cells entering and
leaving the field of view impose a certain degree of complexity for seg-                  leaving the field of view and frequent cell collisions.
mentation and tracking.                                                                       N2DL-HeLa (Fig. 1E and Supplementary Video S5). Histone 2B
   C3DL-MDA231 (Fig. 1C and Supplementary Video S3). MDA231                               (H2B)-GFP expressing HeLa cells on a flat substrate, acquired using
human breast carcinoma cells infected with a pure Murine Stem Cell                        an Olympus IX81 inverted epifluorescence microscope. The videos were
Virus (pMSCV) vector including GFP. The cells were embedded in a                          obtained with permission from the Mitocheck consortium video reposi-
3D collagen matrix and acquired using an Olympus FluoView F1000                           tory (http://www.mitocheck.org). The difficulty of the dataset is classified
laser scanning confocal microscope (courtesy of Prof R. Kamm,                             as high because of the high cell density and low resolution. In particular,
Massachusetts Institute of Technology, Cambridge, MA, USA). The dif-                      the videos display frequent mitoses, both normal and abnormal, in add-
ficulty of the dataset is high because it was acquired under high-through-                ition to the presence of colliding, entering and leaving cells with low
put conditions (i.e. low signal-to-noise ratio, low resolution, especially in             fluorescence intensity.
                                                                                                                                                                1611
M.Maška et al.
   N3DH-CHO (Fig. 1F and Supplementary Video S6). Chinese hamster                 to causing segmentation problems, such as cells undergoing abnormal
ovarian cells overexpressing proliferating cell nuclear antigen tagged with       mitoses, dimly stained cells, oddly shaped cells and colliding pairs of
GFP, acquired using a Zeiss LSM 510 laser scanning confocal microscope            cells. They segmented at least 20 instances of each problematic event.
(courtesy of Dr J. Essers, ERASMUS-NL). The dataset is considered of
medium difficulty because of nuclei with heterogeneous staining, the pres-        2.3.3 Ground truth for tracking The task for the annotators was to
ence of prominent nucleoli, mitotic cells with unstained nuclear periods,         draw a quintessential ‘‘marker’’ (i.e. a set of grid points with the same
colliding cells and cells entering and leaving the field of view.                 unique label) inside each cell and in every frame where the cell consecu-
                                                                                  tively appears entirely or partly within the limits of the FoI. These mar-
2.2.2 Simulated videos The synthetic image data, along with the in-               kers do not need to accurately follow the boundaries of the cells. Markers
herent ground truth, were generated using a simulation toolkit based on           of a given label located in consecutive frames are called ‘‘tracks’’. Tracks
our previous work in CBIA-CZ (Svoboda and Ulman, 2012; Svoboda                    end when a cell entirely leaves the FoI, the video reaches the final frame or
et al., 2009). As this challenge was dedicated to cell tracking, special          the cell divides into two, or abnormally more than two, daughter cells.
attention was paid to the accuracy of cell movement during the cell               When this happens, new tracks are created, one for each daughter cell,
cycle and to the mitotic events. The simulated videos displayed fluores-          and the parental connection is stored in the TRA-GT file.
cently labeled nuclei of the HL60 cell line migrating on a flat 2D surface
(N2DH-SIM) and in a 3D matrix (N3DH-SIM) (Fig. 1G and
Supplementary Videos S7 and S8). They differ in the level of noise, cell          2.4    Evaluation
density of the initial population, the number of cells leaving and entering       2.4.1 Segmentation measure The main purpose of the segmentation
the field of view and the number of simulated mitotic events, yielding up         (SEG) accuracy measurement is to understand how well the segmented
to 70 cells in the field of view. Therefore, both datasets are considered of      cells match the actual cell regions. To quantify it, we used the Jaccard
medium difficulty.                                                                similarity index, defined as:
                                                                                                                           jR \ Sj
2.3    Ground truth generation for real datasets                                                               JðR, SÞ ¼
                                                                                                                           jR [ Sj
One expert from CIMA-ES annotated all the real datasets used in the
                                                                                  where R is a reference segmentation of a cell in SEG-GT-F and S is an
training phase. For the competition phase, all real videos were manually
                                                                                  automatic segmentation of the particular cell provided by a participant.
annotated by three experts from three sites (CBIA-CZ, CIMA-ES and
                                                                                  A reference cell, R, and a segmented one, S, are considered matching if
ERASMUS-NL). Each expert created ground truth for tracking (TRA-
                                                                                  the following condition holds:
GT) and ground truth for segmentation (SEG-GT) for each video. Each
pair of SEG-GT and TRA-GT was manually revised by its creator to                                               jR \ Sj40:5  jRj
correct for automatically detected inconsistencies of two types: a segmen-
tation mask either overlapping with multiple tracking markers or without             Note, for each reference cell, there can be one segmented object at
any complete tracking marker. Finally, to account for inter-subject vari-         most. If there is no significant overlap with any segmented object, the
ability, two final ground truths (SEG-GT-F and TRA-GT-F) were created             matching function is set to empty. The Jaccard index always falls in
by combining the three existing ground truths, using a majority-voting            the [0, 1] interval, where 1 means perfect match and 0 means no match.
scheme, as suggested, for instance, by Foggia et al. (2013). The way the          The final SEG measure for a particular video is calculated as the mean
majority voting was performed is described in detail in the Supplementary         of the Jaccard indices of all the reference objects in the video.
Note.
                                                                                  2.4.2 Tracking measure The goal of the tracking (TRA) measure-
2.3.1 Field of interest        To simplify dealing with incomplete objects,       ment is to evaluate the ability of the tracking algorithms to detect the
entering or leaving the image frame, only objects that had substantially          cells and follow them in time. Although TRA does not evaluate the
advanced into the image frame were analyzed. This is equivalent to defin-         segmentation accuracy, reliable cell detection is the key to this measure-
ing a virtual inner field of interest (FoI) and analyzing only those objects      ment. To the best of our knowledge, there is no standardized, commonly
that are at least partially inside the FoI. The distance in grid points (pixels   used cell tracking accuracy measure currently available. Two popular
or voxels) between the image frame border and the FoI border varied               approaches for measuring the performance of tracking algorithms are
between datasets depending on the size of the objects of interest (50 grid        based on either the ratio of completely reconstructed tracks to the total
points in C2DL-MSC, C3DH-H157, N2DH-GOWT1 and N3DH-CHO;                           number of ground-truth tracks (Li et al., 2008) or the ratio of correct
25 grid points in C3DL-MDA231 and N2DL-HeLa).                                     temporal relations within reconstructed tracks to the total number of
                                                                                  temporal relations within ground-truth tracks (Kan et al., 2011).
2.3.2 Ground truth for segmentation The task for annotators was to                Obviously, both approaches quantify, at different scales, how well the
mark grid points belonging to cells as accurately as possible. Therefore,         cell tracking algorithms are able to reconstruct a particular ground-truth
each cell was segmented as a set of grid points with the same unique label.       reference. However, they neither penalize for spurious tracks nor account
The length of the videos and the high number of cells per frame in some           for division events, which are often evaluated separately (Kan et al., 2011;
of the datasets prevented from having a complete manual annotation of             Li et al., 2008). Therefore, we developed a novel cell tracking accuracy
all the cells. Therefore, we first randomly permutated all the frames of          measure that penalizes for all possible errors in tracking results and com-
each video to unbiasedly select the cells that were used as ground truth. In      bines them with different weights, reflecting the manual effort needed to
the 3D videos, we also randomly selected at least one of its 2D z-slices,         correct a particular error, into a single number.
excluding empty slices. Then, the annotators were asked to segment all               Cell tracking results can be represented using an acyclic oriented
the cells within each frame in the given random order until at least 100          graph. The nodes of such a graph correspond to the detected cells,
cells were segmented and two frames were fully segmented. The segmen-             whereas its edges coincide with temporal relations between them. They
tation masks were drawn in the entire image frame and not just in the             are of two kinds: track links (the cell continues with the same label in the
FoI. Cells visible only outside the FoI were not segmented at all. After          consecutive frames) and parent links (the cell continues with a different
reaching the limit of 100 cells and two full frames, the annotators in-           label not necessarily in the consecutive frames). Non-dividing cells have
spected the remaining frames in the random order provided, and they               one successor at most, whereas those that undergo division have two or
were asked to identify and annotate cells that in their opinion were prone        even more successors in the case of abnormal division. The TRA
1612
                                                                                                          Benchmarking of cell tracking algorithms
measurement computes the difference between the acyclic oriented graph          subsequently, a final ranking was compiled based on the following
provided by a participant algorithm and the reference TRA-GT-F. To this         formula:
end, we automatically quantify how difficult it is to transform the com-                                                         1
puted graph into the reference one as the least number of operations                      Final rank ¼ rankSEG þ rankTRA þ          rankTIM
                                                                                                                                 N
needed to make the graphs identical. The operations allowed (split/
delete/add a node; delete/add or change the semantics of an edge) are           where N is the number of ranked methods for a particular dataset. The
penalized differently based on the effort that would be required if manu-       reason for using different weights for accuracy (SEG and TRA) and speed
ally performed. The correspondence between operations and weights (w)           (TIM) is to prefer more accurate, but possibly slower, methods to faster,
is as follows: delete a node (w ¼ 1, requires one mouse click); split a node    but less accurate, ones.
(w ¼ 5, requires drawing a divider); add a node (w ¼ 10, requires adding a         The best performing method is that with the lowest Final rank for a
whole mask); delete an edge (w ¼ 1, requires one mouse click); change an        particular dataset. Note that the methods having partial or empty sub-
edge semantics (w ¼ 1, requires one mouse click); add an edge (w ¼ 1.5, it      mitted results for a particular dataset were not ranked for that dataset.
is slightly more difficult than deleting an edge, as it requires to determine   Instead, their Final rank was established as NA (not applicable).
both nodes of the edge). The TRA measure is defined as the weighted sum
of graph operations, normalized by the number of markers (i.e. by the
number of nodes in the reference graph) to facilitate the comparison            3     RESULTS AND DISCUSSION
between videos (datasets) with different numbers of cells. The best
result, which requires no changes, has a TRA measure equal to zero.             3.1    Participants and algorithms
    To establish the optimal transformation of a participant graph into the     At the time the challenge was closed, six groups had uploaded
reference graph, we have implemented the following automatic proced-            consistent results to the challenge FTP server. The main
ure. First, correspondences between the nodes of both graphs are deter-         principles of the competing algorithms are briefly described in
mined using the same criterion that is used for finding matching                Table 2 and are fully described in Supplementary Methods.
segmentation masks. Then, the nodes are classified into four categories:
                                                                                Furthermore, executable versions of most of the competing al-
false negatives (ground-truth nodes without any match to the participant
nodes), false positives (participant nodes without any match to the
                                                                                gorithms, along with the instructions of use, are available
ground-truth nodes), true positives (ground-truth nodes that match to           through the challenge Web site.
some participant nodes) and non-split nodes (participant nodes that
match to multiple ground-truth nodes). Knowing the category of each             3.2    Submissions and rankings
node, the procedure directly computes how many edges need to be
removed from the participant graph. They are either connected to at             The percentage of submissions received for each dataset is listed
least one false-positive node or they connect two correctly detected            in Supplementary Table S1. This table also displays the mean
nodes, which are not linked in the ground-truth graph. Analogously, it          and standard deviation of the SEG, TRA and TIM measures
counts the number of missing edges in the participant graph. These are          obtained for each dataset, combining all the submissions
the ground-truth edges without counterpart in the participant graph.            received.
Finally, the number of edges between matching nodes, which differ in               Table 3 presents a summary of the rankings obtained for each
semantics, is counted. The optimal transformation making the participant        dataset, considering each measurement separately, and also com-
graph identical to the reference graph first involves separating all non-       bined, as described in Methods. The specific results for each
split nodes, adding false-negative nodes and removing all false-positive
                                                                                dataset, including the values of the three performance measures
nodes. Having the sets of nodes of both graphs unified, redundant edges
are removed, missing edges added and finally those with wrong semantics
                                                                                for each video are listed in Supplementary Table 1. Sample
corrected. The whole procedure is fully automatic, requires no optimiza-        results are presented as Supplementary Videos 9–16.
tion and is easy to implement.
                                                                                3.3    Discussion
2.4.3 Time consumption       Time consumption (TIM) was evaluated on
                                                                                In the next paragraphs, we will discuss the main contributions of
a common workstation (Intel Core i7-3770 3.40 GHz, 24 GB RAM) run-
ning the 64-bit Windows 7 or the Ubuntu 13.10 operating system. The             the challenge.
total execution time needed to analyze each video of a given dataset was           Datasets. We have created a public data repository composed
measured. The memory consumption was not considered for the perform-            of 24 annotated time-lapse sequences obtained from conven-
ance evaluation, but the participants were asked to ensure that their           tional and confocal fluorescence microscopes, along with 24 real-
algorithms would not require more than the given physical memory                istic computer simulations of moving nuclei. The cell types
limit on that PC configuration.                                                 selected are relevant in the context of cell migration, being cells
                                                                                with stem-like properties involved in embryonic and adult organ
2.4.4 Evaluation tools Two command-line executable programs, one                development and homeostasis, or cancer cell lines with metastatic
for segmentation and one for the tracking accuracy evaluation, were             properties. These videos cover a wide variety of cell types,
provided along with the training datasets to help the participants with         microscopy and experimental setups, cell density and motility,
the self-evaluation and refinement of their algorithms. These programs
                                                                                resolution, image quality and dimensionality. There are 2D
were also used by the organizers to evaluate SEG and TRA for the results
submitted by the participating teams for the competition datasets. Both
                                                                                sequences of nuclearly stained cells, commonly used in cell popu-
programs were written in Cþþ and are publicly available at the challenge        lation studies (e.g. N2DL-HeLa), and 3D sequences of cytoplas-
Web site.                                                                       mically stained cells, more appropriate for single-cell studies that
                                                                                demand a realistic rendering of the cellular environment (e.g.
2.4.5 Compilation of rankings First, for each method, the SEG,                  C3DL-MDA231). The characteristics of the videos in terms of
TRA and TIM measures were averaged over all the videos of a given               contrast, resolution and signal-to-noise ratio are also diverse,
dataset. For each dataset, all the methods were ranked (1 ¼ best) and,          covering conditions ranging from those that could be considered
                                                                                                                                                   1613
M.Maška et al.
COM-US           D        Mean filtering                Iterative histogram analysis           Multiple-hypothesis tracking   Identification of parent links
                                                                                                 of extracted cell
                                                                                                 baricenters
HEID-GE          D        Gaussian and median           Region adaptive threshold-             Local optimization using a     Detection of mitotic events
                           filtering                     ing followed by a water-                cost function within spa-     based on likelihood
                                                         shed transform for                      tially-limited search         measurements
                                                         splitting clusters                      regions
KTH-SE           D        Gaussian band-pass            Global thresholding fol-               State-space diagram opti-      Seeded k-means clustering;
                           filtering                     lowed by a watershed                    mization in a greedy           Merging segments without
                                                         transform for splitting                 fashion.                       tracks into adjacent segments
                                                         clusters                                                               with tracks
LEID-NL          M        Not used                      Region-based contour evolution using the multi-phase level set        Improved handling of mitotic
                                                         framework. Radon transform for splitting clusters;                     events (for cytoplasmic label-
                                                         Compensation for inter-frame cell motion                               ing) based on shape solidity
                                                                                                                                measurements
PRAG-            D        Gaussian filtering            Adaptive k-means thresh-               Nearest-neighbor tracking of   Not used
 CZ                                                       olding followed by a                  extracted centers of mass
                                                          watershed transform for
                                                          splitting clusters
UPM-ES           D        Median filtering;             Stochastic spatio-temporal             Iterative spatio-temporal      Not used
                           Grayscale spatial              morphological reconstruc-              association based on
                           area opening                   tion combined with hier-               three-dimensional con-
                                                          archical clustering                    nectivity for 2D data
‘‘high quality’’ (i.e. high numerical aperture lens, homogeneous                         is put on missing nodes in the weighted sum; therefore, the ability
and bright fluorescent staining) to conditions that could be clas-                       of the method to detect all the cells is important for achieving
sified as ‘‘high-throughput’’ (i.e. low magnification, low numer-                        low score. Because both parameters (i.e. segmentation and track-
ical aperture lens, heterogeneous and dim fluorescent staining).                         ing accuracy) are of similar importance, they were weighted
All real videos used in the competition were manually annotated                          equally in the final ranking function. Only when the algorithms
at three different sites, and a final ground truth per video was                         achieved the same rank in terms of accuracy, the faster one was
generated using a majority voting approach, to account for inter-                        preferred, which was guaranteed by adding time performance
subject variability. The two additional simulated datasets provide                       with a smaller adaptive weight.
an absolute ground truth for the comparison of the algorithms,                              Results: Participants and algorithms. Six algorithms were sub-
eliminating the possible bias introduced by the annotators of the                        mitted to the first Cell Tracking Challenge, covering a wide var-
real videos.                                                                             iety of methods, stemming from the two main tracking
   Measures and rankings. Key to the establishment of a credible                         paradigms: tracking by detection and tracking by model evolution.
benchmark is the use of common measures for algorithm evalu-                             Most of the existing state-of-the-art methods for filtering, en-
ation and comparison. We have described and used measures                                hancement, segmentation, particle analysis and track association
that account for two aspects of the cell tracking problem: seg-                          are represented. Four of the six participating groups (COM-US,
mentation and tracking accuracy. The segmentation accuracy                               HEID-GE, KTH-SE and PRAG-CZ) provided results for all the
measure was based on the Jaccard similarity index, which evalu-                          datasets. This is a remarkable fact that emphasizes the general-
ates how close the cell segmentations are to the ground truth.                           ization of the results.
Tracking accuracy was evaluated using a novel measure, based                                Results: Global analysis. Based on the numbers provided in
on matching acyclic oriented graphs. This method automatically                           Supplementary Table S1, both SEG and TRA accuracy measures
assesses the difficulty of transforming a computed graph into the                        were higher for nuclear labeling than for cytoplasmic labeling.
ground-truth reference. The difficulty is measured as the                                Furthermore, they both reflected the level of complexity pro-
weighted sum of the least number of operations needed to                                 vided in Table 1, along with the description of the datasets.
make the graphs identical. Therefore, the tracking accuracy                                 There were large differences in the segmentation accuracy, the
was measured by one comprehensive scalar measure, whereas                                lowest mean values being for C2DL-MSC and C3DL-MDA231
in most previous works it required evaluating multiple measures                          (both with cytoplasmic labeling), the highest mean value being
to characterize various cell tracking events (Kan et al., 2011; Li                       for N3DH-CHO. This could be explained by the fact that the
et al., 2008). The weights are not biologically motivated; there-                        algorithms seem to be developed and tuned for the segmentation
fore, the measure is application-independent. The highest weight                         of nuclei, as they often incorporate cluster separation routines
1614
                                                                                               Benchmarking of cell tracking algorithms
Table 3. Summary of top-3 rankings per dataset and measure, along with the combined (FINAL) rankings
FINAL
  #1     KTH-SE         PRAG-CZ         KTH-SE              KTH-SE             KTH-SE          HEID-GE         LEID-NL         LEID-NL
  #2     HEID-GE        KTH-SE          HEID-GE             PRAG-CZ            HEID-GE         KTH-SE          KTH-SE          KTH-SE
  #3     UPM-ES         HEID-GE         COM-US              HEID-GE            PRAG-CZ         LEID-NL         HEID-GE         HEID-GE
SEG
  #1     KTH-SE         HEID-GE         KTH-SE              PRAG-CZ            KTH-SE          HEID-GE         LEID-NL         LEID-NL
  #2     HEID-GE        KTH-SE          COM-US              KTH-SE             HEID-GE         LEID-NL         HEID-GE         HEID-GE
  #3     UPM-ES         PRAG-CZ         HEID-GE             HEID-GE            PRAG-CZ         COM-US          KTH-SE          KTH-SE
TRA
  #1     KTH-SE         PRAG-CZ         KTH-SE              KTH-SE             HEID-GE         KTH-SE          KTH-SE          LEID-NL
  #2     HEID-GE        KTH-SE          HEID-GE             PRAG-CZ            KTH-SE          PRAG-CZ         LEID-NL         KTH-SE
  #3     UPM-ES         HEID-GE         PRAG-CZ             HEID-GE            PRAG-CZ         HEID-GE         HEID-GE         HEID-GE
TIM
  #1     COM-US         PRAG-CZ         PRAG-CZ             COM-US             COM-US          COM-US          COM-US          PRAG-CZ
  #2     KTH-SE         COM-US          COM-US              KTH-SE             KTH-SE          PRAG-CZ         KTH-SE          COM-US
  #3     PRAG-CZ        KTH-SE          KTH-SE              PRAG-CZ            PRAG-CZ         KTH-SE          PRAG-CZ         KTH-SE
based on the circularity of segmented objects. Therefore, they are     all the datasets, with the exception of the segmentation accuracy
not appropriate for cellular shapes, which are seldom uniform,         for N3DH-CHO, where KTH-SE ranked fourth. However, one
present protrusions and frequently establish contacts or overlaps.     should note that in this specific case, the difference in SEG be-
    The TRA measure generally provided more uniform re-                tween the most accurate method, HEID-GE, and KTH-SE was
sults among datasets, with the exception of C2DL-MSC.                  small. The other two methods that often belonged to the top
Interestingly, the algorithms achieved significantly higher track-     three most accurate methods were PRAG-CZ and LEID-NL.
ing accuracy on the simulated datasets than on the real ones.          In terms of TIM measure, COM-US, PRAG-CZ and KTH-SE
This is likely because of the fact that the computer-simulated         were consistently the top three fastest methods for all the data-
nuclei are in general uniformly sized, and the simulated cell mo-      sets. It is remarkable that KTH-SE was, at worst, second fastest
tility does not cover all possible random events that occur in real    among the top three best performers in terms of SEG and TRA.
live-cell experiments.                                                    Results per dataset: We will now look at the results of each
    Finally, the TIM measure strongly depended on the size of          particular dataset in detail, trying to extract relevant conclusions
each video, being the lowest for C2DL-MSC and N2DH-SIM                 about the best performing methods (see Table 3 and
and the highest for N3DH-SIM and C3DH-H157. Another                    Supplementary Table S1):
important factor influencing time consumption of the competing            C2DL-MSC (Supplementary Video S9). The accuracy meas-
algorithms was the number of objects to be analyzed. Note that         ures were generally poor, especially because of problems with the
the standard deviations of the TIM measure indicate significant        segmentation of elongated protrusions, often incorrectly con-
differences in the speed of competing algorithms for all the           sidered as whole cells. KTH-SE achieved significantly better
datasets.                                                              accuracy than the other methods because of the optimized
    Results: Rankings. The FINAL ranking in Table 3 shows that         track-linking algorithm used, and an adaptive post-processing
KTH-SE performed best in four real datasets (C2DL-MSC,                 step, which merges segmented object portions into adjacent seg-
C3DL-MDA231, N2DH-GOWT1 and N2DL-HeLa). HEID-                          ments with tracks. Regardless of this additional post-processing
GE and PRAG-CZ performed best in one real dataset (N3DH-               step, the method was still fast, being the second fastest in terms
CHO and C3DH-H157, respectively). LEID-NL performed best               of TIM and 42 faster than the other two top three best per-
in the two simulated datasets (N2DH-SIM and N3DH-SIM).                 forming methods, HEID-GE and UPM-ES.
When we look at the number of appearances of each method                  C3DH-H157 (Supplementary Video S10). All the algorithms
among the top three best performing methods, both KTH-SE               that competed for this dataset achieved comparable segmenta-
and HEID-GE appeared in all eight datasets, LEID-NL and                tion accuracy, HEID-GE being the most accurate. Compared
PRAG-CZ appeared in three datasets and finally UPM-ES                  with the segmentation accuracy, the tracking accuracy was
and COM-US appeared in one dataset.                                    more spread out, PRAG-CZ being the most accurate. The de-
    It is also important to note that in the case of C3DH-H157,        cisive factor for establishing the final ranking of the top three
N2DH-GOWT1, N2DL-HeLa and N3DH-SIM, the decisive                       best performing methods was TIM. Globally, PRAG-CZ was
factor for establishing the final ranking was the speed of compet-     ranked first, having the lowest time consumption, namely,
ing algorithms because multiple methods were evenly ranked             because the preprocessing step, involving Gaussian filtering, is
based on the SEG and TRA accuracy measures only.                       applied only in 2D, slice-by-slice.
    Looking at each accuracy measure separately, HEID-GE and              C3DL-MDA231 (Supplementary Video S11). The calculated
KTH-SE ranked among the top three most accurate methods for            accuracy measures were generally poor, because of the
                                                                                                                                     1615
M.Maška et al.
high-throughput acquisition conditions, making it difficult to         post-processing phases implemented by KTH-SE, especially in
properly separate tightly packed clusters as well as accurately        low signal-to-noise ratio conditions.
segment elongated protrusions. Analogously to C2DL-MSC,
KTH-SE significantly outperformed the other methods in
terms of accuracy. This is due to the additional arcs included         4    CONCLUSION
in the state space diagram, which allow the delayed creation of        In this article, we have presented the implementation of a bench-
correct tracks originally blocked by incorrect preexisting tracks      mark for objective comparison of cell tracking algorithms, based
(Magnusson and Jaldén, 2012). Moreover, this is also a benefit of     on the use of a common diverse video dataset repository and
the track-free object merging used as an adaptive post-processing      ground truth, specific measures for both the evaluation of the
step.                                                                  segmentation and tracking accuracy, and unified criteria for
   N2DH-GOWT1 (Supplementary Video S12). Both accuracy                 comparing and ranking the algorithms. This is something re-
measures were high, especially SEG. A decisive factor for achiev-      cently highlighted by Carpenter et al. (2012) as a requirement
ing these good results was the use of a cluster separation routine     for the usability of biomedical imaging software. The logistics,
(e.g. watershed or the Radon transform), and a hole-filling ap-        datasets, methods and results of the challenge have been
proach to remove small background components within seg-               described herein. In the future, we expect this benchmark to
mented cells at places of prominent nucleoli. The top two best         serve as a reference for the development and evaluation of
performing methods, KTH-SE and PRAG-CZ, performed simi-                novel cell tracking algorithms. To this end, the training and
larly. Based on the SEG and TRA measures, they were assigned           competition datasets are available to the general public, along
the same rank of 3, with PRAG-CZ the best in segmenting and            with the ground truth for the training datasets and the self-evalu-
KTH-SE the best in tracking. Globally, KTH-SE was ranked               ation software. Moreover, executable versions of most of the
first because it was 5 faster than PRAG-CZ.                          competing algorithms will be available through the challenge
   N2DL-HeLa (Supplementary Video S13). The accuracy meas-             Web site. We expect the challenge to remain open for online
ures were high, especially TRA. Analogously to N2DH-GOWT1,             submissions, because there are open problems that need to be
the use of a cluster separation routine resulted in more accurate      addressed and new algorithms that can be developed to improve
results, although such routine sometimes led to over-segmenta-         the existing ones. Namely, accurately segmenting and tracking
tion, especially when multiple touching nuclei formed clusters of      cytoplasmically labeled cells is still something far from being
highly irregular shape. The top two best performing methods,           solved. Automated segmentation and tracking of other cell
KTH-SE and HEID-GE, produced results of comparable accur-              types, other modalities (e.g. brightfield time-lapse microscopy)
acy. Based on the SEG and TRA measures, they were assigned             and existing or new high-throughput modalities, such as selected
the same rank of 3, KTH-SE being the best in segmenting and            plane illumination microscopy, may require further algorithmic
HEID-GE being the best in tracking, mainly because of the spe-         developments, and therefore proper testing and validation that
cific mitosis detection phase implemented in their method.             could be achieved through this benchmark.
Globally, KTH-SE was ranked first as it was 42 faster than
                                                                       Funding: The Spanish Ministry of Economy (DPI2012-38090-
HEID-GE.
                                                                       C03-02 to C.O.d.S. and M.M.); the Czech Science Foundation
   N3DH-CHO (Supplementary Video S14). The accuracy meas-
                                                                       (302/12/G157 to M.K., P.M., P.M., D.S., G.M.H.); the
ures were the best among all the real datasets. This can be ex-
                                                                       European Social Fund and the Czech Ministry of Education
plained by the high magnification objective lens used and low cell
                                                                       (1.07/2.3.00/30.0009 to M.M.); the Swedish Research Council
density. Furthermore, these videos contain little of noise.
                                                                       (VR) (621-2011-5884 to K.M. and J.J.); the National Institutes
Analogously to N2DH-GOWT1, a crucial factor for achieving
                                                                       of Health (R01 HL096113 to H.B.) and California Institute
more accurate results was to involve a hole-filling approach to
                                                                       for Regenerative Medicine (RT1-01001-1, to H.B.); the Grant
deal with nucleoli. Globally, HEID-GE was ranked first, thanks
                                                                       Agency of the Czech Republic (P205/12/P392 to P.Křı́žek);
to its ability to deal with the presence of cell invaginations and a
                                                                       the project UNCE 204022 from the Charles University
morphology-based likelihood measure used to identify candi-
                                                                       (P.Křı́žek); the BMBF projects ENGINE (NGFNþ) and
dates for mitotic events.
                                                                       FANCI (SysTec) (N.H., K.R.); Spain’s Gov. projects (CDTI-
   N2DH-SIM and N3DH-SIM (Supplementary Videos S15 and
                                                                       AMIT, TEC2010-21619-C04-03, TEC2011-28972-C02-02) and
S16). The obtained SEG measures were similar to those for the
                                                                       European Development Funds (D.P-E, D.J-C and M.J.L-C.).
real datasets displaying stained nuclei. This indicates that our
simulator generates images of realistic static content in terms        Conflict of Interest: none declared.
of cell texture, noise level and image degradations. However, in
terms of tracking accuracy, as mentioned before, the algorithms
performed generally better than in the real datasets. Globally,        REFERENCES
LEID-NL was ranked first for both simulated datasets, fitting          Al-Kofahi,O. et al. (2006) Automated cell lineage construction: a rapid method to
with the idea that the model behind this method highly conforms            analyze clonal development established with murine neural progenitor cells. Cell
to computer-simulated nuclei of controlled cell motility.                  Cycle, 5, 327–335.
However, it needs further optimization to properly work in             Bise,R. et al. (2011) Reliable cell tracking by global data association. In: Proceedings
                                                                           of the IEEE International Symposium on Biomedical Imaging: From Nano to
real scenarios. From the analysis of the submitted results, we             Macro. pp. 1004–1010.
can finally stress the importance of the mitosis detection phase       Carpenter,A.E. et al. (2012) A call for bioimaging software usability. Nat. Methods,
implemented by HEID-GE, and the linking and adaptive track                 7, 666–670.
1616
                                                                                                                   Benchmarking of cell tracking algorithms
Chenouard,N. et al. (2013) Multiple-hypothesis tracking for cluttered                 Lin,G. et al. (2005) Hierarchical, model-based merging of multiple fragments
    biological image sequences. IEEE Trans. Pattern Anal. Mach. Intell., 35,             for improved three-dimensional segmentation of nuclei. Cytometry A, 63,
    2736–3750.                                                                           20–33.
Dima,A.A. et al. (2011) Comparison of segmentation algorithms for fluorescence        Long,F. et al. (2007) Automatic segmentation of nuclei in 3D microscopy images of
    microscopy images of cells. Cytometry A, 79, 545–559.                                C. Elegans. In: Proceedings of the 4th IEEE International Symposium on
Dufour,A. et al. (2005) Segmenting and tracking fluorescent cells in dynamic 3-D         Biomedical Imaging. pp. 536–539.
    microscopy with coupled active surfaces. IEEE Trans. Image Process., 14,          Magnusson,K.E.G. and Jaldén,J. (2012) A batch algorithm using iterative applica-
    1396–1410.                                                                           tion of the Viterbi algorithm to track cells and construct cell lineages. In:
Dufour,A. et al. (2011) 3-D active meshes: fast discrete deformable models for cell      Proceedings of the 9th IEEE International Symposium on Biomedical Imaging.
    tracking in 3-D time-lapse microscopy. IEEE Trans. Image Process., 20,               pp. 382–385.
    1925–1937.                                                                        Maška,M. et al. (2013) Segmentation and shape tracking of whole fluorescent cells
Dzyubachyk,O. et al. (2010) Advanced level-set-based cell tracking in time-lapse         based on the Chan-Vese model. IEEE Trans. Med. Imaging, 32, 995–1006.
    fluorescence microscopy. IEEE Trans. Med. Imaging, 29, 852–867.                   Meijering,E. et al. (2009) Tracking in cell and developmental biology. Semin. Cell
Fernandez-Gonzalez,R. et al. (2006) Quantitative in vivo microscopy: the return          Dev. Biol., 20, 894–902.
    from the ‘omics’. Curr. Opin. Biotechnol., 17, 501–510.                           Ortiz-de-Solorzano,C. et al. (1999) Segmentation of confocal microscope images of
Foggia,P. et al. (2013) Benchmarking HEp0 2 cells classification methods. IEEE           cell nuclei in thick tissue sections. J. Microsc., 193, 212–226.
    Trans. Med. Imaging, 32, 1878–1889.                                               Padfield,D. et al. (2011) Coupled minimum-cost flow cell tracking for high-through-
Friedl,P. and Alexander,D. (2011) Cancer invasion and the microenvironment: plas-        put quantitative analysis. Med. Image Anal., 15, 650–668.
    ticity and reciprocity. Cell, 147, 992–1009.                                      Rapoport,D.H. et al. (2011) A novel validation algorithm allows for automated cell
Friedl,P. and Gilmour,D. (2009) Collective cell migration in morphogenesis, regen-       tracking and the extraction of biologically meaningful parameters. PLoS One,
    eration and cancer. Nat. Rev. Mol. Cell Biol., 10, 445–457.                          11, e27315.
Held,C. et al. (2011) Comparison of parameter-adapted segmentation methods for        Rohr,K. et al. (2010) Tracking and quantitative analysis of dynamic movements of
    fluorescence micrographs. Cytometry A, 79, 933–945.                                  cells and particles. Cold Spring Harb. Protoc., 6, pdb.top80.
Indhumathi,C. et al. (2011) An automatic segmentation algorithm for 3D cell cluster   Svoboda,D. et al. (2009) Generation of digital phantoms of cell nuclei and simula-
    splitting using volumetric confocal images. J. Microsc., 243, 60–76.                 tion of image formation in 3D image cytometry. Cytometry A, 75, 494–509.
Kan,A. et al. (2011) Automated and semi-automated tracking: addressing portabil-      Svoboda,D. and Ulman,V. (2012) Generation of synthetic image datasets for time-
    ity challenges. J. Microsc., 244, 194–213.                                           lapse fluorescence microscopy. In: International Conference on Image Analysis
Legant,W.R. et al. (2010) Measurement of mechanical tractions exerted by cells in        and Recognition. pp. 473–482.
    three-dimensional matrices. Nat. Methods, 7, 969–971.                             Zimmer,C. et al. (2002) Segmentation and tracking of migrating cells in videomicro-
Li,K. et al. (2008) Cell population tracking and lineage construction with spatio-       scopy with parametric active contours: a tool for cell-based drug testing. IEEE
    temporal context. Med. Image Anal., 12, 546–566.                                     Trans. Med. Imaging., 21, 1212–1221.
Li,F. et al. (2010) Multiple nuclei tracking using integer programming for quanti-    Zimmer,C. et al. (2006) On the digital trail of mobile cells. IEEE Signal Process.
    tative cancer cell cycle analysis. IEEE Trans. Med. Imaging, 29, 96–105.             Mag., 23, 54–62.
1617