KEMBAR78
Mutation Testing | PDF | Coefficient Of Determination | Sampling (Statistics)
0% found this document useful (0 votes)
38 views11 pages

Mutation Testing

Uploaded by

Carlos Dorival
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views11 pages

Mutation Testing

Uploaded by

Carlos Dorival
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Operator-Based and Random Mutant Selection:

Better Together
Lingming Zhang Milos Gligoric Darko Marinov Sarfraz Khurshid
University of Texas University of Illinois University of Illinois University of Texas
Austin, TX 78712 Urbana, IL 61801 Urbana, IL 61801 Austin, TX 78712
zhanglm@utexas.edu gliga@illinois.edu marinov@illinois.edu khurshid@utexas.edu

Abstract—Mutation testing is a powerful methodology for While mutation testing could be useful in many domains,
evaluating the quality of a test suite. However, the methodology it is extremely expensive. For example, a mutation testing
is also very costly, as the test suite may have to be executed for tool for C, Proteum [19], implements 108 mutation operators
each mutant. Selective mutation testing is a well-studied technique
to reduce this cost by selecting a subset of all mutants, which that generate 4,937 mutants for a small C program with only
would otherwise have to be considered in their entirety. Two 137 non-blank, non-comment lines of code [7]. Therefore,
common approaches are operator-based mutant selection, which generating and (especially) executing the large number of mu-
only generates mutants using a subset of mutation operators, tants against the test suite under evaluation is costly. Various
and random mutant selection, which selects a subset of mutants methodologies for reducing the cost of mutation testing have
generated using all mutation operators. While each of the two
approaches provides some reduction in the number of mutants been proposed. One widely used methodology is selective
to execute, applying either of the two to medium-sized, real- mutation testing [3]–[7], [9], [20], [21], which only generates
world programs can still generate a huge number of mutants, and executes a subset of mutants for mutation testing. Ideally,
which makes their execution too expensive. This paper presents the selected subset of mutants should be representative of the
eight random sampling strategies defined on top of operator- entire set of mutants.
based mutant selection, and empirically validates that operator-
based selection and random selection can be applied in tandem The most widely studied approach for selective mutation
to further reduce the cost of mutation testing. The experimental testing is operator-based mutant selection [3]–[5], [7], [9],
results show that even sampling only 5% of mutants generated [20], which only generates mutants using a subset of mutation
by operator-based selection can still provide precise mutation operators; the selected subset of mutation operators is required
testing results, while reducing the average mutation testing time to be effective, i.e., if a test suite kills all the non-equivalent
to 6.54% (i.e., on average less than 5 minutes for this study).
mutants generated by the selected set of operators (i.e., the
test suite is adequate for selected mutants), then the test suite
I. I NTRODUCTION should kill (almost) all the non-equivalent mutants generated
by all mutation operators. Further, selected operators should
Mutation testing [1]–[10] has been shown to be a powerful lead to high savings; the savings is usually calculated as the
methodology for evaluating the quality of test suites. In (first- ratio of non-selected mutants over all the mutants. Researchers
order) mutation testing, many variants of a program (known also evaluated a simple approach of random mutant selec-
as mutants) are generated through a set of rules (known as tion [7], [9], [22], and a recent study [7] reported that random
mutation operators) that seed one syntactic change at a time in selection is as effective as operator-based mutant selection
order to generate one mutant. Then, all the generated mutants when random selection selects the same number of mutants
are executed against the test suite under evaluation. A test from all mutants as the operator-based selection selects.
suite is said to kill a mutant if at least one test from the suite Although the existing approaches are effective, mutation
has a different result for the mutant and original program. If testing remains one of the most expensive methodologies
a mutant is semantically equivalent to the original program, in software testing. No previous study has explored how to
then it cannot be killed by any test. The more non-equivalent further reduce the number of mutants generated by operator-
mutants a test suite can kill, the higher quality the test suite is based mutant selection, and whether operator-based selection
predicted to be. Indeed, several studies [11], [12] have found and random selection can be combined. Also, previous work
mutants to be suitable alternatives for real faults which can has not explored how random mutant selection and operator-
be used for comparing testing techniques and test suites. The based selection relate for test suites that do not kill all non-
predicted quality of test suites also depends on the quality of equivalent mutants (i.e., non-adequate test suites). In addition,
mutants, consequently higher-order mutation testing has been all the studies for sequential mutants [3]–[5], [7], [20], [22]
proposed to generate a small number of mutants that each evaluated mutant selection on small C and Fortran programs—
have multiple syntactic changes [13], [14]. Mutation testing the largest program used for selective mutation testing was
has also been used to guide the test generation to kill mutants only 513 lines of code. Empirical studies on larger, real-world
automatically [14]–[18]. programs are lacking.

978-1-4799-0215-6/13 c 2013 IEEE 92 ASE 2013, Palo Alto, USA


Accepted for publication by IEEE. c 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/
republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
This paper makes the following contributions: In this study, we apply a set of sampling strategies on top
• Sampling mutation: We investigate a simple idea, which of the selected mutants. Let S be a sampling strategy; the
we call sampling mutation, to reduce the number of mu- set of mutants sampled by S from M is denoted MS . We
tants generated by operator-based mutant selection: sam- apply each strategy m times (with different random seeds) to
pling mutation randomly selects from the set of mutants generate a set of mutant samples: {MS1 , MS2 , ..., MSm }. The
generated by operator-based mutant selection (rather than set of mutants in MSj (1 ≤ j ≤ m) that are killed by test suite
from the set of mutants generated by all operators [7], Ti (1 ≤ i ≤ n) is denoted K(Ti , MSj ). Then, the sampling
[9], [22]); we call the process of obtaining the mutants mutation score achieved by Ti over MSj can be represented
sampling, the percentage of randomly selected mutants as:
|K(Ti , MSj )|
the sampling ratio, and the resulting set of mutants a M S(Ti , MSj ) = (2)
sample. |MSj ∩ N EM |
• Various sampling mutation strategies: We evaluate 8 Intuitively, if M S(Ti , MSj ) is close to M S(Ti , M ) for all
sampling strategies; our study is the first to consider 1 ≤ i ≤ n and 1 ≤ j ≤ m, we say that the sampling
random selection based on the program elements (rather strategy S applied on top of selected mutants is effective at
than on the mutation operators). Our empirical study predicting the result that would be obtained on all selected
shows that although all sampling strategies are effective mutants. (Section III precisely defines the predictive power.)
for mutation testing, sampling based on program elements
can provide the most effective results. B. Measurement
• Extensive study: We evaluate mutation sampling on 11 In the literature, there are two main approaches for evalu-
real-world Java projects of various sizes (from 2681 to ating the effectiveness of how a subset of mutants represents
36910 lines of code) to investigate the effectiveness, a larger set of mutants. (Traditionally, the sets are generated
predictive power, and savings of sampling mutation. Our by all operators, and the subsets are selected mutants; in
study evaluates effectiveness in case of adequate test our study, the sets are selected mutants, and the subsets are
suites, predictive power in case of non-adequate test sampled mutants.) First, researchers [3], [4], [7], [20] construct
suites, and savings in terms of time to generate and test suite Ti (1 ≤ i ≤ n) that can kill all non-equivalent
execute mutants. mutants from the subset (called adequate test suites), and
• Empirical evidence: The study shows that sampling calculate the mutation score of Ti on the original set of
mutation remains effective and has a high predictive mutants. Second, Namin et al. [5] also examined how, for test
power even while providing high savings. The study suites Ti (1 ≤ i ≤ n) that cannot kill all the non-equivalent
shows the cost-effectiveness of applying sampling mu- mutants from the subset (called non-adequate test suites), the
tation with various strategies and ratios. Surprisingly, mutation score of Ti on the subset of mutants compares with
for all our subjects, the experimental results show that the mutation score of Ti on the original set of mutants.
sampling only 5% of operator-based selected mutants In this study, we use both approaches to evaluate the
can still provide a precise mutation score, with almost sampling strategies applied on top of operator-based mutant
no loss in precision, while reducing the mutation time selection. For the first approach, since our sampling strategy
to 6.54% on average. Moreover, the study shows that may select different subsets of mutants at different runs, we
the sampling strategies are more beneficial for larger randomly construct n adequate test suites that can kill all
subjects; as more and more researchers are using mutants sampled non-equivalent mutants for each of the m sampling
to compare testing techniques, our sampling strategies can runs. We denote the ith (1 ≤ i ≤ n) test suite that kills all
help researchers to scale mutation to larger programs by sampled non-equivalent mutants in the jth (1 ≤ j ≤ m) run of
choosing a representative subset of mutants for efficient sampling (i.e., the selected mutants are MSj ) as Tij . Following
but effective evaluation. previous work [7], we use the following formula to measure
II. S TUDY A PPROACH the effectiveness of a sampling technique S:
A. Problem Definition Σm n
j=1 Σi=1 |M S(Tij , M )|
EF (S) = × 100% (3)
Given a program under test, P , and a test suite, T , we n∗m
denote the set of all selected mutants generated by operator- The only difference between our formula and the original
based mutant selection as M , and the set of non-equivalent formula [7] is that we also average over m sampling runs;
mutants in M as N EM . Following existing studies [3]–[5], the previous work did not average over different runs because
[7], we randomly construct n test suites of various sizes each run of their operator-based selection gives a fixed subset
{T1 , T2 , ..., Tn }; the set of mutants that can be killed by of mutants. Also note that in the evaluation, we present the
Ti (1 ≤ i ≤ n) is denoted K(Ti , M ). Then the (actual) standard deviation (SD) values across m sampling runs to
selected mutation score achieved by Ti over the selected show the stability of the sampling strategies.
mutants M is defined as: For the second approach, we randomly construct k non-
|K(Ti , M )| adequate test suites ({T1 , T2 , ..., Tk }) for each subject and
M S(Ti , M ) = (1) check how m runs of sampling influence the mutation scores
|N EM |

93
of the constructed test suites. We use the correlation between TABLE I: Subject programs used in the evaluation
sampling mutation score of Ti and selected mutation score of
#Mutants
Ti to measure the predictive power of a random sampling
Subject LOC #Tests All Killed
strategy S over the mutants generated by operator-based
TimeMoneyr207 [23] 2681 236 2304 1667
mutant selection:
JDependv2.9 [24] 2721 55 1173 798
P P (S) = Corr({hM S(Ti , MSj ), M S(Ti , M )i JTopasv2.0 [25] 2901 128 1921 1103
Barbecuer87 [26] 5391 154 36418 1002
|1 ≤ i ≤ k ∧ 1 ≤ j ≤ m}) (4) Mime4Jv0.50 [27] 6954 120 19111 4414
Jaxenr1346 [28] 13946 690 9880 4616
The correlation analysis is between the mutation scores on XStreamv1.41 [29] 18369 1200 18046 10022
the sampled mutant set MSj and the mutation scores on XmlSecurityv3.0 [30] 19796 83 9693 2560
the selected mutants M for all constructed test suites for all CommonsLangr1040879 [31] 23355 1691 19746 12970
sampling runs (1 ≤ j ≤ m). Section III presents, both visually JodaTimer1604 [32] 32892 3818 24174 16063
and statistically, the Corr functions we use. To illustrate, JMeterv1.0 [33] 36910 60 21896 2024
for all the m sampling runs of a strategy S, if we use the
mutation scores of the k tests suites on sampled mutants MSj
(1 ≤ j ≤ m), i.e., M S(Ti , MSj ) (1 ≤ i ≤ k) as the x- In this way, Sclass , Smeth , and Sstmt can be defined when
axis values, for each x value we will have a corresponding using the program element granularities of class, method,
y value to predict, which is M S(Ti , M ) (1 ≤ i ≤ k). For and statement, respectively.
a perfect strategy, the graph will be the straight line function • PElem-MOp-Based Strategies, which sample x% mu-
y = x, which means all the k test suites have exactly the same tants from each set of mutants generated by the same
mutation score on sampled mutants and all original mutants mutation operator inside the same program element. As-
before sampling. sume the sets of mutants generated for the set of program
elements in the project under test are Me1 , Me2 , ..., Mek ,
C. Combining Operator-Based and Random Mutant Selection
then M = ∪ki=1 Mei . Also assume the sets of mutants
Given selected mutants, M , we define eight random sam- generated by the set of selective mutation operator are
pling strategies that specify which mutants to select from M . Mop1 , Mop2 , ..., Moph , then M = ∪hj=1 Mopj . Finally,
• Baseline Strategy, which samples x% mutants from the the set of sampled mutants can be defined as:
selected set of mutants M . Formally, the set of mutants
MSpelem−mop = ∪ki=1 ∪hj=1 Sample(Mei ∩ Mopj , x%)
sampled by strategy Sbase can be defined as:
In this way, Sclass−mop , Smeth−mop , and Sstmt−mop can
MSbase = Sample(M, x%) be defined when using the program element granularities
where Sample(M, x%) denotes random sampling of x% of class, method, and statement, respectively.
mutants from M .1 Note that the first two strategies, Sbase and Smop , have been
• MOp-Based Strategy, which samples x% mutants from used by previous studies [7], [22] to evaluate random mutant
each set of mutants generated by the same mutation selection from all mutants. We believe that using all mutants
operator. Assume the sets of mutants generated by the set as the candidate set may be unnecessary. Therefore, we use
of selective mutation operators, say op1 , op2 , ..., opk , are these two strategies to evaluate random mutant sampling from
Mop1 , Mop2 , ..., Mopk , i.e., M = ∪ki=1 Mopi . Then, the operator-based selected mutants. In addition, our three Spelem
set of mutants sampled by strategy Smop can be formally strategies, which aim to sample mutants across all program
defined as: locations evenly, are the first to randomly sample mutants
at the program element dimension. Furthermore, our three
MSmop = ∪ki=1 Sample(Mopi , x%) Spelem−mop strategies are the first to sample mutants across
two dimensions: mutation operators and program elements.
• PElem-Based Strategies, which sample x% mutants
from each set of mutants generated inside the same pro- III. E MPIRICAL S TUDY
gram element (e.g., class, method, or statement). Assume
We performed an extensive empirical evaluation to demon-
the sets of mutants generated for the set of elements
strate the effectiveness, predictive power, and savings of the
in the project under test are Me1 , Me2 , ..., Mek , i.e.,
proposed sampling strategies.
M = ∪ki=1 Mei . Then, the set of sampled mutants can be
defined as: A. Subject Programs
MSpelem = ∪ki=1 Sample(Mei , x%) The evaluation includes a broad set of Java programs from
various sources. We chose programs of different sizes (from
1 If the sample size is a float f , we first sample ⌊f ⌋ mutants at random, 2681 to 36910 LOC) to explore the benefits of our sampling
and then with probability f − ⌊f ⌋ pick one more mutant at random. strategies for various cases.

94
Table I shows 11 subject programs used in the evalu- selected mutation score of those test suites. The higher the
ation: TimeMoney, a set of classes for manipulating time selected mutation score is, the more effective the selected
and money; JDepend, a tool for measuring the quality of mutants are for evaluating test suites (Equation 3). (The same
code design; JTopas, a library for parsing arbitrary text data; experimental procedure was used previously to measure the
Barbecue, a library for creating barcodes; Mime4J, a parser effectiveness of operator-base selection and random selec-
for e-mail message streams in MIME format; Jaxen, an tion [3]–[5], [7], [9], [20].)
implementation of XPath engine; XStream, a library for fast DV2: Predictive Power. For each sampling strategy S, we
serialization/deserialization to/from XML; XmlSecurity, an also construct test suites that do not kill all sampled non-
Apache project that implements security standards for XML; equivalent mutants, and use statistical analysis to measure
CommonsLang, an Apache project that extends standard Java the predictive power of the sampled mutants (equation 4). If
library; JodaTime, a replacement for standard Java date and the constructed test suites have similar values for sampling
time classes; and JMeter, an Apache project for performance mutation score and selected mutation score, then the sampled
testing. All the 11 subjects have been widely used in software mutants are a good predictor of the selected mutants. More
testing research [6], [10], [34]–[37]. precisely, we instantiate the Corr function to measure: R2
Table I includes some characteristics of the programs. coefficient of determination for linear regression, Kendall’s τ
Column “Subject” shows the name of each subject program, rank correlation coefficient, and Spearman’s ρ rank correlation
the version/revision number (as applicable) and the reference coefficient.
to the webpage with sources; “LOC” shows the number of DV3: Time Savings. For each triple (P , S, r) of subject
non-blank lines of code measured by JavaSourceMetric [38]; program P , sampling strategy S, and sampling ratio r, we
“#Tests” shows the number of available tests for the pro- compare the mutation testing time for the sampled mutants
gram (it is important to note that we have not created any and the mutation testing time for the selected mutants.
special test for the purpose of this study: all the tests for 3) Experimental Setup: Following previous studies on se-
11 subjects come from their code repositories, and to the lective mutation testing [5], [7], we deemed all mutants that
best of our knowledge, all these tests are manually written); cannot be killed by any test from the original test suite as
“#MutantsAll” and “#MutantsKilled” show the total number equivalent mutants in our study. We evaluate all the sampling
of mutants that Javalanche [6], [34] generated using operator- strategies with all sampling ratios on all subjects. Given a
based selection and the number of killed mutants, respectively. subject program and selected mutants for that program, we
We used Javalanche because it is a state-of-the-art mutation first run sampling 20 times for each of 8 sampling strategies
testing tool for Java programs. It generates mutants using the with each of the 19 sampling ratios. As a result, we get
operator-based mutant selection approach proposed by Offutt 20*8*19=3,040 samples of mutants for each subject program.
et al. [3], [20]. Specifically, Javalanche uses the following four Then, for each sample of mutants, we randomly construct
mutation operators: Negate Jump Condition, Omit Method 20 adequate test suites that each kill all the non-equivalent mu-
Call, Replace Arithmetic Operator, and Replace Numerical tants in sampled mutants, i.e., we construct 20*3,040=60,800
Constant. Note that the subjects used in our study are orders test suites for each subject. Next, we measure the selected
of magnitude larger than the subjects used in previous studies mutation score for each test suite. Each test suite is randomly
on selective mutation testing [3], [5], [7], [20], [22]. constructed by including one test at a time until all sampled
non-equivalent mutants are killed. We deviate from the previ-
B. Experimental Design ous work [3], [7], [20] that constructed test suites by including
We next describe our experimental setup and the data we multiple tests at a time (using increment of 50 or 200), as such
collected. decision can lead to large test suites and high selected mutation
1) Independent Variables: We used the following indepen- scores that do not correspond to practice. By including one test
dent variables in the study: at a time, we simulate a more realistic use of mutation testing
IV1: Different Random Sampling Strategies. We apply in practice, where a user could include one test at a time until
each of our eight sampling strategies on top of the mutants all the mutants are killed.
generated by operator-based mutant selection, to investigate Next, for each subject, we randomly construct 100 (non-
their effectiveness, predictive power, and savings. adequate) test suites of various sizes that do not necessary kill
IV2: Different Sampling Ratios. For each sampling strategy all the sampled mutants. We randomly construct each test suite
S, we use 19 sampling ratios r ∈ {5%, 10%, ..., 95%}. by uniformly selecting the size of the test suite to be between
IV3: Different Subject Sizes. For each strategy S with each 1 and the number of tests available for the subject. Note that
ratio r, we apply S on all the subjects with various sizes, and our experiments differ in this step from previous work [5],
investigate the differences. where 100 test suites were generated by taking two test suites
2) Dependent Variables: We used the following dependent for each size between 1 and 50. The reason to deviate from
variables to investigate the output of the experiments: previous work is that our programs greatly differ in size and
DV1: Effectiveness. For the mutants sampled by each strategy number of tests, which was not the case in previous studies.
S among all selected mutants, we construct test suites that For example, taking sizes between 1 and 50 does not seem
can kill all sampled non-equivalent mutants, and record the appropriate for both Barbecue and JodaTime (with 154 and

95
3818 tests, respectively). Therefore, we uniformly select the effectiveness of mutation testing: the more mutants sampled,
sizes of the test suites up to the total number of tests for the more precise and stable the results would be.
each subject program. Then we measure the sampling mutation Second, the studied strategies perform better on larger
score (i.e., the mutation score on the sampled mutants) and subjects than on the smaller subjects. For example, when
selected mutation score (i.e., the mutation score on the selected sampling 5% of mutants, Smeth achieves the average selected
mutants) achieved by each of the constructed test suites. We mutation scores ranging from 98.31% to 99.32% for the
further perform correlation analysis between the sampling first four subjects that have fewer than 6000 LOC, while it
mutation score and the selected mutation score for all test achieves the average selected mutation scores ranging from
suites on each strategy and ratio combination on each subject. 99.69% to 99.92% for all the other seven larger subjects. This
(Section III-C2 shows the details.) demonstrates that using small sampling ratios (e.g., r=5%) of
Finally, for each sample of mutants, we also trace the mutants is more beneficial for evaluating test suites for larger
time for generating and executing the mutants. Although it is subjects. Section III-D further investigates the effectiveness of
common in the literature to report the savings in terms of the sampling mutation for ratios even below 5%.
number of mutants not generated, this information is implicitly Third, all the strategies perform similarly, but Smeth and
given in our study through the sampling ratio (e.g., if a Smeth−mop tend to perform the best of all the strategies for
sampling ratio is 5%, we have 20x fewer mutants). Therefore, the majority of the subjects. Moreover, the additional use of
our study also reports the mutation execution time in order mutation operator information in Smeth−mop does not make it
to confirm that savings in terms of the number of mutants outperform Smeth . This demonstrates that sampling mutants
correspond to the savings in terms of mutation execution time across different program elements can be a better choice
for mutation sampling. We performed all experiments on a than sampling mutants globally (Sbase ) or across different
Dell desktop with Intel i7 8-Core 2.8GHz processor, 8G RAM, mutation operators (Smop ). Smeth performs better than Sclass
and Windows 7 Enterprise 64-bit version. and Sstmt likely because sampling at the class level is too
coarse (bringing it closer to Sbase ), while sampling at the
C. Results and Analysis statement level is too fine making it select no mutant from
We report the most interesting findings of our study in this some statements (because the number of mutants for each
section, while some additional results and detailed experimen- statement is relatively small).
tal data are publicly available online [39]. 2) Predictive Power for Non-Adequate Test Suites: While
1) Effectiveness for Adequate Test Suites: Table II shows the above results showed that adequate sampling mutation
the selected mutation scores achieved by randomly constructed score implies high selected mutation score, it is uncommon in
adequate test suites that achieve 100% sampled mutation score, practice to have adequate test suites. Thus, we further investi-
i.e., kill all the sampled non-equivalent mutants. According to gate the predictive power of the sampling strategies for non-
our experimental setup, for each triple of subject program, adequate test suites that do not kill all sampled non-equivalent
strategy, and sampling ratio, (P , S, r), we obtain 20 samples mutants. More precisely, we analyze whether the sampling
of mutants and construct 20 adequate test suites for each mutation score is a good predictor of the selected mutation
sample. Thus, for each (P , S, r), we show the average score across a range of test suites, which are almost all non-
selected mutation score and standard deviation achieved by adequate. Ideally, for all (non-adequate) test suites sampling
the 20*20=400 test suites. Specifically, column “Ra.” shows and selected mutation score would have the same value. In
sampling ratio, column “Subject” shows the subject name, and practice, if a test suite achieves selected mutation score M S,
columns 3-18 show the average values and standard deviations the same test suite may achieve sampling mutation score M S ′
achieved by 8 sampling strategies. The results for all the 19 such that M S < M S ′ , M S = M S ′ , or M S > M S ′ . We use
sampling ratios can be found on the project webpage [39]. three statistical measures to evaluate the predictive power of
Based on the obtained values, we make several observations sampling mutation score for all strategies.
as follows. Evaluating Single Test Suite. Originally, mutation testing
First, for all subjects and all sampling strategies, one can was proposed as a method for evaluating the quality of test
see that the sampled mutants are extremely effective, i.e., the suites by measuring mutation score; the higher mutation score
sampled mutants are representative of the selected mutants. For means higher quality. To evaluate a test suite using one of
example, even when sampling 5% of the selected mutants, the the sampling strategies, we have to ensure that the result
test suites that kill all the sampled mutants can kill almost all obtained on the sampled mutants predicts the result that would
selected mutants. To illustrate, when sampling 5% of selected be obtained on all the selected mutants. Following previous
mutants, the selected mutation score for Sbase strategy ranges work [5], we determine how well the independent variable
from 98.23% (on JDepend) to 99.91% (on JodaTime) with (sampling mutation score) predicts the dependent variable
the average value of 99.44%. As the sampling ratio increases, (selected mutation score) using a linear regression model.
all the strategies have higher selected mutation score and We measure the quality of fit of a model by calculating the
lower standard deviation for all subjects. This demonstrate adjusted coefficient of determination R2 , which is a statistical
that a user can use the sampling strategies to control the cost- measure of how well the regression line approximates the real

96
TABLE II: Selected mutation scores (%) achieved by the test suites that achieve 100% sampled mutation scores
Base MOp Class Meth Stmt Class-MOp Meth-MOp Stmt-MOp
Ra. Subjects MS. Dev. MS. Dev. MS. Dev. MS. Dev. MS. Dev. MS. Dev. MS. Dev. MS. Dev.
TimeMoney 99.14 1.12 99.30 0.98 99.15 1.20 99.32 0.90 99.35 0.94 99.10 1.23 99.26 1.00 99.09 1.17
JDepend 98.23 1.85 98.10 2.09 97.81 2.34 98.31 1.91 98.31 1.71 97.99 2.31 98.54 1.84 98.67 1.54
JTopas 99.37 1.09 99.24 1.23 99.32 1.18 99.25 1.18 99.30 1.22 99.19 1.35 99.25 1.31 99.16 1.30
Barbecue 98.63 1.66 98.64 1.82 98.92 1.56 98.99 1.26 99.03 1.37 98.64 1.76 99.04 1.41 98.69 1.73
Mime4J 99.77 0.34 99.78 0.34 99.77 0.32 99.81 0.29 99.76 0.32 99.82 0.26 99.80 0.28 99.78 0.39
5% Jaxen 99.72 0.33 99.69 0.36 99.67 0.37 99.69 0.33 99.67 0.42 99.70 0.36 99.64 0.39 99.70 0.36
XStream 99.85 0.21 99.86 0.17 99.87 0.18 99.90 0.14 99.88 0.15 99.87 0.18 99.88 0.15 99.85 0.19
XmlSecurity 99.62 0.61 99.53 0.74 99.68 0.56 99.69 0.50 99.64 0.70 99.73 0.46 99.73 0.45 99.68 0.56
CommonsLang 99.90 0.13 99.89 0.15 99.89 0.14 99.92 0.10 99.90 0.14 99.89 0.14 99.90 0.12 99.89 0.13
JodaTime 99.91 0.12 99.91 0.11 99.91 0.11 99.92 0.09 99.92 0.11 99.91 0.11 99.91 0.10 99.91 0.12
JMeter 99.71 0.53 99.73 0.52 99.76 0.43 99.76 0.37 99.72 0.48 99.72 0.51 99.77 0.44 99.67 0.59
Avg. 99.44 - 99.42 - 99.43 - 99.51 - 99.50 - 99.41 - 99.52 - 99.46 -
TimeMoney 99.61 0.61 99.66 0.47 99.71 0.43 99.75 0.37 99.67 0.48 99.67 0.48 99.69 0.45 99.72 0.43
JDepend 99.21 1.17 99.27 1.03 99.35 0.90 99.40 0.89 99.43 0.84 99.32 0.92 99.38 0.91 99.32 0.87
JTopas 99.67 0.61 99.62 0.65 99.74 0.44 99.66 0.54 99.65 0.65 99.64 0.61 99.76 0.46 99.75 0.43
Barbecue 99.45 0.79 99.36 0.95 99.42 0.82 99.60 0.64 99.56 0.62 99.54 0.72 99.47 0.88 99.49 0.74
Mime4J 99.91 0.17 99.90 0.16 99.93 0.13 99.91 0.15 99.91 0.15 99.92 0.13 99.92 0.13 99.92 0.14
10% Jaxen 99.85 0.19 99.85 0.20 99.87 0.17 99.87 0.17 99.87 0.17 99.85 0.19 99.86 0.18 99.87 0.16
XStream 99.95 0.08 99.94 0.08 99.94 0.08 99.96 0.06 99.96 0.07 99.95 0.09 99.95 0.07 99.95 0.09
XmlSecurity 99.88 0.23 99.86 0.27 99.88 0.22 99.86 0.23 99.88 0.20 99.86 0.26 99.86 0.25 99.88 0.25
CommonsLang 99.96 0.06 99.96 0.06 99.96 0.06 99.97 0.04 99.96 0.07 99.95 0.06 99.96 0.06 99.96 0.06
JodaTime 99.96 0.05 99.96 0.05 99.96 0.06 99.97 0.04 99.97 0.04 99.96 0.06 99.97 0.05 99.97 0.05
JMeter 99.86 0.27 99.87 0.25 99.90 0.21 99.92 0.17 99.88 0.23 99.88 0.27 99.89 0.24 99.88 0.25
Avg. 99.76 - 99.75 - 99.79 - 99.81 - 99.79 - 99.78 - 99.79 - 99.79 -
TimeMoney 99.81 0.30 99.81 0.30 99.82 0.29 99.88 0.19 99.86 0.23 99.84 0.27 99.86 0.24 99.81 0.30
JDepend 99.42 0.84 99.50 0.82 99.69 0.55 99.63 0.53 99.74 0.48 99.63 0.57 99.59 0.59 99.62 0.60
JTopas 99.82 0.32 99.82 0.35 99.80 0.32 99.84 0.28 99.87 0.24 99.85 0.26 99.83 0.32 99.79 0.37
Barbecue 99.68 0.50 99.73 0.40 99.70 0.47 99.74 0.36 99.70 0.46 99.72 0.47 99.74 0.41 99.69 0.52
Mime4J 99.95 0.10 99.95 0.10 99.96 0.08 99.96 0.09 99.96 0.08 99.94 0.10 99.97 0.08 99.95 0.10
15% Jaxen 99.90 0.15 99.91 0.12 99.92 0.11 99.92 0.10 99.92 0.12 99.92 0.10 99.92 0.12 99.92 0.11
XStream 99.97 0.05 99.97 0.05 99.97 0.06 99.98 0.04 99.98 0.04 99.97 0.05 99.97 0.04 99.97 0.06
XmlSecurity 99.91 0.19 99.94 0.12 99.93 0.16 99.92 0.15 99.94 0.11 99.92 0.14 99.95 0.11 99.93 0.15
CommonsLang 99.97 0.05 99.98 0.04 99.97 0.04 99.99 0.02 99.98 0.03 99.98 0.03 99.98 0.03 99.98 0.04
JodaTime 99.98 0.03 99.98 0.03 99.98 0.03 99.99 0.02 99.98 0.02 99.98 0.03 99.98 0.03 99.98 0.03
JMeter 99.93 0.16 99.93 0.17 99.93 0.18 99.96 0.12 99.94 0.15 99.94 0.17 99.96 0.14 99.94 0.15
Avg. 99.85 - 99.87 - 99.88 - 99.89 - 99.90 - 99.88 - 99.89 - 99.87 -
TimeMoney 99.89 0.19 99.88 0.21 99.88 0.20 99.91 0.15 99.91 0.16 99.88 0.20 99.90 0.21 99.89 0.18
JDepend 99.73 0.45 99.74 0.45 99.71 0.52 99.78 0.38 99.83 0.33 99.80 0.34 99.78 0.41 99.73 0.48
JTopas 99.89 0.20 99.89 0.20 99.86 0.26 99.90 0.20 99.89 0.20 99.89 0.22 99.85 0.31 99.87 0.27
Barbecue 99.80 0.38 99.72 0.44 99.81 0.31 99.84 0.28 99.82 0.31 99.80 0.33 99.81 0.33 99.78 0.39
Mime4J 99.97 0.07 99.97 0.07 99.98 0.05 99.97 0.05 99.98 0.04 99.97 0.06 99.97 0.06 99.97 0.07
20% Jaxen 99.95 0.07 99.94 0.09 99.94 0.09 99.95 0.08 99.94 0.08 99.95 0.07 99.94 0.08 99.94 0.09
XStream 99.98 0.03 99.98 0.04 99.98 0.03 99.99 0.02 99.98 0.03 99.98 0.03 99.99 0.03 99.98 0.03
XmlSecurity 99.95 0.09 99.94 0.13 99.95 0.10 99.96 0.08 99.97 0.07 99.94 0.15 99.95 0.10 99.95 0.10
CommonsLang 99.99 0.02 99.98 0.03 99.99 0.02 99.99 0.02 99.99 0.02 99.98 0.03 99.99 0.02 99.99 0.03
JodaTime 99.99 0.02 99.99 0.02 99.98 0.02 99.99 0.01 99.99 0.01 99.99 0.02 99.99 0.02 99.99 0.02
JMeter 99.95 0.15 99.95 0.13 99.95 0.14 99.98 0.09 99.96 0.12 99.97 0.10 99.98 0.07 99.94 0.16
Avg. 99.92 - 99.91 - 99.91 - 99.93 - 99.93 - 99.92 - 99.92 - 99.91 -

data points. The value of R2 is between 0 and 1, where a the x-axis shows the sampling mutation scores achieved by
higher value indicates a better goodness of fit. the test suites on various sampling runs, while the y-axis
We calculate R2 for each triple (P , S, r) consisting of a shows the selected mutation score for the same test suites.
subject program, strategy, and ratio. For each sample strategy There are 20*100=2,000 points on each plot. From the three
S, we sample mutants at each ratio r and measure sampling subfigures, we can see that a higher sampling ratio (r) leads to
mutation score for the same set of randomly constructed test more stable data points, which can also be seen by smoother
suites (Section III-B3). We repeat sampling 20 times to obtain splines. However, note that the sampling mutation scores on
sampling and selected mutation scores for a variety of samples. all sampling runs are close to their selected mutation scores
We then calculate how well the sampling mutation scores from even when r = 5%.
all the 20 sampling runs predict the selected mutation scores
by calculating R2 values2 . Note that we calculate R2 for all The left part of Table III shows R2 values for all strategies
20 sampling runs at once (which gives a more robust result with the sampling ratio of 5% on all subjects. (Due to the
than calculating R2 for individual runs and averaging the result space limit, the detailed results for the other ratios are not
over 20 sampling runs). To illustrate, Figure 1 shows scatter shown but can be found on the project webpage [39].) Column
plots of the sampling mutation score and the selected mutation “Subjects” lists the name of the subjects, and columns 2-9
score for CommonsLang. In each of the three subfigures, include R2 values for all 8 sampling strategies. The higher
the R2 value is, the better predictor the sampling strategy
2 We use R language for statistical computing. is. We find that the R2 results at the 5% ratio level are

97
100 CommonsLang CommonsLang CommonsLang

100

100
Selected Mutation Score (%)

Selected Mutation Score (%)

Selected Mutation Score (%)


80

80

80
60

60

60
40

40

40
20

20

20
0

0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Sampling Mutation Score (%) Sampling Mutation Score (%) Sampling Mutation Score (%)

(a) 5% ratio (b) 10% ratio (c) 15% ratio

Fig. 1: Sampling mutation score vs. Selected mutation score, with best fit line (black color) and smoothing spline line (red
color), for CommonsLang subject, Meth strategy, and three different sampling ratios

TABLE III: R2 and τ correlation values between mutation scores on sampled 5% mutants and on mutants before sampling
R2 correlation Kendall’s τ correlation
Subjects Base MOp Class Meth Stmt Class Meth Stmt Base MOp Class Meth Stmt Class Meth Stmt
-MOp -MOp -MOp -MOp -MOp -MOp
TimeMoney 0.971 0.972 0.975 0.977 0.976 0.975 0.975 0.972 0.874 0.872 0.881 0.888 0.882 0.877 0.880 0.872
JDepend 0.936 0.928 0.927 0.946 0.948 0.933 0.934 0.920 0.784 0.782 0.774 0.791 0.796 0.766 0.789 0.790
JTopas 0.932 0.923 0.909 0.945 0.944 0.943 0.922 0.922 0.845 0.834 0.826 0.856 0.851 0.850 0.836 0.831
Barbecue 0.956 0.951 0.958 0.970 0.964 0.961 0.962 0.946 0.844 0.833 0.849 0.867 0.861 0.854 0.857 0.832
Mime4J 0.991 0.992 0.993 0.994 0.992 0.992 0.993 0.991 0.936 0.937 0.938 0.943 0.936 0.938 0.939 0.933
Jaxen 0.985 0.987 0.983 0.990 0.984 0.986 0.986 0.982 0.894 0.896 0.886 0.910 0.898 0.902 0.901 0.890
XStream 0.993 0.994 0.996 0.996 0.995 0.996 0.996 0.995 0.942 0.946 0.954 0.956 0.949 0.953 0.954 0.948
XmlSecurity 0.982 0.984 0.986 0.986 0.983 0.987 0.984 0.982 0.888 0.892 0.904 0.900 0.894 0.906 0.899 0.896
CommonsLang 0.996 0.996 0.997 0.998 0.997 0.997 0.997 0.997 0.953 0.955 0.955 0.964 0.957 0.956 0.958 0.956
JodaTime 0.996 0.996 0.997 0.998 0.996 0.996 0.997 0.996 0.950 0.950 0.954 0.957 0.952 0.950 0.953 0.949
JMeter 0.982 0.982 0.988 0.989 0.985 0.985 0.988 0.980 0.915 0.917 0.931 0.935 0.925 0.924 0.931 0.913
Avg. 0.975 0.973 0.974 0.981 0.979 0.977 0.976 0.971 0.893 0.892 0.896 0.906 0.900 0.898 0.900 0.892

already extremely high, e.g., ranging from 0.945 (on JTopas) to sampling mutation is valuable and can be used for evaluation
0.998 (on CommonsLang) for the Smeth strategy. This further of test suites. We believe that the results of our study can
confirms our findings for adequate test suites—the sampling greatly impact the use of mutation testing in research practice;
ratio of 5% can be effective for mutation testing in practice. using sampling mutation testing makes it feasible to evaluate
In addition, similar to our findings for adequate test suites, the quality of test suites for large-scale programs.
the sampling strategies are less effective for smaller subjects, Comparing Testing Techniques and Test Suites. Mutation
e.g., the R2 for the Smeth strategy ranges from 0.945 to 0.977 testing has also been extensively used in studies that compare
for the first four subjects below 6,000 LOC, while it is over testing techniques [40]–[42]. Commonly, a testing technique
0.98 for the other seven larger subjects. Furthermore, although or a test suite that has a relatively higher mutation score than
all the strategies perform well, the Smeth strategy slightly another testing technique or test suite is claimed to be better
outperforms Sbase and Smop for all the 11 subjects, indicating (regardless of the absolute mutation score that it achieves). We
again that sampling across different program elements can be a thus want to evaluate whether sampling mutation can be used
better choice than sampling purely randomly from all mutants for comparison of testing techniques and test suites, i.e., if a
or sampling across different mutation operators. test suite T has a higher sampling mutation score than another
To show how the correlation varies when the sampling ratio test suite T ′ , does T have a higher selected mutation score than
changes, Figure 2a shows the R2 values for all 8 strategies T ′ ? Similar to a previous study [5], we calculate Kendall’s τ
when the sampling ratio increases from 5% to 95% for the and Spearman’s ρ rank correlation coefficients, which measure
subject CommonsLang. The plots for the other subjects look the strength of the agreement between two rankings. Both τ
similar and are available on the project webpage [39]. We and ρ can take values between -1 and 1, where 1 indicates
can draw the following conclusions. First, Smeth is slightly perfect agreement, and -1 indicates perfect disagreement.
better than the other strategies across all sampling ratios, To illustrate how τ is computed, consider all the pairs of
further demonstrating the benefits of sampling mutants across sampling and selected mutation scores; two pairs (M S1 , M S1′ )
program elements. Second, more importantly, all sampling and (M S2 , M S2′ ) are said to be concordant if (M S1 >
strategies predict the selected mutation score very well. Across M S2 ∧ M S1′ > M S2′ ) ∨ (M S1 < M S2 ∧ M S1′ < M S2′ )
all the programs, strategies, and ratios, the minimum R2 was and discordant if (M S1 < M S2 ∧ M S1′ > M S2′ ) ∨ (M S1 >
0.909 (for JTopas). Extremely high R2 gives evidence that M S2 ∧ M S1′ < M S2′ ); otherwise, the pair is neither concor-

98
CommonsLang CommonsLang

1.000

1.00
0.99
0.999

0.98
Kendall
R^2
0.998
base base
mop mop

0.97
class class
method method
0.997

stat stat

0.96
class−mop class−mop
method−mop method−mop
stat−mop stat−mop
20 40 60 80 20 40 60 80
Sampling Ratio (%) Sampling Ratio (%)

(a) R2 (b) τ

Fig. 2: Correlation values for CommonsLang subject, all strategies, and all rates

TABLE IV: Selective and sampling mutation testing time operator-based selection), and the sampling mutation testing
All Mutants 5% Sampled Mutants (mm:ss) time for the sampling ratio of 5% and our Smeth strategy.
Subject (mmm:ss) Min. Max. Avg. (Pct.)
Column “Subject” lists the subjects, column “All Mutants”
TimeMoney 7:13 0:54 0:57 0:55 (12.89%)
JDepend 3:02 0:31 0:33 0:32 (17.66%) shows the mutant generation and execution times for all the
JTopas 34:59 1:01 1:12 1:03 (3.02%) mutants generated by Javalanche, and columns 3-5 list the
Barbecue 7:35 2:42 2:56 2:46 (36.62%)
Mime4J 181:20 6:44 9:24 8:09 (4.50%)
minimum/maximum/average mutant generation and execution
Jaxen 48:21 2:49 4:03 3:15 (6.75%) times for the sampling mutation with the sampling ratio of
XStream 132:02 4:37 9:49 6:07 (4.64%) 5% across 20 sampling runs. In column 6 (“Pct.”), we also
XmlSecurity 53:04 3:17 4:03 3:46 (7.10%)
CommonsLang 74:35 5:12 6:51 6:01 (8.08%) show the ratio of the sampling mutation testing time over
JodaTime 196:28 12:28 19:03 14:57 (7.61%) the selected mutation testing time. Note that we include
JMeter 57:32 3:42 5:26 4:28 (7.77%)
Avg. 72:22 - - 4:43 (6.54%)
the mutant generation time of all selected mutants for both
selected mutation and sampling mutation, because our current
implementation requires Javalanche to generate all the mutants
dant nor discordant. Kendall’s τ is calculated as the ratio of before sampling. The results show that the sampling mutation
difference between the number of concordant and discordant testing time, with sampling ratio of 5%, is close to 5% of
pairs over total number of pairs. In this paper we use τb , which the selected mutation testing time. We further noticed that the
has a more complex computation because it takes ties into sampling mutation testing time on small subjects tends to be
consideration. longer than expected 5% of selected mutation time, because for
We calculate Kendall’s τb for each triple (P , S, r), following small subjects the tool setup time and the mutant generation
the same procedure as for R2 . Similar with the R2 measure, time (rather than the mutation execution time) can dominate
we show the τb measure for all the strategies on all subjects the total mutation testing time. However, for the seven larger
with the sampling ratio of 5% in the right part of Table III. We subjects, the tool setup time and the mutant generation time
also show Kendall’s τb values for CommonsLang subject, all take insignificant time compared to the total mutation testing
sampling strategies, and all sampling ratios in Figure 2b. The time, leading to sampling mutation time from 4.50% to 8.08%
plots for the other examples look similar and are available on of the selected mutation time. On average across all the 11
the project webpage [39]. Across all the subjects, all strategies, subjects, the sampling mutation testing time is less than 5
and all ratios, the minimal value for τb in our study was 0.766 minutes; in contrast, the original Javalanche time is much more
(for JDepend). and exceeds 70 minutes.
Considering Table III and Figure 2b, we can draw similar
conclusions as from the R2 correlation measures. First, all D. Below 5%
sampling strategies provide very similar result for Kendall’s Our experimental results show that it is possible to greatly
τ . In addition, Smeth slightly outperforms Sbase and Smop for reduce the number of mutants (e.g., sampling only 5% mu-
all 11 subjects. Second, all the values are very high, which tants) while still preserving the mutation score. However, it
indicates very strong agreement between rankings. The results was not clear whether we can use sampling ratio below 5%.
for Spearman’s ρ show even stronger agreement (details can Thus, we additionally collected the experimental results for
be found on the project webpage [39]). Based on our study, we sampling fewer than 5% mutants. Table V shows the results
believe that the comparison of test suites or testing techniques for the Sbase and Smeth strategies. The detailed results for
can be done using sampling mutation. all the 8 strategies can be found online [39]. In the table,
3) Savings Obtained by Mutation Sampling: Table IV Column 1 lists all the studied sampling ratios, columns 2-4 list
shows the selected mutation testing time for all the mu- the average selected mutation scores for adequate test suites
tants generated by Javalanche (recall that Javalanche uses as well as the average R2 and the τ correlation values for

99
TABLE V: Results of sampling below 5% of selected mutants a large amount of research has been dedicated to reducing
Ra. Base Meth the cost of mutation testing and exploring the application of
MS. R2 τ MS. R2 τ mutation testing. In this section, we first discuss the related
0.5% 92.24 0.806 0.716 92.02 0.803 0.722
1.0% 96.55 0.896 0.792 96.32 0.905 0.799 work in reducing the cost of mutation testing. Then we discuss
1.5% 97.27 0.926 0.819 97.71 0.939 0.837 the existing applications of mutation testing. More details
2.0% 98.21 0.944 0.843 98.48 0.952 0.858
2.5% 98.68 0.955 0.859 98.82 0.964 0.872
about existing work on mutation testing can be found in a
3.0% 98.94 0.964 0.874 99.00 0.973 0.885 recent survey by Jia and Harman [8].
3.5% 99.07 0.968 0.877 99.22 0.977 0.895
4.0% 99.22 0.974 0.888 99.34 0.978 0.899 A. Reducing The Cost of Mutation Testing
4.5% 99.38 0.974 0.893 99.46 0.982 0.907
There are mainly three ways to reduce the cost of mutation
testing – selecting a subset of all mutants (Section IV-A1), ex-
inadequate test suites by the Sbase strategy across all subjects. ecuting each mutant partially (Section IV-A2), and optimizing
Similarly, columns 5-7 list the corresponding results for the mutation generation and execution (Section IV-A3).
Smeth strategy. 1) Selective Mutation Testing: Selective mutation testing,
The results show that it is possible to have a fairly reliable which was first proposed by Mathur [44], aims to select
mutation score even when sampling fewer than 5% mutants. a representative subset of all mutants that can still achieve
However, by the “rule of 99%” [3], which would require similar results as all mutants. Since its first proposal, a large
the sampling mutation score to be 99% or higher, the ratio amount of research effort has been put on operator-based mu-
of 3-3.5% is on the borderline for our set of programs and tant selection, which only generates mutants based on a subset
tests and may not generalize to other programs and tests. In of mutation operators. Wong and Mathur [22] investigated
the future, we plan to evaluate whether advanced techniques selection of two mutation operators among all the 22 mutation
(e.g., search-based mutant selection [13], [43]) could achieve operators in Mothra [45], and found that mutants generated
even smaller sampling ratios. In addition, the results show that with the selected two mutation operators can achieve similar
Smeth outperforms Sbase in terms of all the three metrics with mutation testing results as all the 22 mutation operators. Offutt
sampling ratio of greater than 1%, further demonstrating the et al. [3], [20] then proposed five mutation operators, named
benefits of our proposed sampling based on program elements. sufficient mutation operators, through a set of experimental
E. Threats to Validity studies to ensure that the selected set of mutants achieves
Threats to construct validity. The main threat to construct almost the same results as the entire mutant set. Barbosa
validity for our study is the set of metrics used to evaluate the et al. [4] proposed six guidelines to determine a set of 10
mutant sampling strategies. To reduce this threat, we use two mutation operators. Namin et al. [5] used rigorous statistical
widely used metrics, the mutation score metric for adequate analysis to determine 28 mutation operators from all the 108
test suites [3], [4], [7], [20], [22] and the correlation analysis mutation operators of Proteum [19]. Recently, Gligoric et
for non-adequate test suites [5]. Our study still inherits a major al. [9] also investigated operator-based mutant selection for
threat to construct validity: as in those previous studies, we concurrent mutation operators.
considered all mutants not killed by the original test pool to be In contrast to operator-based mutant selection, random mu-
equivalent due to the lack of precise techniques for detecting tant selection was less widely researched. The idea of random
equivalent mutants. mutant selection was first proposed by Acree et al. [46]
Threats to internal validity. The main threat to internal and Budd [47]. Wong and Mathur [22] empirically studied
validity is the potential faults in the implementation of our randomly selecting x% of all mutants generated by Mothra.
sampling strategies or in our data analysis. To reduce this Since then, researchers mainly used random mutant selection
threat, the first two authors carefully reviewed all the code as a control technique when evaluating operator-based mutant
for mutant sampling and data analysis during the study. selection [4]. However, a recent study by Zhang et al. [7]
Threats to external validity. The main threat to external demonstrated that random mutant selection can be equally
validity is that our results from this study may not generalize effective with operator-based selection when selecting the
to other contexts, including programs, tests, and mutants. To same number of mutants. The study used larger subjects (7
reduce this threat, we select 11 real-world Java programs C programs from Siemens Suite [48] ranging from 137 to
with various sizes (from 2681 to 36910 lines of code) from 513 lines of code) and more mutation operators (108 mutation
various application domains. Note that our study includes more operators implemented by Proteum [19]) than previous studies.
programs than any previous study on selective mutation testing However, the studied subjects are still relatively small. In ad-
for sequential code [3]–[5], [7], [20], [22]. In addition, the 11 dition, they did not demonstrate that random mutant selection
programs used in our study are one to two orders of magnitude can further be applied to operator-based selected mutants. In
larger than programs used in similar previous studies. contrast, our study is the first to show that random mutant
selection can be applied together with operator-based selected
IV. R ELATED W ORK mutants, e.g., even sampling 5% of operator-based generated
Mutation testing was first proposed by Hamlet [1] and mutants can still achieve precise mutation score. In addition,
DeMillo et al. [2]. Since then, due to its cost and effectiveness, we used 11 real-world Java programs from 2681 to 36910 lines

100
of code, which are orders of magnitude larger than subject used search-based software testing (SBST) to generate tests
programs used in previous studies on selective mutation [3]– for mutant killing. Zhang et al. [16] and Papadakis et al. [17]
[5], [7], [22]. used dynamic symbolic execution (DSE) to generate tests for
2) Weakened Mutation Testing: Weakened mutation testing, mutant killing. Harman et al. [14] combined SBST and DSE
which was first proposed by Howden [49], aims to provide a techniques to generate tests that kill multiple mutants at a
more efficient way to determine whether a mutant is killed by a time. Our sampling strategies may make these test generation
test. More precisely, the traditional mutation testing considers techniques more efficient by only generating tests that kill a
a mutant as killed only when a test generates different final sample of mutants.
results for the mutant and the original program. On the Mutants generated by mutation testing can also be used
contrary, the first work of weakened mutation testing, weak to simulate real program faults to evaluate software testing
mutation [49], considers a mutant as weakly killed when a test techniques. The advantage compared with seeded or real faults
triggers a different program internal state when executing the is that mutation faults can be systematically generated, making
mutated statement on the mutant and on the original program. the generation replicable and sufficient for statistical analysis.
However, this approach is imprecise because some internal Andrews et al. [11], [12] empirically showed that the mutants,
states triggered by a test may not be propagated to the final which are generated by selective operators, simulate real faults
result. To better balance the cost and precision, Woodward and better than manually seeded faults, and is appropriated for
Halewood [50] proposed firm mutation, which is a spectrum evaluating testing techniques. Do et al. [61] also showed that it
of techniques between weak and strong mutation. Offutt and is suitable to use mutation faults to evaluate regression testing
Lee [51] experimentally investigated the relationships between techniques. Recently, Zhang et al. [62] showed that mutation
weak mutation and strong mutation. Our study is orthogonal faults can be used to simulate the impacts of real faults. Thus,
to this line of work; our study aims to reduce the number of more and more software testing techniques are evaluated using
mutants while weakened mutation testing aims to reduce the mutation faults [41], [58], [63]–[65]. Currently, testing tech-
execution cost for each mutant. niques are mainly evaluated using an arbitrary set of mutants
3) Optimized Mutation Testing: Optimized mutation testing due to the large number of mutants. Our study establishes rules
aims to explore efficient ways to generate, compile, and for evaluating testing techniques – if technique t outperforms
execute mutants. DeMillo et al. [52] extended a compiler to technique t′ on 5% sampled mutants, we can predict with high
compile all mutants at once to reduce the cost of generating confidence that t is better than t′ on all mutants.
and compiling a large number of mutants. Similarly, Untch
et al. [53] proposed the schema-based mutation approach, V. C ONCLUSIONS AND F UTURE W ORK
which generates one meta-mutant that encodes all the mutants This paper reports an empirical study to answer an important
and can be compiled by a traditional compiler. Researchers question for selective mutation testing: Can random mutant
have also investigated various ways to run mutants in par- sampling be applied on top of operator-based mutant selection
allel to speed up mutation testing [54], [55]. Recently, we to further reduce the cost of mutation testing? We evaluate
proposed approaches inspired by regression testing [56]–[58] that question for various sampling strategies, for adequate and
to optimize the test execution for each mutant [10], [37]. non-adequate test suites, and on 11 real-world Java programs.
More specifically, inspired by regression test selection, we Surprisingly, the empirical results show that sampling only
proposed ReMT [10] to incrementally collect mutation testing 5% of mutants generated by operator-based selection can
results based on old mutation results; inspired by regression still provide a highly precise mutation score. In addition, the
test prioritization and reduction, we proposed FaMT [37] to study shows that our newly proposed random mutant sampling
prioritize and reduce tests for each mutant to collect mutation strategies based on program elements can be more effective
testing results faster. Our sampling strategies, which further than strategies based on mutation operators. Furthermore, the
sample mutants over operator-based selected mutants, aim to study shows that mutant sampling is more beneficial for larger
reduce the cost of mutation testing at a different dimension. programs, indicating a promising future for applying mutant
sampling to larger projects.
B. Applications of Mutation Testing In the future, we plan to consider more sophisticated mutant
Mutation testing was initially used to evaluate the quality sampling based on dynamic test behavior (e.g., we may sample
of test suites. After achieving the mutation score for the test more mutants on some critical paths) or on combinations of
suite evaluation, the user can improve the quality of the test random mutant sampling and search-based mutant selection.
suite manually. Researchers have also proposed techniques
that automatically generate tests to kill mutants. DeMillo ACKNOWLEDGMENTS
and Offutt first proposed constraint-based testing (CBT) [59] We would like to thank Nathan Hirtz and Douglas Simpson
to generate tests (each killing one mutant) based on static for help with statistical analysis, and David Schuler for help
symbolic evaluation. Offutt et al. [15] further proposed the with Javalanche. This material is based upon work partially
dynamic domain reduction technique to further refine CBT. supported by the National Science Foundation under Grant
Recently, researchers proposed more solutions for this area Nos. CNS-0958231, CNS-0958199, CCF-0845628, and CCF-
due to the growing computing power. Fraser and Zeller [60] 0746856.

101
R EFERENCES [37] L. Zhang, D. Marinov, and S. Khurshid, “Faster mutation testing inspired
by test prioritization and reduction,” in Proc. of ISSTA, 2013, pp. 235–
[1] R. G. Hamlet, “Testing programs with the aid of a compiler,” TSE, vol. 3, 245.
1977. [38] “JavaSourceMetric home,” http://sourceforge.net/projects/jsourcemetric/.
[2] R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints on test data [39] “Our project,” https://webspace.utexas.edu/∼ lz3548/ase13support.html.
selection: Help for the practicing programmer,” Computer, vol. 11, 1978. [40] M. Staats, G. Gay, and M. P. Heimdahl, “Automated oracle creation
[3] A. J. Offutt, A. Lee, G. Rothermel, R. Untch, and C. Zapf, “An exper- support, or: How i learned to stop worrying about fault propagation and
imental determination of sufficient mutation operators,” ACM TOSEM, love mutation testing,” in Proc. of ICSE, 2012, pp. 870–880.
vol. 5, no. 2, pp. 99–118, 1996. [41] L. Briand, Y. Labiche, and Y. Wang, “Using simulation to empirically
investigate test coverage criteria based on statechart,” in Proc. of ICSE,
[4] E. F. Barbosa, J. C. Maldonado, and A. M. R. Vincenzi, “Toward the
2004, pp. 86–95.
determination of sufficient mutant operators for C,” Software Testing,
[42] R. Sharma, M. Gligoric, A. Arcuri, G. Fraser, and D. Marinov, “Testing
Verification and Reliability, vol. 11, no. 2, pp. 113–136, 2001.
container classes: Random or systematic?” in Proc. of FASE, 2011, pp.
[5] A. Siami Namin, J. H. Andrews, and D. J. Murdoch, “Sufficient mutation
262–277.
operators for measuring test effectiveness,” in Proc. of ICSE, 2008, pp.
[43] W. B. Langdon, M. Harman, and Y. Jia, “Multi objective higher order
351–360.
mutation testing with genetic programming,” in Proc. of TAIC PART,
[6] D. Schuler, V. Dallmeier, and A. Zeller, “Efficient mutation testing by 2009, pp. 21–29.
checking invariant violations,” in Proc. of ISSTA, 2009. [44] A. P. Mathur, “Performance, effectiveness, and reliability issues in
[7] L. Zhang, S.-S. Hou, J.-J. Hu, T. Xie, and H. Mei, “Is operator-based software testing,” in COMPSAC, 1991, pp. 604–605.
mutant selection superior to random mutant selection?” in Proc. of ICSE, [45] R. DeMillo and R. Martin, “The Mothra software testing environment
2010, pp. 435–444. users manual,” Software Engineering Research Center, Tech. Rep, 1987.
[8] Y. Jia and M. Harman, “An analysis and survey of the development of [46] A. T. Acree, T. A. Budd, R. A. DeMillo, R. J. Lipton, and F. G. Sayward,
mutation testing,” IEEE TSE, vol. 37, no. 5, pp. 649–678, 2011. “Mutation analysis.” DTIC Document, Tech. Rep., 1979.
[9] M. Gligoric, L. Zhang, C. Pereira, and G. Pokam, “Selective mutation [47] T. A. Budd, “Mutation analysis of program test data[ph. d. thesis],”
testing for concurrent code,” in Proc. of ISSTA, 2013, pp. 224–234. 1980.
[10] L. Zhang, D. Marinov, L. Zhang, and S. Khurshid, “Regression mutation [48] H. Do, S. Elbaum, and G. Rothermel, “Supporting controlled experi-
testing,” in Proc. of ISSTA, 2012, pp. 331–341. mentation with testing techniques: An infrastructure and its potential
[11] J. H. Andrews, L. C. Briand, and Y. Labiche, “Is mutation an appropriate impact,” Empirical Software Engineering, vol. 10, no. 4, pp. 405–435,
tool for testing experiments?” in Proc. of ICSE, 2005, pp. 402–411. 2005.
[12] J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin, “Using [49] W. E. Howden, “Weak mutation testing and completeness of test sets,”
mutation analysis for assessing and comparing testing coverage criteria,” IEEE TSE, no. 4, pp. 371–379, 1982.
IEEE Trans. Softw. Eng., vol. 32, no. 8, pp. 608–624, 2006. [50] M. Woodward and K. Halewood, “From weak to strong, dead or alive?
[13] Y. Jia and M. Harman, “Higher order mutation testing,” IST, vol. 51, an analysis of some mutation testing issues,” in Proc. of the Second
no. 10, pp. 1379–1393, 2009. Workshop on Software Testing, Verification, and Analysis, 1988, pp. 152–
[14] M. Harman, Y. Jia, and W. B. Langdon, “Strong higher order mutation- 158.
based test data generation,” in Proc. of FSE. ACM, 2011, pp. 212–222. [51] A. J. Offutt and S. D. Lee, “An empirical evaluation of weak mutation,”
[15] A. J. Offutt, Z. Jin, and J. Pan, “The dynamic domain reduction IEEE TSE, vol. 20, no. 5, pp. 337–344, 1994.
procedure for test data generation,” Software-Practice and Experience, [52] R. A. DeMillo, E. W. Krauser, and A. P. Mathur, “Compiler-integrated
vol. 29, no. 2, pp. 167–194, 1999. program mutation,” in COMPSAC, 1991, pp. 351–356.
[16] L. Zhang, T. Xie, L. Zhang, N. Tillmann, J. De Halleux, and H. Mei, [53] R. H. Untch, A. J. Offutt, and M. J. Harrold, “Mutation analysis using
“Test generation via dynamic symbolic execution for mutation testing,” mutant schemata,” in Proc. of ISSTA, 1993, pp. 139–148.
in Proc. of ICSM, 2010, pp. 1–10. [54] E. W. Krauser, A. P. Mathur, and V. J. Rego, “High performance software
[17] M. Papadakis, N. Malevris, and M. Kallia, “Towards automating the testing on SIMD machines,” IEEE TSE, vol. 17, no. 5, pp. 403–423,
generation of mutation tests,” in Proc. of AST, 2010, pp. 111–118. 1991.
[18] K. Pan, X. Wu, and T. Xie, “Automatic test generation for mutation [55] A. J. Offutt, R. P. Pargas, S. V. Fichter, and P. K. Khambekar, “Mutation
testing on database applications,” in Proc. of AST, 2013, p. to appear. testing of software using a MIMD computer,” in Proc. of International
[19] M. E. Delamaro, J. C. Maldonado, and A. M. R. Vincenzi, “Proteum/im Conference on Parallel Processing, 1992, pp. 257–266.
2.0: An integrated mutation testing environment,” in Mutation testing [56] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, “Prioritizing test
for the new century. Springer, 2001, pp. 91–101. cases for regression testing,” IEEE TSE, vol. 27, no. 10, pp. 929–948,
[20] A. J. Offutt, G. Rothermel, and C. Zapf, “An experimental evaluation 2001.
of selective mutation,” in Proc. of ICSE, 1993. [57] A. Orso, N. Shi, and M. J. Harrold, “Scaling regression testing to large
software systems,” in Proc. of FSE, vol. 29, no. 6, 2004, pp. 241–251.
[21] P. Ammann and J. Offutt, Introduction to Software Testing, 2008.
[58] D. Hao, L. Zhang, X. Wu, H. Mei, and G. Rothermel, “On-demand test
[22] W. E. Wong and A. P. Mathur, “Reducing the cost of mutation testing:
suite reduction,” in Proc. of ICSE, 2012, pp. 738–748.
An empirical study,” JSS, vol. 31, no. 3, pp. 185–196, 1995.
[59] R. DeMillo and A. J. Offutt, “Constraint-based automatic test data
[23] “Time and Money home,” http://timeandmoney.sourceforge.net/. generation,” IEEE TSE, vol. 17, no. 9, pp. 900–910, 1991.
[24] “JDepend home,” http://clarkware.com/software/JDepend.html. [60] G. Fraser and A. Zeller, “Mutation-driven generation of unit tests and
[25] “JTopas home,” http://jtopas.sourceforge.net/jtopas/index.html. oracles,” IEEE TSE, vol. 38, no. 2, pp. 278–292, 2012.
[26] “Barbecue home,” http://barbecue.sourceforge.net/. [61] H. Do and G. Rothermel, “On the use of mutation faults in empirical
[27] “Apache Mime4J home,” http://james.apache.org/mime4j/. assessments of test case prioritization techniques,” IEEE TSE, vol. 32,
[28] “Jaxen home,” http://jaxen.codehaus.org/. no. 9, pp. 733–752, 2006.
[29] “XStream home,” http://xstream.codehaus.org/. [62] L. Zhang, L. Zhang, and S. Khurshid, “Injecting mechanical faults to
[30] “Santuario home,” http://santuario.apache.org/. localize developer faults for evolving software,” in Proc. of OOPSLA,
[31] “Apache Commons home,” http://commons.apache.org/proper/commons-lang/. 2013, p. to appear.
[32] “Joda Time home,” http://joda-time.sourceforge.net/. [63] M. Staats, G. Gay, M. Whalen, and M. Heimdahl, “On the danger of
[33] “Apache JMeter home,” http://jmeter.apache.org/. coverage directed test case generation,” in Proc. of FASE, 2012, pp.
[34] D. Schuler and A. Zeller, “Javalanche: efficient mutation testing for 409–424.
Java,” in Proc. of FSE, 2009, pp. 297–298. [64] H. Mei, D. Hao, L. Zhang, L. Zhang, J. Zhou, and G. Rothermel, “A
[35] S. Zhang, C. Zhang, and M. D. Ernst, “Automated documentation static approach to prioritizing JUnit test cases,” IEEE TSE, vol. 38, no. 6,
inference to explain failed tests,” in Proc. of ASE. IEEE, 2011, pp. pp. 1258–1275, 2012.
63–72. [65] M. Gligoric, A. Groce, C. Zhang, R. Sharma, A. Alipour, and D. Mari-
[36] L. Zhang, D. Hao, L. Zhang, G. Rothermel, and H. Mei, “Bridging the nov, “Comparing non-adequate test suites using coverage criteria,” in
gap between the total and additional test-case prioritization strategies,” Proc. of ISSTA, 2013, pp. 302–313.
in Proc. ICSE, 2013, pp. 192–201.

102

You might also like