Metamorphic Testing
Metamorphic Testing
time Checking of such a manner that the change to O (if any) can be predicted.
In cases where the correctness of the original output O
Applications With-
cannot be determined, i.e., if there is no test oracle, program
defects can still be detected if the new output O is not as
expected when using the new input.
For a simple example of metamorphic testing (where we do have
out Test Oracles a test oracle), consider a function that calculates the standard devi-
ation of a set of numbers. Certain transformations of the set would
be expected to produce the same result: for instance, permuting
the order of the elements should not affect the calculation, nor
should multiplying each value by -1. Furthermore, other transforma-
tions should alter the output, but in a predictable way: if each value
in the set were multiplied by 2, then the standard deviation should
Jonathan Bell, Columbia University
be twice that of the original set.
Christian Murphy, University of Pennsylvania
Through our own past investigations into metamorphic testing
Gail Kaiser, Columbia University
[4] [5] [6], we have garnered three key insights. First, the meta-
Abstract. For some applications, it is impossible or impractical to know what morphic properties of individual functions are often different than
the correct output should be for an arbitrary input, making testing difficult. Many those of the application as a whole. Thus, by checking for addi-
machine-learning applications for “big data”, bioinformatics and cyberphysical tional and different relationships, we can reveal defects that would
systems fall in this scope: they do not have a test oracle. Metamorphic Testing, not be detected using only the metamorphic properties of the
a simple testing technique that does not require a test oracle, has been shown full application. Second, the metamorphic properties of individual
to be effective for testing such applications. We present Metamorphic Runtime functions can be checked in the course of executing metamor-
Checking, a novel approach that conducts metamorphic testing of both the entire phic tests on the full application. This addresses the problem of
application and individual functions during a program’s execution. We have ap- generating test cases from which to derive new inputs, since we
plied Metamorphic Runtime Checking to 9 machine-learning applications, finding can simply use those inputs with which the functions happened to
it to be on average 170% more effective than traditional metamorphic testing at be invoked within the full application. Third, when conducting tests
only the full application level. of individual functions within the full running application in this
manner, checking the metamorphic properties of one function can
Introduction sometimes detect defects in other functions, which may not have
During software testing, a “test oracle” [1] is required to indi- any known metamorphic properties, because the functions share
cate whether the output is correct for the given input. Despite a application state.
recent interest in the testing community in creating and evaluat-
ing test oracles, still there are a variety of problem domains for Approach
which a practical and complete test oracle does not exist. In order to realize these improvements, we present a solution
Many emerging application domains fall into a category of based on checking the metamorphic properties of the entire
software that Weyuker describes as “Programs which were written program and those of individual functions (methods, procedures,
in order to determine the answer in the first place. There would be subroutines, etc.) as the full program runs. That is, the program
no need to write such programs, if the correct answer were known under test is not treated only as a black box, but rather meta-
[2].” Thus, in the general case, it is not possible to know the correct morphic testing also occurs within the program, at the function
output in advance for arbitrary input. In other domains, such as level, in the context of the running program. This will allow for
optimization, determining whether the output is correct is at least as the execution of more tests and also makes it possible to check
difficult as it is to derive the output in the first place, and creating for subtle faults inside the code that may not cause violations of
an efficient, practical oracle may not be feasible. the full program’s metamorphic properties and lead to appar-
Although some faults in such programs - such as those that ently reasonable output (remember we cannot check whether
cause the program to crash or produce results that are obvi- that output is correct, since there is no test oracle).
ously wrong to someone who knows the domain - are easily In our new approach, additional metamorphic tests are logi-
found, and partial oracles may exist for a subset of the input cally attached to the individual functions for which metamorphic
domain, subtle errors in performing calculations or in adhering properties have been specified. Upon a function’s execution when
to specifications can be much more difficult to identify without it happens to be invoked within the full program, the correspond-
a practical, general oracle. ing function-level tests are executed as well: the arguments are
Much recent research addressing the so-called “oracle modified according to the function’s metamorphic properties, the
problem” has focused on the use of metamorphic testing [3]. In function is run again (in a sandbox, not shown) in the same pro-
metamorphic testing changes are made to existing test inputs gram state as the original, and the output of the function with the
in such a way (based on the program’s “metamorphic proper- original input is compared to that of the function with the modified
ties”) that it is possible to predict what the change to the output input. If the result is not as expected according to the metamor-
should be without a test oracle. phic property, then a fault has been exposed.
CrossTalk—March/April 2015 9
TEST AND DIAGNOSTICS
10 CrossTalk—March/April 2015
TEST AND DIAGNOSTICS
CrossTalk—March/April 2015 11
TEST AND DIAGNOSTICS
erty simply states that the quality of the solutions should be overhead was typically less than a few minutes, which
increasing with subsequent generations. Even though the value we consider a small price to pay for being able to detect faults
of the fitness is incorrect, it would still be increasing (unless the in programs with no test oracle.
omitted element had a very large effect on the result, which is Future work could investigate techniques for improving the
unlikely), and the property would not be violated. performance of a Metamorphic Runtime Checking framework.
Previously we considered an approach whereby tests were
Performance Overhead only executed in application states that had not previously been
Although Metamorphic Runtime Checking using function-level encountered, and showed that performance could be improved
properties is able to detect faults not found by metamorphic even when the functions are invoked with new parameters up to
testing based on application-level properties alone, this runtime 90% of the time [12]. It may be possible to reduce the over-
checking of the properties comes at a cost, particularly if the tests head even more, for instance by running tests probabilistically
are run frequently. In application-level metamorphic testing, the (our framework already allows the tester to specify a probability
program needs to be run one more time with the transformed in- for checking each function-level metamorphic property, but we
put, and then each metamorphic property is checked exactly once turned that off for the studies presented here).
(at the end of the program execution). In Metamorphic Runtime
Checking, however, each property can be checked numerous Limitations
times, depending on the number of times each function is called, We used Daikon to create the program invariants for
and the overhead can grow to be much higher. runtime assertion checking. Although in practice invariants
During the studies discussed above, we measured the per- are typically generated by hand, and some researchers have
formance overhead of our C and Java implementations of the questioned the usefulness of Daikon-generated invariants
Metamorphic Runtime Checking framework. Tests were conducted compared to those generated by humans [13], we chose to
on a server with a quad-core 3GHz CPU running Ubuntu 7.10 with use the tool so that we could eliminate any human bias or hu-
2GB RAM. On average, the performance overhead for the Java man error in creating the invariants.
applications was around 3.5ms per test; for C, it was only 0.4ms Additionally, others have independently shown that metamorphic
per test. This cost is mostly attributed to the time it takes to create properties are more effective at detecting defects than manually
sandboxes (so the side-effects of function-level metamorphic test- identified invariants [14], though for programs on a smaller scale
ing do not impact application-level testing). than those in our experiment (a few hundred lines, as opposed to
This impact can be substantial from a percentage overhead thousands as in many of the programs we studied).
point of view if many tests are run in a short-lived program. The ability of metamorphic testing to reveal failures is clearly
For instance, for C4.5, the overhead was on the order of 10x, dependent on the selection of metamorphic properties. How-
even though in absolute terms it was well under a second. ever, we have shown that a basic set of metamorphic properties
However, for most programs we investigated in our study, the can be used without a particularly strong understanding of the
implementation - the authors knew essentially nothing about the
target systems or their domains beyond textbook generality; the
use of domain-specific properties from the developers of these
systems might reveal even more failures [15].
Conclusion
As shown in our empirical studies, Metamorphic Runtime
Checking has three distinct advantages over metamorphic test-
ing using application-level properties alone. First, we are able to
increase the scope of metamorphic testing, by identifying proper-
ties for individual functions in addition to those of the entire appli-
cation. Second, we increase the scale of metamorphic testing by
running more tests for a given input to the program. And third, we
can increase the sensitivity of metamorphic testing by checking
the properties of individual functions, making it possible to reveal
subtle faults that may otherwise go unnoticed.
Acknowledgements
We would like to thank T.Y. Chen, Lori Clarke, Lee Osterweil, Sal
Stolfo, and Junfeng Yang for their guidance and assistance. Sahar
Hasan, Lifeng Hu, Kuang Shen, and Ian Vo contributed to the
implementation of the Metamorphic Runtime Checking framework.
Bell and Kaiser are members of the Programming Systems
Laboratory, funded in part by NSF CCF-1302269, NSF CCF-
1161079, NSF CNS-0905246, and NIH U54 CA121852.
12 CrossTalk—March/April 2015
TEST AND DIAGNOSTICS
CrossTalk—March/April 2015 13