KEMBAR78
Metamorphic Testing | PDF | Subroutine | Computer Science
0% found this document useful (0 votes)
92 views5 pages

Metamorphic Testing

1) Metamorphic runtime checking tests applications by checking metamorphic properties at both the full application level and individual function level during program execution. 2) This allows for more tests to be executed and can detect subtle faults inside functions that don't cause violations of overall application properties. 3) The approach attaches additional metamorphic tests to individual functions and executes them when the functions are called, modifying arguments according to properties and comparing outputs to detect faults.

Uploaded by

Yang Feng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views5 pages

Metamorphic Testing

1) Metamorphic runtime checking tests applications by checking metamorphic properties at both the full application level and individual function level during program execution. 2) This allows for more tests to be executed and can detect subtle faults inside functions that don't cause violations of overall application properties. 3) The approach attaches additional metamorphic tests to individual functions and executes them when the functions are called, modifying arguments according to properties and comparing outputs to detect faults.

Uploaded by

Yang Feng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

TEST AND DIAGNOSTICS

Metamorphic Run- That is, if program input I produces output O, additional


test inputs based on transformations of I are generated in

time Checking of such a manner that the change to O (if any) can be predicted.
In cases where the correctness of the original output O

Applications With-
cannot be determined, i.e., if there is no test oracle, program
defects can still be detected if the new output O is not as
expected when using the new input.
For a simple example of metamorphic testing (where we do have

out Test Oracles a test oracle), consider a function that calculates the standard devi-
ation of a set of numbers. Certain transformations of the set would
be expected to produce the same result: for instance, permuting
the order of the elements should not affect the calculation, nor
should multiplying each value by -1. Furthermore, other transforma-
tions should alter the output, but in a predictable way: if each value
in the set were multiplied by 2, then the standard deviation should
Jonathan Bell, Columbia University
be twice that of the original set.
Christian Murphy, University of Pennsylvania
Through our own past investigations into metamorphic testing
Gail Kaiser, Columbia University
[4] [5] [6], we have garnered three key insights. First, the meta-
Abstract. For some applications, it is impossible or impractical to know what morphic properties of individual functions are often different than
the correct output should be for an arbitrary input, making testing difficult. Many those of the application as a whole. Thus, by checking for addi-
machine-learning applications for “big data”, bioinformatics and cyberphysical tional and different relationships, we can reveal defects that would
systems fall in this scope: they do not have a test oracle. Metamorphic Testing, not be detected using only the metamorphic properties of the
a simple testing technique that does not require a test oracle, has been shown full application. Second, the metamorphic properties of individual
to be effective for testing such applications. We present Metamorphic Runtime functions can be checked in the course of executing metamor-
Checking, a novel approach that conducts metamorphic testing of both the entire phic tests on the full application. This addresses the problem of
application and individual functions during a program’s execution. We have ap- generating test cases from which to derive new inputs, since we
plied Metamorphic Runtime Checking to 9 machine-learning applications, finding can simply use those inputs with which the functions happened to
it to be on average 170% more effective than traditional metamorphic testing at be invoked within the full application. Third, when conducting tests
only the full application level. of individual functions within the full running application in this
manner, checking the metamorphic properties of one function can
Introduction sometimes detect defects in other functions, which may not have
During software testing, a “test oracle” [1] is required to indi- any known metamorphic properties, because the functions share
cate whether the output is correct for the given input. Despite a application state.
recent interest in the testing community in creating and evaluat-
ing test oracles, still there are a variety of problem domains for Approach
which a practical and complete test oracle does not exist. In order to realize these improvements, we present a solution
Many emerging application domains fall into a category of based on checking the metamorphic properties of the entire
software that Weyuker describes as “Programs which were written program and those of individual functions (methods, procedures,
in order to determine the answer in the first place. There would be subroutines, etc.) as the full program runs. That is, the program
no need to write such programs, if the correct answer were known under test is not treated only as a black box, but rather meta-
[2].” Thus, in the general case, it is not possible to know the correct morphic testing also occurs within the program, at the function
output in advance for arbitrary input. In other domains, such as level, in the context of the running program. This will allow for
optimization, determining whether the output is correct is at least as the execution of more tests and also makes it possible to check
difficult as it is to derive the output in the first place, and creating for subtle faults inside the code that may not cause violations of
an efficient, practical oracle may not be feasible. the full program’s metamorphic properties and lead to appar-
Although some faults in such programs - such as those that ently reasonable output (remember we cannot check whether
cause the program to crash or produce results that are obvi- that output is correct, since there is no test oracle).
ously wrong to someone who knows the domain - are easily In our new approach, additional metamorphic tests are logi-
found, and partial oracles may exist for a subset of the input cally attached to the individual functions for which metamorphic
domain, subtle errors in performing calculations or in adhering properties have been specified. Upon a function’s execution when
to specifications can be much more difficult to identify without it happens to be invoked within the full program, the correspond-
a practical, general oracle. ing function-level tests are executed as well: the arguments are
Much recent research addressing the so-called “oracle modified according to the function’s metamorphic properties, the
problem” has focused on the use of metamorphic testing [3]. In function is run again (in a sandbox, not shown) in the same pro-
metamorphic testing changes are made to existing test inputs gram state as the original, and the output of the function with the
in such a way (based on the program’s “metamorphic proper- original input is compared to that of the function with the modified
ties”) that it is possible to predict what the change to the output input. If the result is not as expected according to the metamor-
should be without a test oracle. phic property, then a fault has been exposed.

CrossTalk—March/April 2015 9
TEST AND DIAGNOSTICS

In this example, if we used only the application-level property


of P, we would run just one test. However, by also considering P’s
Program Input function f with one specified metamorphic property, we can now
Application-Level Testing
check two properties and run a total of three tests. This combined
Transform Input approach also allows us to reveal subtle faults at the function
Program P Program P
... ... level that may not violate application-level properties. Our study
Function-Level Testing Function-Level Testing
Transform Transform
shows that this sensitivity gain can increase the effectiveness of
metamorphic testing by up to 1,350% (on average, 170%).
f(x) f(t(x)) f(x') f(t(x'))

Check Check Evaluation


... ... To evaluate the effectiveness of Metamorphic Runtime Checking
Check
at detecting faults in applications without test oracles, we compare
it to runtime assertion checking using program invariants (a state-
Program Output of-the art technique). When used in applications without test ora-
cles, assertions can detect some programming bugs by checking
Figure 1: Model of Metamorphic Runtime Checking of program P and one that function input and output values are within a specified range,
of its constituent functions, f. Metamorphic Runtime Checking combines the relationships between variables are maintained, and a function’s
program-level metamorphic testing with function-level metamorphic checking, effects on the application state are as expected [7]. While satisfying
performing such checking automatically. the invariants does not ensure correctness, any violation of them at
runtime indicates an error.
As shown in Figure 1 the tester provides a program input to The experiments described in this section seek to answer the
a Metamorphic Runtime Checking framework, which then trans- following research questions:
forms it according to the metamorphic property of the program 1. Is Metamorphic Runtime Checking more effective than
P (for simplicity, this diagram only shows one metamorphic prop- using runtime assertion checking for detecting faults in applica-
erty, but a program may, of course, have many). The framework tions without test oracles?
then invokes P with both the original input and the transformed 2. What contribution do application-level and function-level
input; as seen at the bottom of the diagram, when each program metamorphic properties make to the effectiveness of Metamor-
invocation is finished, the outputs can be checked according to phic Runtime Checking?
the property. 3. Is Metamorphic Runtime Checking suitable for practical use?
While each invocation of P is running, metamorphic proper- In these experiments, we applied both runtime assertion
ties of individual functions can be checked as well. As shown on checking and Metamorphic Runtime Checking to nine real-
the left side of Figure 1, in the invocation of P with the original world applications that are representative of different domains
program input, before a function f is called, its input x can be that have no practical, general test oracles: supervised machine
transformed according to one of the function’s metamorphic learning, unsupervised machine learning, data mining, discrete
properties to give t(x). The function is called with each input, and event simulation, and NP-hard optimization. The applications are
then f(t(x)) is evaluated according to the original value of f(x) to described (along with the number of invariants, function-level
see if the property is violated. and application-level properties) in Table 1.
Meanwhile, in the additional invocation of P (right side To create the set of invariants that we could use for runtime
of the diagram), function-level metamorphic testing still assertion checking, we applied the Daikon invariant detector
occurs for function f, this time using input x’, which results tool [8] to each application. To identify the application-level
from the transformed program input to P. In this case, metamorphic properties for the experiment, we followed the
f(t(x’)) and f(x’) are compared. guidelines set forth in [4], which categorizes the types of proper-
ties that applications in these domains tend to exhibit.
To identify function-level properties, we inspected the source
# of Metamorphic
Properties identified at code and hand-annotated the functions that we expected to ex-
the level of:
Application Domain Language LOC Functions Invariants Application Function
hibit the types of properties described in [4]. To ensure that the
C4.5 classification C 5,285 141 27,603 4 40 properties were not limited to only the ones that we could think
GAFFitter optimization C++ 1,159 19 744 2 11
of, some of the function-level metamorphic properties used in
JSim simulation Java 3,024 468 306 2 12
K-means clustering Java 717 46 137 4 12 this experiment are based on those used in other, similar studies
LDA topic modeling Java 1,630 103 1,323 4 28 such as [9], [10] and [11].
Lucene information Java 661 57 456 4 26
retrieval
MartiRank ranking C 804 19 3,647 4 15
PAYL anomaly Java 4,199 164 19,730 2 40
Methodology
detection To determine the effectiveness of the testing techniques, we
SVM classification Java 1,213 49 2,182 4 4
used mutation analysis to systematically insert faults into the
Table 1: Listing of applications studied source code of the applications described above, and then de-
termined whether the mutants could be killed (i.e., whether the
faults could be detected) using each approach. Mutations that
yielded a fatal runtime error, an infinite loop, or an output that

10 CrossTalk—March/April 2015
TEST AND DIAGNOSTICS

was clearly wrong (for instance, not conforming to the expected


Average
output syntax or simply being blank) were discarded since any
PAYL
reasonable approach would detect such faults.
Lucene
We also did not consider “equivalent mutants” for which the GAFFitter
inputs used in the experiment produced the same program out- SVM
put as the original, unmutated version, e.g., those mutants that K-means
were not on the execution path for any test case or that would MartiRank
C4.5
not have been killed with an oracle for these inputs.
JSim
For each mutated version, we conducted runtime assertion LDA
checking with the invariants detected by Daikon. If any invariant
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
was violated, the mutant was considered killed. We then per- Mutants Killed
formed Metamorphic Runtime Checking on the same mutated Runtime Assertion Checking Metamorphic Runtime Checking  
versions to determine whether any of the specified metamorphic  
properties were violated. The inputs used for mutation analysis Figure 2: Results of mutation analysis comparing metamorphic runtime check-
were the same as those used for detecting invariants and verify- ing and runtime assertion checking. Metamorphic runtime checking was on
ing metamorphic properties. average more effective.
Figure 2 summarize the results of our experiment evaluating
the efficacy of Metamorphic Runtime Checking. Overall, Meta-
Mutants Killed By
morphic Runtime Checking was more effective, killing 1,602
Application Total Application- Function-level Both Not MRC %
(90.4%) of the mutants in the applications, compared to just Mutants level Properties Properties Types Killed Improvement
1,498 (84.5%) for assertion checking. Only Only
C4.5 856 133 37 653 33 4.71%
Broadly speaking, Metamorphic Runtime Checking was more
GAFFitter 66 2 14 20 30 63.64%
effective at killing mutants that related to operations on arrays,
K-means 35 6 11 11 7 64.71%
sets, collections, etc. However, further analysis could character-
JSim 36 14 0 22 0 0.00%
ize the types of faults each approach is most suitable for detect-
LDA 24 2 0 22 0 0.00%
ing, but it follows, that runtime assertion checking and Metamor-
Lucene 15 5 3 6 1 27.27%
phic Runtime Checking should be used together for best results.
MartiRank 413 298 22 70 23 5.98%
When used in combination in our experiments, they were able to
PAYL 40 0 27 2 11 1350.00%
kill 95% of the mutants (totaling across all applications): only 88
SVM 287 69 23 130 65 11.56%
of the 1,772 survived.
Average 197 59 15 104 19 169.76%
To understand the factors that impacted the efficacy of Meta-
 
morphic Runtime Checking, we performed a deeper analysis of Table 2: Number of Mutants Killed by Different Types of Metamorphic Properties
the contribution of the separate mechanisms. We first deter-
mined the number of mutants killed only by application-level
properties, then the number killed only by function-level proper- Checking to run metamorphic tests at the function level, we
ties. Table 2 shows these results. were able to identify many more properties that involved chang-
On average, we saw a 170% improvement in the number of ing the byte arrays that were passed as function arguments,
mutants killed when combining application-level properties with thus revealing 27 additional faults.
function-level properties. The variance in improvement was very Likewise, we were able to increase the scale of metamor-
large, however, showing a striking improvement of 1,350% in phic testing by running many more test cases. For instance, in
PAYL, while showing smaller improvement in C4.5 and Marti- MartiRank, even though we specified function-level properties
Rank. There was no improvement at all in the JSim and LDA for only a handful of functions, many of those are called numer-
applications, because application-level properties had already ous times per program execution, meaning that there are many
been able to kill all mutants. opportunities for the property to be violated.
We believe that this improvement is attributed primarily to Another reason why function-level properties were able to kill
our increase in: the number of properties identified (scope); mutants not killed by application-level properties is that we were
the number of tests run (scale); and the likelihood that a fault able to improve the sensitivity in terms of the ability to reveal more
would be detected (sensitivity). subtle faults, as seen in GAFFitter. In the function to calculate the
The improvement in the scope of metamorphic testing was “fitness” of a given candidate solution in the genetic algorithm, i.e.,
particularly clear in the anomaly-based intrusion detection how close to the optimal solution (target) a candidate comes, one
system PAYL. We were only able to identify two application-level of the metamorphic properties is that permuting the elements in
metamorphic properties because it was not possible to cre- the candidate solution should not affect the result, since it is merely
ate new program inputs based on modifying the values of the taking a sum of all the elements.
bytes inside the payloads, since the application only allowed for If, for instance, there is a mutation such that the last element
syntactically and semantically valid inputs that reflected what it is omitted from the calculation, then the metamorphic property
considered to be “real” network traffic. will be violated since the return value will be different after the
These two properties were only able to kill two of the 40 second function call. However, at the application level, such a
mutants. However, once we could use Metamorphic Runtime fault is unlikely to be detected, since the metamorphic prop-

CrossTalk—March/April 2015 11
TEST AND DIAGNOSTICS

erty simply states that the quality of the solutions should be overhead was typically less than a few minutes, which
increasing with subsequent generations. Even though the value we consider a small price to pay for being able to detect faults
of the fitness is incorrect, it would still be increasing (unless the in programs with no test oracle.
omitted element had a very large effect on the result, which is Future work could investigate techniques for improving the
unlikely), and the property would not be violated. performance of a Metamorphic Runtime Checking framework.
Previously we considered an approach whereby tests were
Performance Overhead only executed in application states that had not previously been
Although Metamorphic Runtime Checking using function-level encountered, and showed that performance could be improved
properties is able to detect faults not found by metamorphic even when the functions are invoked with new parameters up to
testing based on application-level properties alone, this runtime 90% of the time [12]. It may be possible to reduce the over-
checking of the properties comes at a cost, particularly if the tests head even more, for instance by running tests probabilistically
are run frequently. In application-level metamorphic testing, the (our framework already allows the tester to specify a probability
program needs to be run one more time with the transformed in- for checking each function-level metamorphic property, but we
put, and then each metamorphic property is checked exactly once turned that off for the studies presented here).
(at the end of the program execution). In Metamorphic Runtime
Checking, however, each property can be checked numerous Limitations
times, depending on the number of times each function is called, We used Daikon to create the program invariants for
and the overhead can grow to be much higher. runtime assertion checking. Although in practice invariants
During the studies discussed above, we measured the per- are typically generated by hand, and some researchers have
formance overhead of our C and Java implementations of the questioned the usefulness of Daikon-generated invariants
Metamorphic Runtime Checking framework. Tests were conducted compared to those generated by humans [13], we chose to
on a server with a quad-core 3GHz CPU running Ubuntu 7.10 with use the tool so that we could eliminate any human bias or hu-
2GB RAM. On average, the performance overhead for the Java man error in creating the invariants.
applications was around 3.5ms per test; for C, it was only 0.4ms Additionally, others have independently shown that metamorphic
per test. This cost is mostly attributed to the time it takes to create properties are more effective at detecting defects than manually
sandboxes (so the side-effects of function-level metamorphic test- identified invariants [14], though for programs on a smaller scale
ing do not impact application-level testing). than those in our experiment (a few hundred lines, as opposed to
This impact can be substantial from a percentage overhead thousands as in many of the programs we studied).
point of view if many tests are run in a short-lived program. The ability of metamorphic testing to reveal failures is clearly
For instance, for C4.5, the overhead was on the order of 10x, dependent on the selection of metamorphic properties. How-
even though in absolute terms it was well under a second. ever, we have shown that a basic set of metamorphic properties
However, for most programs we investigated in our study, the can be used without a particularly strong understanding of the
implementation - the authors knew essentially nothing about the
target systems or their domains beyond textbook generality; the
use of domain-specific properties from the developers of these
systems might reveal even more failures [15].

Conclusion
As shown in our empirical studies, Metamorphic Runtime
Checking has three distinct advantages over metamorphic test-
ing using application-level properties alone. First, we are able to
increase the scope of metamorphic testing, by identifying proper-
ties for individual functions in addition to those of the entire appli-
cation. Second, we increase the scale of metamorphic testing by
running more tests for a given input to the program. And third, we
can increase the sensitivity of metamorphic testing by checking
the properties of individual functions, making it possible to reveal
subtle faults that may otherwise go unnoticed.

Acknowledgements
We would like to thank T.Y. Chen, Lori Clarke, Lee Osterweil, Sal
Stolfo, and Junfeng Yang for their guidance and assistance. Sahar
Hasan, Lifeng Hu, Kuang Shen, and Ian Vo contributed to the
implementation of the Metamorphic Runtime Checking framework.
Bell and Kaiser are members of the Programming Systems
Laboratory, funded in part by NSF CCF-1302269, NSF CCF-
1161079, NSF CNS-0905246, and NIH U54 CA121852.

12 CrossTalk—March/April 2015
TEST AND DIAGNOSTICS

ABOUT THE AUTHORS REFERENCES


Jonathan Bell is a Ph.D. student in Software 1. Pezzé, M. and M. Young, Software Testing and Analysis: Process, Principles and
Engineering at Columbia University. His Techniques. 2007: Wiley.
research interests include software testing, 2. Weyuker, E.J., On testing non-testable programs. Computer Journal, 1982. 25(4): p. 465-470.
program analysis, and fault reproduction. He’s 3. Chen, T.Y., S.C. Cheung, and S.M. Yiu, Metamorphic testing: a new approach for
received an M Phil, MS and BS in Computer generating next test cases. 1998, Dept. of Computer Science, Hong Kong Univ. of
Science from Columbia University. Science and Technology.
4. Murphy, C., et al., Properties of Machine Learning Applications for Use in
Dept. of Computer Science Metamorphic Testing, in Proc. of the 20th International Conference on Software
Columbia University Engineering and Knowledge Engineering (SEKE). 2008. p. 867-872.
New York, NY 10027 5. Murphy, C., et al., On Effective Testing of Health Care Simulation Software, in Proc.
Phone: 212-939-7184 of the 3rd International Workshop on Software Engineering in Health Care. 2011.
E-mail: jbell@cs.columbia.edu 6. Murphy, C., K. Shen, and G. Kaiser, Automated System Testing of Programs without
Test Oracles, in Proc. of the 2009 ACM International Conference on Software Testing
Christian Murphy is an Associate Profes- and Analysis (ISSTA). 2009. p. 189-199.
sor of Practice and Director of the Master 7. Nimmer, J.W. and M.D. Ernst, Automatic generation of program specifications, in
of Computer and Information Technology Proc. of the 2002 International Symposium on Software Testing and Analysis
program at The University of Pennsylvania. (ISSTA). 2002. p. 232-242.
His primary interests are software engineer- 8. Ernst, M.D., et al., Dynamically discovering likely programming invariants to
ing, systems programming, and mobile/em- support program evolution, in Proc. of the 21st International Conference on Software
bedded computing. He received his Ph.D. in Engineering (ICSE). 1999. p. 213-224.
Computer Science from Columbia University. 9. Barus, A.C., et al., Testing of Heuristic Methods: A Case Study of Greedy Algorithm.
Lecture Notes in Computer Science, 2011. 4890: p. 246-260.
Dept. of Computer and Information 10. Kanewala, U. and J.M. Bieman, Techniques for Testing Scientific Programs Without
Science an Oracle, in Proc. of the 2013 International Workshop on Software Engineering for
University of Pennsylvania Computational Science and Engineering. 2013.
Philadelphia, PA 19104 11. Cheatham, T.J., J.P. Yoo, and N.J. Wahl, Software testing: a machine learning experiment,
Phone: 215-898-0382 in Proc. of the ACM 23rd Annual Conference on Computer Science. 1995. p. 135-141.
E-mail: cdmurphy@cis.upenn.edu 12. Murphy, C., et al., Automatic Detection of Previously-Unseen Application States
for Deployment Environment Testing and Analysis, in Proc. of the 5th International
Gail E. Kaiser is a Professor of Computer Workshop on Automation of Software Test (AST). 2010.
Science at Columbia University and a Senior 13. Polikarpova, N., I. Ciupa, and B. Meyer, A comparative study of programmer-written
Member of IEEE. Her research interests and automatically inferred contracts, in Proc. of the 2009 International Symposium
include software reliability and robustness, on Software Testing and Analysis (ISSTA). 2009. p. 93-104.
information management, social software 14. Hu, P., et al., An empirical comparison between direct and indirect test result
engineering, and software development checking approaches, in Proc. of the 3rd International Workshop on Software Quality
environments and tools. She has served as Assurance. 2006. p. 6-13.
a founding associate editor of ACM TOSEM 15. Xie, X., et al., Application of Metamorphic Testing to Supervised Classifiers, in Proc.
and as an editorial board member for IEEE of the 9th International Conference on Quality Software (QSIC). 2009. p. 135-144.
Internet Computing. She received her Ph.D.
and MS from CMU and her ScB from MIT.

Dept. of Computer Science


Columbia University
New York, NY 10027
Phone: 212-939-7184
E-mail: kaiser@cs.columbia.edu

CrossTalk—March/April 2015 13

You might also like