KEMBAR78
Extreme Mutation Testing in Practice | PDF | Unit Testing | Computer Programming
0% found this document useful (0 votes)
18 views4 pages

Extreme Mutation Testing in Practice

Extreme Mutation Testing in Practice article
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

Extreme Mutation Testing in Practice

Extreme Mutation Testing in Practice article
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Extreme mutation testing in practice:

An industrial case study


Maik Betka Stefan Wagner
Institute of Software Engineering Institute of Software Engineering
University of Stuttgart University of Stuttgart
Stuttgart, Germany Stuttgart, Germany
maik.betka@iste.uni-stuttgart.de stefan.wagner@iste.uni-stuttgart.de
arXiv:2103.08480v1 [cs.SE] 15 Mar 2021

Abstract—Mutation testing is used to evaluate the effectiveness prevent industry adoption. Therefore, we conducted an indus-
of test suites. In recent years, a promising variation called trial case study where we performed both mutation testing
extreme mutation testing emerged that is computationally less techniques in a large software project from the semiconductor
expensive. It identifies methods where their functionality can
be entirely removed, and the test suite would not notice it, industry and interviewed two experienced developers about
despite having coverage. These methods are called pseudo-tested. their software testing practices and the findings of extreme
In this paper, we compare the execution and analysis times for mutation testing in particular.
traditional and extreme mutation testing and discuss what they
mean in practice. We look at how extreme mutation testing II. F OUNDATIONS AND RELATED WORK
impacts current software development practices and discuss open A. Mutation testing
challenges that need to be addressed to foster industry adoption.
For that, we conducted an industrial case study consisting of For mutation testing, small controlled changes in the source
running traditional and extreme mutation testing in a large code of the software under test are introduced to change
software project from the semiconductor industry that is covered its behavior. The types of changes are called mutators. The
by a test suite of more than 11,000 unit tests. In addition to that,
we did a qualitative analysis of 25 pseudo-tested methods and respective altered software versions are called mutants. Af-
interviewed two experienced developers to see how they write terwards, the given test suite is re-run. If no test fails, this
unit tests and gathered opinions on how useful the findings means that the test suite is not able to detect the mutant.
of extreme mutation testing are. Our results include execution Thus, it is said that the mutant survived. Otherwise, it is
times, scores, numbers of executed tests and mutators, reasons said that the mutant is killed. Some software changes may
why methods are pseudo-tested, and an interview summary. We
conclude that the shorter execution and analysis times are well produce syntactically equivalent mutants, i.e., mutants that
noticeable in practice and show that extreme mutation testing do not have a different behavior. These mutants are called
supplements writing unit tests in conjunction with code coverage equivalent mutants. The goal is to strengthen the test suite
tools. We propose that pseudo-tested code should be highlighted by enhancing or writing new tests to kill the surviving non-
in code coverage reports and that extreme mutation testing equivalent mutants. The ratio of killed mutants to all generated
should be performed when writing unit tests rather than in a
decoupled session. Future research should investigate how to non-equivalent mutants is called the mutation score. Thus, the
perform extreme mutation testing while writing unit tests such goal of mutation testing is typically expressed by optimizing
that the results are available fast enough but still meaningful. the mutation score towards one [1], [2].

I. I NTRODUCTION B. Extreme mutation testing


In contrast to traditional mutation testing, where mutations
Mutation testing is a software testing technique to measure are usually performed on an instruction level, extreme mu-
how effective a test suite is. Despite being around for several tation testing performs these mutations on a method level.
decades and thoroughly studied in research, it is still not The goal is to find pseudo-tested methods, which are methods
widely adopted in industry [1], [2]. Extreme mutation testing where their whole functionality can be removed, and still,
is a variation of mutation testing and was introduced by no test will fail. [3]. Pseudo-tested methods are surprisingly
Niedermayr, Juergens, and Wagner in 2016 [3]. This variation common. In another study, 19 open-source projects have been
seems promising because it is computationally less expensive analyzed and the median proportion of pseudo-tested methods
and easier to comprehend due to its higher abstraction level. to the total number of mutated methods was 10.1% [4].
We want to find out how big the time improvements of Particular mutants have to survive to categorize a method
extreme mutation testing over traditional mutation testing as pseudo-tested. For methods with no return value (void-
are, what they mean in practice, and which factors currently methods), the mutator empties the body of the whole method.
If the resulting mutant survives, the method is categorized as
This research was supported by Advantest as part of the Graduate School
“Intelligent Methods for Test and Reliability” (GS-IMTR) at the University pseudo-tested. For methods with return values, often multiple
of Stuttgart. mutants have to survive. For example, for a primitive data
type like boolean, two mutants have to survive. One mutator different properties. Afterwards, we killed the extreme mutants
replaces the method-body with a single return statement that that caused the methods to be pseudo-tested, where possible,
returns true and one that only returns false. If both by enhancing or writing new tests and noted the reason why
resulting mutants survive, the method is categorized as pseudo- the method was pseudo-tested.
tested. If only one survives, the method is categorized as
C. Interview with developers
partially-tested. Analogously, other mutators that replace the
whole method-body with default return values for other data The second part of this case study was a semi-structured
types like integers (e.g. 0 and 1) or strings (e.g. "" and "A") interview. We interviewed two developers with more than
are chosen such that, if the mutants survive, the method can seven years of development experience who are familiar with
be categorized as pseudo-tested. Complex data types derived the software under test. None of them used mutation testing
from classes, like objects, can be simply set to null. The beforehand and only one developer heard of it at university.
choice which default values are used depends on the mutation We structured the interview into four phases. In the first phase,
testing tool used [3], [4]. we asked several questions about the developers’ roles in
Compared to traditional mutation testing, extreme mutation the company and their experiences with unit testing, code
testing has the advantage of generating far fewer mutants as coverage, and mutation testing. Afterwards, we explained
the number of methods is significantly lower than the number traditional and extreme mutation testing and showed the output
of instructions that can be mutated. This reduces both: runtime of the PIT tool. In the third phase, we opened the software
and time to analyze the mutants. On the other hand, extreme project in an integrated development environment (IDE) and
mutation testing is not as fine-grained as traditional mutation presented pseudo-tested methods to depict the process of how
testing to find weaknesses in the test suite. to analyze and strengthen the test suite. For this session,
four classes with eight pseudo-tested methods to discuss were
III. M ETHOD pre-selected. Five methods had 100% line coverage and the
A. Research questions remaining three had more or equal to 75% line coverage.
During the enhancement of the test suite, the developers were
RQ1: Are the improved execution and analysis times of asked several questions regarding the relevance of the findings
extreme mutation testing relevant in practice? and at which stage in development the findings would be most
RQ2: How do extreme mutation testing results impact the likely fixed. We also asked whether the developers would have
established practice of writing unit tests with code coverage? written more or better tests if pseudo-tested lines would be
RQ3: Which factors prevent extreme mutation testing from shown as “uncovered” in code coverage reports. The last phase
industry adoption? was an open discussion.
B. Mutation testing IV. R ESULTS
In the first part of this case study, we have run traditional A. Mutation testing results
and extreme mutation testing for a software project that is Table I shows the results of mutation testing with PIT
used in the semiconductor testing industry. The software is for 11,424 unit tests that normally finish in 12 seconds. For
tested by more than 11,000 unit tests that call about 2,000 traditional mutation testing, we used two sets of mutators that
methods (about 12,500 lines of code). The software project PIT provides. The “default” set consists of seven mutators,
is written in Java and consists of multiple smaller projects whereas the “all” set consists of 66 mutators. All executions
of different age, authors, and coding styles. For traditional were performed on the same machine.
mutation testing, we used PIT1 [5] with its default mutation Table II shows the total number of methods and the number
engine Gregor as a mutation testing tool (version 1.4.10). For of pseudo-tested methods as well as their proportion. It also
extreme mutation testing, we used the same tool but with the shows the lines that are covered, i.e., executed, by the test
Descartes engine2 [6] (version 1.2.6). In addition to that, we suite, and the total line counts. In the analyzed software
also measured code coverage with the Java code coverage project, 291 (14%) of all 2,041 methods are pseudo-tested.
library JaCoCo3 (version 0.8.6) and counted the total lines The 291 pseudo-tested methods consist of 1,129 lines of code
of code for all methods called by the test suite. where 835 lines have coverage.
After running mutation tests with different mutators and We analyzed 25 methods in more depth and found the
measuring the code coverage, we additionally labeled the following three reasons why methods are pseudo-tested:
methods by their access modifiers (public, private, etc.) by • Weak tests with no assertions (8)
writing a Python script. We then manually selected 25 pseudo- • Incomplete tests (3)
tested methods according to the collected data, namely: cov- • Side-effect methods (14)
erage, number of lines, access modifier, package-, class-,
We found eight tests that had no assertions, which caused them
and method-name to have a high diversity of methods with
to be pseudo-tested. For example, a class that handled network
1 http://pitest.org/ connections was tested by opening and closing connections,
2 https://github.com/STAMP-project/pitest-descartes but the connections were never verified to work by, e.g., count-
3 https://www.jacoco.org/jacoco/ ing the number of open sessions before and after closing.
TABLE I
M UTATION TESTING RESULTS

Type Score Killed / Total Survived Mutators Executed Tests Time


Extreme 85% 2,297 / 2,706 409 19 597,877 13 min
Traditional (default) 73% 5,813 / 7,989 2,176 7 4,228,169 37 min
Traditional (all) 70% 55,515 / 79,350 23,835 66 34,391,374 4 h 1 min

Instead, these tests were designed in a way that they pass if no depict that level of verification. Code coverage is not seen as
failure occurs, which means that no functionality is checked an appropriate measure to judge how bug-free the code is.
by verifying any output. Some of these tests could be fixed 3) Relevance of pseudo-tested methods: The developers
by simply capturing the output and adding assert statements. see remediating pseudo-tested methods as relevant when they
Other tests, like the example that concerns the connections, want to improve regression testing and to verify that the code
have required to add new methods to the class under test to works as expected. They do not see remediating pseudo-tested
get the information like the session count in addition to adding methods as relevant to write clean code.
assertions to the test. 4) Pseudo-tested methods in the development process:
We also found three incomplete tests. These are tests that When asking the developers at which stage in development
are well designed with strong assertions but missed to check the knowledge about pseudo-tested methods would most likely
certain properties of a class. For example, a class that creates result in improving the test suite, both developers answered at
a table was tested whether it writes the correct data rows, but the time when writing unit tests. The best point in time would
the table header was not tested, despite being called. This was be while working with an IDE or when invoking the build on
easily fixed by adding another assertion for the table header. the command line because the developers who write the unit
The remaining 14 methods are methods that are called tests have the expert knowledge and could quickly add new
during testing but are not subject to be tested and do not tests. The developers stated that the information would still
have an impact on the test result. Because of that, we named be useful when a continuous integration pipeline produces a
the last category side-effects. Typical side-effect methods are, report, but it would strongly depend on the available time and
e.g., custom methods that are responsible for logging or priorities whether they would fix the pseudo-tested method.
methods that store meta-data that is never used or verified They would probably not remediate pseudo-tested methods
by the test suite elsewhere. when spotted during a code review when merging code
changes or when receiving the results from a separate quality
assurance team because they usually have other priorities at
B. Interview results
these stages.
1) Main reasons to write unit tests: In general, the develop- 5) Acting upon pseudo-tested methods: The developers
ers write unit tests to verify that code works as expected, future would have written more or better tests if pseudo-tested
code changes do not break anything, and to write clean code methods were highlighted in code coverage reports. However,
by using, for example, the test-driven software development the developers stated one exception where they probably would
approach. None of the developers write unit tests because of not have enhanced the test suite. This was the case for a
an existing 80% line-coverage policy in the company. subset of unit tests that tested classes of a particular package
2) Confidence in code coverage: The developers mainly use but found pseudo-tested methods of classes of a completely
code coverage to spot regions in the code that have not been different package that was not the target of the unit tests.
tested so far. They have some confidence that code coverage is
V. E VALUATION
an accurate measure to judge the quality of the test suite when
using it for regression testing but are aware of the caveats. A. Answer to RQ1 – Relevance of execution and analysis time
Although one of the main reasons to write unit tests is to Traditional mutation testing with seven default mutators
verify that the code works as expected, the developers only takes almost three times as long (37 min.) as extreme mutation
have very little confidence that code coverage can accurately testing (13 min.), despite having only about a third of the
number of mutators (see table I). In practice, the execution
time requirements strongly depend on in which phase of the
TABLE II development process mutation testing is used. When analyzing
M ETHODS , LINES , AND COVERAGE
mutation testing results in a separate session, none of the
Measure Pseudo-tested Total Proportion execution times would actually be troublesome because muta-
Methods 291 2,041 14% tion testing would be performed in the background and later
Lines (covered) 835 11,189 7% analyzed. However, when trying to perform mutation testing
Lines (total) 1,129 12,572 9% during the development of unit tests, as the interview results
suggest (see section IV-B4), all of the execution times are too
long and thus impractical. This may be changed when not (see section IV-A). For pseudo-tested side-effect methods,
performing mutation testing for the whole test suite. For that, reflecting them in the coverage reports would be sufficient.
extreme mutation testing is more promising due to its lower VI. T HREATS TO VALIDITY
time requirements.
The presented numbers rely on the correctness of the
Table I shows that the coarse-grained approach of extreme
mutation testing tool PIT. Some tests of the software project
mutation testing results in having only a fifth of surviving
are designed to test parallel execution features. Other tests rely
mutants (409) compared to the traditional approach (2,176).
on static classes and their fields and depend on each other,
This number gets further lowered to 291 when working on a
which influences the test result when executed in parallel or
method-level to check which methods are pseudo-tested (see
split into independent chunks, which is what PIT does. Thus,
table II). These numbers are more relevant in practice. If we
for PIT, we set the number of execution threads to one and
hypothetically assume that a pseudo-tested method can be
tried to remove parallel tests where possible. Nonetheless, we
remediated in only five minutes, then even remediating 291
still found nine methods that belong to a class that included
pseudo-tested methods would take about 24 hours. Applying
“parallel” as part of its name. However, we ran the extreme
the same calculation to all 2,176 mutants of the traditional
and traditional mutation tests with default mutators both three
approach would result in about seven days of analysis. Thus,
times and always got the same results.
we conclude that the improved execution and analysis times
Further, we only interviewed two experienced developers
are relevant and noticeable in practice in terms of execution
who mostly agreed in their opinions. This does not necessarily
time and workload for the developers.
mean that the results are well generalizable and that other
B. Answer to RQ2 – Impact on established testing practices developers would rate the relevance of pseudo-tested methods
or how to use extreme mutation testing the same way they did.
Extreme mutation testing supplements writing unit tests in
conjunction with code coverage. Unit tests are mainly written VII. C ONCLUSIONS
to verify that the code works as expected and serve as regres- We have shown that extreme mutation testing is supplemen-
sion tests for future code changes (see sections IV-B1, IV-B2). tal to current software testing practices. It can help to write
Pseudo-tested methods are seen as relevant for improving both better unit tests and addresses some shortcomings that code
(see section IV-B3). Line coverage alone does not accurately coverage has. Compared to the traditional approach, the faster
depict how effective a test suite is. We found 1,129 lines execution time is well noticeable in practice and provides a
of code (see table II) where we could easily introduce new good starting point for further research. For result presentation,
bugs that would remain unnoticed by the given test suite. In we consider it to be best integrated into existing coverage
fact, table II shows that when counting pseudo-tested lines as reports because it is easy to implement and it is a format
“uncovered”, the total line coverage would have to be adjusted that is already known by the developers. Future research
from 89% to 82%. should investigate how to apply extreme mutation testing while
writing unit tests such that the results are available fast enough
C. Answer to RQ3 – Industry adoption
but still meaningful. Lastly, future research should also analyze
Despite the mentioned advantages in terms of execution the severity and number of issues that remain undiscovered by
and analysis time (see section V-A) and its positive impact the extreme approach, which trades accuracy for speed gains.
on established software testing practices (see section V-B),
R EFERENCES
extreme mutation testing is comparatively young with only
a few available implementations [4] and thus, is not widely [1] Y. Jia and M. Harman, “An analysis and survey of the development of
mutation testing,” IEEE transactions on software engineering, vol. 37,
known yet. We noticed that there are two areas to improve to no. 5, pp. 649–678, 2010.
foster industry adoption: tooling and usage. [2] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. Le Traon, and M. Harman,
For tooling, we found that the results of extreme mutation “Mutation testing advances: an analysis and survey,” in Advances in
Computers. Elsevier, 2019, vol. 112, pp. 275–378.
testing would be best represented in code coverage reports [3] R. Niedermayr, E. Juergens, and S. Wagner, “Will my tests tell me
by highlighting the pseudo-tested code (see section IV-B5). if i break this code?” in 2016 IEEE/ACM International Workshop on
This is currently not supported by the Descartes engine. The Continuous Software Evolution and Delivery (CSED). IEEE, 2016, pp.
23–29.
advantage of that approach is that developers are already [4] R. Niedermayr, “Evaluation and improvement of automated software
familiar with coverage reports, and it is easy to implement. test suites,” Ph.D. dissertation, University of Stuttgart, 2019. [Online].
Available: http://dx.doi.org/10.18419/opus-10640
For usage, further research should investigate how extreme [5] H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit:
mutation testing has to be improved such that it can be run a practical mutation testing tool for java,” in Proceedings of the 25th
while writing unit tests. This is the point in time where International Symposium on Software Testing and Analysis, 2016, pp.
449–452.
issues are most likely fixed (see section IV-B4). Also, we [6] O. L. Vera-Pérez, M. Monperrus, and B. Baudry, “Descartes: a pitest
observed that the mutation score is irrelevant when using engine to detect pseudo-tested methods: tool demonstration,” in 2018 33rd
extreme mutation testing in practice because some methods IEEE/ACM International Conference on Automated Software Engineering
(ASE). IEEE, 2018, pp. 908–911.
are not worth fixing (see section IV-B5). This was also found [7] O. L. Vera-Pérez, B. Danglot, M. Monperrus, and B. Baudry, “A
in another study [7]. It would probably be enough to only comprehensive study of pseudo-tested methods,” Empirical Software
enhance weak tests with no assertions and incomplete tests Engineering, vol. 24, no. 3, pp. 1195–1225, 2019.

You might also like