KEMBAR78
3 - Assessing The Value of Coding | PDF | Regression Analysis | Version Control
0% found this document useful (0 votes)
30 views10 pages

3 - Assessing The Value of Coding

Uploaded by

bmi45045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views10 pages

3 - Assessing The Value of Coding

Uploaded by

bmi45045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Assessing the Value of Coding Standards: An Empirical Study∗

Cathal Boogerd Leon Moonen


Software Evolution Research Lab Simula Research Laboratory
Delft University of Technology Norway
The Netherlands Leon.Moonen@computer.org
c.j.boogerd@tudelft.nl

Abstract code analyzer, in which well-known academic options (e.g.,


[10, 5, 6], for a description and more tools see [4]) and com-
In spite of the widespread use of coding standards and tools mercial tools (e.g., QA - C,1 K7,2 and CodeSonar3 ) abound.
enforcing their rules, there is little empirical evidence sup- These tools are equipped with a pre-defined set of rules,
porting the intuition that they prevent the introduction of but can often be customized to check for other properties
faults in software. Not only can compliance with a set of as well. As such, when defining a coding standard for a cer-
rules having little impact on the number of faults be consid- tain project, one can use the existing tools as well as writ-
ered wasted effort, but it can actually result in an increase ing one’s own. In a recent investigation of bug characteris-
in faults, as any modification has a non-zero probability of tics, Li et al. argued that such early automated checking has
introducing a fault or triggering a previously concealed one. contributed to the sharp decline in memory errors present in
Therefore, it is important to build a body of empirical knowl- software [15]. Thus, with rules based on long years of ex-
edge, helping us understand which rules are worthwhile en- perience, and support for automated checking readily avail-
forcing, and which ones should be ignored in the context of able, universal adoption of coding standards should be just
fault reduction. In this paper, we describe two approaches a step away. But is it?
to quantify the relation between rule violations and actual Developers often have less sympathy for these tools,
faults, and present empirical data on this relation for the since they typically result in an overload of non-
MISRA C 2004 standard on an industrial case study. conformance warnings (referred to as violations in this pa-
per). Some of these violations are by-products of the un-
1. Introduction derlying static analysis, which cannot always determine
whether code violates a certain check or not. Kremenek et
Long have coding standards been used to guide developers al. [14] observed that all tools suffer from such false posi-
to a common style or to prevent them from using construc- tives, with rates ranging from 30-100%. In addition to the
tions that are known to be potentially problematic. For in- noise produced by the static analysis, there is the noise of the
stance, Sun publishes its internal Java coding conventions, ruleset itself. For many of the rules in coding standards there
arguing that they help new developers to more easily un- is no substantial evidence of a link to actual faults. More-
derstand existing code and reduce maintenance costs [16]. over, coding standard necessary limit themselves to generic
Another example is the MISRA C standard [17], specifi- rules, that may not always be applicable in a certain con-
cally targeted towards safe use of C in critical systems. The text. These problems constitute a barrier to adoption of both
rules in these standards are typically based on expert opin- standard and conformance checking tool.
ion, and can be targeted towards multiple goals, such as reli- Apart from barring adoption, there is an even more
ability, portability or maintainability. In this paper, we focus alarming aspect to having noisy rules (i.e., leading to many
on reliability, i.e., the prevention of faults by the adoption violations unrelated to actual faults) in a standard. Adams
of coding rules. In this context, the MISRA standard is of [1] first noted that any fault correction in software has a
particular interest, as it places a special emphasis on auto- non-zero probability of introducing a new fault, and if this
matic checking of compliance. The rationale for automated probability exceeds the reduction achieved by fixing the vi-
checking is that it is of little use to set rules if they cannot olation, the net result is an increased probability of faults in
be properly enforced. the software. Enforcing conformance to noisy rules can thus
Such a check typically takes the form of a static source increase the number of faults in the software. As Hatton
remarks when discussing the MISRA 2004 standard: “the
∗ This work has been carried out in the Software Evolution Research false positive rate is still unacceptably high with the accom-
Lab at Delft University of Technology as part of the TRADER project un-
1 www.programmingresearch.com
der the responsibility of the Embedded Systems Institute. This project is
2 www.klocwork.com
partially supported by the Netherlands Ministry of Economic Affairs under
the BSIK03021 program. 3 www.grammatech.com

978-1-4244-2614-0/08/$25.00 © 2008 IEEE 277 ICSM 2008


panying danger that compliance may make matters worse Configuraton Management (SCM) system and the accom-
not better” [8]. panying issue tracking system, further referred to as Prob-
Clearly, it is of great interest to investigate if there is lem Report (PR) database. Using the version information,
empirical evidence for the relation between coding standard we can reconstruct a timeline with all open PRs as well as
rule violations and actual faults. This relation is especially the rule violations. If violations truthfully point at poten-
important when, rather than striving for full compliance in tial fault areas, we expect more violations for versions with
order to meet certain regulations, the body of rule violations more faults present. Also, the reverse holds: if a known
is used as a means to assess the software reliability (i.e., the fault was removed in a certain version, it is probable that vi-
number of dormant faults), as is increasingly becoming pop- olations, disappearing in that same version, were related to
ular in industry. Therefore, the goal of this paper is: the fault. This covariation can then be observed as a posi-
tive correlation between the two measures. Of course, this
To empirically assess the relation between viola- approach results in some noise because also violations dis-
tions of coding standard rules and actual faults appearing during modifications not related to fixing faults
Where we define a violation to be a signal of non- are included in the correlation.
conformance of a source code location to any of the rules
in the standard; and a fault to be an issue present in the issue
2.2. Spatial coincidence
tracking system, which may be related to multiple locations The noise present in the previous approach can be par-
in the source code. In the process of this investigation, we tially eliminated by requiring proximity in space as well as
make the following contributions: time. If we use the data present in the SCM system and PR
database to link PRs with their related changes in the ver-
• We propose two methods to quantify the relation be-
sion history, we can determine which changes were fault-
tween violations and faults, and describe their require-
fixing. Violations on lines that are changed or deleted in
ments with respect to the richness of the input data set
non-fix modifications can thus be ruled out as likely false
(i.e., the software history).
positives. With such information, we can determine for ev-
• We present empirical data obtained from applying both ery rule how many violations were accurately pointing out a
methods to an industrial embedded software project problem area, the true positives.
and the MISRA C 2004 coding standard. The data con- However, the prime prerequisite for this approach is also
sists of correlations between rule violations and faults, its Achilles’ heel. Establishing a link between faults and
as well as measured true positive rates for individual source code modifications requires either integrated support
rules. by the SCM system, or a disciplined development team, an-
notating all fix changes with informative commit messages
We describe the two approaches and the measurement setup (e.g., referring to the identifier of the problem in the PR
in Section 2, and the particular project and standard under database). Such information may not always be present,
study in Section 3. The resulting empirical data is presented which limits the applicability of the method. This the reason
in Section 4. We evaluate the results and the two methods in why we have chosen to use both methods on a case: it will
Section 5. Finally, we compare with related work in Section allow us to use the results of the second to assess the impact
6 and summarize our findings in Section 7. of the noise present in the first. If the noise impact is small,
we can safely use the first method on a wider array of cases,
2. Methodology expanding our possibilities to gather empirical data.

To assess the relation between violations and faults, we take 2.3. Measurement approach
two approaches. The first is a more high-level approach, To gather the measurements for a large software history we
studying the co-evolution of violations and faults over time, need an automated measurement approach. This raises a
testing for the presence of a correlation. The second ap- number of challenges, and we will discuss for each measure
proach looks at individual violations in more detail, tracking how we have solved those. Summarizing from the previ-
them over time and seeing how often they correctly signal ous sections, the measures needed in this investigation are
actual problem locations, giving us a true positive rate for the number of violations per version, the number of open
every rule. We will discuss both methods in more detail, af- (i.e., unsolved) PRs per version, and the true positive rate
ter which we describe how to obtain the necessary measures. for all rules. The number of open PRs per version is prob-
ably easiest, as we can simply use the status field for the
2.1. Temporal coincidence
PR to determine whether it was unsolved at the moment the
To quantify the relation between rule violations and actual version was released or built. Determining the number of vi-
faults we look at the software history present in a Software olations per version requires a bit more work, and comprises

278
the following steps: tion between violations and faults. As a result, we do not
even know whether assessing this relation is feasible. We
• Retrieving a full version from the SCM. Since we will therefore decided to use one typical industrial case as a pilot
run a static analyzer on the retrieved source code, this study. NXP (formerly Philips Semiconductors), our indus-
needs to be a proper, compilable set of source files, as trial partner in the TRADER 4 project, was willing to provide
such tools often mimick the build environment in order us with tooling and access to its TV on Mobile project.
to check compliancy.
• Extracting configuration information necessary to run 3.1. TV on Mobile
the compliancy checker. This kind of information is
usually present in Makefiles or project files (e.g., Visual This project (TVoM for short) was selected because of its
Studio solution files). relatively high-quality version history, featuring the well-
defined link between PRs and source code modifications we
• Running the compliancy checker using the extracted
need for our spatial coincidence method. It represents a typi-
configuration information, saving the rule violations cal embedded software project, consisting of the driver soft-
and their locations (file name, line number) within the
ware for a small SD-card-like device. When inserted into the
source archive.
memory slot of a phone or PDA, this device enables one to
To determine the true positive rates we use a process similar receive and play video streams broadcasted using the Digital
to the one used in the warning prioritization algorithm by Video Broadcast (DVB) standard.
Kim and Ernst [11]. However, whereas they track bug-fix The complete source tree contains 148KLoC of C code,
lines throughout the history, we track the violations them- 93KloC C++, and approximately 23KLoC of configuration
selves. Tracking violations over different versions, although items in perl and shell script (all reported numbers are non-
not too difficult, is not widely addressed. We are only aware commented lines of code). The real area of interest is the
of the work by Spacco et al. [20]. C code of the actual driver, which totals approximately
In our approach, we build a complete version graph for 91KLoC. Although the policy of the project was to mini-
every file in the project using the predefined successor and mize compiler warnings, no coding standard or code inspec-
predecessor relations present in the SCM. The version graph tion tool was used. This allows us to actually relate rule
allows us to accurately model branches and merges of the violations to fault-fixing changes; if the developers would
files. For every edge in the graph, i.e., a file version and one have conformed to the standard we are trying to assess, they
of its successors, the diff is computed. These differences are would have probably removed all these violations rightaway.
expressed as an annotation graph, which represents a map-
ping of lines between file versions.
3.2. MISRA-C: 2004
Violations are uniquely identified by triplets containing
the file name, line number and rule identifier. Using the ver- The first MISRA standard was defined by a consortium of
sion graph and the annotation graph associated with every UK-based automotive companies (The Motor Industry Soft-
edge, violations can be tracked over the different versions. ware Reliability Association) in 1998. Acknowledging the
As long as a violation remains on a line which has a des- widespread use of C in safety-related systems, the intent was
tination in the annotation graph (i.e., is present the next file to promote the use of a safe subset of C, given the unsafe na-
version) no action is taken. When a line on which a violation ture of some of its constructs [7]. The standard became quite
has been flagged is changed or removed, the PR database is popular, and was also widely adopted outside the automotive
consulted to see if the modification was fix-related. If it is, industry. In 2004 a revised version was published, attempt-
the score for the rule is incremented; in both cases, the total ing to prune unnecessary rules and to strengthen existing
incidence count for the current rule is incremented. Viola- ones. However, Hatton [9] argued that even these modifica-
tions introduced by the modification are considered to con- tions could not prevent the standard from having many false
stitute a new potential fault, and are tracked similar to the positives among reported violations, so “a programmer will
existing ones. In the last version present in the graph, the not be able to see the wood for the trees”. Currently, NXP is
number of remaining violations are added to the incidence also introducing MISRA-C: 2004 as the standard of choice
count (on a per rule basis), as these can all be considered for new projects. Given the wide adoption of the MISRA
false positives. standard, an assessment of its rules remains a relevant topic.
Copyright laws prevent us from quoting the MISRA rules
3. Case Description themselves. However, a discussion on the content of these
rules is beyond the scope of this paper.
To the best of our knowledge, there is no previous com-
plete empirical data from a longitudinal study on the rela- 4 www.esi.nl/trader

279
7361
39

1.5
# Faults (dots; left axis) / # Violations (triangles; right axis)

1.0
Fault density

0.5
3379

0.0
0

0 50 100 150 200 140 150 160 170

Revisions over time Violation density

Figure 1. Number of faults and violations over time Figure 2. All MISRA violations versus open PRs

4.1. Project evolution data


3.3. Experimental details
In Figure 1 the number of open PRs for every version (using
The complete version history of the project runs from Febru- dots; left y-axis) is plotted. Since we have not included the
ary 2006 until June 2007, but up to August 2006 the or- first half year of the project, the number of open PRs starts
ganization and structure of the project was volatile and the at 27, steadily increasing until it almost hits 40, then jumps
version history (as a result) erratic. We therefore selected down (in February 2007) until it reaches 0 by the end of
the period from August 2006 onwards as input to our anal- the project history. This is typical behavior in a project like
ysis. It contains 214 versions, related to both daily builds this, where at first the focus is on implementing all features
and extra builds in case the normal daily build failed. In ad- of the software, and only the most critical PRs are solved
dition, we selected PRs from the PR database that fulfilled immediately. Only when the majority of features has been
the following conditions: (1) classified as ‘problem’ (thus implemented, a concerted effort is made to solve all known
excluding change requests); (2) involved with C code; and problems.
(3) had status ‘solved’ by the end date of the project. The rule violation history, also in Figure 1 (using tri-
NXP has developed its own front-end to QA-C 7, dubbed angles; right y-axis), does not quite follow this trend. At
Qmore, that uses the former to detect MISRA rule viola- the point where the number of faults decreases sharply, the
tions. Configuration information needed to run the tool (e.g., number of violations actually increases. Also, the gap in
preprocessor directives or include directories) is extracted the number of violations is striking. After manual inspec-
from the configuration files (Visual Studio project files) driv- tion we found that this is due to one very large header file
ing the daily build that also reside in the source tree. (8KLoC) being added at that point. In addition, the effect in
The SCM system in use at NXP is CMSynergy, which the graph is emphasized because of the partial scale of the
features a built-in PR database next to the usual configura- vertical axis.
tion management operations. The built-in solution allows Clearly, with the size of the software (LoC) having such a
linking PRs and tasks, which group conceptually related profound effect on the number of rule violations, an analysis
source code modifications. Using this mechanism, we are of the relation between violations and faults should rule out
able to precisely extract the source lines involved in a fix. this confounding factor. To this end, we will only use fault
and violation densities, the numbers per version divided by
4. Results the number of KLoC for that version. Figure 2 shows the
relation between violation density and fault density for the
The results in this section comprise two different parts: first complete set of MISRA rules. In this figure, every point rep-
we look at some evolution data of the project, to rule out any resents one version of the software. If there were a positive
abnormalities or confounding factors that might distort our correlation, we would see the points clustering around the
measurements; in the second part we examine the results of diagonal to some extent. However, quite the contrary can
the rule violation and fault relation analysis. be observed: the correlation for the MISRA standard as a

280
0.0015
1.5

1.5

0.0010
1.0

1.0
Fault density

Fault density

Fault density

0.0005
0.5

0.5

0.0000
0.0

0.0
55 60 65 70 75 80 85 15 16 17 18 19 0.015 0.020 0.025 0.030 0.035 0.040 0.045

Violation density Violation density Violation density

(a) positive relation (b) no relation (c) negative relation

Figure 3. Relation between violations and faults for different subsets

whole looks negative rather than positive. line of the regression model.
Another way in which we verified that the data does not
4.2. Analysis of individual rules violate the assumptions of the regression was by inspecting
residual plots. One example is displayed in Figure 4 (for the
Now that we have seen the correlation for MISRA as a ‘positive’ set). The upper left graph contains the data and
whole, we will look more closely at the influence of indi- the fitted line; the upper right plots residuals versus fitted,
vidual rules. To do this, we create graphs, similar to Figure which should be spread with no apparent relation around
2, but using only violations for a single rule. These are visu- the horizontal axis. Lower left is a histogram of the residu-
ally inspected to determine whether there exist a correlation als, which should be a normal distribution, and lower right
between the two measures. If the distribution is acceptable, a normal plot of the residuals, where they should be laid out
we classify the rule as either ‘positive’ or ‘negative’, if there more or less on the diagonal. In this case, although slightly
are too many outliers it is classified as ‘none’. The resulting skewed, the distribution is acceptable.
classification for all rules can be observed in columns 2-4 of The next two remaining columns in Table 1 display the
Table 1. This table does not include all rules in the MISRA total number of violations and the true positive rate for each
standard, as we could only include those for which we have rule. The violations are the number of unique violations over
observed violations (in total 72 out of 141 rules). the complete observed history. That is, if exactly the same
In addition, we use a least squares linear regression violation was present in two versions, it was counted only
model to summarize the characteristics of the correlation. once. The true positive rate has been computed by divid-
Columns 5-7 display the properties of this model: the lin- ing the number of violations on lines changed or deleted in
ear coefficient, or the slope of the fitted line; the percentage fix-changes by the total number of violations. The corre-
of variation explained by the model, or the square of the sponding per-class data are displayed in Table 2 in a manner
correlation coefficient R; and the statistical significance, ex- similar to Table 1.
pressed as the probability that the correlation found is sim- If we see rule violations as a line-based predictor of
ply coincidence. Significance has been presented in star- faults, the true positive rate can be considered a measure
notation, with ‘ns’ for p > 0.05, p < 0.05 *, p < 0.01 of the prediction accuracy. Consequently, when comparing
**, p < 0.001 *** and p < 0.0001 ****. In general, we rule 1.1 and 2.3 this rate suggests the former to be more ac-
can observe that the linear models for rules in the classes curate. However, in a population of 62 violations (for 1.1) it
‘positive’ and ‘negative’ explain a large part of the variation; is likely that there are some correct violations simply due to
R2 > 0.75. chance, which is unlikely if there are few violations. There-
From an automation point of view, it may be appealing to fore, true positive rates cannot be used in their own right to
simply rely on the summary statistics, such as the correlation compare rules on accuracy (unless the number of violations
coefficient R. However, this does not guarantee existence of is the same).
a true relation, as the input distribution may, for instance, We can assess the significance of the true positive rate
contain too many outliers, something which can easily be by comparing it to a random predictor. This predictor ran-
ascertained using the aforementioned scatter plots [2]. Due domly select lines out of a population of over 300K unique
to space limitations, we have not included scatter plots for lines in the TVoM history, of which 17% was involved in a
all individual rules, but have plotted aggregated results for fix change. Since the population is so large, and the num-
each class in Figure 3. These figures also display the fitted ber of violations per rule relatively small (max. 0.5% of the

281
Variation explained (R2 )

Variation explained (R2 )


Prediction performance

Prediction performance
True positives (ratio)

True positives (ratio)


Linear coefficient
Linear coefficient

Total violations

Total violations
MISRA rule

MISRA rule
Significance

Significance
Negative

Negative
Positive

Positive
None

None
√ √
1.1 -1.07 0.86 **** 62 0.08 0.04 13.1 -0.39 0.85 **** 121 0.01 0.00
√ √
1.2 3.92 0.77 **** 26 0.23 0.86 13.2 -0.09 0.85 **** 781 0.06 0.00
√ √
2.3 53.13 0.86 **** 1 0.00 0.83 13.7 -2.51 0.84 **** 26 0.12 0.33
√ √
3.1 12.80 0.46 **** 11 0.00 0.13 14.1 -0.11 0.02 * 191 0.04 0.00
√ √
3.4 0.98 0.17 **** 15 0.00 0.06 14.2 0.27 0.89 **** 260 0.30 1.00
√ √
5.1 -19.43 0.85 **** 2 0.00 0.69 14.3 -0.11 0.85 **** 818 0.01 0.00
√ √
5.2 -9.10 0.85 **** 10 0.00 0.16 14.5 -15.28 0.84 **** 7 0.00 0.27
√ √
5.3 21.10 0.83 **** 2 0.50 0.97 14.6 -0.74 0.82 **** 70 0.01 0.00
√ √
5.6 -0.56 0.57 **** 278 0.08 0.00 14.7 -0.35 0.79 **** 427 0.03 0.00
√ √
6.1 -19.43 0.85 **** 2 0.00 0.69 14.8 -0.17 0.85 **** 282 0.02 0.00
√ √
6.2 -2.43 0.85 **** 16 0.00 0.05 14.10 4.58 0.21 **** 53 0.09 0.09
√ √
6.3 0.06 0.82 **** 1675 0.10 0.00 15.2 -11.91 0.84 **** 3 0.00 0.57
√ √
8.1 1.69 0.79 **** 52 0.35 1.00 15.3 -6.29 0.87 **** 15 0.00 0.06
√ √
8.4 13.85 0.24 **** 1 0.00 0.83 15.4 -14.22 0.87 **** 7 0.00 0.27
√ √
8.5 18.41 0.42 **** 2 1.00 1.00 16.1 -13.26 0.27 **** 15 0.00 0.06
√ √
8.7 -7.97 0.82 **** 46 0.30 0.99 16.4 1.04 0.03 * 64 0.12 0.22
√ √
8.8 0.51 0.84 **** 115 0.10 0.03 16.5 2.50 0.82 **** 20 0.15 0.55
√ √
8.10 0.83 0.80 **** 90 0.01 0.00 16.7 -0.69 0.84 **** 143 0.08 0.00
√ √
8.11 -4.27 0.08 **** 4 0.25 0.86 16.8 11.13 0.84 **** 2 0.00 0.69
√ √
9.1 0.80 0.06 *** 15 0.27 0.90 16.9 53.13 0.86 **** 2 0.50 0.97
√ √
9.2 -3.90 0.01 ns 2 1.00 1.00 16.10 -0.10 0.85 **** 1845 0.06 0.00
√ √
9.3 -4.55 0.85 **** 12 0.00 0.11 17.4 -0.39 0.78 **** 205 0.09 0.00
√ √
10.1 0.36 0.79 **** 490 0.13 0.02 17.6 -38.86 0.85 **** 1 1.00 1.00
√ √
10.6 -0.13 0.85 **** 378 0.02 0.00 18.4 3.10 0.79 **** 19 0.05 0.14
√ √
11.1 3.08 0.39 **** 38 0.26 0.95 19.4 -3.05 0.32 **** 37 0.05 0.04
√ √
11.3 -1.92 0.86 **** 108 0.06 0.00 19.5 10.63 0.86 **** 5 0.00 0.39
√ √
11.4 -0.43 0.37 **** 151 0.06 0.00 19.6 10.63 0.86 **** 5 0.00 0.39
√ √
11.5 0.59 0.11 **** 26 0.08 0.16 19.7 -5.03 0.81 **** 37 0.03 0.01
√ √
12.1 0.33 0.85 **** 151 0.08 0.00 19.10 -1.83 0.09 **** 37 0.00 0.00
√ √
12.4 -2.07 0.16 **** 19 0.11 0.35 19.11 8.03 0.79 **** 5 0.00 0.39
√ √
12.5 -0.71 0.75 **** 110 0.05 0.00 20.2 -3.56 0.57 **** 22 0.00 0.02
√ √
12.6 -4.72 0.84 **** 11 0.00 0.13 20.4 -1.23 0.06 *** 39 0.00 0.00
√ √
12.7 1.40 0.87 **** 91 0.24 0.97 20.9 5.92 0.57 **** 16 0.00 0.05
√ √
12.8 1.21 0.83 **** 68 0.09 0.04 20.10 13.85 0.24 **** 1 0.00 0.83
√ √
12.10 -38.86 0.85 **** 1 0.00 0.83 20.12 -8.40 0.44 **** 16 0.19 0.72
√ √
12.13 5.25 0.79 **** 37 0.27 0.96 21.1 -2.80 0.84 **** 25 0.08 0.18

Table 1. Relation between violations and faults per rule

population), we can model this as a Bernoulli process with rate higher than the expected value of the random predictor
p = 0.17. The number of succesful attempts or correctly (0.17). As a result, the value of the CDF for all three classes
predicted lines has a binomial distribution; using the cumu- is near-zero.
lative density function (CDF) we can indicate how likely a
rule is to outperform the random predictor in terms of ac-
curacy. This value is displayed in the last column of the 5. Evaluation
table, and can be compared across rules. Note that none of
In this section, we will discuss the meaning of the results, as
the three classes in Table 2 have an average true positive
well as the applicability and limitation of the methods used

282
y = −0.02 x + 1.85 Residuals vs. fitted

Variation explained (R2 )

0.4
1.5
True positives (ratio)
Average violations

0.2
Linear coefficient

1.0
Total violations

Residuals

0.0
y
Significance

0.5
Rule class

−0.2
−0.4
0.0
40 50 60 70 80 90 0.2 0.4 0.6 0.8 1.0 1.2

x Fitted
Positive (18) 0.03 0.87 **** 3092 171.78 0.13
None (25) -0.10 0.15 **** 1077 43.08 0.07
hist of residuals normal plot of residuals
Negative (29) -0.02 0.85 **** 5571 192.10 0.05

0.4
10 20 30 40 50 60

0.2
Sample Quantiles
Table 2. Aggregated statistics per class

Frequency

0.0
−0.2
to obtain them.

−0.4
0
5.1. Meaning of the correlations −0.4 −0.2 0.0 0.2 0.4 −3 −2 −1 0 1 2 3

Residuals Theoretical Quantiles

The first important moderation for our results is the fact that
they are correlations, and therefore not proof of causality. Figure 4. Residual plots for the positive class
However, the positive and negative relations observed ex-
plain a large amount of the variation (R2 > 0.75) and are relation, which is why many of those rules have been clas-
statistically significant. Then what do they mean? sified as ‘none’ in Table 1. This actually explains all but
A positive correlation is rather intuitive: we oberve an rules 5.6, 11.4 and 14.1 in this category, for which we did
increasing violation density during the first phase of the observe a large amount of violations. Re-examination of the
project, where most of the faults are introduced. An in- correlation plots revealed distributions of points that sug-
crease in violation density means, that the areas that devel- gest a correlation, except for a cluster of close outliers. This
opers have been working on violate the rule relatively often. was the result of files being moved in and out of the main
The decrease of violation density in the second phase of the branch of development, these files containing a relatively
project signals that many have been removed at a time when high amount of violations of the particular rule. Although
there was a focus on fixing faults. Possibly, some violations we try to compensate for this by using densities, in some
were introduced in unrelated areas during the first phase, but cases the influence of one file can still distort an otherwise
seeing them removed in the second phase strengthens the in- strong correlation.
tuition that they were indeed related to a fault.
Similar to rules with positive correlations, rules with neg- 5.2. Correlations and true positive rates
ative correlations experience a relative high rate of injection, Overall, the true positive rates computed by also requiring
not just in the first phase, but also in the second phase. For spatial coincidence partially match the partitioning of the
a negative correlation to occur, the violation density for this rules into classes using only temporal coincidence. This can
rule must keep increasing, even when the fault density de- be seen in Table 2, where the ‘positive’ class shows a higher
creases. Since the the changes are relatively small compared true positive rate than both others; and ‘none’, in turn, higher
to the complete codebase (on average only adding 9 lines), than ‘negative’. However, there are some individual rules
injection of a single violation without removal somewhere whose inconsistency between class and measured true pos-
else already results in an increased density. In other words, itive rate warrant further explanation. Two different cate-
violations of these rules are seldom removed: apparently, gories of interest can be distinguished:
the areas they reside in were not considered relevant to fix-
ing faults. Class positive, yet zero true positive rate This category
Intuitively, the magnitude of the coefficient of the lin- (e.g., rule 3.1) illustrates the extra precision gained
ear model should be seen as a quantification of the ratio of by also requiring spatial coincidence. These rules
true positives. When looking closer, however, we see that have violations removed at the same time faults were
the coefficient is large (either positive or negative) in cases removed from the software, and as a result, show a
of a rule with few violations, as a small number of viola- positive correlation. Spatial coincidence shows that, as
tions is compared to a greater number of faults. This small these removals were not due to a fix-change, they are
number of violations makes it more difficult to distinguish a not likely to be related to the fault removal.

283
Class negative, yet non-zero true positive rate This cate- in a ‘lagging’ measured fault density compared to the actual
gory encompasses 18 rules out of the negative class, fault density. The number of violations is measured on the
which itself contains a total of 29. Although at a first source that contains this ‘hidden’ fault, and as a result, the
glance, having both a negative correlation and viola- correlation between violation density and fault density may
tions removed during fixes may be paradoxal, this can be slightly underestimated.
be understood by looking at the rate of removal as well In addition, the correlations as used in this paper are sen-
as the rate of injection. Despite continuous removal sitive to changes in the set of files between the different ver-
(in fixes as well as non-fix related changes), the vio- sions. These changes occur since in case of a failed daily
lation density keeps increasing, as more violations are build, the culprit module (i.e., a set of files) will be excluded
being injected by all those changes. Consequently, the from the build until repaired, to ensure a working version
measured true positive rates also depend on the point of the project. Those changes result in the different clus-
of time at which they are measured (e.g., beginning, ters of points in the scatter plots in Section 4. To see why,
halfway or end of the project). consider the following example. Suppose that for a certain
The true positive rates indicate that 15 out of 72 rules out- rule r, most of the violations are contained in a small subset
perform a random predictor with respect to selecting fault- Sr of all the files in the project. Even if there would be a
related lines. If we only select rules whose true positive rate perfect linear relationship between violations of r and the
is significantly different (α = 0.05) from the expected value number of faults, then this would not be visible in a scat-
(0.17), 12 rules remain (those where the performance value ter plot if files in Sr are alternatingly included and excluded
≥ 0.95). This means that the results of the two approaches from subsequent versions of the project. Two clusters of
agree on only 7 out of the 18 rules in class ‘positive’. Since versions would appear; in one, there is no relation, while
we can consider the true positive rates a more accurate char- the other suggests a linear relation. Clearly, this distorting
acterization, it seems unwise to use the correlations as the effect only increases as more rules and versions are consid-
only means to assess the relation between violations and ered. In future investigations we intend to address this issue
faults. by looking at official internal releases instead of daily build
However, as we mentioned before, the change in viola- versions. This will minimize changes in the composition of
tions over time remains relevant. For instance, in the TVoM files because of the aforementioned build problems. In ad-
project, the most critical faults were removed before the dition, we intend to split the analysis between phases of the
project was suspended, but during maintenance more faults project that are distinctly different with respect to the com-
might have been found, altering the measured true positive position of files.
rate. In addition, the character of the software written may The spatial coincidence method produces conservative
change over time (e.g., from more development in the hard- estimates, for two reasons. Underestimation might occur
ware layer to the application layer), which could also affect with bug fixes that introduce new code or modify spatially
the number and type of violations in the software. unrelated code. Assume that the static analysis signals that
Finally, another criterion to assess the true positive rate some statement, requiring a check on input data, could be
worth mentioning is the rate at which faults are injected reached without performing such a check. Adding the check
when violations are fixed. A non-zero probability means to solve the problem would then only introduce new code,
that the use of at least 25 out of 72 rules with a zero true unrelated to the existing violation. Violations addressing
positive rate is invalidated. Although the injection rate is multiple lines are unlikely to occur with generic coding stan-
not known for this project, in Adams’ study, this probability dards, which typically contain rules at the level of a single
was 0.15 [1]. This value would even invalidate 57 out of 72 expression or statement (especially ones based on the safer
MISRA rules. This lends credence to the possibility that ad- subset paradigm, such as MISRA). Therefore, impact of this
herence to the complete MISRA standard would have had a problem remains small, as there are usually no violations to
negative impact on the number of faults in this project. link the additional code to. Another reason is the way we
handle violations remaining at the end of the project. Those
5.3. Threats to validity
violations might point to faults that simply have not been
Although some of these issues have been mentioned else- found yet. But we since we do not know this, we take a
where in the text, we summarize and discuss them here for conservative approach and assume they are not. The influ-
clarity’s sake. ence of dormant faults is somewhat mitigated in the case of
Internal validity There is one measure in our case study a long-running project, where most of the faults will have
that may suffer from inacurracies. The fault density for a been found. Moreover, we can expect that in short-term
given version has been measured using the number of open projects at least the most severe faults have been solved, so
PRs for that version. Since a PR may only be submitted the relation between violations and the most critical problem
some time after the fault was actually introduced, this results areas in the code can still be assessed.

284
External validity Generalizing the measured correlations history-based warning prioritization approach by Kim and
and true positive rates for the TVoM project to arbitrary Ernst [11]. This approach seeks to improve the ranking
other projects may not be easy. In fact, our investigation mechanism for warnings produced by static bug detection
was partly inspired by the intuition that such relations would tools. To that end, it observes which (classes of) warn-
differ from project to project. Still, it may be possible to ings were removed during bug-fix changes, and gives classes
generalize results to other projects within NXP, since (1) all with the highest removal rate (i.e., the highest true positive
projects use a similar infrastructure and development pro- rate) the highest priority. The approach was evaluated us-
cess; (2) developers are circulated across projects, which ing different Java bug detection tools on a number of open-
lessens the impact of individual styles; and (3) all projects source Java projects. Although they do use the version his-
concern typical embedded C development, likely to suffer tory as input to the algorithm, the empirical data reported
from some common issues. The only problem in this respect (true positive rates) covers only warnings of a single version
may be that the platform for which the embedded software is of every project. In addition, there is a potential issue with
developed requires idiosyncratic implementation strategies, using true positive rates for comparing rules on accuracy if
violating some coding rules but known to be harmless. the number of warnings for those rules are different (as dis-
Although the approaches used in this paper are generic cussied in Section 4.2). Another difference with our study
in nature, some specific (technical) challenges need to be is the application domain: we assess a coding standard for
faced when applying them to an arbitrary project. First of embedded C development on an industrial case.
all, as mentioned before, linking known faults and source While we report on data obtained from a longitudinal
modifications made to fix them requires a rich data set. Al- study of a single project, Basalaj [3] uses versions from 18
though lately, many studies have succesfully exploited such different projects at a single point in time. He computes two
a link [15, 19, 13, 21, 12, 22, 18], we found that in our in- rankings of the projects, one based on warnings generated
dustrial settings, most project histories did not contain the by QA C++, and one based on known fault data. For certain
necessary information. This limits the applicability of the warnings, a positive rank correlation between the two can be
spatial coincidence approach. Second, in order to accurately observed. Unfortunately, the paper highlights 12 positively
measure the number of rule violations and known faults, it correlated warning types, and ignores the negatively corre-
is important to precisely define which source code entities lated ones (reportedly, nearly 900 rules were used). Apart
are going to be part of the measurements. Source files for from these two studies, we are not aware of any other work
different builds may be interleaved in the source tree, so that reports on measured relations between coding rules and
selection may not always be a trivial matter. For instance, actual faults.
some source files can be slighly altered when built for dif- As many studies as exist using software history, so lit-
ferent platforms, so this may influence the number of mea- tle exist that assess coding standards. The idea of a safer
sured violations. In addition, some faults may be present in subset of a programming language, the precept on which
the program when built for one platform but not for others, the MISRA coding standard is based, was promoted by
thus the selection may also influence the measured number Hatton [7]. In [8] he assesses a number of coding stan-
of faults. Finally, the inspection tool used to detect viola- dards, introducing the signal to noise ratio for coding stan-
tions can also influence results, as it might produce false dards, based on the difference between measured violation
positives, i.e., signalling a violation when in fact the corre- rates and known average fault rates. The assessment of the
sponding code complies with the rule. Unfortunately, some MISRA standard was repeated in [9], where it was argued
inaccuracy in (complex) static analysis is unavoidable, and that the update was no real improvement over the original
the extent may differ from one implementation to the next. standard, and “both versions of the MISRA C standard are
too noisy to be of any real use”. The methods we introduce
in this paper can be used to specialize a coding standard for
6. Related Work a certain project, so as to make use of them in the best way
possible. In addition they can be used to build a body of em-
In recent years, many studies have appeared that take advan- pirical data to assess them in a more general sense, and the
tage of the data present in SCM systems and bug databases. data presented here are a first step towards that goal.
These studies exhibit a wide array of applications, ranging
from an examination of bug characteristics [15], techniques
for automatic identification of bug-introducing changes [19, 7. Conclusions
13], bug-solving effort estimation [21], prediction of fault-
prone locations in the source [12], and identification of The contributions of this paper are (1) a description and
project-specific bug-patterns, to be used in static bug detec- comparison of two approaches to quantify the relation be-
tion tools [22, 18]. tween coding rule violations and faults; and (2) empirical
Of all those applications, closest to our work is the data on this relation for the MISRA standard in the context

285
of an industrial case. Saxe, and R. Stata. Extended static checking for java. In
From the data obtained, we can make the following key Proc. ACM Conf. on Programming Language Design and Im-
observations. First, there are 12 out of 72 rules for which plementation (PLDI), pages 234–245. ACM, 2002.
violations were observed that perform significantly better [7] L. Hatton. Safer C: Developing Software in High-integrity
and Safety-critical Systems. McGraw-Hill, New York, 1995.
(α = 0.05) than a random predictor at locating fault-related
[8] L. Hatton. Safer language subsets: an overview and a case
lines. The true positive rates for these rules range from 23-
history, MISRA C. Information & Software Technology,
100%. Second, we observed a negative correlation between 46(7):465–472, 2004.
MISRA rule violations and observed faults. In addition, [9] L. Hatton. Language subsetting in an industrial context: A
25 out of 72 rules had a zero true positive rate. Taken to- comparison of MISRA C 1998 and MISRA C 2004. Infor-
gether with Adams’ observation that all modifications have mation & Software Technology, 49(5):475–482, 2007.
a non-zero probability of introducing a fault [1], this makes [10] S. C. Johnson. Lint, a C program checker. In Unix Program-
it possible that adherence to the MISRA standard as a whole mer’s Manual, volume 2A, chapter 15, pages 292–303. Bell
would have made the software less reliable. This obser- Laboratories, 1978.
vation is consistent with Hatton’s earlier assessment of the [11] S. Kim and M. D. Ernst. Which warnings should I fix first? In
MISRA C 2004 standard [9]. Proc. 6th joint meeting of the European Software Engineering
Conf. and the ACM SIGSOFT Intnl. Symp. on Foundations of
These two observations emphasize the fact that it is im-
Software Engineering, pages 45–54. ACM, 2007.
portant to select accurate and applicable rules. Selection of [12] S. Kim, T. Zimmermann, E. James Whitehead Jr., and
rules that are most likely to contribute to an increase in re- A. Zeller. Predicting Faults from Cached History. In Proc.
liability maximizes the benefit of adherence while decreas- 29th Intnl. Conf. on Software Engineering (ICSE), pages
ing the necessary effort. Moreover, empirical evidence can 489–498. IEEE, 2007.
give substance to the arguments of advocates of coding stan- [13] S. Kim, T. Zimmermann, K. Pan, and E. James Whitehead
dards, making adoption of a standard in an organization eas- Jr. Automatic Identification of Bug-Introducing Changes. In
ier. However, correlations and true positive rates as observed Proc. 21st IEEE/ACM Intnl. Conf. on Automated Software
in this study may differ from one project to the next. To in- Engineering (ASE), pages 81–90. IEEE, 2006.
crease confidence in our results, and to investigate if we can [14] T. Kremenek, K. Ashcraft, J. Yang, and D.R. Engler. Cor-
distinguish a consistent subset of MISRA rules positively relation Exploitation in Error Ranking. In Proc. 12th ACM
SIGSOFT Intnl. Symp. on Foundations of Software Engineer-
correlated with actual faults, we intend to repeat this study
ing (FSE), pages 83–93. ACM, 2004.
for a number of projects. In addition, we intend to address [15] Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai. Have
some of the issues encountered when using correlations, as things changed now?: an empirical study of bug characteris-
discussed in Section 5. tics in modern open source software. In Proc. 1st Workshop
on Architectural and System Support for Improving Software
Dependability (ASID), pages 25–33. ACM, 2006.
Acknowledgements The authors wish to thank the people
[16] Sun Microsystems. Code Conventions for the Java Program-
of NXP for their support in this investigation and the anony-
ming Language, April 1999.
mous reviewers for their valuable feedback. [17] Motor Industry Software Reliability Association (MISRA).
Guidelines for the Use of the C Language in Critical Systems,
October 2004.
References [18] S., K. Pan, and E. James Whitehead Jr. Memories of bug
fixes. In Proc. 14th ACM SIGSOFT Intnl. Symp. on Foun-
[1] E. N. Adams. Optimizing Preventive Service of Software
dations of Software Engineering (FSE), pages 35–45. ACM,
Products. IBM J. of Research and Development, 28(1):2–14,
2006.
1984.
[19] J. Sliwerski, T. Zimmermann, and A. Zeller. When do
[2] F. J. Anscombe. Graphs in Statistical Analysis. The American
changes induce fixes? In Proc. Intnl. Workshop on Mining
Statistician, 27(1):17–21, 1973.
Software Repositories (MSR). ACM, 2005.
[3] W. Basalaj. Correlation between coding standards compli-
[20] J. Spacco, D. Hovemeyer, and W. Pugh. Tracking defect
ance and software quality. White paper, Programming Re-
warnings across versions. In Proc. Intnl. Workshop on Mining
search Ltd., 2006.
Software Repositories (MSR), pages 133–136. ACM, 2006.
[4] C. Boogerd and L. Moonen. Prioritizing Software Inspec-
[21] C. Weiß, R. Premraj, T. Zimmermann, and A. Zeller. How
tion Results using Static Profiling. In Proc. Sixth IEEE
Long Will It Take to Fix This Bug? In Proc. Fourth Intnl.
Intnl. Workshop on Source Code Analysis and Manipulation
Workshop on Mining Software Repositories (MSR), page 1.
(SCAM), pages 149–158. IEEE, 2006.
IEEE, 2007.
[5] D. Engler, B. Chelf, A. Chou, and S. Hallem. Checking sys-
[22] C. C. Williams and J. K. Hollingsworth. Automatic Mining
tem rules using system-specific, programmer-written com-
of Source Code Repositories to Improve Bug Finding Tech-
piler extensions. In Proc. 4th Symp. on Operating Systems
niques. IEEE Trans. Softw. Eng., 31(6):466–480, 2005.
Design and Implementation, pages 1–16, October 2000.
[6] C. Flanagan, K. R. M. Leino, M. Lillibridge, G. Nelson, J. B.

286

You might also like