3 - Assessing The Value of Coding
3 - Assessing The Value of Coding
To assess the relation between violations and faults, we take 2.3. Measurement approach
two approaches. The first is a more high-level approach, To gather the measurements for a large software history we
studying the co-evolution of violations and faults over time, need an automated measurement approach. This raises a
testing for the presence of a correlation. The second ap- number of challenges, and we will discuss for each measure
proach looks at individual violations in more detail, tracking how we have solved those. Summarizing from the previ-
them over time and seeing how often they correctly signal ous sections, the measures needed in this investigation are
actual problem locations, giving us a true positive rate for the number of violations per version, the number of open
every rule. We will discuss both methods in more detail, af- (i.e., unsolved) PRs per version, and the true positive rate
ter which we describe how to obtain the necessary measures. for all rules. The number of open PRs per version is prob-
ably easiest, as we can simply use the status field for the
2.1. Temporal coincidence
PR to determine whether it was unsolved at the moment the
To quantify the relation between rule violations and actual version was released or built. Determining the number of vi-
faults we look at the software history present in a Software olations per version requires a bit more work, and comprises
278
the following steps: tion between violations and faults. As a result, we do not
even know whether assessing this relation is feasible. We
• Retrieving a full version from the SCM. Since we will therefore decided to use one typical industrial case as a pilot
run a static analyzer on the retrieved source code, this study. NXP (formerly Philips Semiconductors), our indus-
needs to be a proper, compilable set of source files, as trial partner in the TRADER 4 project, was willing to provide
such tools often mimick the build environment in order us with tooling and access to its TV on Mobile project.
to check compliancy.
• Extracting configuration information necessary to run 3.1. TV on Mobile
the compliancy checker. This kind of information is
usually present in Makefiles or project files (e.g., Visual This project (TVoM for short) was selected because of its
Studio solution files). relatively high-quality version history, featuring the well-
defined link between PRs and source code modifications we
• Running the compliancy checker using the extracted
need for our spatial coincidence method. It represents a typi-
configuration information, saving the rule violations cal embedded software project, consisting of the driver soft-
and their locations (file name, line number) within the
ware for a small SD-card-like device. When inserted into the
source archive.
memory slot of a phone or PDA, this device enables one to
To determine the true positive rates we use a process similar receive and play video streams broadcasted using the Digital
to the one used in the warning prioritization algorithm by Video Broadcast (DVB) standard.
Kim and Ernst [11]. However, whereas they track bug-fix The complete source tree contains 148KLoC of C code,
lines throughout the history, we track the violations them- 93KloC C++, and approximately 23KLoC of configuration
selves. Tracking violations over different versions, although items in perl and shell script (all reported numbers are non-
not too difficult, is not widely addressed. We are only aware commented lines of code). The real area of interest is the
of the work by Spacco et al. [20]. C code of the actual driver, which totals approximately
In our approach, we build a complete version graph for 91KLoC. Although the policy of the project was to mini-
every file in the project using the predefined successor and mize compiler warnings, no coding standard or code inspec-
predecessor relations present in the SCM. The version graph tion tool was used. This allows us to actually relate rule
allows us to accurately model branches and merges of the violations to fault-fixing changes; if the developers would
files. For every edge in the graph, i.e., a file version and one have conformed to the standard we are trying to assess, they
of its successors, the diff is computed. These differences are would have probably removed all these violations rightaway.
expressed as an annotation graph, which represents a map-
ping of lines between file versions.
3.2. MISRA-C: 2004
Violations are uniquely identified by triplets containing
the file name, line number and rule identifier. Using the ver- The first MISRA standard was defined by a consortium of
sion graph and the annotation graph associated with every UK-based automotive companies (The Motor Industry Soft-
edge, violations can be tracked over the different versions. ware Reliability Association) in 1998. Acknowledging the
As long as a violation remains on a line which has a des- widespread use of C in safety-related systems, the intent was
tination in the annotation graph (i.e., is present the next file to promote the use of a safe subset of C, given the unsafe na-
version) no action is taken. When a line on which a violation ture of some of its constructs [7]. The standard became quite
has been flagged is changed or removed, the PR database is popular, and was also widely adopted outside the automotive
consulted to see if the modification was fix-related. If it is, industry. In 2004 a revised version was published, attempt-
the score for the rule is incremented; in both cases, the total ing to prune unnecessary rules and to strengthen existing
incidence count for the current rule is incremented. Viola- ones. However, Hatton [9] argued that even these modifica-
tions introduced by the modification are considered to con- tions could not prevent the standard from having many false
stitute a new potential fault, and are tracked similar to the positives among reported violations, so “a programmer will
existing ones. In the last version present in the graph, the not be able to see the wood for the trees”. Currently, NXP is
number of remaining violations are added to the incidence also introducing MISRA-C: 2004 as the standard of choice
count (on a per rule basis), as these can all be considered for new projects. Given the wide adoption of the MISRA
false positives. standard, an assessment of its rules remains a relevant topic.
Copyright laws prevent us from quoting the MISRA rules
3. Case Description themselves. However, a discussion on the content of these
rules is beyond the scope of this paper.
To the best of our knowledge, there is no previous com-
plete empirical data from a longitudinal study on the rela- 4 www.esi.nl/trader
279
7361
39
1.5
# Faults (dots; left axis) / # Violations (triangles; right axis)
1.0
Fault density
0.5
3379
0.0
0
Figure 1. Number of faults and violations over time Figure 2. All MISRA violations versus open PRs
280
0.0015
1.5
1.5
0.0010
1.0
1.0
Fault density
Fault density
Fault density
0.0005
0.5
0.5
0.0000
0.0
0.0
55 60 65 70 75 80 85 15 16 17 18 19 0.015 0.020 0.025 0.030 0.035 0.040 0.045
whole looks negative rather than positive. line of the regression model.
Another way in which we verified that the data does not
4.2. Analysis of individual rules violate the assumptions of the regression was by inspecting
residual plots. One example is displayed in Figure 4 (for the
Now that we have seen the correlation for MISRA as a ‘positive’ set). The upper left graph contains the data and
whole, we will look more closely at the influence of indi- the fitted line; the upper right plots residuals versus fitted,
vidual rules. To do this, we create graphs, similar to Figure which should be spread with no apparent relation around
2, but using only violations for a single rule. These are visu- the horizontal axis. Lower left is a histogram of the residu-
ally inspected to determine whether there exist a correlation als, which should be a normal distribution, and lower right
between the two measures. If the distribution is acceptable, a normal plot of the residuals, where they should be laid out
we classify the rule as either ‘positive’ or ‘negative’, if there more or less on the diagonal. In this case, although slightly
are too many outliers it is classified as ‘none’. The resulting skewed, the distribution is acceptable.
classification for all rules can be observed in columns 2-4 of The next two remaining columns in Table 1 display the
Table 1. This table does not include all rules in the MISRA total number of violations and the true positive rate for each
standard, as we could only include those for which we have rule. The violations are the number of unique violations over
observed violations (in total 72 out of 141 rules). the complete observed history. That is, if exactly the same
In addition, we use a least squares linear regression violation was present in two versions, it was counted only
model to summarize the characteristics of the correlation. once. The true positive rate has been computed by divid-
Columns 5-7 display the properties of this model: the lin- ing the number of violations on lines changed or deleted in
ear coefficient, or the slope of the fitted line; the percentage fix-changes by the total number of violations. The corre-
of variation explained by the model, or the square of the sponding per-class data are displayed in Table 2 in a manner
correlation coefficient R; and the statistical significance, ex- similar to Table 1.
pressed as the probability that the correlation found is sim- If we see rule violations as a line-based predictor of
ply coincidence. Significance has been presented in star- faults, the true positive rate can be considered a measure
notation, with ‘ns’ for p > 0.05, p < 0.05 *, p < 0.01 of the prediction accuracy. Consequently, when comparing
**, p < 0.001 *** and p < 0.0001 ****. In general, we rule 1.1 and 2.3 this rate suggests the former to be more ac-
can observe that the linear models for rules in the classes curate. However, in a population of 62 violations (for 1.1) it
‘positive’ and ‘negative’ explain a large part of the variation; is likely that there are some correct violations simply due to
R2 > 0.75. chance, which is unlikely if there are few violations. There-
From an automation point of view, it may be appealing to fore, true positive rates cannot be used in their own right to
simply rely on the summary statistics, such as the correlation compare rules on accuracy (unless the number of violations
coefficient R. However, this does not guarantee existence of is the same).
a true relation, as the input distribution may, for instance, We can assess the significance of the true positive rate
contain too many outliers, something which can easily be by comparing it to a random predictor. This predictor ran-
ascertained using the aforementioned scatter plots [2]. Due domly select lines out of a population of over 300K unique
to space limitations, we have not included scatter plots for lines in the TVoM history, of which 17% was involved in a
all individual rules, but have plotted aggregated results for fix change. Since the population is so large, and the num-
each class in Figure 3. These figures also display the fitted ber of violations per rule relatively small (max. 0.5% of the
281
Variation explained (R2 )
Prediction performance
True positives (ratio)
Total violations
Total violations
MISRA rule
MISRA rule
Significance
Significance
Negative
Negative
Positive
Positive
None
None
√ √
1.1 -1.07 0.86 **** 62 0.08 0.04 13.1 -0.39 0.85 **** 121 0.01 0.00
√ √
1.2 3.92 0.77 **** 26 0.23 0.86 13.2 -0.09 0.85 **** 781 0.06 0.00
√ √
2.3 53.13 0.86 **** 1 0.00 0.83 13.7 -2.51 0.84 **** 26 0.12 0.33
√ √
3.1 12.80 0.46 **** 11 0.00 0.13 14.1 -0.11 0.02 * 191 0.04 0.00
√ √
3.4 0.98 0.17 **** 15 0.00 0.06 14.2 0.27 0.89 **** 260 0.30 1.00
√ √
5.1 -19.43 0.85 **** 2 0.00 0.69 14.3 -0.11 0.85 **** 818 0.01 0.00
√ √
5.2 -9.10 0.85 **** 10 0.00 0.16 14.5 -15.28 0.84 **** 7 0.00 0.27
√ √
5.3 21.10 0.83 **** 2 0.50 0.97 14.6 -0.74 0.82 **** 70 0.01 0.00
√ √
5.6 -0.56 0.57 **** 278 0.08 0.00 14.7 -0.35 0.79 **** 427 0.03 0.00
√ √
6.1 -19.43 0.85 **** 2 0.00 0.69 14.8 -0.17 0.85 **** 282 0.02 0.00
√ √
6.2 -2.43 0.85 **** 16 0.00 0.05 14.10 4.58 0.21 **** 53 0.09 0.09
√ √
6.3 0.06 0.82 **** 1675 0.10 0.00 15.2 -11.91 0.84 **** 3 0.00 0.57
√ √
8.1 1.69 0.79 **** 52 0.35 1.00 15.3 -6.29 0.87 **** 15 0.00 0.06
√ √
8.4 13.85 0.24 **** 1 0.00 0.83 15.4 -14.22 0.87 **** 7 0.00 0.27
√ √
8.5 18.41 0.42 **** 2 1.00 1.00 16.1 -13.26 0.27 **** 15 0.00 0.06
√ √
8.7 -7.97 0.82 **** 46 0.30 0.99 16.4 1.04 0.03 * 64 0.12 0.22
√ √
8.8 0.51 0.84 **** 115 0.10 0.03 16.5 2.50 0.82 **** 20 0.15 0.55
√ √
8.10 0.83 0.80 **** 90 0.01 0.00 16.7 -0.69 0.84 **** 143 0.08 0.00
√ √
8.11 -4.27 0.08 **** 4 0.25 0.86 16.8 11.13 0.84 **** 2 0.00 0.69
√ √
9.1 0.80 0.06 *** 15 0.27 0.90 16.9 53.13 0.86 **** 2 0.50 0.97
√ √
9.2 -3.90 0.01 ns 2 1.00 1.00 16.10 -0.10 0.85 **** 1845 0.06 0.00
√ √
9.3 -4.55 0.85 **** 12 0.00 0.11 17.4 -0.39 0.78 **** 205 0.09 0.00
√ √
10.1 0.36 0.79 **** 490 0.13 0.02 17.6 -38.86 0.85 **** 1 1.00 1.00
√ √
10.6 -0.13 0.85 **** 378 0.02 0.00 18.4 3.10 0.79 **** 19 0.05 0.14
√ √
11.1 3.08 0.39 **** 38 0.26 0.95 19.4 -3.05 0.32 **** 37 0.05 0.04
√ √
11.3 -1.92 0.86 **** 108 0.06 0.00 19.5 10.63 0.86 **** 5 0.00 0.39
√ √
11.4 -0.43 0.37 **** 151 0.06 0.00 19.6 10.63 0.86 **** 5 0.00 0.39
√ √
11.5 0.59 0.11 **** 26 0.08 0.16 19.7 -5.03 0.81 **** 37 0.03 0.01
√ √
12.1 0.33 0.85 **** 151 0.08 0.00 19.10 -1.83 0.09 **** 37 0.00 0.00
√ √
12.4 -2.07 0.16 **** 19 0.11 0.35 19.11 8.03 0.79 **** 5 0.00 0.39
√ √
12.5 -0.71 0.75 **** 110 0.05 0.00 20.2 -3.56 0.57 **** 22 0.00 0.02
√ √
12.6 -4.72 0.84 **** 11 0.00 0.13 20.4 -1.23 0.06 *** 39 0.00 0.00
√ √
12.7 1.40 0.87 **** 91 0.24 0.97 20.9 5.92 0.57 **** 16 0.00 0.05
√ √
12.8 1.21 0.83 **** 68 0.09 0.04 20.10 13.85 0.24 **** 1 0.00 0.83
√ √
12.10 -38.86 0.85 **** 1 0.00 0.83 20.12 -8.40 0.44 **** 16 0.19 0.72
√ √
12.13 5.25 0.79 **** 37 0.27 0.96 21.1 -2.80 0.84 **** 25 0.08 0.18
population), we can model this as a Bernoulli process with rate higher than the expected value of the random predictor
p = 0.17. The number of succesful attempts or correctly (0.17). As a result, the value of the CDF for all three classes
predicted lines has a binomial distribution; using the cumu- is near-zero.
lative density function (CDF) we can indicate how likely a
rule is to outperform the random predictor in terms of ac-
curacy. This value is displayed in the last column of the 5. Evaluation
table, and can be compared across rules. Note that none of
In this section, we will discuss the meaning of the results, as
the three classes in Table 2 have an average true positive
well as the applicability and limitation of the methods used
282
y = −0.02 x + 1.85 Residuals vs. fitted
0.4
1.5
True positives (ratio)
Average violations
0.2
Linear coefficient
1.0
Total violations
Residuals
0.0
y
Significance
0.5
Rule class
−0.2
−0.4
0.0
40 50 60 70 80 90 0.2 0.4 0.6 0.8 1.0 1.2
x Fitted
Positive (18) 0.03 0.87 **** 3092 171.78 0.13
None (25) -0.10 0.15 **** 1077 43.08 0.07
hist of residuals normal plot of residuals
Negative (29) -0.02 0.85 **** 5571 192.10 0.05
0.4
10 20 30 40 50 60
0.2
Sample Quantiles
Table 2. Aggregated statistics per class
Frequency
0.0
−0.2
to obtain them.
−0.4
0
5.1. Meaning of the correlations −0.4 −0.2 0.0 0.2 0.4 −3 −2 −1 0 1 2 3
The first important moderation for our results is the fact that
they are correlations, and therefore not proof of causality. Figure 4. Residual plots for the positive class
However, the positive and negative relations observed ex-
plain a large amount of the variation (R2 > 0.75) and are relation, which is why many of those rules have been clas-
statistically significant. Then what do they mean? sified as ‘none’ in Table 1. This actually explains all but
A positive correlation is rather intuitive: we oberve an rules 5.6, 11.4 and 14.1 in this category, for which we did
increasing violation density during the first phase of the observe a large amount of violations. Re-examination of the
project, where most of the faults are introduced. An in- correlation plots revealed distributions of points that sug-
crease in violation density means, that the areas that devel- gest a correlation, except for a cluster of close outliers. This
opers have been working on violate the rule relatively often. was the result of files being moved in and out of the main
The decrease of violation density in the second phase of the branch of development, these files containing a relatively
project signals that many have been removed at a time when high amount of violations of the particular rule. Although
there was a focus on fixing faults. Possibly, some violations we try to compensate for this by using densities, in some
were introduced in unrelated areas during the first phase, but cases the influence of one file can still distort an otherwise
seeing them removed in the second phase strengthens the in- strong correlation.
tuition that they were indeed related to a fault.
Similar to rules with positive correlations, rules with neg- 5.2. Correlations and true positive rates
ative correlations experience a relative high rate of injection, Overall, the true positive rates computed by also requiring
not just in the first phase, but also in the second phase. For spatial coincidence partially match the partitioning of the
a negative correlation to occur, the violation density for this rules into classes using only temporal coincidence. This can
rule must keep increasing, even when the fault density de- be seen in Table 2, where the ‘positive’ class shows a higher
creases. Since the the changes are relatively small compared true positive rate than both others; and ‘none’, in turn, higher
to the complete codebase (on average only adding 9 lines), than ‘negative’. However, there are some individual rules
injection of a single violation without removal somewhere whose inconsistency between class and measured true pos-
else already results in an increased density. In other words, itive rate warrant further explanation. Two different cate-
violations of these rules are seldom removed: apparently, gories of interest can be distinguished:
the areas they reside in were not considered relevant to fix-
ing faults. Class positive, yet zero true positive rate This category
Intuitively, the magnitude of the coefficient of the lin- (e.g., rule 3.1) illustrates the extra precision gained
ear model should be seen as a quantification of the ratio of by also requiring spatial coincidence. These rules
true positives. When looking closer, however, we see that have violations removed at the same time faults were
the coefficient is large (either positive or negative) in cases removed from the software, and as a result, show a
of a rule with few violations, as a small number of viola- positive correlation. Spatial coincidence shows that, as
tions is compared to a greater number of faults. This small these removals were not due to a fix-change, they are
number of violations makes it more difficult to distinguish a not likely to be related to the fault removal.
283
Class negative, yet non-zero true positive rate This cate- in a ‘lagging’ measured fault density compared to the actual
gory encompasses 18 rules out of the negative class, fault density. The number of violations is measured on the
which itself contains a total of 29. Although at a first source that contains this ‘hidden’ fault, and as a result, the
glance, having both a negative correlation and viola- correlation between violation density and fault density may
tions removed during fixes may be paradoxal, this can be slightly underestimated.
be understood by looking at the rate of removal as well In addition, the correlations as used in this paper are sen-
as the rate of injection. Despite continuous removal sitive to changes in the set of files between the different ver-
(in fixes as well as non-fix related changes), the vio- sions. These changes occur since in case of a failed daily
lation density keeps increasing, as more violations are build, the culprit module (i.e., a set of files) will be excluded
being injected by all those changes. Consequently, the from the build until repaired, to ensure a working version
measured true positive rates also depend on the point of the project. Those changes result in the different clus-
of time at which they are measured (e.g., beginning, ters of points in the scatter plots in Section 4. To see why,
halfway or end of the project). consider the following example. Suppose that for a certain
The true positive rates indicate that 15 out of 72 rules out- rule r, most of the violations are contained in a small subset
perform a random predictor with respect to selecting fault- Sr of all the files in the project. Even if there would be a
related lines. If we only select rules whose true positive rate perfect linear relationship between violations of r and the
is significantly different (α = 0.05) from the expected value number of faults, then this would not be visible in a scat-
(0.17), 12 rules remain (those where the performance value ter plot if files in Sr are alternatingly included and excluded
≥ 0.95). This means that the results of the two approaches from subsequent versions of the project. Two clusters of
agree on only 7 out of the 18 rules in class ‘positive’. Since versions would appear; in one, there is no relation, while
we can consider the true positive rates a more accurate char- the other suggests a linear relation. Clearly, this distorting
acterization, it seems unwise to use the correlations as the effect only increases as more rules and versions are consid-
only means to assess the relation between violations and ered. In future investigations we intend to address this issue
faults. by looking at official internal releases instead of daily build
However, as we mentioned before, the change in viola- versions. This will minimize changes in the composition of
tions over time remains relevant. For instance, in the TVoM files because of the aforementioned build problems. In ad-
project, the most critical faults were removed before the dition, we intend to split the analysis between phases of the
project was suspended, but during maintenance more faults project that are distinctly different with respect to the com-
might have been found, altering the measured true positive position of files.
rate. In addition, the character of the software written may The spatial coincidence method produces conservative
change over time (e.g., from more development in the hard- estimates, for two reasons. Underestimation might occur
ware layer to the application layer), which could also affect with bug fixes that introduce new code or modify spatially
the number and type of violations in the software. unrelated code. Assume that the static analysis signals that
Finally, another criterion to assess the true positive rate some statement, requiring a check on input data, could be
worth mentioning is the rate at which faults are injected reached without performing such a check. Adding the check
when violations are fixed. A non-zero probability means to solve the problem would then only introduce new code,
that the use of at least 25 out of 72 rules with a zero true unrelated to the existing violation. Violations addressing
positive rate is invalidated. Although the injection rate is multiple lines are unlikely to occur with generic coding stan-
not known for this project, in Adams’ study, this probability dards, which typically contain rules at the level of a single
was 0.15 [1]. This value would even invalidate 57 out of 72 expression or statement (especially ones based on the safer
MISRA rules. This lends credence to the possibility that ad- subset paradigm, such as MISRA). Therefore, impact of this
herence to the complete MISRA standard would have had a problem remains small, as there are usually no violations to
negative impact on the number of faults in this project. link the additional code to. Another reason is the way we
handle violations remaining at the end of the project. Those
5.3. Threats to validity
violations might point to faults that simply have not been
Although some of these issues have been mentioned else- found yet. But we since we do not know this, we take a
where in the text, we summarize and discuss them here for conservative approach and assume they are not. The influ-
clarity’s sake. ence of dormant faults is somewhat mitigated in the case of
Internal validity There is one measure in our case study a long-running project, where most of the faults will have
that may suffer from inacurracies. The fault density for a been found. Moreover, we can expect that in short-term
given version has been measured using the number of open projects at least the most severe faults have been solved, so
PRs for that version. Since a PR may only be submitted the relation between violations and the most critical problem
some time after the fault was actually introduced, this results areas in the code can still be assessed.
284
External validity Generalizing the measured correlations history-based warning prioritization approach by Kim and
and true positive rates for the TVoM project to arbitrary Ernst [11]. This approach seeks to improve the ranking
other projects may not be easy. In fact, our investigation mechanism for warnings produced by static bug detection
was partly inspired by the intuition that such relations would tools. To that end, it observes which (classes of) warn-
differ from project to project. Still, it may be possible to ings were removed during bug-fix changes, and gives classes
generalize results to other projects within NXP, since (1) all with the highest removal rate (i.e., the highest true positive
projects use a similar infrastructure and development pro- rate) the highest priority. The approach was evaluated us-
cess; (2) developers are circulated across projects, which ing different Java bug detection tools on a number of open-
lessens the impact of individual styles; and (3) all projects source Java projects. Although they do use the version his-
concern typical embedded C development, likely to suffer tory as input to the algorithm, the empirical data reported
from some common issues. The only problem in this respect (true positive rates) covers only warnings of a single version
may be that the platform for which the embedded software is of every project. In addition, there is a potential issue with
developed requires idiosyncratic implementation strategies, using true positive rates for comparing rules on accuracy if
violating some coding rules but known to be harmless. the number of warnings for those rules are different (as dis-
Although the approaches used in this paper are generic cussied in Section 4.2). Another difference with our study
in nature, some specific (technical) challenges need to be is the application domain: we assess a coding standard for
faced when applying them to an arbitrary project. First of embedded C development on an industrial case.
all, as mentioned before, linking known faults and source While we report on data obtained from a longitudinal
modifications made to fix them requires a rich data set. Al- study of a single project, Basalaj [3] uses versions from 18
though lately, many studies have succesfully exploited such different projects at a single point in time. He computes two
a link [15, 19, 13, 21, 12, 22, 18], we found that in our in- rankings of the projects, one based on warnings generated
dustrial settings, most project histories did not contain the by QA C++, and one based on known fault data. For certain
necessary information. This limits the applicability of the warnings, a positive rank correlation between the two can be
spatial coincidence approach. Second, in order to accurately observed. Unfortunately, the paper highlights 12 positively
measure the number of rule violations and known faults, it correlated warning types, and ignores the negatively corre-
is important to precisely define which source code entities lated ones (reportedly, nearly 900 rules were used). Apart
are going to be part of the measurements. Source files for from these two studies, we are not aware of any other work
different builds may be interleaved in the source tree, so that reports on measured relations between coding rules and
selection may not always be a trivial matter. For instance, actual faults.
some source files can be slighly altered when built for dif- As many studies as exist using software history, so lit-
ferent platforms, so this may influence the number of mea- tle exist that assess coding standards. The idea of a safer
sured violations. In addition, some faults may be present in subset of a programming language, the precept on which
the program when built for one platform but not for others, the MISRA coding standard is based, was promoted by
thus the selection may also influence the measured number Hatton [7]. In [8] he assesses a number of coding stan-
of faults. Finally, the inspection tool used to detect viola- dards, introducing the signal to noise ratio for coding stan-
tions can also influence results, as it might produce false dards, based on the difference between measured violation
positives, i.e., signalling a violation when in fact the corre- rates and known average fault rates. The assessment of the
sponding code complies with the rule. Unfortunately, some MISRA standard was repeated in [9], where it was argued
inaccuracy in (complex) static analysis is unavoidable, and that the update was no real improvement over the original
the extent may differ from one implementation to the next. standard, and “both versions of the MISRA C standard are
too noisy to be of any real use”. The methods we introduce
in this paper can be used to specialize a coding standard for
6. Related Work a certain project, so as to make use of them in the best way
possible. In addition they can be used to build a body of em-
In recent years, many studies have appeared that take advan- pirical data to assess them in a more general sense, and the
tage of the data present in SCM systems and bug databases. data presented here are a first step towards that goal.
These studies exhibit a wide array of applications, ranging
from an examination of bug characteristics [15], techniques
for automatic identification of bug-introducing changes [19, 7. Conclusions
13], bug-solving effort estimation [21], prediction of fault-
prone locations in the source [12], and identification of The contributions of this paper are (1) a description and
project-specific bug-patterns, to be used in static bug detec- comparison of two approaches to quantify the relation be-
tion tools [22, 18]. tween coding rule violations and faults; and (2) empirical
Of all those applications, closest to our work is the data on this relation for the MISRA standard in the context
285
of an industrial case. Saxe, and R. Stata. Extended static checking for java. In
From the data obtained, we can make the following key Proc. ACM Conf. on Programming Language Design and Im-
observations. First, there are 12 out of 72 rules for which plementation (PLDI), pages 234–245. ACM, 2002.
violations were observed that perform significantly better [7] L. Hatton. Safer C: Developing Software in High-integrity
and Safety-critical Systems. McGraw-Hill, New York, 1995.
(α = 0.05) than a random predictor at locating fault-related
[8] L. Hatton. Safer language subsets: an overview and a case
lines. The true positive rates for these rules range from 23-
history, MISRA C. Information & Software Technology,
100%. Second, we observed a negative correlation between 46(7):465–472, 2004.
MISRA rule violations and observed faults. In addition, [9] L. Hatton. Language subsetting in an industrial context: A
25 out of 72 rules had a zero true positive rate. Taken to- comparison of MISRA C 1998 and MISRA C 2004. Infor-
gether with Adams’ observation that all modifications have mation & Software Technology, 49(5):475–482, 2007.
a non-zero probability of introducing a fault [1], this makes [10] S. C. Johnson. Lint, a C program checker. In Unix Program-
it possible that adherence to the MISRA standard as a whole mer’s Manual, volume 2A, chapter 15, pages 292–303. Bell
would have made the software less reliable. This obser- Laboratories, 1978.
vation is consistent with Hatton’s earlier assessment of the [11] S. Kim and M. D. Ernst. Which warnings should I fix first? In
MISRA C 2004 standard [9]. Proc. 6th joint meeting of the European Software Engineering
Conf. and the ACM SIGSOFT Intnl. Symp. on Foundations of
These two observations emphasize the fact that it is im-
Software Engineering, pages 45–54. ACM, 2007.
portant to select accurate and applicable rules. Selection of [12] S. Kim, T. Zimmermann, E. James Whitehead Jr., and
rules that are most likely to contribute to an increase in re- A. Zeller. Predicting Faults from Cached History. In Proc.
liability maximizes the benefit of adherence while decreas- 29th Intnl. Conf. on Software Engineering (ICSE), pages
ing the necessary effort. Moreover, empirical evidence can 489–498. IEEE, 2007.
give substance to the arguments of advocates of coding stan- [13] S. Kim, T. Zimmermann, K. Pan, and E. James Whitehead
dards, making adoption of a standard in an organization eas- Jr. Automatic Identification of Bug-Introducing Changes. In
ier. However, correlations and true positive rates as observed Proc. 21st IEEE/ACM Intnl. Conf. on Automated Software
in this study may differ from one project to the next. To in- Engineering (ASE), pages 81–90. IEEE, 2006.
crease confidence in our results, and to investigate if we can [14] T. Kremenek, K. Ashcraft, J. Yang, and D.R. Engler. Cor-
distinguish a consistent subset of MISRA rules positively relation Exploitation in Error Ranking. In Proc. 12th ACM
SIGSOFT Intnl. Symp. on Foundations of Software Engineer-
correlated with actual faults, we intend to repeat this study
ing (FSE), pages 83–93. ACM, 2004.
for a number of projects. In addition, we intend to address [15] Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai. Have
some of the issues encountered when using correlations, as things changed now?: an empirical study of bug characteris-
discussed in Section 5. tics in modern open source software. In Proc. 1st Workshop
on Architectural and System Support for Improving Software
Dependability (ASID), pages 25–33. ACM, 2006.
Acknowledgements The authors wish to thank the people
[16] Sun Microsystems. Code Conventions for the Java Program-
of NXP for their support in this investigation and the anony-
ming Language, April 1999.
mous reviewers for their valuable feedback. [17] Motor Industry Software Reliability Association (MISRA).
Guidelines for the Use of the C Language in Critical Systems,
October 2004.
References [18] S., K. Pan, and E. James Whitehead Jr. Memories of bug
fixes. In Proc. 14th ACM SIGSOFT Intnl. Symp. on Foun-
[1] E. N. Adams. Optimizing Preventive Service of Software
dations of Software Engineering (FSE), pages 35–45. ACM,
Products. IBM J. of Research and Development, 28(1):2–14,
2006.
1984.
[19] J. Sliwerski, T. Zimmermann, and A. Zeller. When do
[2] F. J. Anscombe. Graphs in Statistical Analysis. The American
changes induce fixes? In Proc. Intnl. Workshop on Mining
Statistician, 27(1):17–21, 1973.
Software Repositories (MSR). ACM, 2005.
[3] W. Basalaj. Correlation between coding standards compli-
[20] J. Spacco, D. Hovemeyer, and W. Pugh. Tracking defect
ance and software quality. White paper, Programming Re-
warnings across versions. In Proc. Intnl. Workshop on Mining
search Ltd., 2006.
Software Repositories (MSR), pages 133–136. ACM, 2006.
[4] C. Boogerd and L. Moonen. Prioritizing Software Inspec-
[21] C. Weiß, R. Premraj, T. Zimmermann, and A. Zeller. How
tion Results using Static Profiling. In Proc. Sixth IEEE
Long Will It Take to Fix This Bug? In Proc. Fourth Intnl.
Intnl. Workshop on Source Code Analysis and Manipulation
Workshop on Mining Software Repositories (MSR), page 1.
(SCAM), pages 149–158. IEEE, 2006.
IEEE, 2007.
[5] D. Engler, B. Chelf, A. Chou, and S. Hallem. Checking sys-
[22] C. C. Williams and J. K. Hollingsworth. Automatic Mining
tem rules using system-specific, programmer-written com-
of Source Code Repositories to Improve Bug Finding Tech-
piler extensions. In Proc. 4th Symp. on Operating Systems
niques. IEEE Trans. Softw. Eng., 31(6):466–480, 2005.
Design and Implementation, pages 1–16, October 2000.
[6] C. Flanagan, K. R. M. Leino, M. Lillibridge, G. Nelson, J. B.
286