Software Reliability and Security
Software Reliability and Security
S
Yashwant K. Malaiya
Computer Science Department, Colorado State University, Fort Collins, Colorado, U.S.A.
A change in the requirements in the later phases can The time taken to develop a patch after a vulnerability
cause increased defect density. discovery, and the delayed application of an available
patch contribute to the security risks. When significant
2. Design: In this phase, the system is specified as
additions or modifications are made to an existing ver-
an interconnection of units, such that each unit is well
sion, regression testing is done on the new or ‘‘build’’ ver-
defined and can be developed and tested indepen-
sion to ensure that it still works and has not ‘‘regressed’’
dently. The design is reviewed to recognize errors.
to lower reliability. Support for an older version of a soft-
3. Coding: In this phase, the actual program for ware product needs to be offered until newer versions
each unit is written, generally in a higher-level lan- have made a prior version relatively obsolete.
guage such as Java or Cþþ. Occasionally, assembly It should be noted that the exact definition of a test
level implementation may be required for high perfor- phase and its exit criterion may vary from organization
mance or for implementing input=output operations. to organization. When a project goes through incre-
The code is inspected by analyzing the code (or speci- mental refinements (as in the extreme programming
fication) in a team meeting to identify errors. approach), there may be many cycles of requirements–
design–code–test phases.
4. Testing: This phase is a critical part of the quest
Table 1 shows the typical fraction of total defects
for high reliability and can take 30–60% of the entire
introduced and found during a phase.[3,4] Most defects
development time. It is often divided into the following
occur during the design and coding phases. The frac-
separate phases.
tion of defects found during the system test is small,
but that may be misleading. The system test phase
a. Unit test: In this phase of testing, each unit is can take a long time because the defects remaining
separately tested, and changes are done to are much harder to find. It has been observed that
remove the defects found. As each unit is rela- the testing phases can account for 30–60% of the entire
tively small and can be tested independently, development effort.
they can be exercised much more thoroughly Two types of testing deserve special attention: inte-
than a large program. gration testing and interoperability testing. Integration
b. Integration testing: During integration, the testing assumes that unit testing has already been done,
units are gradually assembled and partially and thus the focus is on testing for defects that are
assembled subsystems are tested. Testing sub- associated with interaction among the modules. Exer-
systems allows the interface among modules to cising a unit module requires use of a driver module,
be tested. By incrementally adding units to a which will call the unit under test. If a unit module
subsystem, the unit responsible for a failure does not call other units, it is called a terminal module.
can be identified more easily. If a unit module calls other modules, which are not yet
c. System testing: The system as a whole is exer- ready to be used, surrogate modules called stubs simu-
cised during system testing. Debugging is con- late the interaction.
tinued until some exit criterion is satisfied. The Integration testing can be bottom-up or top-down.
objective of this phase is to find defects as fast In the bottom-up approach, integration starts with
as possible. In general, the input mix may not attaching the terminal modules to the modules that call
represent what would be encountered during them. This requires the use of drivers to drive the
the actual operation. higher-level modules. The top-down integration starts
d. Acceptance testing: The purpose of this test with connecting the highest-level modules with the
phase is to assess the system reliability and per-
formance in the operational environment. This
requires collecting (or estimating) information Table 1 Defects introduced and found during
on how the actual users would use the system. different phases
This is also called a-testing. This is often fol- Defects (%)
lowed by b-testing, which involves use of the
Phase Introduced Found Remaining
b-version by the actual users.
Requirements 10 5 5
5. Operational use and maintenance: Once the analysis
software developer has determined that an appropriate Design 35 15 25
reliability criterion is satisfied, the software is released. Coding 45 30 40
Any bugs reported by the users are recorded but are Unit test 5 25 20
not fixed until the next patch or bug-fix. In case a Integration test 2 12 10
defect discovered represents a security vulnerability, a
System test 1 10 1
patch for it needs to be released as soon as possible.
Software Reliability and Security 3
modules called by them. This requires the use of stubs 2. Transaction reliability: Sometimes a single-
until finally the terminal modules are integrated. transaction reliability measure, as defined below, is
Integration testing should include passing interface convenient to use. S
data that represent normal cases (n), special cases (s),
and illegal cases (i). For two interacting modules A
R ¼ Pr fa single transaction will not
and B, all combinations of fAn, As, Aig and fBn,
Bs, and Big should be tested. Thus, if a value represents encounter a failureg ð2Þ
a normal case for A and a special case for B, the corre-
sponding combination is (An, Bs). Both measures above assume normal operation, i.e.,
In some cases, specially for distributed and web- the input mix encountered obeys the operational pro-
based applications, components of the applications file (defined below).
may be developed independently, perhaps using two
3. Mean-time-to-failure (MTTF): The expected
different languages. Such components are also often
duration between two successive failures.
updated independently. Interaction of such compo-
nents needs to be tested to ensure interoperability. 4. Failure intensity (l): The expected number of
Interoperability requires the following. failures per unit time. Note that MTTF is the inverse
of failure intensity. Thus, if MTTF is 500 hr, the failure
Mutual compatibility of exchanged data units: The intensity is 1=500 ¼ 0.002 hr 1.
data structures exchanged must use the same for- As testing attempts to achieve a high defect-finding
mat (identical fields and data types). If different for- rate, failure intensity during testing, lt, is significantly
mats are used, a suitable translator unit is required. higher than lop, failure intensity during operation. Test
Mutual compatibility of control signals: In addi- acceleration factor A is given by the ratio lt= lop. Thus,
tion to the data units, often control information if testing is 12 times more effective in discovering
needs to be exchanged that provides context to defects than normal use the test acceleration factor is
the data unit. Exchange of some control signals 12. This factor is controlled by the test selection strat-
may sometimes be implicit. Testing must ensure egy and the type of application.
that the control protocol is properly specified (this
5. Defect density: This is usually measured in
sometimes requires a formal representation) and
terms of the number of defects per 1000 source lines
control signals are properly exchanged.
of code (KSLOC). It cannot be measured directly,
but can be estimated using the growth and static
Interoperability testing may be considered a more
models presented below. The failure intensity is
general form of integration testing. Interoperability
approximately proportional to the defect density.
testing must be performed whenever a component is
The acceptable defect density for critical or high-
added or upgraded. Some field-specific interoperability
volume software can be less than 0.1 defects per
standards have been developed. For example, Z39.50 is
KLOC, whereas for other applications 0.5 defects
an international (ISO 23950) standard defining a proto-
per KLOC is often currently considered acceptable.
col for computer-to-computer information retrieval.[5]
Sometimes weights are assigned to defects depending
on the severity of the failures they can cause. To
keep analysis simple, we assume here that each
defect has the same weight.
SOFTWARE RELIABILITY MEASURES
6. Test coverage measures: Tools are now available
The classical reliability theory generally deals with that can automatically evaluate how thoroughly a soft-
hardware. In hardware systems, the reliability decays ware has been exercised. The following are some of the
because of the possibility of permanent failures. How- common coverage measures
ever, this is not applicable for software. During testing,
the software reliability grows because of debugging a. Statement coverage: The fraction of all state-
and becomes constant once defect removal is stopped. ments actually exercised during testing.
The following are the most common reliability mea- b. Branch coverage: The fraction of all branches
sures used: that were executed by the tests.
c. P-use coverage: The fraction of all predicate use
1. System availability: Following classical reli-
(p-use) pairs covered during testing. A p-use
ability terminology, we can define availability of a
pair includes two points in the program, a point
system as:
where the value of a variable is defined or mod-
ified followed by a point where it is used for a
AðtÞ ¼ Pr fsystem is operational at time tg ð1Þ branching decision, i.e., a predicate.
4 Software Reliability and Security
The first two are structural coverage measures, Table 3 The programming team factor Fpt
while the last is a data-flow coverage measure. As dis- Team’s average skill level Multiplier
cussed below, test coverage is correlated with the num-
High 0.4
ber of defects that will be triggered during testing.[6]
Hundred percent statement coverage can often be quite Average 1 (default)
easy to achieve. Sometimes a predetermined branch Low 2.5
coverage, say 85% or 90%, may be used as an accep-
tance criterion for testing; higher levels of branch cov-
erage would require significantly more testing effort. The Programming Team Factor Fpt
D ¼ CFph Fpl Fm Fs ð3Þ This factor takes into account the rigor of the software
development process in a specific organization. The
where the five factors are the phase factor Fph, model- SEI Capability Maturity Model level, can be used to
ing dependence on software test phase; the program- quantify it. Here, we assume level II as the default
ming team factor Fpt, taking in to account the level, as a level I organization is not likely to be using
capabilities and experience of programmers in the software reliability engineering. Table 4 gives a model
team; the maturity factor Fm, depending on the matur- based on the numbers suggested by Jones and Keene
ity of the software development process; and the struc- and also reported in a Motorola study.
ture factor Fs, depending on the structure of the
software under development . The constant of propor-
The Software Structure Factor Fs
tionality C represents the defect density per KSLOC.
The default value of each factor is 1. We propose the
This factor takes into account the dependence of defect
following preliminary submodels for each factor.
density on language type (the fractions of code in
assembly and high-level languages) and program com-
Phase Factor Fph plexity. It can be reasonably assumed that assembly
language code is harder to write, and thus will have a
Table 2 presents a simple model using the actual data higher defect density. The influence of program com-
reported by Musa et al. and the error profile presented plexity has been extensively debated in the literature.
by Piwowarski et al. It takes the default value of 1 to Many complexity measures are strongly correlated to
represent the beginning of the system test phase.
Table 4 The process maturity factor Fm
Table 2 Phase factor Fph SEI CMM level Multiplier
At the beginning of the phase Multiplier Level 1 1.5
Unit testing 4 Level 2 1 (default)
Subsystem testing 2.5 Level 3 0.4
System testing 1 (default) Level 4 0.1
Operation 0.35 Level 5 0.05
Software Reliability and Security 5
software size. As we are constructing a model for per KSLOC. Thus, the total number of defects can
defect density, software size has already been taken range from 628 to 864.
into account. A simple model for Fs depending on S
language use, is given below.
SOFTWARE TEST METHODOLOGY
Fs ¼ 1 þ 0:4a
To test a program, a number of inputs are applied and
where a is the fraction of the code in assembly lan-
the program response is observed. If the response is
guage. Here, we are assuming that assembly code has
different from expected, the program has at least one
40% more defects.
defect. Testing can have one of two separate objectives.
Distribution of module sizes for a project may have
During debugging, the aim is to increase the reliability
some impact on the defect density. Very large modules
as fast as possible, by finding faults as quickly as pos-
can be hard to comprehend. Very small modules can
sible. On the other hand during certification, the object
have a higher fraction of defects associated with the
is to assess the reliability, thus the fault-finding rate
interaction of the modules. As module sizes tend to
should be representative of actual operation. The test
be unevenly distributed, there may be an overall effect
generation approaches can be divided into the follow-
of module size distribution.[9] Further research is
ing classes:
needed to develop a model for this factor. Requirement
volatility is another factor that needs to be considered. 1. Black-box (or functional) testing: When test
If requirement changes occur later during the software generation is done by only considering the
development process, they will have more impact on input=output description of the software, nothing
defect density. We can allow other factors to be taken about the implementation of the software is assumed
into account by calibrating the overall model. to be known. This is the most common form of testing.
2. White-box (or structural) testing: When the
actual implementation is used to generate the tests.
Calibrating and Using the Defect Density Model
In actual practice, a combination of the two
The model given in Eq. (5) provides an initial estimate. approaches will often yield the best results. Black-box
It should be calibrated using past data from the same testing only requires a functional description of the pro-
organization. Calibration requires application of the gram; however, some information about actual imple-
factors using available data in the organization and mentation will allow testers to better select the points
determining the appropriate values of the factor to probe in the input space. In the random-testing
parameters. As we are using the beginning of the sub- approach, the inputs are selected randomly. In the
system test phase as the default, Musa et al.’s data sug- partition testing approach, the input space is divided into
gest that the constant of proportionality C can range suitably defined partitions. The inputs are then chosen
from about 6 to 20 defects per KSLOC. For best accu- such that each partition is reasonably and thoroughly
racy, the past data used for calibration should come exercised. It is possible to combine the two approaches;
from projects similar to the one for which the projec- partitions can be probed both deterministically for
tion needs to be made. Some of the indeterminacy boundary cases and randomly for nonspecial cases.
inherent in such models can be taken into account by Some faults are easily detected, i.e., they have high
using a high and a low estimate and using both of them testability. Some faults have very low testability; they
to make projections. are triggered only under a rarely occurring input com-
Example 1: For an organization, the value of C bination. At the beginning of testing, a large fraction of
has been found to be between 12 and 16. A project is faults have high testability. However, they are easily
being developed by an average team and the SEI detected and removed. In the later phases of testing,
(Software Engineering Institute) maturity level is II. the remaining faults have low testability. Finding these
About 20% of the code is in assembly language. Other faults can be challenging. The testers need to use care-
factors are assumed to be average. The software size is ful and systematic approaches to achieve a very low
estimated to be 20,000 LOC. We want to estimate the defect density.
total number of defects at the beginning of the integra- Thoroughness of testing can be measured using a
tion test phase. test coverage measure, as discussed before in the sec-
From the model given by Eq. (3), we estimate that tion ‘‘Software Reliability and Measures.’’ Branch
the defect density at the beginning of the subsystem coverage is a stricter measure than statement coverage.
test phase can range between 12 2.5 1 1 Some organizations use branch coverage (say 85%) as
(1 þ 0.4 0.2) 1 ¼ 32.4 per KSLOC and the minimum criterion. For very high-reliability pro-
16 2.5 1 1 (1 þ 0.4 0.2 1 ¼ 43.2 grams, a stricter measure (like p-use coverage) or a
6 Software Reliability and Security
combination of measures (like those provided by the if high reliability is desired, testing needs to be more
GCT coverage tool) should be used. uniform. Defects in parts of the code that are infre-
To be able to estimate operational reliability, testing quently executed can be hard to detect. To achieve very
must be done in accordance with the operational pro- high reliability, special tests should be used to detect
file. A profile is the set of disjoint actions, operations such defects.[11]
that a program may perform, and their probabilities
of occurrence. The probabilities that occur in actual
operation specify the operational profile. Sometimes,
when a program can be used in very different environ- MODELING SOFTWARE RELIABILITY GROWTH
ments, the operational profile for each environment
may be different. Obtaining an operational profile The fraction of cost needed for testing a software sys-
requires dividing the input space into sufficiently small tem to achieve a suitable reliability level can sometimes
leaf partitions, and then estimating the probabilities be as high as 60% of the overall cost. Testing must be
associated with each leaf partition. A subspace with carefully planned so that the software can be released
high probability may need to be further divided into by a target date. Even after a lengthy testing period,
smaller subspaces. additional testing will always potentially detect more
Example 2: This example is based on the Fone– bugs. Software must be released, even if it is likely to
Follower system example by Musa.[10] A Fone– have a few bugs, provided an appropriate reliability
follower system responds differently to a call depending level has been achieved. Careful planning and decision
on the type of call. Based on past experience, the making requires the use of a software reliability growth
following types are identified and their probabilities model (SRGM).
have been estimated as given below: An SRGM assumes that reliability will grow with
testing time t, which can be measured in terms of the
CPU execution time used, or the number of man-hours
A. Voice call 0.74
or days. The time can also be measured in terms of the
B. FAX call 0.15 number of transactions encountered. The growth of
C. New number entry 0.10 reliability is generally specified in terms of either failure
D. Database audit 0.009 intensity l(t) or total expected faults detected by time
E. Add subscriber 0.0005 t, given by m(t). The relationship between the two is
F Delete subscriber 0.0005 given by
G. Hardware failure recovery 0.000001 d
lðtÞ ¼ mðtÞ ð4Þ
Total for all events: 1.0 dt
Thus, the leaf partitions are fA1, A2, A3, A4, A5, B, C, K
b1 ¼ ð6Þ
SQ 1r
D, E, F, Gg. These and their probabilities form the
operational profile. During acceptance testing, the tests
would be chosen such that a FAX call occurs 15% of where S is the total number of source instructions, Q is
the time, a fvoice call, no pager, answerg occurs 18% the number of object instructions per source instruc-
of the time, and so on. tion, and r is the object instruction execution rate of
Testing should be done according to the operation the computer being used. The term K is called fault-
profile if the objective is to estimate the failure rate. exposure ratio; its value has been found to be in the
For debugging, operational profile-based testing is range 1 10 7 to 10 10 7, when t is measured in
more efficient if the testing time is limited. However, seconds of CPU execution time.
Software Reliability and Security 7
When N(0) is the initial total number of defects, the Eqs. (11) and (12) are applicable as long as
total expected faults detected by time t is then m(t) N(0). In practice, the condition will almost
always be satisfied, as testing always terminates while
a few bugs are still likely to be present.
mðtÞ ¼ N ð0Þ N ðtÞ The variation in K, as assumed by the logarithmic
ð8Þ
¼ N ð0Þð1 e b1 t
Þ model, has been observed in actual practice. The value
of K declines at higher defect densities, as defects get
harder to find. However, at low defect densities, K
which is generally written in the form starts rising. This may be explained by the fact that real
testing tends to be directed rather than random, and
b1 t this starts affecting the behavior at low defect densities.
mðtÞ ¼ b0 ð1 e Þ ð9Þ
The two parameters for the logarithmic model, b0
and b1, do not have a simple interpretation. A possible
where b0, the total number of faults that would be interpretation is provided by Malaiya and Denton.[7]
eventually detected, is equal to N(0). This assumes that They have also given an approach for estimating the
no new defects are generated during debugging. logarithmic model parameters b0L, b1L, once the expo-
Using Eq. (4), we can obtain an expression for nential model parameters have been estimated.
failure intensity using Eq. (9) The exponential model has been shown to have a
negative bias; it tends to underestimate the number
lðtÞ ¼ b0 b1 e b1 t
ð10Þ of defects that will be detected in a future interval.
The logarithmic also has a negative bias; however, it
is much smaller. Among the major models, only the
The exponential model is easy to understand and
Littlewood–Verral Bayesian model exhibits a positive
apply. One significant advantage of this model is that
bias. This model has also been found to have good pre-
both parameters b0 and b1 have a clear interpretation
dictive capabilities, however, because of computational
and can be estimated even before testing begins. The
complexity, and a lack of interpretation of the param-
models proposed by Jelinski and Muranda (1971),
eter values, it is not popular.
Shooman (1971), God and Okumoto (1979), and Musa
An SRGM can be applied in two different types of
(1975–1980) can be considered to be reformulations of
situations. Applying it before testing requires static
the exponential model. The hyperexponential model,
estimation of parameters. During testing, actual test
considered by Ohba, Yamada, and Lapri, assumes that
data are used to estimate the parameters.
different sections of the software are separately gov-
erned by an exponential model, with different param-
eter values for different sections.
Many other SRGMs have been proposed and used. Before Testing Begin
Several models have been compared for their predic-
tive capability using data obtained from different pro- A manager often has to come up with a preliminary
jects. The exponential model fares well in comparison plan for testing very early. For the exponential and
with other models; however, a couple of models can the logarithmic models, it is possible to estimate the
outperform the exponential model. We will here look two parameter values based on defect density model
at the logarithmic model, proposed by Musa and and Eq. (6). One can then estimate the testing time
Okumoto, which has been found to have a better predic- needed to achieve the target failure intensity, MTTF,
tive capability compared with the exponential model. or defect density.
Unlike the exponential model, the logarithmic Example 3: Let us assume that for a project, the
model assumes that the fault exposure ratio K varies initial defect density has been estimated, using the sta-
during testing.[12] The logarithmic model is also a tic model given in Eq. (3), and has been found to be 25
finite-time model, assuming that after a finite time, defects per KLOC. The software consists of 10,000
there will be no more faults to be found. The model Loc. The code expansion ratio Q for C programs is
can be stated as: about 2.5, hence the compiled program will be about
10,000 2.5 ¼ 25,000 object instructions. The test-
ing is done on a computer that executes 70 million
mðtÞ ¼ b0 lnð1 þ b1 tÞ ð11Þ object instructions per second. Let us also assume that
8 Software Reliability and Security
the fault exposure ratio K has an expected average smoothed. A common form of smoothing is to use
value of 4 10 7. We wish to estimate the testing grouped data. It involves dividing the test duration
time needed to achieve a defect density of 2.5 defects into a number of intervals and then computing the
per KLOC. average failure intensity in each interval.
For the exponential model, we can estimate that:
2. Select a model and determine parameters: The
best way to select a model is to rely on the past experi-
b0 ¼ N ðOÞ ¼ 25 10 ¼ 250 defects:
ence with other projects using same process. The expo-
nential and logarithmic models are often good choices.
and from Eq. (6)
Early test data have a lot of noise, thus a model that
fits early data well may not have the best predictive
K 4:0 10 7 capability. The parameter values can be estimated
b1 ¼ ¼
SQ 1r 10; 000 2:5 70 1 106
using either least square or maximum likelihood
¼ 11:2 10 4
per sec approaches. In the very early phases of testing, the
parameter values can fluctuate enormously; they
If t1 is the time needed to achieve a defect density of 2.5 should not be used until they have stabilized.
per KLOC, then using Eq. (7), 3. Perform analysis to decide how much more test-
ing is needed: Using the fitted model, we can project
N ðt1 Þ 2:5 10 how much additional testing needs to be done to
¼ ¼ expð 11:2 10 4 t1 Þ
N ðOÞ 25 10 achieve a desired failure intensity or estimated defect
density. It is possible to recalibrate a model that does
giving not confirm with the data to improve the accuracy
of the projection. A model that describes the process
lnð0:1Þ well to start with can be improved very little by
t1 ¼ ¼ 2056 sec CPU time
11:2 10 4 recalibration.
Example 4: This example is based on the T1 data
We can compute the failure intensity at time t1 to be
reported by Musa.[1] For the first 12 hr of testing, the
4 11:210 4 t1
number of failures each hour is given in Table 5.
lðt1 Þ ¼ 250 11:2 10 e Thus, we can assume that during the middle of the
¼ 0:028 failure/sec first hour (i.e., t ¼ 30 60 ¼ 1800 sec) the failure
intensity is 0.0075 sec 1. Fitting all the 12 data points
For this example, it should be noted that the value of K to the exponential model [Eq. (12)], we obtain:
(and hence t1) may depend on the initial defect density
and the testing strategy used. In many cases, the time t b0 ¼ 101:47
is specified in terms of the number of man-hours. We
and
would then have to convert man-hours to CPU execution
time by multiplying by an appropriate factor. This factor b1 ¼ 5:22 10 5
would have to be determined using recently collected
data. An alternative way to estimate b1 is found by noting
that Eq. (6) suggests that for the same environment, Table 5 Hourly failure data
b1 I is constant. Thus, if for a prior project with 5 Hour Number of failures
KLOC source code, the final value for b1 was
1 27
2 10 3 sec 1. Then, for a new 15 KLOC project, b1
2 16
can be estimated as 2 10 3=3 ¼ 0.66 10 3 sec 1.
3 11
During Testing 4 10
5 11
During testing, the defect finding rate can be recorded. 6 7
By fitting an SRGM, the manager can estimate the 7 2
additional testing time needed to achieve a desired 8 5
reliability level. The major steps for using SRGMs
9 3
are as follows:
10 1
1. Collect and preprocess data: The failure inten- 11 4
sity data include a lot of short-term noise. To extract
12 7
the long-term trend, the data often need to be
Software Reliability and Security 9
Let us now assume that the target failure intensity is If the changes are significant, early data points should
one failure per hour, i.e., 2.7810 4 failures per second. be dropped from the computations. If the additions
An estimate of the stopping time tf is then given by are component by component, reliability data for each S
component can be separately collected and the methods
4
2:78 10 presented in the next section can be used.
5:2210 5 tf
¼ 101:47 5:22 10 5 e ð13Þ
yielding tf ¼ 56,473 sec., i.e., 15.69 hr, as shown in Test Coverage Based Approach
Fig. 1.
Investigations with the parameter values of the Several software tools are available that can evaluate
exponential model suggest that early during testing, coverage of statements, branches, P-uses, etc., during
the estimated value of b0 tends to be lower than the testing. Higher test coverage means the software has
final value, and the estimated value of b1 tends to be been more thoroughly exercised.
higher. Thus, the value of b0 tends to rise, and that It has been established that software test coverage is
of b1 tends to fall, with the product b0b1 remaining related to the residual defect density, and hence reli-
relatively unchanged. In Eq. (13) above, we can guess ability.[5] The total number of defects found, m, is line-
that the true value of b1 should be smaller, and thus arly related to the test coverage measures at higher
the true value of tf should be higher. Hence, the value values of test coverage. For example, if we are using
15.69 hr should be used as a lower estimate for the total branch coverage CB, we will find that for low values
test time needed. of CB, m remains close to zero. However at some value
In some cases, it is useful to obtain interval esti- of C B (we term it a knee), m starts rising linearly, as
mates of the quantities of interest. The statistical meth- given by this Eq. (14).
ods to do that can be found in the literature.[12,13]
Sometimes we wish to continue testing until a failure m ¼ a þ bCB ; CB > knee ð14Þ
rate is achieved with a given confidence level, say
95%. Graphical and analytical methods for determin- The values of the parameters a and b will depend on
ing such stopping points are also available.[13] the software size and the initial defect density. The
The SRGMs assume that a uniform testing strategy is advantage of using coverage measures is that variations
used throughout the testing period. In actual practice, in test effectiveness will not influence the relationship, as
the test strategy is changed from time to time. Each test coverage directly measures how thoroughly a pro-
new strategy is initially very effective in detecting a dif- gram has been exercised. For high-reliability systems, a
ferent class of faults, causing a spike in failure intensity strict measure-like p-use coverage should be used.
when a switch is made. A good smoothing approach will Fig. 2 below plots the actual data from a project. By
minimize the influence of these spikes during computa- the end of testing, 29 defects were found and 70%
tion. A bigger problem arises when the software under branch coverage was achieved. If testing were to con-
test is not stable because of continuing additions to it. tinue until 100% branch coverage is achieved, then
about 47 total defects would have been detected. Thus,
according to the model, about 18 residual defects were
present when testing was terminated. Note that only
Fig. 1 Using an SRGM. (View this art in color at www. Fig. 2 Coverage based modeling. (View this art in color at
dekker.com.) www.dekker.com.)
10 Software Reliability and Security
one defect was detected when branch coverage was A model for the cumulative number of vulnerabil-
about 25%, thus the knee of the curve is approximately ities y, against calendar time t is given by Eq. (15).[14]
at branch coverage of 0.25.
B
y ¼ ABt
ð15Þ
BCe þ 1
VULNERABILITIES IN INTERNET-RELATED
SOFTWARE
A, B, and C are empirical constants determined from
Internet related software systems including operating the recorded data. The parameter B gives the total
systems, servers, and browsers face escalating security number of vulnerabilities that will be eventually found.
challenges because internet connectivity is growing The w-square goodness of fit examination shows that
and the number of security violations is increasing. the data for several common operating systems fit the
CERT and other databases keep track of the reported model very well. The vulnerabilities in such a program
vulnerabilities. An increasing number of individuals are software defects that permit an unauthorized
and organizations depend on the Internet for financial action. The density of vulnerabilities in a large soft-
and other critical services. This has made potential ware system is an important measure of risk. The
exploitation of vulnerabilities very attractive to crim- known vulnerability density can be evaluated using
inals with suitable technical expertise. the database of the reported vulnerabilities. Known
Each such system passes through several phases: the vulnerability density VKD can be defined as the
release of the system, increasing popularity, peak, and reported number of vulnerabilities in the system per
stability followed by decreasing popularity that ends unit size of the system. This is given by:
with the system eventually becoming obsolete. There VK
is a common pattern of three phases in the cumulative VKD ¼ ð16Þ
S
vulnerabilities plot for a specific version of software.
We observe a slow rise when a product is first released. where S is the size of software and VK is the reported
This becomes a steady stream of vulnerability reports number of vulnerabilities in the system. Table 6 pre-
as the product develops a market share and starts sents the values based on data from several sources.[14]
attracting attention of both professional and criminal It gives the known defect density DKD, VKD, and the
vulnerability finders. When an alternative product ratios of the two.
starts taking away the market share, the rate of vulner- In Table 6 we see that the source code size is 16 and
ability finding drops in an older product. Fig. 3 below 18 MSLOC (million source lines of code) for NT and
shows a plot of vulnerabilities reported during January Win 98, respectively, approximately the same. The
1999 to August 2002 for Windows 98. reported defect densities at release, 0.625 and 0.556,
are also similar for the two OSs. The known vulner- As exp( dili) is Ri, single execution reliability of
abilities in Table 2 are as of July 2004. We notice that module i, we have
for Windows 98 the vulnerability density is 0.0032, n
S
whereas for Windows NT 4.0 it is 0.0101, significantly
Y
Rsys ¼ ðRi Þei ð19Þ
higher. The higher values for NT 4.0 may be due two i¼1
factors. First, as a server OS, it may contain more code
that handles access mechanisms. Second, because
attacking servers would generally be much more reward- Multiple-Version Systems
ing, it must have attracted a lot more testing effort
resulting in detection of more vulnerabilities. In some critical applications, like defense or avionics,
The last column in Table 6 gives the ratios of known multiple versions of the same program are sometimes
vulnerabilities to known defects. For the two OSs the used. Each version is implemented and tested indepen-
values are 1.62% and 0.58%. It has been assumed by dently to minimize the probability of a multiple num-
different researchers that the vulnerabilities can be ber of them failing at the same time. The most
1% or 5% of the total defects. The values given in common implementation uses triplication and voting
Table 6 suggest that 1% may be closer to reality. on the result. The system is assumed to operate cor-
rectly as long as the results of at least two of the ver-
sions agree. This assumes the voting mechanism to be
RELIABILITY OF MULTICOMPONENT SYSTEMS perfect. If the failures in the three versions are truly
independent, the improvement in reliability can be dra-
A large software system consists of a number of mod- matic; however, it has been shown that correlated
ules. It is possible that the individual modules are failures must be taken into account.
developed and tested differently resulting in different In a three-version system, let q3 be the probability of
defect densities and failure rates. Here, we will present all three versions failing for the same input. Also, let q2
methods for obtaining the system failure rate and the be the probability that any two versions will fail together.
reliability if we know the reliabilities of the individual As three different pairs are possible among the three ver-
modules. Let us assume that for a system one module sions, the probability Psys of the system failing is:
is under execution at a time. Modules will differ in how
often and how long they are executed. If fi is the frac- Psys ¼ q3 þ 3q2 ð20Þ
tion of the time module i under execution, then the
mean system failure rate is given by:[16] In the ideal case, the failures are statistically inde-
pendent. If the probability of a single version failing is
n
X p, the above equation can be written for an ideal case as:
lsys ¼ fi li ð17Þ
i¼1
Psys ¼ p3 þ 3ð1 pÞp2 ð21Þ
where li is the failure rate of the module i. Here, each
major interoperability interface should be regarded to In practice, there is a significant correlation, requiring
be a virtual module. estimation of q3 and q2 for system reliability evaluation.
Let the mean duration of a single transaction be T. Example 5: This example is based on the data col-
Let us assume that module i is called ei times during T, lected by Knight and Leveson, and the computations
and each time it is executed for duration di, then by Hatton.[15] In a three-version system, let the prob-
ability of a version failing for a transaction be
ei di 0.0004. Then, in the absence of any correlated failures,
fi ¼ ð18Þ
T we can achieve a system failure probability of:
Let us define the system reliability Rsys as the prob- Psys ¼ ð0:0004Þ2 þ 3ð1 0:0004Þð0:0004Þ2
ability that no failures will occur during a single trans- 7
action. From reliability theory, it is given by ¼ 4:8 10
Hatton points out that the state-of-the-art techni- 3. Carleton, A.D.; Park, R.E.; Florac, W.A. Practi-
ques have been found to reduce defect density only cal Software Measurement. Technical Report;
by a factor of 10. Hence, an improvement factor of SRI, CMU=SEI-97-HB-003.
about 50 may be unattainable except by using 4. Piwowarski, P.; Ohba, M.; Caruso, J. Coverage
N-version redundancy. measurement experience during function test.
Proceedings of International Conference on Soft-
ware Engineering, 1993; 287–301.
TOOLS FOR SOFTWARE TESTING 5. International Standard Z39.50 Maintenance Agency;
AND RELIABILITY http:==www.loc.gov=z3950=agency (accessed Mar
21 2005).
Software reliability has now emerged as an engineering 6. Malaiya, Y.K.; Li, N.; Bieman, J.; Karcich, R.;
discipline. It can require a significant amount of data col- Skibbe, B.; The relation between test coverage
lection and analysis. Tools are now becoming available and reliability. Proceedings IEEE-CS Interna-
that can automate several of the tasks. Here, the names tional Symposium on Software Reliability Engi-
of some of the representative tools are mentioned. Many neering, Nov 1994; 186–195.
of the tools may run on specific platforms only, and some 7. Malaiya, Y.K.; Denton, J. What do the software
are intended for some specific applications only. Instal- reliability growth model parameters represent.
ling and learning a tool can require a significant amount Proceedings IEEE-CS International Symposium
of time, thus a tool should be selected after a careful com- on Software Reliability Engineering ISSRE, Nov
parison of the applicable tools available. 1997; 124–135.
8. Takahashi, M.; Kamayachi, Y. An empirical
Automatic test generations: TestMaster (Tera- study of a model for program error prediction.
dyne), AETG (Bellcore), ARTG (CSU), etc. Proceeding International Conference on Software
GUI testing: QA Partner (Seague), WinRunner Engineering, Aug 1995; 330–336.
(Mercury Interactive), etc. 9. Malaiya, Y.K.; Denton, J. Module size distribu-
Memory testing: Purify (Relational), Bounds tion and defect density. Proceedings IEEE
Checker (NuMega Tech.), etc. International Symposium Software Relaibility
Defect tracking: BugBase (Archimides), DVCS Engineering, Oct 2000; 62–71.
Tracker (Intersolv), DDTS (Qualtrack), etc. 10. Musa, J. More Reliable, Faster, Cheaper Testing
Interoperability: Interoperability tool (AlphaWorks). through Software Reliability Engineering, Tutor-
Test coverage evaluation: Jcover (Man Machine ial Notes, ISSRE 0 97, Nov 1997; 1–88.
Systems), GCT (Testing Foundation), PureCover- 11. Yin, H.; Lebne-Dengel, Z.; Malaiya, Y.K. Auto-
age (Relational), XSUDS (Bellcore), etc. matic test generation using checkpoint encoding
Reliability growth modeling: CASRE (NASA), and antirandom testing. International Sympo-
SMERFS (NSWC), ROBUST (CSU), etc. sium on Software Reliability Engineering, 1997;
Defect density estimation: ROBUST (CSU). 84–95.
Coverage-based reliability modeling: ROBUST 12. Li, N.; Malaiya, Y.K. Fault exposure ratio: esti-
(CSU). mation and applications. Proceedings IEEE-CS
Markov reliability evaluation: HARP (NASA), International Symposium on Software Reliability
HiRel (NASA), PC Availability (Management Engineering, Nov 1993; 372–381.
Sciences), etc. 13. Lyu, M.R., Ed.; Handbook of Software Reliabil-
Fault-tree analysis: RBD (Relax), Galileo (UV), ity Engineering; McGraw-Hill, 1996; 71–117.
CARA (Sydvest), FaultTreeþ(AnSim), etc. 14. Alhazmi, O.H.; Malaiya, Y.K. Quantitative
vulnerability assessment in systems software.
Proceedings IEEE Reliability & Maintainability
REFERENCES Symposium, Jan. 2005; 615–620.
15. Hatton, L. N-version Design Versus One Good
1. Musa, J.D. Software Reliability Engineering; Design; IEEE Software; Nov=Dec 1997; 71–76.
McGraw-Hill, 1999. 16. Lakey, P.B.; Neufelder, A.M. System and Soft-
2. Malaiya, Y.K.; Srimani, P. Software Reliability ware Reliability Assurance Notebook; Rome
Models; IEEE Computer Society Press, 1990. Lab, FSC-RELI, 1997.