KEMBAR78
Bayesian Inference Under Small Sample | PDF | Bayesian Inference | Statistical Inference
0% found this document useful (0 votes)
6 views20 pages

Bayesian Inference Under Small Sample

This paper presents a Bayesian inference method tailored for small sample sizes, introducing a general noninformative prior that encompasses various classical priors. The proposed prior allows for analytical evaluations of Bayesian posteriors and predictors, and its effectiveness is demonstrated through a fatigue reliability problem and an aeroengine disk lifing application. The study emphasizes the importance of prior selection in Bayesian inference, particularly when sample sizes are limited, to avoid biased results.

Uploaded by

leidy rivera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views20 pages

Bayesian Inference Under Small Sample

This paper presents a Bayesian inference method tailored for small sample sizes, introducing a general noninformative prior that encompasses various classical priors. The proposed prior allows for analytical evaluations of Bayesian posteriors and predictors, and its effectiveness is demonstrated through a fatigue reliability problem and an aeroengine disk lifing application. The study emphasizes the importance of prior selection in Bayesian inference, particularly when sample sizes are limited, to avoid biased results.

Uploaded by

leidy rivera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

mathematics

Article
Bayesian Inference under Small Sample Sizes Using General
Noninformative Priors
Jingjing He 1 , Wei Wang 1 , Min Huang 1 , Shaohua Wang 2 and Xuefei Guan 3, *

1 School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China;
hejingjing@buaa.edu.cn (J.H.); w.wei@buaa.edu.cn (W.W.); buaahuang@163.com (M.H.);
2 China Aviation Power Plant Research Institute, Zhuzhou 412002, China; wangshaohua@buaa.edu.cn
3 Graduate School of China Academy of Engineering Physics, Beijing 100193, China
* Correspondence: xfguan@gscaep.ac.cn

Abstract: This paper proposes a Bayesian inference method for problems with small sample sizes. A
general type of noninformative prior is proposed to formulate the Bayesian posterior. It is shown that
this type of prior can represent a broad range of priors such as classical noninformative priors and
asymptotically locally invariant priors and can be derived as the limiting states of normal-inverse-
Gamma conjugate priors, allowing for analytical evaluations of Bayesian posteriors and predictors.
The performance of different noninformative priors under small sample sizes is compared using
the likelihood combining both fitting and prediction performances. Laplace approximation is used
to evaluate the likelihood. A realistic fatigue reliability problem was used to illustrate the method.
Following that, an actual aeroengine disk lifing application with two test samples is presented, and
the results are compared with the existing method.

 Keywords: Bayesian inference; noninformative prior; Jeffreys’ prior; invariant




Citation: He, J.; Wang, W.; Huang,


M.; Wang, S.; Guan, X. Bayesian
Inference under Small Sample Sizes
1. Introduction
Using General Noninformative Priors.
Mathematics 2021, 9, 2810. https://
Sample sizes are often quite small in many engineering fields due to time, economic,
doi.org/10.3390/math9212810
and physical constraints, for example, life testing data of high-reliability mechanical compo-
nents, large and complex engineering systems, and so on. The influence of sample size on
Academic Editor: Francisco-José interpreting classical significance test results has been discussed in several studies [1–3]. The
Vázquez-Polo predictive models established based on scarce samples may highly depend on the chosen
parameter estimation method [4,5]. In those cases, the probabilistic approach is usually
Received: 6 September 2021 preferred over the deterministic approach [6–11]. The Bayes rule provides a consistent and
Accepted: 3 November 2021 rational mathematical device to incorporate relevant information and prior knowledge for
Published: 5 November 2021 probabilistic inference.
To motivate the discussion, consider an observable random variable X with a condi-
Publisher’s Note: MDPI stays neutral tional probability density function (PDF) (or simply density) of p( x |φ), where φ ∈ Φ is also
with regard to jurisdictional claims in a random variable. The inverse problem is to make inferences about φ given an observed
published maps and institutional affil- value x of X. The Bayesian approach to the solution is to use some density p(φ) over Φ to
iations. represent the prior information of φ. In this way, the prior knowledge of φ can be encoded
through the Bayes rule to obtain the posterior PDF of φ given x,

p ( φ | x ) ∝ p ( φ ) p ( x | φ ). (1)
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
The forward inference, e.g., the PDF of a certain variable or the probability of an event
This article is an open access article
involving φ, can be made on the basis of the posterior PDF. Using the method of Markov
distributed under the terms and chain Monte Carlo (MCMC), samples can directly be drawnR from the posterior distribution
conditions of the Creative Commons without knowing the normalizing constant p( x ) = p(φ) p( x |φ)dφ in Equation (1). The
Attribution (CC BY) license (https:// Bayesian method has been successfully demonstrated in all important disciplines [12–19].
creativecommons.org/licenses/by/ The choice of p(φ) can have a great influence on the inference result. The proper choice
4.0/). of priors has been extensively discussed in the probability and statistics communities, and it

Mathematics 2021, 9, 2810. https://doi.org/10.3390/math9212810 https://www.mdpi.com/journal/mathematics


Mathematics 2021, 9, 2810 2 of 20

can never be overlooked as it is one of the fundamental pieces of Bayesian inference [20–26].
For one thing, the formal rule of constructing a prior regardless of the data and likelihood
is sought in the field of physics [9,27]. Jaynes and Bretthorst [28] argued that “problems of
inference are ill-posed until we recognize three essential things. (A) The prior probabilities
represent our prior information, and are to be determined, not by introspection but by logical
analysis of that information. (B) Since the final conclusions depend necessarily on both the
prior information and the data, it follows that, in formulating a problem, one must specify
the prior information to be used just as fully as one specifies the data. (C) Our goal is that
inferences are to be completely ‘objective’ in the sense that two persons with the same prior
information must assign the same prior probabilities.” (p. 373). For another, the choice of a
prior may highly depend on data and likelihood in practice. Gelman et al. [29] argued that
“a prior can in general only be interpreted in the context of the likelihood with which it will
be paired” (p. 12).
To ensure a consistent and objective inference, rules for constructing priors with minimal
subjective constraints are sought. Early work on the construction of such priors is based on the
“ignorance” over the parameter space using invariance techniques [27,30,31]. The fundamental
reasoning is that the priors should carry the same amount of information such that a change
of scale and/or a shift of location do not affect the inference results on those parameters. The
ignorant prior can systematically be derived using the concept of the transformation group.
Using different transformation groups, different priors can be obtained. For simplicity, such
priors are loosely referred to as noninformative priors. One of the most notable priors for
a scale parameter σ of a distribution is Jeffreys’ prior, i.e., p(σ ) ∝ 1/σ. Jeffreys’ prior can
be obtained using the tool of the transformation group under the condition that a change of
scale does not change that state of knowledge. A further extension of the noninformative
priors is “reference priors” [32,33]. The theoretical framework has been discussed in many
studies, including, but not limited to, [31,34]. Apart from the noninformative approach
to derive the priors, there are several less objective approaches to construct the priors.
Gelman [35] proposed the weakly informative priors based on the idea of conditional
conjugacy for hierarchical Bayesian models. The conditionally conjugate priors provide
some computational convenience as a Gibbs sampler can be used to draw samples from a
posterior distribution, and some parameters for inverse-Gamma distributions in Bayesian
applications are suggested. Simpson et al. [36] proposed a method to build priors. The
basic idea is to penalize the complexity induced by deviating from a simpler base model
using the Kullback–Leibler divergence as a metric [37].
For critical problems with small sample sizes, the choice of the prior in Bayesian
inference can have a great impact on the inference results, rendering an unreliable decision-
making. Despite a great deal of research, including the aforementioned, having been
conducted, the choice of an optimal prior is still nontrivial to make from the practical point
of view, and a systematical method to cope with such cases is rarely seen. Moreover, a
quantitative measure to evaluate the performance of a prior is equally important to identify
the optimal prior. This study develops a noninformative Bayesian approach to probabilistic
inference under small sample sizes. A general 1/σq -type of noninformative prior for the
location-scale family of distributions is proposed. This type of prior is further shown to
be the limiting state of the commonly used normal-inverse-Gamma conjugate prior, thus
allowing for an analytical evaluation of the posterior of the parameter. Given a linear model
with a Gaussian error variable, the analytical form of the posterior of the model prediction
can also be obtained. The informative or noninformative nature of the prior is centered
more on the concept of whether the prior encodes subjective information; therefore, it is
less relevant to the likelihood function (and data). The subjective information is the source
of inconsistency in the inference result as different practitioners can use different subjective
information in the form of assumptions, which in most cases may only be justified by
oneself. However, the prior has to be built with a certain state of knowledge. This state
of knowledge, or loosely called information, is not related to experience nor assumptions,
but is a result of a logical reasoning process [27]. The purpose of the study is not to argue
Mathematics 2021, 9, 2810 3 of 20

whether these assumptions are justified or not, nor to question the suitable assumptions
made during the inference. This study provides an alternative to existing methods in
such cases to avoid introducing subjectiveness into the Bayesian inference. In particular,
when the number of samples is small, the influence of the prior can be as important as the
likelihood (and data). In such cases, subjective information encoded in the prior can lead
to biased results.
The remainder of the paper is organized as follows. First, the Bayesian model with
noninformative priors is developed, in particular, a general 1/σq -type of noninformative
prior for the location-scale family of distributions is proposed for small sample problems.
The noninformative priors are further shown as the limiting states of the normal-inverse-
Gamma (NIG) conjugate priors. The closed-form expressions of the Bayesian posterior
and predictors using the proposed noninformative priors are obtained under a Gaussian
likelihood. Next, a performance measure considering both the fitting performance and
predictive performance is proposed using the concept of Bayes factors. Different priors are
treated as models in a Bayesian hypothesis testing context for comparisons. Following that,
the method is illustrated using two examples.

2. Bayesian Linearized Models with Noninformative Priors


To motivate the discussion, consider a general linear or linearized model:

y i = x i θ + ei , (2)

where θ is a k-dimensional column vector and ei are independent and identical distributed
random error variables. The distribution of ei determines the likelihood function or vice
versa. Without loss of generality, the distribution belongs to a location-scale family of
distributions. Furthermore, it is enough to use a single scale parameter to characterize the
distribution of e since any constant nonzero mean, no matter known or unknown, can be
grouped into θ. Denote the scale parameter of e as σ2 . A Bayesian model incorporates
both the prior information and the observation through Bayes’ rule. Using the matrix form
y = xθ + e, the Bayesian posterior of (θ, σ2 ) writes:

p(θ, σ2 |y) ∝ p(θ, σ2 ) p(y|θ, σ2 ). (3)

The common Gaussian error variable, i.e., ei ∼ N (0, σ2 ), corresponds to the fol-
lowing likelihood function, for n observations y = (y1 , y2 , ..., yn ) T and n input vector
x = (x1 , x2 , ..., xn ) T where xi , i = 1, ..., n, is a row vector of size k.

p(y|θ, σ2 ) = N(xθ, σ2"I) #


n
1 (4)
∝ σ−n exp − 2
2σ ∑ ( y i − x i θ )2 .
i =1

It should be noted that the error variable does not necessarily follow a Gaussian PDF,
and other types of error distributions can be used. For example, the extreme-value PDF for
the error corresponds to a Weibull likelihood for the log-transformed model prediction.

2.1. A General Form of Noninformative Priors—∝ 1/σq


A general ∝ 1/σq (q >= 0) form of priors is considered here. The classical Jeffreys’
prior and the related asymptotically locally invariant priors are introduced first for the
purpose of completeness.
Consider the PDF of a random variable x characterized by a parameter vector φ;
Jeffreys’ noninformative prior distribution of φ is proportional to the square root of the
determinant of the Fisher information matrix, e.g.,
q
p(φ) ∝ detI(φ), (5)
Mathematics 2021, 9, 2810 4 of 20

where det(·) is the determinant operator and I(·) is the Fisher information matrix. The key
feature of it is invariance under the monotone transformation of φ. This feature is achieved
by using the change of variables theorem. Denote the reparameterized variable or vector
as ψ; it can be shown that:
q
∂φi
p(ψ) = p(φ) det = detI( ), (6)
∂ψj

For a Gaussian likelihood with unknown parameters µ and σ2 ,

( x − µ)2
 
2 1
p( x |µ, σ ) = √ exp − . (7)
2πσ2 2σ2

Jeffreys’ prior for the joint parameter (µ, σ2 ) is:


q
p(µ, σ2 ) ∝ detI(µ, σ2 ), (8)

where I(µ, σ2 ) is:

∂ ln p( x |·) ∂ ln p( x |·) ∂ ln p( x |·) ∂ ln p( x |·)


 
 ∂µ ∂µ ∂µ ∂ ( σ2 ) 
I = E
 ∂ ln p( x |·) ∂ ln p( x |·)
 (9)
∂ ln p( x |·) ∂ ln p( x |·) 
∂µ ∂ ( σ2 ) ∂ ( σ2 ) ∂ ( σ2 )

and E(·) is the expectation operator. Using algebraic deduction and integration, I(µ, σ2 ) is
simplified to:  
1
 2 0
I = σ 2 . (10)
0
σ4
As a result, Jeffreys’ prior for (µ, σ2 ) is:

p(µ, σ2 ) ∝ 1/σ3 . (11)

It is noted that the distribution of interest is (µ, σ2 ), not (µ, σ ); therefore, the derivative
is taken with respective to σ2 as a whole instead of σ. In other word, the parameter space is
on σ2 , not σ. For example, the Jeffreys’ prior for σ2 , with a fixed value of µ, is:

1
p ( σ2 ) ∝ . (12)
σ2

The Jeffreys’ prior for (µ, σ ) and σ with a fixed µ are ∝ 1/σ2 and ∝ 1/σ, respectively.
It is shown in Appendix B that the Fisher information matrix is also the Hessian matrix
of the Kullback–Leibler (KL) distance of a deviated distribution with respect to the true
distribution evaluated at the true parameters. For exponential families of distributions,
the KL distance has analytical forms, allowing for the evaluation of the Fisher information
matrix without involving integrals in Equation (9).
The asymptotically locally invariant (ALI) prior is another type of prior that satisfies
the invariance under certain transformations. An ALI prior can be uniquely determined
using the following equation according to [31],

∂ ln p(φ)
= −E[ f 1 f 2 ]/E[ f 2 ], (13)
∂φ
Mathematics 2021, 9, 2810 5 of 20

where f 1 = ∂ ln p( x |·)/∂φ and f 2 = ∂2 ln p( x |·)/∂φ2 . For a Gaussian distribution with a


fixed mean and a random σ, the two terms are E[ f 1 f 2 ] = −6/σ3 and E[ f 2 ] = −2/σ2 . Solve:

∂ ln p(σ ) 3
=− (14)
∂σ σ
to obtain the ALI prior for σ:
1
p(σ) ∝ . (15)
σ3
Similarly, the ALI prior for σ2 is:

1
p ( σ2 ) ∝ . (16)
σ4
When both µ and σ are random, the joint ALI prior for (µ, σ ) is:

1
p(µ, σ ) ∝ . (17)
σ5
It is noticed that the Jeffreys’, ALI, and uniform priors are reproduced from ∝ 1/σq as
q takes different integer values. In the following, the derivation of a 1/σq prior from NIG
conjugates is shown.

2.2. Derivation of the 1/σq Priors as the Limiting States of NIG Conjugates under Gaussian Likelihood
In Bayesian models, for the given likelihood function, the posterior p(φ| x ) and the
prior p(φ) are called conjugate distributions if both are in the same family of distributions.
It can be considered as the prior can be reconditioned by encoding the evidence through the
likelihood; therefore, the evidence or data merely change the distribution parameters of the
prior and yield another distribution of the same type, but with a different set of parameters.
For a location-scale parameter vector (θ, σ2 ) used in linear or linearized Bayesian mod-
els with a Gaussian likelihood, the corresponding conjugate prior is the NIG distribution.
The closed-form posterior PDFs for the parameter and prediction are given in Appendix A.
Noninformative priors, including the Jeffreys’, ALI, and reference priors, for (θ, σ2 )
are mostly in the form of 1/σq , q ∈ 1, 2, .... The uniform prior can be seen as a special case
of 1/σq as q = 0. It is shown as follows that these noninformative priors can be obtained as
certain limiting states of NIG conjugates of (θ, σ2 ). For example, the NIG distribution with
parameters (α, β, Σ, µ) given by Equation (A3) can reduce to the Jeffreys’ prior:

1
p(θ, σ2 ) → (18)
σ2
as: 
 → −k/2
α
→ 0+


β
− . (19)

 Σ 1 →0
|µ| < ∞

Furthermore, a 1/σq -type of prior can all be derived as reduced NIG distributions.
By assigning the initial values for α, β, and Σ of the NIG distribution, different q values
are obtained. Table 1 presents priors with different q values as reduced NIG distributions
and the corresponding α, β, and Σ of the NIG distributions and α∗ , β∗ , and Σ∗ of the
resulting NIG posterior distributions. The posterior distribution of σ2 is an inverse-Gamma
distribution IG(α∗ , β∗ ), and the PDF is given by:
∗  α ∗ +1
β∗α β∗
  
2 1
p(σ |y) = exp − 2 . (20)
Γ(α∗ ) σ2 σ
Mathematics 2021, 9, 2810 6 of 20

Table 1. 1/σq -type of prior as limiting state cases of the NIG distribution and its NIG posteriors.

Prior (θ, σ 2 ) NIG(α, β, θ, σ 2 ) → Prior Posterior → NIG(α∗ , β∗ , θ, σ 2 )


α = −k/2 − 1 α∗ = α + n/2 = (n − k − 2)/2
flat β →0 β∗ = β + SSE/2 = SSE/2
 −1  −1
Σ −1 →0 Σ∗ = Σ −1 + x T x = xT x
α = −k/2 − 1/2 α∗ = α + n/2 = (n − k − 1)/2
1
σ β →0 β∗ = β + SSE/2 = SSE/2
 −1  −1
Σ −1 →0 Σ∗ = Σ −1 + x T x = xT x
α = −k/2 α∗ = α + n/2 = (n − k)/2
1 β →0 β∗ = β + SSE/2 = SSE/2
σ2  −1  −1
Σ −1 →0 Σ∗ = Σ −1 + x T x = xT x
α = −k/2 + 1/2 α∗ = α + n/2 = (n − k + 1)/2
1 β →0 β∗ = β + SSE/2 = SSE/2
σ3  −1  −1
Σ −1 →0 Σ∗ = Σ −1 + x T x = xT x
α = −k/2 + 1 α∗ = α + n/2 = (n − k + 2)/2
1 β →0 β∗ = β + SSE/2 = SSE/2
σ4  −1  −1
Σ −1 →0 Σ∗ = Σ −1 + x T x = xT x
α = −k/2 + 3/2 α∗ = α + n/2 = (n − k + 3)/2
1 β →0 β∗ = β + SSE/2 = SSE/2
σ5  −1  −1
Σ −1 →0 Σ∗ = Σ −1 + x T x = xT x

The posterior distribution of θ is obtained by integrating out σ2 from the joint posterior
distribution of Equation (A6) as,
Z
p(θ |y) = p(θ, σ2 |y)dσ2
Z   α ∗ +1   
1 1 ∗ 1 ∗ T ∗−1 ∗
∝ exp − β + ( θ − µ ) Σ ( θ − µ ) dσ2 (21)
σ2 σ2 2
 −(2α∗ +k)/2
1
∝ 1 + ∗ (θ − µ∗ ) T Σ∗−1 (θ − µ∗ ) ,

which is a (2α∗ ) degree-of-freedom multivariate t– distribution with a location vector of µ∗


β∗
 
and a shape matrix of α∗ Σ∗ . In particular, when the NIG prior for (θ, σ2 ) is reduced to
the Jeffreys’ prior 1/σ2 by Equation (19), the resulting posteriors of σ2 and θ are:

n − k SSE
 
p(σ2 |y) = IG , , (22)
2 2

and:  
SSE T −1
p(θ |y) = MVTn−k (x T x)−1 (x T y), (x x) , (23)
n−k
respectively. The term SSE in Equation (22) is the sum of squared errors, given by:
n
SSE = ∑ ( y i − x i θ )2 . (24)
i

The prediction posterior in this case is:


 
T −1 T SSE  T −1
p(ỹ|y) = MVTn−k x̃(x x) (x y), I + x̃(x x) x̃ . (25)
n−k
Mathematics 2021, 9, 2810 7 of 20

The advantage of treating the 1/σq -type of noninformative prior as reduced NIG
conjugates is that the Bayesian posteriors of the model prediction and parameters all
have analytical forms, allowing for efficient evaluations without resorting to the MCMC
techniques, as done in regular Bayesian analysis. It is worth noting that the analytical
results are obtained under the condition (or assumption) that the likelihood is a Gaussian.
This condition implies that the conjugate prior is an NIG. The values of the NIG parameters
are obtained by equating the noninformative prior to the NIG prior. In other words,
equating the two distributions for the purpose of mathematical convenience is the premise
of resolving those parameter values.

3. Assessment of Noninformative Priors


To evaluate the performance of priors, the Bayesian model assessment method using
efficient asymptotic approximations is proposed. The idea is to recast the assessment as
a model comparison problem. The participating models here are the posteriors obtained
with different priors. The comparison can then be made using Bayes’ factors in a Bayesian
hypothesis testing context.

3.1. Fitting Performance


The Bayes factor, on the basis of observed data y, evaluating the plausibility of two
different models, M1 and M2 , can be expressed as:

p(y|θ, σ2 , M1 ) p(θ, σ2 | M1 )dθdσ2


R
P(y| M1 )
B12 = = R (26)
P(y| M2 ) p(y|θ, σ2 , M2 ) p(θ, σ2 | M2 )dθdσ2

The comparison of two models can be made based on the ratio of the posterior
probabilities of the models:

P( M1 |y) P(y| M1 ) P( M1 ) P( M1 )
= = B12 · . (27)
P( M2 |y) P(y| M2 ) P( M2 ) P( M2 )

The ratio, when the prior probabilities of the models are equal, is reduced to the Bayes
factor of Equation (26). The assessment of the Bayes factors involves two integrals over
the parameter space of (θ, σ2 ). The integrands for models M1 and M2 are p(y|θ, σ2 , M1 ) ·
p(θ, σ2 | M1 ) and p(y|θ, σ2 , M2 ) · p(θ, σ2 | M2 ), respectively.
For general multidimensional integration, asymptotic approximation or simulation-
based estimation are two commonly used methods. The Laplace approximation method
evaluates the integral as a multivariate normal distribution. Consider the above integrand
term in Equation (26), i.e., p(y|θ, σ2 , M1 ) · p(θ, σ2 | M1 ). Drop the model symbol for sim-
plicity of the derivation, and denote (θ, σ2 ) as φ. The natural logarithm of the integrand,
p(y|φ) p(φ), i.e., p(y, φ), can be expressed using Taylor expansion around its mode φ0 as:

ln p(y, φ) = ln p(y, φ0 ) + (φ − φ0 )T ∇ ln p(y, φ0 )


1 h i
+ (φ − φ0 )T ∇2 ln p(y, φ0 ) (φ − φ0 ) (28)
2
+O (φ − φ0 )3 ,


where ∇ ln p(y, φ0 ) is the gradient of ln p(y, φ) evaluated at φ0 , ∇2 ln p(y, φ0 ) is the Hessian


matrix of ln p(y, φ) evaluated at φ0 , and O[·] are higher-order terms. When the higher-order
terms are negligible,

ln p(y, φ) ≈ ln p(y, φ0 ) + (φ − φ0 )T ∇ ln p(y, φ0 )


| {z }
(◦) (29)
1 h i
+ (φ − φ0 )T ∇2 ln p(y, φ0 ) (φ − φ0 ).
2
Mathematics 2021, 9, 2810 8 of 20

The term (◦) is zero at the mode of the distribution where the gradient is zero;
therefore, expanding ln p(y, φ) around φ0 eliminates the term (◦) and yields:

1 h i
ln p(y, φ) ≈ ln p(y, φ0 ) + (φ − φ0 ) T ∇2 ln p(y, φ0 ) (φ − φ0 ). (30)
2
Exponentiate the above equation to obtain:
 
0 1  h
0 T 2 0
i
0

p(y, φ) ≈ p(y, φ ) exp − φ − φ −∇ ln p(y, φ ) φ − φ . (31)
2

Realizing the first term of the above equation is a constant and the last term is the
variable part of a multivariate normal distribution with a mean vector of φ0 and a covariance
matrix Σ0 = [−∇2 ln p(y, φ0 )]−1 , the integration of p(y, φ) writes,
 
1
Z   −1
P(y) = p(y, φ0 ) · exp − (φ − φ0 ) T Σ0 (φ − φ0 ) dφ
pΦ 2 (32)
= p(y, φ0 ) (2π )k+1 |Σ0 |.

Notice that θ is a k-dimensional vector, so φ = (θ, σ2 ) is a k + 1-dimensional vector. Using


the result of Equation (32), the integral associated with the jth model in Equation (26) is:
 q
P(y| M j ) ≈ p y, (θ, σ2 )0 | M j (2π )k+1 |Λ0j |. (33)

where (θ, σ2 )0 is the mode of the distribution and Λ0j is the covariance matrix, i.e., the
inverse of the negative Hessian matrix of ln p(y, θ, σ2 | M j ) evaluated at (θ, σ2 )0 . Proper
numerical derivative methods based on finite difference schemes can achieve reliable
results for the gradient and the Hessian in the following equation.
 h i
2 0 2


 ( θ, σ ) = arg max ∇ ln p ( y, θ, σ | M j )

 (θ,σ2 )
" # −1 , (34)
0

2 2
Λ = −∇ ln p(y, θ, σ | M j )




(θ,σ2 )0

Experience has shown that for a distribution with a single mode and approximately
symmetric, the Laplace approximation method can achieve accurate results for engineering
applications [38,39].
It should be noted that the 1/σq -type of noninformative prior is not a proper prior, i.e.,
the normalizing constant for 1/σq is not bounded for σ ∈ (0, +∞); therefore, it is necessary
to limit the support of 1/σq to a proper range and to have a finite normalizing constant.
Assume the effective range of σ is (σ− , σ+ ) for q ∈ (0, 1, ...); the normalizing constant is:


 ln(σ+1 ) − ln(σ− ) q=1
Z (σ) = 1  1− q 1− q
 . (35)
 σ+ − σ− q 6= 1
1−q

Substitute p(θ, σ2 ) with the 1/( Z (σ )σq ) in Equation (3). The likelihood of Equation (33)
compares the fitting performance of different 1/σq -type of priors.

3.2. Predictive Performance


The predictive performance for a model is measured by the likelihood of the data that
are not used for model parameter estimation. The likelihood of the future measurement
data ỹ with the posterior PDF of p(θ, σ2 |y) writes:
Mathematics 2021, 9, 2810 9 of 20

Z
P(ỹ) = p(ỹ|θ, σ2 ) p(θ, σ2 |y)dθdσ2
1
Z (36)
= p(ỹ|θ, σ2 ) p(y|θ, σ2 ) p(θ, σ2 )dθdσ2 ,
P(y)
where P(y) is evaluated using Equation (33).
The comprehensive performance integrating both fitting performance and predictive
performance in terms of likelihood in the joint space of (y × ỹ) is:
Z
P(ỹ) P(y) = p(ỹ|θ, σ2 ) p(y|θ, σ2 ) p(θ, σ2 )dθdσ2 . (37)

Realize that Equations (32), (36), and (37) can all be evaluated using the method of
Laplace approximation, and the performance of different priors can quantitatively be
compared. It is noted that the likelihood is the product of the fitting component and
the prediction component, and the influence of the prior is incorporated in the fitting
component, i.e., P(y).

4. Application Examples
To illustrate the proposed Bayesian inference with noninformative priors, two exam-
ples are presented. The results were compared with the classical least squares approach.
The fitting and comprehensive performance of the 1/σq -type of priors were evaluated
using the method of Laplace approximation.

4.1. Low-Cycle Fatigue Reliability Assessment with Sparse Samples


The above numerical example reveals the advantage of using a noninformative prior
over the flat prior in probabilistic inference under small sample sizes. To examine its useful-
ness in realistic problems and further compare the performance of different noninformative
priors, a fatigue life prediction problem is presented.
The low-cycle fatigue testing data reported in ASTM E739-10 [40] were used and
shown in Table 2. The cyclic strain amplitudes (∆ε/2) were (∼0.016, ∼0.0068, ∼0.0016,
∼0.0005), and each level had two or three data points. The following log-linear model
was adopted,
ln N = a0 + a1 ln(∆ε/2), (38)
where a0 and a1 are model parameters that need to be identified. It should be noted that
other variants of the above equation can be used to include material plasticity, the mean
stress effect, etc. The discrepancy between the experimental data and the model prediction
can be modeled using an uncertain variable with zero mean and a standard deviation of σe .
To compare with the regular least squares method, a Gaussian likelihood was assumed,
and the Bayesian posterior is:
 n/2
1 1
p( a0 , a1 , σe | N ) ∝ ·
σ q " σ2
2
# (39)
∑in=1 (ln Ni − a0 − a1 ln(∆ε i /2))
× exp − .
2σ2

Table 2. Low-cycle fatigue testing data. Source: [40].

i 1 2 3 4 5 6 7 8 9
∆ε/2 0.01636 0.01609 0.00675 0.00682 0.00179 0.00160 0.00165 0.00053 0.00054
N 168 200 1000 1180 4730 8035 5254 28,617 32,650
Mathematics 2021, 9, 2810 10 of 20

The exponent q takes the values of 0, 1, 2, ... to generate a flat prior and different
noninformative priors, and (∆ε i , Ni ), i = 1, ..., n is the ith experimental data points of the
total n data points used for parameter estimation.
Notice that when q = 0, the above equation reduces to a regular least squares format.
One data point (i = 6) was arbitrarily chosen for prediction performance evaluation, and
the rest eight data points were used to estimate the model parameters using Equation (39).
To compare the performance of 1/σq -type priors, q = 0, ...5 were used to represent the
regular least squares, Jeffreys’, and ALI priors. The likelihood of the fitting performance
and prediction performance was evaluated using the method of Laplace approximation,
and results are shown in Figure 1. The total likelihood of the fitting and prediction results
indicates that the Jeffreys’ prior 1/σ2 outperformed the others. The result associated with
q = 0 is the Bayesian version of the regular least squares estimation.

106

-1.2
104
-1.3

102
N

-1.4

-1.5 Exp.
100
Mean
-1.6 LS: POF = 1.0E-05
NB: POF = 1.0E-05
10-2
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 10-4 10-3 10-2 10-1

(a) (b)

Figure 1. (a) MCMC samples drawn from the noninformative Bayesian posterior with Jeffreys’ prior (q = 2) and (b) the
fatigue life results at POF = 10−5 obtained with the regular least squares (LS) estimator and noninformative Bayesian (NB).

To further demonstrate the difference between the regular least squares and noninfor-
mative Bayesian approach under this sample size, the following reliability problem was
considered. The fatigue life given a prescribed probability of failure (POF) was calculated
for the cyclic strain range (∆ε/2) between 10−4 and 10−1 . Given a cyclic strain range and a
POF, the fatigue life NPOF can be expressed as:

Pr( N > NPOF ) = 1 − POF. (40)

For the regular least squares estimator, it can be obtained using the t-statistics given
by Equation (A9) where γ/2 should be replaced with the prescribed POF because only the
lower one-sided bound is needed. For the Bayesian with Jeffreys’ prior, the one-side bound
can trivially be obtained using the POF-quantile of the prediction results evaluated using
MCMC samples of the posterior POF. Figure 2 presents the comparisons of the fatigue
life results where the obvious difference was observed. The results of the noninformative
Bayesian were larger than those of regular least squares, indicating a longer fatigue life for
the same risk level, i.e., 10−5 POF.

4.2. Aeroengine Turbine Disk Lifing and Failure Rate Assessment


The prescribed safe life is a critical parameter for airworthiness and service health
management of turbine disks in aeroengines. To determine the prescribed safe life for
turbine disks, life testing with a number of disks is performed using disk samples in a
rotating pit. The number of tested piece is usually less than five due to the cost and time
Mathematics 2021, 9, 2810 11 of 20

constraints; sometimes, only one or two test pieces are available. The determination of the
prescribed safe life must be carefully made to mitigate the risk. A general and standard
procedure described in [41] is centered on the idea that two source of uncertainty, namely
the sampling uncertainty and the life distribution uncertainty, should be incorporated. To
account for the sampling uncertainty, the 95% single-sided lower bound of the sample
mean is used as the mean of the life distribution. To account for the uncertainty of the
life distribution itself, no more than one in seven-hundred fifty service disks is allowed
to develop an engineering crack. The total safety margin is a 1/750 cumulative failure
probability to a 95% confidence.

5 12
4 10
3 8
2 6
1 4
0 2
0 1 2 3 4 5 0 1 2 3 4 5

(a) (b)

Figure 2. (a) The fitting performance P( N ) evaluated with different 1/σq priors and (b) the prediction performance
P( Ñ ) P( N ) evaluated with different 1/σq priors. One data point Ñ is considered for the prediction.

The prescribed safe life is assumed to be a log-normal random variable, and the log-life
is a normally distributed variable. Denote the mean and standard deviation of the normal
distribution of the log-life as ln Nµ and σ. The 95% single-sided lower bound of the sample
mean can be expressed as:
σ
ln Nµ95% = ln Nµ − λα √ , (41)
n
where Nµ is the geometric mean of the life, Nµ,95% is the single-sided lower bound of the
geometric mean, λα = Φ−1 (1 − α), and Φ−1 (·) is the inverse cumulative density function
(CDF) of the standard normal variable when the standard deviation σ is known. For the
95% single-sided lower bound, zα = Φ−1 (0.95) = 1.645. When the standard deviation is
unknown, the estimate of the standard deviation based on the samples should be used.
In such cases, the value of λα is obtained using the Student distribution with a degree of
freedom (DOF) of n − 1, i.e., λα = t−1 (1 − α, n − 1). The function t−1 (·, n − 1) is the inverse
CDF of the Student distribution with a DOF of n − 1. The 1/750 failure probability log-life
corresponds to the −3σ shift from the mean of the log-life distribution. When using ln Nµ95%
as the mean of the log-life distribution, the 1/750 failure probability life Nµ95%
−3σ writes:

ln Nµ95% 95%
−3σ = ln Nµ − 3σ. (42)

Combine Equations (41) and (42) to have:


σ
ln Nµ − ln Nµ95%
−3σ = λα √ + 3σ, (43)
n

The safety factor that ensures a 1/750 failure to a 95% confidence can be expressed by
transforming Equation (43) into the linear scale as:
  
Nµ λα
S = 95% = exp √ + 3 σ (44)
Nµ−3σ n
Mathematics 2021, 9, 2810 12 of 20

It is seen that the prescribed safe life depends on the number of tested samples n
and the standard deviation of the distribution. The deterministic prescribed safe life Ar is
finally obtained as:
Ar = Nµ /S. (45)
The cumulative burst probability when the disk is used y cycles can be expressed as:
" #
( t − x 0 )2
Z y
0 1
Fb (y| x ) = √ exp − dt, (46)
0 2πσb 2σb2

where x 0 is the log-transformed geometric mean of the samples and y is the log-transformed
number of cycles, i.e., y = ln N. The failure rate in terms of the failures per cycle can be
expressed as:
∂F (y| x 0 )/∂y f b (y| x 0 )
h(y| x 0 ) = b = , (47)
1 − Fb (y| x 0 ) 1 − Fb (y| x 0 )
where f b (y| x 0 ) is the burst PDF:
" #
0 1 ( y − x 0 )2
f b (y| x ) = √ exp − . (48)
2πσb 2σb2

The sampling uncertainty in the log-transformed geometric mean x 0 can be quantified


using the sampling PDF. The sample mean x 0 distributes according to:

( x 0 − ln Nµ )2
 
 √1

 exp − σ known
f s (x0 ) = 2πσs 2σs2 , (49)
s
 ln Nµ + √ · t(n − 1) σ unknown


n

where s is the standard deviation of the sample, t(n − 1) is the standard Student distribution
with a DOF of n − 1, and σs = √σn .
The final failure rate can be obtained by marginalizing out the sampling error variable
x 0 as: Z
h(y) = h(y| x 0 ) f s ( x 0 )dx 0 . (50)

With the above discussion, turbine disk life testing results on two disks were obtained
as shown in Table 3.

Table 3. Spin rig testing results on two disks.

Disk No. No. of Cycles


1 27,000
2 21,700

It is seen that the key information is the standard deviation of the population. For
one thing, whether it is considered as a known quantity requires a justification, and its
actual value relies on empirical evidence and historical data. For another, when it needs
to be estimated from the samples, the proper method should be used due to its limited
sample size.

4.2.1. Existing Method


Previous disk life testing results showed that the scatter factor, defined as the ratio
of the maximum life (Nmax ) over the minimum life (Nmin ), is less than six. Furthermore,
Nmax and Nmin are considered as the +3σ and −3σ points of the normal distribution of
Mathematics 2021, 9, 2810 13 of 20

the log-life. With the above empirical evidence and assumption The empirical evidence of
Nmax /Nmin = 6, combined with the assumption that ln Nmax − ln Nmin = 6σ, yields:

ln Nmax − ln Nmin
 
1 Nmax ln 6
σ= = ln = . (51)
6 6 Nmin 6

The safety factor that ensures a 1/750 failure to a 95% confidence can be expressed, by
substituting Equation (51) into Equation (43) as,

Nµ 1 + 1.645

S= 95%
= 62 6 n (52)
Nµ−3σ

Given that assumption that the standard deviation of the log-life distribution is known,
the first equation of the right-hand side in Equation (49) is used. In addition, the term
1 − Fb (y| x 0 ) ≈ 1 for rare events. In such a case, the hazard rate function Equation (50)
reduces to the convolution of two normal PDFs, and the resulting hazard rate function is:
Z
h(y) = h(y| x 0 ) f s ( x 0 )dx 0 ≈ Norm(y, ln Nµ , σb2 + σs2 ). (53)

Note that y = ln N; the hazard rate in terms of failure per flight hour is found by using
the chain rule as,
1
h( H ) = · Norm(ln( βH ), ln Nµ , σb2 + σs2 ), (54)
H
where β is the exchange rate (cycles/hour) converting the cycle to equivalent engine flight
hours. Using the data in Table 3, the log-transformed geometric mean ln Nµ = 24,205,
the standard deviation of the log-life distribution is σ√ b = (ln 6) /6, and the standard
deviation of the sampling distribution is σs = (ln 6)/(6 2). Substitute these parameters
into Equation (45) to obtain the prescribed safe life as Ar = 6981 cycles.

4.2.2. The Proposed Noninformative Bayesian Method


Without relying on the empirical evidence and the assumption, the noninformative
Bayesian estimator for (µ, σ ) of the log-life distribution can be obtained by writing the
posterior of the likelihood of samples with a noninformative prior. For a demonstration,
the prior of ∝ 1/σ2 was used, and the posterior writes,
n/2
∑n (ln Ni − µ)2
  
1 1
p(µ, σ | D ) ∝ · · exp − i=1 , (55)
σ2 σ2 2σ2
where D denotes the event of observed life samples N1 , ..., Nn . To estimate µ and σ, 1 × 107
samples were drawn from Equation (55) using the MCMC method. The resulting histogram
of the samples is shown in Figure 3. The 95% single-sided confidence bound µ95% can be
estimated using the 0.05-quantile of the MCMC samples as 9.86565 or 19,259 in the linear
scale. Centering on µ95% = 9.86565, the value corresponding to a 1/750 cumulative failure
probability was found by subtracting µ95% by 3σ. The term σ was estimated using the
MCMC sample mean, and in this case, it was found as 0.1893. Consequently, after applying
the two safety margins, the prescribed safe life (in log scale) using the noninformative
Bayesian method was found as µ95% −3σ = 9.2976 or Ar = 10,911 cycles.
The failure rates calculated using the existing method and the proposed noninforma-
tive Bayesian method are compared in Figure 4. The failure rate per flight hour was based
on an exchange rate β = 2.5.
Based on the results shown in Figure 4, the usable flight hours obtained using the
noninformative Bayesian method were about 1000 h more than those obtained using the
existing method given the same risk constraint. For example, the flight hours corresponding
to a failure rate of 10−8 /h were 2734 and 1734, respectively, for the two methods. When
the failure rate was required to be 10−6 /h, the usable flight hours were 3817 and 2720,
Mathematics 2021, 9, 2810 14 of 20

respectively. It should be noted that the proposed noninformative Bayesian method does
not rely on empirical evidence and assumptions other than the log-normal distribution for
disk life.

0.8

0.6

0.4

0.2

0
9 9.5 10 10.5 11

Figure 3. Samples drawn from the posterior PDF and histograms.

Cycles
2500 5000 7500 10000 12500
-5
10
Failure rate (failure/hrs.)

-10
10

NB
Ref.
10-15
1000 2000 3000 4000 5000
Time (hrs.)
Figure 4. Comparisons of the failure rate results of the two methods. The solid line and the dashed
line are results from the noninformative Bayesian (NB) and existing method (Ref.), respectively.

5. Conclusions
The study developed a noninformative Bayesian inference method for small sample
problems. In particular, a 1/σq -type of prior was proposed to formulate the Bayesian pos-
terior. It was shown that this type of prior can represent the classical Jeffreys’ prior (Fisher
information), flat prior, and asymptotic locally invariant prior. More importantly, this type
of prior was derived as the limiting state of normal-inverse-Gamma (NIG) conjugate priors,
allowing for fast and efficient evaluations of the Bayesian posteriors and predictors in
closed-form expressions under a Gaussian likelihood. To illustrate the overall method and
compare the performance of noninformative priors, a realistic fatigue reliability problem
was discussed in detail. The combined fitting and prediction performance in terms of
Mathematics 2021, 9, 2810 15 of 20

the likelihood was used to evaluate different priors. It was observed that Jeffreys’ prior,
1/σ2 , yielded the maximum likelihood. The developed method was further applied to
an actual aeroengine disk lifing problem with two test samples, and the results on the
prescribed safe life and risk in terms of failure/hour were comparable with those obtained
using the existing method incorporating the empirical evidence and assumptions. It should
be emphasized that the study did not promote abandoning justified experience and/or
information in prior construction. It was argued in this study that in the absence of those
data or experience, the noninformative prior should be used to avoid introducing unjus-
tified information into the prior. The following conclusions were drawn based on the
current results:
• The 1/σq -type of prior can be used equivalently as the NIG conjugate priors at their
limiting states. The great features of conjugate priors, such as having analytical
posterior and prediction PDFs under a Gaussian likelihood, can be retained when
using noninformative priors for Bayesian linear regression analysis;
• For the 1/σq -type of noninformative prior, the classical Jeffreys’ prior 1/σ2 yielded an
optimal fitting and prediction performance in terms of the likelihood or Bayes factors.
When q = 0, the Bayesian estimator reduces to the regular least squares estimator.
The results of the two case studies showed the advantage of using the noninformative
Bayesian estimator with the prior of ∝ 1/σ2 over the regular least squares estimator
and other empirical methods under small sample sizes.

Author Contributions: Conceptualization, X.G.; Data curation, S.W.; Formal analysis, W.W. and
M.H.; Methodology, J.H., W.W. and X.G.; Resources, M.H. and S.W. All authors have read and agreed
to the published version of the manuscript.
Funding: The study was supported by National Natural Science Foundation of China, Nos. 11872088,
12088101, NSAF U1930403, CAEP CX20200035, and 2020-JCJQ-JJ-465. The support is gratefully acknowledged.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Acknowledgments: The study was supported by National Natural Science Foundation of China,
Nos. 11872088, 12088101, NSAF U1930403, CAEP CX20200035, and 2020-JCJQ-JJ-465. The support is
gratefully acknowledged.
Conflicts of Interest: The authors declare no conflict of interest.

Appendix A. Conjugate Prior, Posterior, and Prediction


Assume a general linear model with a parameter vector θ and a Gaussian likelihood
with a scale parameter σ2 , i.e., the modeling error ei = (yi − xi θ ) ∼ Norm(0, σ2 ).

Appendix A.1. Conjugate Prior


It is known that when the likelihood function belongs to the exponential family, a
conjugate prior exists and belongs also to the exponential family. To obtain the joint
conjugate using p(θ, σ2 ) = p(θ |σ2 ) p(σ2 ), it is necessary to look at the conjugate prior of σ2
first. For a Gaussian likelihood with an unknown variance σ2 and zero mean, the conjugate
prior for σ2 is an inverse Gamma distribution, i.e., p(σ2 ) ∼ IG(α, β), and is expressed as:
  α +1  
βα 1 β
p ( σ2 ) = · exp − 2 , (A1)
Γ(α) σ2 σ

where α > 0 and β > 0 are shape and scale parameters, respectively. The posterior
distribution of σ2 , given n data points, has the following new shape and scale parameters:
Mathematics 2021, 9, 2810 16 of 20

!
2
n ∑ n ( y − xi θ )
(α0 , β0 ) = α+ ,β+ i i
2 2
  (A2)
n SSE
= α+ ,β+ ,
2 2
where SSE = ∑in (yi − xi θ )2 is the sum of squared errors. Consider the joint conjugate
of (θ, σ2 ), a widely adopted conjugate prior for Bayesian linear regression is the normal-
inverse-Gamma (NIG) prior for (θ, σ2 ), and it can be written as Equation (A3).

p(θ, σ2 ) = p(θ |σ2 ) p(σ2 ) = NIG(µθ , Σθ , α, β)


  
1 1 T −1
= Z · exp − 2 β + (θ − µθ ) Σθ (θ − µθ )
σ 2 (A3)
  
0 1 1 T −1
∝ Z · exp − 2 β + (θ − µθ ) Σθ (θ − µθ ) ,
σ 2

where:  α+k/2+1
βα 1
Z= , (A4)
(2π )k/2 Γ(α) |Σθ | σ2
p

and:  α+k/2+1
1
Z0 = . (A5)
σ2
Using the inverse gamma distribution for σ2 , the initial distribution parameter (α, β)
still needs to be determined. Unfortunately, there is no formal rule to do that. Gelman [35]
discussed the inverse-Gamma distribution IG(e, e) where e was set to a low value such as
1, 0.01, or 0.001. A difficulty of this prior is that e must be set to a proper value. Inferences
become very sensitive to e for cases where low values of σ2 are possible. Due to this reason,
the IG(e, e) family of priors for σ2 was not recommended by the author. It was further argued
that the resulting inference based on it for cases where σ2 is estimated near zero was the case
for which classical and Bayesian inferences differed the most [35]. Browne and Draper [42]
also mentioned this disadvantage and suggested using a uniform one U(0, 1/e) to reduce
this defect. However, choosing an appropriate value for e for U(0, 1/e) still requires
attention and understanding of the data. Indeed, the effect of the prior and its parameters
is highly subjected to the data and problem, and its sensitivity to Bayesian inferences based
on sparse data or extensive data can be dramatically different as the data dominants the
posterior due to the law of large numbers.

Appendix A.2. Posterior Distribution


The posterior distribution of (θ, σ2 ) is expressed as:

p(θ, σ2 ) p(y|θ, σ2 )
p(θ, σ2 |y) = . (A6)
p(y)

The term p(y) = p(θ, σ2 ) p(y|θ, σ2 )dθdσ2 is the marginal distribution of the data. It
R

can be shown that after some algebraic operations, the posterior can be written as:
 ∗
1 α +k/2+1

p(θ, σ2 |y) ∝
σ2   (A7)
1 1
× exp − 2 β∗ + (θ − µ∗ )T Σ∗−1 (θ − µ∗ ) ,
σ 2
Mathematics 2021, 9, 2810 17 of 20

where:   −1 −1
µ∗ = Σ −1 + x T x Σ µ + xT y




 ∗  −1
Σ = Σ −1 + x T x .

 α∗ = α + n/2
 ∗
= β + 12 µT Σ−1 µ + y T y − µ∗T Σ∗−1 µ∗
 
β
The initial NIG conjugate parameter (µ, Σ, α, β) is updated to a new set of parameters
(µ∗ , Σ ∗ , α ∗ , β ∗ ).

Appendix A.3. Prediction


The Bayesian prediction of the response given the data y can be obtained using the
above distributions with a new set of input variable x̃. Denote the corresponding prediction
as ỹ. The prediction posterior distribution of ỹ can be expressed as:
Z
p(ỹ|y) = p(ỹ|θ, σ2 ) p(θ, σ2 |y)dθdσ2
Z
= MVN(x̃θ, σ2 I) × NIG(µ∗ , Σ∗ , α∗ , β∗ )dθdσ2 (A8)
β∗
 
= MVT2α∗ x̃µ∗ , ∗ (I + x̃Σ∗ x̃ T ) .
α

when a single point is evaluated, Equation (A8) is reduced to a one-dimensional Student


distribution, and the confidence interval is readily evaluated using the inverse Student
distribution. For multipoint simultaneous evaluations, the interval contours can be approx-
imated using a few methods as suggested in [43].
To compare, the simple linear regression results of the prediction mean and bounds
are given. The prediction mean is x̃θ̂, and the prediction bounds of level (1 − γ)% are:

ỹ1−γ = x̃θ̂ ± tinv(γ/2, n − k) · œy , (A9)

where θ̂ = (x T x)−1 (x T y) is the maximum likelihood estimator of θ. It is known that θ̂ is


also identical to the least squares estimator and the Bayesian estimator, i.e., Equation (23).
The term tinv(γ/2, n − k) represents the γ/2-percentile value of the standard Student
t-distribution with n − k degrees of freedom, and σy is the standard error of the prediction
given by:
eT e h i
σy2 = · I + x̃(x T x)−1 x̃ T , (A10)
n−k
where the term e = y − xθ̂ is the discrepancy vector between the model and observation,
also called the residual or error. As multiple points are estimated, the F-statistic can be
used instead of the t-statistic to yield the Working–Hotelling confidence intervals [44],
q
ỹ1−γ = x̃θ̂ ± 2Finv(γ, k, n − k) · œy , (A11)

where Finv(γ, k, n − k) represents the γ-percentile value of the F-distribution with (k, n − k)
degrees of freedom.

Appendix B. Fisher Information Matrix and Kullback–Leibler Divergence


The equivalence of the Fisher information matrix and the Kullback–Leibler divergence
can be shown as follows. The matrix form of the Fisher information matrix of a continuous
PDF with a parameter vector θ writes,
h i
I(θ ) = Eθ ∇ ln pθ ( x ) · ∇ ln pθ ( x ) T . (A12)

Under the condition that the integration and derivation may be exchanged and the
log-function is twice differentiable, the following Equation (A13) can be attained.
Mathematics 2021, 9, 2810 18 of 20

∇2 pθ ( x ) ∇ pθ ( x )∇ · pθ ( x )T
∇2 ln pθ ( x ) = −
pθ ( x ) [ pθ ( x )]2 (A13)
2
∇ pθ ( x )
= − ∇ ln pθ ( x ) · ∇ ln pθ ( x )T .

It is noted that:

∇2 p θ ( x )
  Z
Eθ = ∇2 pθ ( x )dx = ∇2 1 = 0. (A14)
pθ ( x )

The following result, that is the negative expected Hessian matrix of the log-function
is equal to the Fisher information matrix, can be obtained.
h i h i
I(θ ) = Eθ ∇ ln pθ ( x ) · ∇ ln pθ ( x ) T = −Eθ ∇2 ln pθ ( x ) , (A15)

where ∇2 ln pθ ( x ) is recognized as the Hessian matrix of ln pθ ( x ), denoted as Hln pθ ( x) .


The Fisher information matrix can loosely be thought of as the curvature matrix of the
log-function graph.
The Kullback–Leibler divergence, also called the relative entropy, between two distri-
butions can be written as:
 
p (x)
Z
D(θ ||θ 0 ) = pθ ( x ) ln θ dx = Eθ [ln pθ ( x ) − ln pθ 0 ( x )]. (A16)
pθ 0 ( x )

Expand ln pθ 0 ( x ) around θ to the second order to have,

ln pθ 0 ( x ) ≈ ln pθ ( x ) + (θ − θ 0 )∇ ln pθ ( x )
1 . (A17)
+ (θ − θ 0 )T ∇2 ln pθ ( x )(θ − θ 0 )
2
Substitute Equation (A17) into Equation (A16):

1 h i
D(θ ||θ 0 ) ≈ −Eθ [∆θ · ∇ ln pθ ( x )] − Eθ ∆θ · ∇2 ln pθ ( x ) · ∆θ T , (A18)
2
where ∆θ = (θ − θ 0 ). The first term of the right-hand side of Equation (A18) is zero; and
the second term of the right-hand side is related to Equation (A15). The KL divergence can
finally be expressed as,
1
D(θ ||θ 0 ) ≈ ∆θ T · I(θ ) · ∆θ. (A19)
2
Furthermore, differentiate Equation (A16) with respect to θ 0 to obtain:

∇θ 0 D(θ ||θ 0 ) = ∇θ 0 Eθ [ln pθ ( x )− ln pθ 0 ( x )]


∇ pθ 0 ( x ) . (A20)
= −Eθ
pθ 0 ( x )

Continue to differentiate Equation (A20) with respect to θ 0 to obtain:


" #
2 0 ∇ p θ 0 ( x ) ∇2 p θ 0 ( x )
∇θ 0 D(θ ||θ ) = −Eθ ∇ pθ 0 ( x ) − . (A21)
p2θ pθ 0 ( x )

Notice that the expectation of the second term in Equation (A21), when θ 0 = θ, reads,

∇2 p θ 0 ( x )
  Z
Eθ = ∇2 pθ ( x ) = ∇2 1 = 0, (A22)
pθ 0 ( x )
Mathematics 2021, 9, 2810 19 of 20

and the first term is the Fisher information matrix. As a result, the Fisher information matrix
is the Hessian matrix of the Kullback–Leibler distance evaluated at the true parameter θ.

∇2θ 0 D(θ ||θ 0 )|θ 0 =θ = I(θ ). (A23)

References
1. Royall, R.M. The effect of sample size on the meaning of significance tests. Am. Stat. 1986, 40, 313–315.
2. Raudys, S.J.; Jain, A.K. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Trans.
Pattern Anal. Mach. Intell. 1991, 13, 252–264. [CrossRef]
3. Nelson, C.R.; Kim, M.J. Predictable stock returns: The role of small sample bias. J. Financ. 1993, 48, 641–661. [CrossRef]
4. Fan, X.; Thompson, B.; Wang, L. Effects of sample size, estimation methods, and model specification on structural equation
modeling fit indexes. Struct. Equ. Model. A Multidiscip. J. 1999, 6, 56–83. [CrossRef]
5. McNeish, D. On using Bayesian methods to address small sample problems. Struct. Equ. Model. A Multidiscip. J. 2016, 23, 750–773.
[CrossRef]
6. Winkler, R.L. Probabilistic prediction: Some experimental results. J. Am. Stat. Assoc. 1971, 66, 675–685. [CrossRef]
7. Melchers, R.E.; Beck, A.T. Structural Reliability Analysis and Prediction; John Wiley & Sons: Hoboken, NJ, USA, 2018.
8. He, J.; Chen, J.; Guan, X. Lifetime distribution selection for complete and censored multi-level testing data and its influence on
probability of failure estimates. Struct. Multidiscip. Optim. 2020, 62, 1–17. [CrossRef]
9. Du, Y.M.; Ma, Y.H.; Wei, Y.F.; Guan, X.; Sun, C. Maximum entropy approach to reliability. Phys. Rev. E 2020, 101, 012106.
[CrossRef]
10. Zhou, D.; He, J.; Du, Y.M.; Sun, C.; Guan, X. Probabilistic information fusion with point, moment and interval data in reliability
assessment. Reliab. Eng. Syst. Saf. 2021, 213, 107790. [CrossRef]
11. Gao, C.; Fang, Z.; Lin, J.; Guan, X.; He, J. Model averaging and probability of detection estimation under hierarchical uncertainties
for Lamb wave detection. Mech. Syst. Signal Process. 2022, 165, 108302. [CrossRef]
12. Gregory, P. Bayesian Logical Data Analysis for the Physical Sciences: A Comparative Approach with Mathematica Support; Cambridge
University Press: Cambridge, UK, 2005.
13. Guan, X.; Jha, R.; Liu, Y. Model selection, updating, and averaging for probabilistic fatigue damage prognosis. Struct. Saf. 2011,
33, 242–249. [CrossRef]
14. Hamdia, K.M.; Zhuang, X.; He, P.; Rabczuk, T. Fracture toughness of polymeric particle nanocomposites: Evaluation of models
performance using Bayesian method. Compos. Sci. Technol. 2016, 126, 122–129. [CrossRef]
15. Yang, J.; He, J.; Guan, X.; Wang, D.; Chen, H.; Zhang, W.; Liu, Y. A probabilistic crack size quantification method using in-situ
Lamb wave test and Bayesian updating. Mech. Syst. Signal Process. 2016, 78, 118–133. [CrossRef]
16. Wang, D.; He, J.; Guan, X.; Yang, J.; Zhang, W. A model assessment method for predicting structural fatigue life using Lamb
waves. Ultrasonics 2018, 84, 319–328. [CrossRef]
17. Guan, X.; He, J. Life time extension of turbine rotating components under risk constraints: A state-of-the-art review and case
study. Int. J. Fatigue 2019, 129, 104799. [CrossRef]
18. He, J.; Huo, H.; Guan, X.; Yang, J. A Lamb wave quantification model for inclined cracks with experimental validation. Chin. J.
Aeronaut. 2021, 34, 601–611. [CrossRef]
19. Huo, H.; He, J.; Guan, X. A Bayesian fusion method for composite damage identification using Lamb wave. Struct. Health Monit.
2020, 1475921720945000. [CrossRef]
20. Berger, J.O. Robust Bayesian analysis: Sensitivity to the prior. J. Stat. Plan. Inference 1990, 25, 303–328. [CrossRef]
21. Berger, J.O.; Moreno, E.; Pericchi, L.R.; Bayarri, M.J.; Bernardo, J.M.; Cano, J.A.; De la Horra, J.; Martín, J.; Ríos-Insúa, D.; Betrò, B.;
et al. An overview of robust Bayesian analysis. Test 1994, 3, 5–124. [CrossRef]
22. Kass, R.E.; Wasserman, L. The selection of prior distributions by formal rules. J. Am. Stat. Assoc. 1996, 91, 1343–1370. [CrossRef]
23. Bishop, P.; Bloomfield, R.; Littlewood, B.; Povyakalo, A.; Wright, D. Toward a formalism for conservative claims about the
dependability of software-based systems. IEEE Trans. Softw. Eng. 2010, 37, 708–717. [CrossRef]
24. Gelman, A.; Hennig, C. Beyond subjective and objective in statistics. J. R. Stat. Soc. Ser. A 2017, 180, 967–1033. [CrossRef]
25. Zhang, J.; Shields, M.D. The effect of prior probabilities on quantification and propagation of imprecise probabilities resulting
from small datasets. Comput. Methods Appl. Mech. Eng. 2018, 334, 483–506. [CrossRef]
26. Schervish, M.J.; DeGroot, M.H. Probability and Statistics; Pearson Education: London, UK, 2014.
27. Jaynes, E.T. Prior probabilities. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 227–241. [CrossRef]
28. Jaynes, E.; Bretthorst, G. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 2003.
29. Gelman, A.; Simpson, D.; Betancourt, M. The prior can often only be understood in the context of the likelihood. Entropy 2017,
19, 555. [CrossRef]
30. Jeffreys, H. An Invariant Form for the Prior Probability in Estimation Problems. Proc. R. Soc. Lond. Ser. A 1946, 186, 453–461.
31. Hartigan, J. Invariant prior distributions. Ann. Math. Stat. 1964, 35, 836–845. [CrossRef]
32. Bernardo, J.M. Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. Ser. B 1979, 41, 113–147. [CrossRef]
Mathematics 2021, 9, 2810 20 of 20

33. Berger, J.O.; Bernardo, J.M. Ordered group reference priors with application to the multinomial problem. Biometrika 1992,
79, 25–37. [CrossRef]
34. Jeffreys, H. The Theory of Probability; Oxford University Press: Oxford, UK, 1998.
35. Gelman, A. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper).
Bayesian Anal. 2006, 1, 515–534. [CrossRef]
36. Simpson, D.; Rue, H.; Riebler, A.; Martins, T.G.; Sørbye, S.H. Penalising model component complexity: A principled, practical
approach to constructing priors. Stat. Sci. 2017, 32, 1–28. [CrossRef]
37. Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [CrossRef]
38. Guan, X.; He, J.; Jha, R.; Liu, Y. An efficient analytical Bayesian method for reliability and system response updating based on
Laplace and inverse first-order reliability computations. Reliab. Eng. Syst. Saf. 2012, 97, 1–13. [CrossRef]
39. He, J.; Guan, X.; Jha, R. Improve the accuracy of asymptotic approximation in reliability problems involving multimodal
distributions. IEEE Trans. Reliab. 2016, 65, 1724–1736. [CrossRef]
40. American Society for Testing and Materials. ASTM E739-10(2015)-Standard Practice for Statistical Analysis of Linear or Linearized
Stress-Life (S-N) and Strain-Life (ε-N) Fatigue Data; ASTM International: West Conshohocken, PA, USA, 2015.
41. Boyd-Lee, A.; Harrison, G.; Henderson, M. Evaluation of standard life assessment procedures and life extension methodologies
for fracture-critical components. Int. J. Fatigue 2001, 23, 11–19. [CrossRef]
42. Browne, W.J.; Draper, D. A comparison of Bayesian and likelihood-based methods for fitting multilevel models. Bayesian Anal.
2006, 1, 473–514. [CrossRef]
43. Kotz, S.; Nadarajah, S. Multivariate T-Distributions and Their Applications; Cambridge University Press: Cambridge, UK, 2004.
44. Graybill, F.A.; Bowden, D.C. Linear segment confidence bands for simple linear models. J. Am. Stat. Assoc. 1967, 62, 403–408.
[CrossRef]

You might also like