Bayesian Inference Under Small Sample
Bayesian Inference Under Small Sample
Article
Bayesian Inference under Small Sample Sizes Using General
Noninformative Priors
Jingjing He 1 , Wei Wang 1 , Min Huang 1 , Shaohua Wang 2 and Xuefei Guan 3, *
1 School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China;
hejingjing@buaa.edu.cn (J.H.); w.wei@buaa.edu.cn (W.W.); buaahuang@163.com (M.H.);
2 China Aviation Power Plant Research Institute, Zhuzhou 412002, China; wangshaohua@buaa.edu.cn
3 Graduate School of China Academy of Engineering Physics, Beijing 100193, China
* Correspondence: xfguan@gscaep.ac.cn
Abstract: This paper proposes a Bayesian inference method for problems with small sample sizes. A
general type of noninformative prior is proposed to formulate the Bayesian posterior. It is shown that
this type of prior can represent a broad range of priors such as classical noninformative priors and
asymptotically locally invariant priors and can be derived as the limiting states of normal-inverse-
Gamma conjugate priors, allowing for analytical evaluations of Bayesian posteriors and predictors.
The performance of different noninformative priors under small sample sizes is compared using
the likelihood combining both fitting and prediction performances. Laplace approximation is used
to evaluate the likelihood. A realistic fatigue reliability problem was used to illustrate the method.
Following that, an actual aeroengine disk lifing application with two test samples is presented, and
the results are compared with the existing method.
p ( φ | x ) ∝ p ( φ ) p ( x | φ ). (1)
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
The forward inference, e.g., the PDF of a certain variable or the probability of an event
This article is an open access article
involving φ, can be made on the basis of the posterior PDF. Using the method of Markov
distributed under the terms and chain Monte Carlo (MCMC), samples can directly be drawnR from the posterior distribution
conditions of the Creative Commons without knowing the normalizing constant p( x ) = p(φ) p( x |φ)dφ in Equation (1). The
Attribution (CC BY) license (https:// Bayesian method has been successfully demonstrated in all important disciplines [12–19].
creativecommons.org/licenses/by/ The choice of p(φ) can have a great influence on the inference result. The proper choice
4.0/). of priors has been extensively discussed in the probability and statistics communities, and it
can never be overlooked as it is one of the fundamental pieces of Bayesian inference [20–26].
For one thing, the formal rule of constructing a prior regardless of the data and likelihood
is sought in the field of physics [9,27]. Jaynes and Bretthorst [28] argued that “problems of
inference are ill-posed until we recognize three essential things. (A) The prior probabilities
represent our prior information, and are to be determined, not by introspection but by logical
analysis of that information. (B) Since the final conclusions depend necessarily on both the
prior information and the data, it follows that, in formulating a problem, one must specify
the prior information to be used just as fully as one specifies the data. (C) Our goal is that
inferences are to be completely ‘objective’ in the sense that two persons with the same prior
information must assign the same prior probabilities.” (p. 373). For another, the choice of a
prior may highly depend on data and likelihood in practice. Gelman et al. [29] argued that
“a prior can in general only be interpreted in the context of the likelihood with which it will
be paired” (p. 12).
To ensure a consistent and objective inference, rules for constructing priors with minimal
subjective constraints are sought. Early work on the construction of such priors is based on the
“ignorance” over the parameter space using invariance techniques [27,30,31]. The fundamental
reasoning is that the priors should carry the same amount of information such that a change
of scale and/or a shift of location do not affect the inference results on those parameters. The
ignorant prior can systematically be derived using the concept of the transformation group.
Using different transformation groups, different priors can be obtained. For simplicity, such
priors are loosely referred to as noninformative priors. One of the most notable priors for
a scale parameter σ of a distribution is Jeffreys’ prior, i.e., p(σ ) ∝ 1/σ. Jeffreys’ prior can
be obtained using the tool of the transformation group under the condition that a change of
scale does not change that state of knowledge. A further extension of the noninformative
priors is “reference priors” [32,33]. The theoretical framework has been discussed in many
studies, including, but not limited to, [31,34]. Apart from the noninformative approach
to derive the priors, there are several less objective approaches to construct the priors.
Gelman [35] proposed the weakly informative priors based on the idea of conditional
conjugacy for hierarchical Bayesian models. The conditionally conjugate priors provide
some computational convenience as a Gibbs sampler can be used to draw samples from a
posterior distribution, and some parameters for inverse-Gamma distributions in Bayesian
applications are suggested. Simpson et al. [36] proposed a method to build priors. The
basic idea is to penalize the complexity induced by deviating from a simpler base model
using the Kullback–Leibler divergence as a metric [37].
For critical problems with small sample sizes, the choice of the prior in Bayesian
inference can have a great impact on the inference results, rendering an unreliable decision-
making. Despite a great deal of research, including the aforementioned, having been
conducted, the choice of an optimal prior is still nontrivial to make from the practical point
of view, and a systematical method to cope with such cases is rarely seen. Moreover, a
quantitative measure to evaluate the performance of a prior is equally important to identify
the optimal prior. This study develops a noninformative Bayesian approach to probabilistic
inference under small sample sizes. A general 1/σq -type of noninformative prior for the
location-scale family of distributions is proposed. This type of prior is further shown to
be the limiting state of the commonly used normal-inverse-Gamma conjugate prior, thus
allowing for an analytical evaluation of the posterior of the parameter. Given a linear model
with a Gaussian error variable, the analytical form of the posterior of the model prediction
can also be obtained. The informative or noninformative nature of the prior is centered
more on the concept of whether the prior encodes subjective information; therefore, it is
less relevant to the likelihood function (and data). The subjective information is the source
of inconsistency in the inference result as different practitioners can use different subjective
information in the form of assumptions, which in most cases may only be justified by
oneself. However, the prior has to be built with a certain state of knowledge. This state
of knowledge, or loosely called information, is not related to experience nor assumptions,
but is a result of a logical reasoning process [27]. The purpose of the study is not to argue
Mathematics 2021, 9, 2810 3 of 20
whether these assumptions are justified or not, nor to question the suitable assumptions
made during the inference. This study provides an alternative to existing methods in
such cases to avoid introducing subjectiveness into the Bayesian inference. In particular,
when the number of samples is small, the influence of the prior can be as important as the
likelihood (and data). In such cases, subjective information encoded in the prior can lead
to biased results.
The remainder of the paper is organized as follows. First, the Bayesian model with
noninformative priors is developed, in particular, a general 1/σq -type of noninformative
prior for the location-scale family of distributions is proposed for small sample problems.
The noninformative priors are further shown as the limiting states of the normal-inverse-
Gamma (NIG) conjugate priors. The closed-form expressions of the Bayesian posterior
and predictors using the proposed noninformative priors are obtained under a Gaussian
likelihood. Next, a performance measure considering both the fitting performance and
predictive performance is proposed using the concept of Bayes factors. Different priors are
treated as models in a Bayesian hypothesis testing context for comparisons. Following that,
the method is illustrated using two examples.
y i = x i θ + ei , (2)
where θ is a k-dimensional column vector and ei are independent and identical distributed
random error variables. The distribution of ei determines the likelihood function or vice
versa. Without loss of generality, the distribution belongs to a location-scale family of
distributions. Furthermore, it is enough to use a single scale parameter to characterize the
distribution of e since any constant nonzero mean, no matter known or unknown, can be
grouped into θ. Denote the scale parameter of e as σ2 . A Bayesian model incorporates
both the prior information and the observation through Bayes’ rule. Using the matrix form
y = xθ + e, the Bayesian posterior of (θ, σ2 ) writes:
The common Gaussian error variable, i.e., ei ∼ N (0, σ2 ), corresponds to the fol-
lowing likelihood function, for n observations y = (y1 , y2 , ..., yn ) T and n input vector
x = (x1 , x2 , ..., xn ) T where xi , i = 1, ..., n, is a row vector of size k.
It should be noted that the error variable does not necessarily follow a Gaussian PDF,
and other types of error distributions can be used. For example, the extreme-value PDF for
the error corresponds to a Weibull likelihood for the log-transformed model prediction.
where det(·) is the determinant operator and I(·) is the Fisher information matrix. The key
feature of it is invariance under the monotone transformation of φ. This feature is achieved
by using the change of variables theorem. Denote the reparameterized variable or vector
as ψ; it can be shown that:
q
∂φi
p(ψ) = p(φ) det = detI( ), (6)
∂ψj
( x − µ)2
2 1
p( x |µ, σ ) = √ exp − . (7)
2πσ2 2σ2
and E(·) is the expectation operator. Using algebraic deduction and integration, I(µ, σ2 ) is
simplified to:
1
2 0
I = σ 2 . (10)
0
σ4
As a result, Jeffreys’ prior for (µ, σ2 ) is:
It is noted that the distribution of interest is (µ, σ2 ), not (µ, σ ); therefore, the derivative
is taken with respective to σ2 as a whole instead of σ. In other word, the parameter space is
on σ2 , not σ. For example, the Jeffreys’ prior for σ2 , with a fixed value of µ, is:
1
p ( σ2 ) ∝ . (12)
σ2
The Jeffreys’ prior for (µ, σ ) and σ with a fixed µ are ∝ 1/σ2 and ∝ 1/σ, respectively.
It is shown in Appendix B that the Fisher information matrix is also the Hessian matrix
of the Kullback–Leibler (KL) distance of a deviated distribution with respect to the true
distribution evaluated at the true parameters. For exponential families of distributions,
the KL distance has analytical forms, allowing for the evaluation of the Fisher information
matrix without involving integrals in Equation (9).
The asymptotically locally invariant (ALI) prior is another type of prior that satisfies
the invariance under certain transformations. An ALI prior can be uniquely determined
using the following equation according to [31],
∂ ln p(φ)
= −E[ f 1 f 2 ]/E[ f 2 ], (13)
∂φ
Mathematics 2021, 9, 2810 5 of 20
∂ ln p(σ ) 3
=− (14)
∂σ σ
to obtain the ALI prior for σ:
1
p(σ) ∝ . (15)
σ3
Similarly, the ALI prior for σ2 is:
1
p ( σ2 ) ∝ . (16)
σ4
When both µ and σ are random, the joint ALI prior for (µ, σ ) is:
1
p(µ, σ ) ∝ . (17)
σ5
It is noticed that the Jeffreys’, ALI, and uniform priors are reproduced from ∝ 1/σq as
q takes different integer values. In the following, the derivation of a 1/σq prior from NIG
conjugates is shown.
2.2. Derivation of the 1/σq Priors as the Limiting States of NIG Conjugates under Gaussian Likelihood
In Bayesian models, for the given likelihood function, the posterior p(φ| x ) and the
prior p(φ) are called conjugate distributions if both are in the same family of distributions.
It can be considered as the prior can be reconditioned by encoding the evidence through the
likelihood; therefore, the evidence or data merely change the distribution parameters of the
prior and yield another distribution of the same type, but with a different set of parameters.
For a location-scale parameter vector (θ, σ2 ) used in linear or linearized Bayesian mod-
els with a Gaussian likelihood, the corresponding conjugate prior is the NIG distribution.
The closed-form posterior PDFs for the parameter and prediction are given in Appendix A.
Noninformative priors, including the Jeffreys’, ALI, and reference priors, for (θ, σ2 )
are mostly in the form of 1/σq , q ∈ 1, 2, .... The uniform prior can be seen as a special case
of 1/σq as q = 0. It is shown as follows that these noninformative priors can be obtained as
certain limiting states of NIG conjugates of (θ, σ2 ). For example, the NIG distribution with
parameters (α, β, Σ, µ) given by Equation (A3) can reduce to the Jeffreys’ prior:
1
p(θ, σ2 ) → (18)
σ2
as:
→ −k/2
α
→ 0+
β
− . (19)
Σ 1 →0
|µ| < ∞
Furthermore, a 1/σq -type of prior can all be derived as reduced NIG distributions.
By assigning the initial values for α, β, and Σ of the NIG distribution, different q values
are obtained. Table 1 presents priors with different q values as reduced NIG distributions
and the corresponding α, β, and Σ of the NIG distributions and α∗ , β∗ , and Σ∗ of the
resulting NIG posterior distributions. The posterior distribution of σ2 is an inverse-Gamma
distribution IG(α∗ , β∗ ), and the PDF is given by:
∗ α ∗ +1
β∗α β∗
2 1
p(σ |y) = exp − 2 . (20)
Γ(α∗ ) σ2 σ
Mathematics 2021, 9, 2810 6 of 20
Table 1. 1/σq -type of prior as limiting state cases of the NIG distribution and its NIG posteriors.
The posterior distribution of θ is obtained by integrating out σ2 from the joint posterior
distribution of Equation (A6) as,
Z
p(θ |y) = p(θ, σ2 |y)dσ2
Z α ∗ +1
1 1 ∗ 1 ∗ T ∗−1 ∗
∝ exp − β + ( θ − µ ) Σ ( θ − µ ) dσ2 (21)
σ2 σ2 2
−(2α∗ +k)/2
1
∝ 1 + ∗ (θ − µ∗ ) T Σ∗−1 (θ − µ∗ ) ,
2β
n − k SSE
p(σ2 |y) = IG , , (22)
2 2
and:
SSE T −1
p(θ |y) = MVTn−k (x T x)−1 (x T y), (x x) , (23)
n−k
respectively. The term SSE in Equation (22) is the sum of squared errors, given by:
n
SSE = ∑ ( y i − x i θ )2 . (24)
i
The advantage of treating the 1/σq -type of noninformative prior as reduced NIG
conjugates is that the Bayesian posteriors of the model prediction and parameters all
have analytical forms, allowing for efficient evaluations without resorting to the MCMC
techniques, as done in regular Bayesian analysis. It is worth noting that the analytical
results are obtained under the condition (or assumption) that the likelihood is a Gaussian.
This condition implies that the conjugate prior is an NIG. The values of the NIG parameters
are obtained by equating the noninformative prior to the NIG prior. In other words,
equating the two distributions for the purpose of mathematical convenience is the premise
of resolving those parameter values.
The comparison of two models can be made based on the ratio of the posterior
probabilities of the models:
P( M1 |y) P(y| M1 ) P( M1 ) P( M1 )
= = B12 · . (27)
P( M2 |y) P(y| M2 ) P( M2 ) P( M2 )
The ratio, when the prior probabilities of the models are equal, is reduced to the Bayes
factor of Equation (26). The assessment of the Bayes factors involves two integrals over
the parameter space of (θ, σ2 ). The integrands for models M1 and M2 are p(y|θ, σ2 , M1 ) ·
p(θ, σ2 | M1 ) and p(y|θ, σ2 , M2 ) · p(θ, σ2 | M2 ), respectively.
For general multidimensional integration, asymptotic approximation or simulation-
based estimation are two commonly used methods. The Laplace approximation method
evaluates the integral as a multivariate normal distribution. Consider the above integrand
term in Equation (26), i.e., p(y|θ, σ2 , M1 ) · p(θ, σ2 | M1 ). Drop the model symbol for sim-
plicity of the derivation, and denote (θ, σ2 ) as φ. The natural logarithm of the integrand,
p(y|φ) p(φ), i.e., p(y, φ), can be expressed using Taylor expansion around its mode φ0 as:
The term (◦) is zero at the mode of the distribution where the gradient is zero;
therefore, expanding ln p(y, φ) around φ0 eliminates the term (◦) and yields:
1 h i
ln p(y, φ) ≈ ln p(y, φ0 ) + (φ − φ0 ) T ∇2 ln p(y, φ0 ) (φ − φ0 ). (30)
2
Exponentiate the above equation to obtain:
0 1 h
0 T 2 0
i
0
p(y, φ) ≈ p(y, φ ) exp − φ − φ −∇ ln p(y, φ ) φ − φ . (31)
2
Realizing the first term of the above equation is a constant and the last term is the
variable part of a multivariate normal distribution with a mean vector of φ0 and a covariance
matrix Σ0 = [−∇2 ln p(y, φ0 )]−1 , the integration of p(y, φ) writes,
1
Z −1
P(y) = p(y, φ0 ) · exp − (φ − φ0 ) T Σ0 (φ − φ0 ) dφ
pΦ 2 (32)
= p(y, φ0 ) (2π )k+1 |Σ0 |.
where (θ, σ2 )0 is the mode of the distribution and Λ0j is the covariance matrix, i.e., the
inverse of the negative Hessian matrix of ln p(y, θ, σ2 | M j ) evaluated at (θ, σ2 )0 . Proper
numerical derivative methods based on finite difference schemes can achieve reliable
results for the gradient and the Hessian in the following equation.
h i
2 0 2
( θ, σ ) = arg max ∇ ln p ( y, θ, σ | M j )
(θ,σ2 )
" # −1 , (34)
0
2 2
Λ = −∇ ln p(y, θ, σ | M j )
(θ,σ2 )0
Experience has shown that for a distribution with a single mode and approximately
symmetric, the Laplace approximation method can achieve accurate results for engineering
applications [38,39].
It should be noted that the 1/σq -type of noninformative prior is not a proper prior, i.e.,
the normalizing constant for 1/σq is not bounded for σ ∈ (0, +∞); therefore, it is necessary
to limit the support of 1/σq to a proper range and to have a finite normalizing constant.
Assume the effective range of σ is (σ− , σ+ ) for q ∈ (0, 1, ...); the normalizing constant is:
ln(σ+1 ) − ln(σ− ) q=1
Z (σ) = 1 1− q 1− q
. (35)
σ+ − σ− q 6= 1
1−q
Substitute p(θ, σ2 ) with the 1/( Z (σ )σq ) in Equation (3). The likelihood of Equation (33)
compares the fitting performance of different 1/σq -type of priors.
Z
P(ỹ) = p(ỹ|θ, σ2 ) p(θ, σ2 |y)dθdσ2
1
Z (36)
= p(ỹ|θ, σ2 ) p(y|θ, σ2 ) p(θ, σ2 )dθdσ2 ,
P(y)
where P(y) is evaluated using Equation (33).
The comprehensive performance integrating both fitting performance and predictive
performance in terms of likelihood in the joint space of (y × ỹ) is:
Z
P(ỹ) P(y) = p(ỹ|θ, σ2 ) p(y|θ, σ2 ) p(θ, σ2 )dθdσ2 . (37)
Realize that Equations (32), (36), and (37) can all be evaluated using the method of
Laplace approximation, and the performance of different priors can quantitatively be
compared. It is noted that the likelihood is the product of the fitting component and
the prediction component, and the influence of the prior is incorporated in the fitting
component, i.e., P(y).
4. Application Examples
To illustrate the proposed Bayesian inference with noninformative priors, two exam-
ples are presented. The results were compared with the classical least squares approach.
The fitting and comprehensive performance of the 1/σq -type of priors were evaluated
using the method of Laplace approximation.
i 1 2 3 4 5 6 7 8 9
∆ε/2 0.01636 0.01609 0.00675 0.00682 0.00179 0.00160 0.00165 0.00053 0.00054
N 168 200 1000 1180 4730 8035 5254 28,617 32,650
Mathematics 2021, 9, 2810 10 of 20
The exponent q takes the values of 0, 1, 2, ... to generate a flat prior and different
noninformative priors, and (∆ε i , Ni ), i = 1, ..., n is the ith experimental data points of the
total n data points used for parameter estimation.
Notice that when q = 0, the above equation reduces to a regular least squares format.
One data point (i = 6) was arbitrarily chosen for prediction performance evaluation, and
the rest eight data points were used to estimate the model parameters using Equation (39).
To compare the performance of 1/σq -type priors, q = 0, ...5 were used to represent the
regular least squares, Jeffreys’, and ALI priors. The likelihood of the fitting performance
and prediction performance was evaluated using the method of Laplace approximation,
and results are shown in Figure 1. The total likelihood of the fitting and prediction results
indicates that the Jeffreys’ prior 1/σ2 outperformed the others. The result associated with
q = 0 is the Bayesian version of the regular least squares estimation.
106
-1.2
104
-1.3
102
N
-1.4
-1.5 Exp.
100
Mean
-1.6 LS: POF = 1.0E-05
NB: POF = 1.0E-05
10-2
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 10-4 10-3 10-2 10-1
(a) (b)
Figure 1. (a) MCMC samples drawn from the noninformative Bayesian posterior with Jeffreys’ prior (q = 2) and (b) the
fatigue life results at POF = 10−5 obtained with the regular least squares (LS) estimator and noninformative Bayesian (NB).
To further demonstrate the difference between the regular least squares and noninfor-
mative Bayesian approach under this sample size, the following reliability problem was
considered. The fatigue life given a prescribed probability of failure (POF) was calculated
for the cyclic strain range (∆ε/2) between 10−4 and 10−1 . Given a cyclic strain range and a
POF, the fatigue life NPOF can be expressed as:
For the regular least squares estimator, it can be obtained using the t-statistics given
by Equation (A9) where γ/2 should be replaced with the prescribed POF because only the
lower one-sided bound is needed. For the Bayesian with Jeffreys’ prior, the one-side bound
can trivially be obtained using the POF-quantile of the prediction results evaluated using
MCMC samples of the posterior POF. Figure 2 presents the comparisons of the fatigue
life results where the obvious difference was observed. The results of the noninformative
Bayesian were larger than those of regular least squares, indicating a longer fatigue life for
the same risk level, i.e., 10−5 POF.
constraints; sometimes, only one or two test pieces are available. The determination of the
prescribed safe life must be carefully made to mitigate the risk. A general and standard
procedure described in [41] is centered on the idea that two source of uncertainty, namely
the sampling uncertainty and the life distribution uncertainty, should be incorporated. To
account for the sampling uncertainty, the 95% single-sided lower bound of the sample
mean is used as the mean of the life distribution. To account for the uncertainty of the
life distribution itself, no more than one in seven-hundred fifty service disks is allowed
to develop an engineering crack. The total safety margin is a 1/750 cumulative failure
probability to a 95% confidence.
5 12
4 10
3 8
2 6
1 4
0 2
0 1 2 3 4 5 0 1 2 3 4 5
(a) (b)
Figure 2. (a) The fitting performance P( N ) evaluated with different 1/σq priors and (b) the prediction performance
P( Ñ ) P( N ) evaluated with different 1/σq priors. One data point Ñ is considered for the prediction.
The prescribed safe life is assumed to be a log-normal random variable, and the log-life
is a normally distributed variable. Denote the mean and standard deviation of the normal
distribution of the log-life as ln Nµ and σ. The 95% single-sided lower bound of the sample
mean can be expressed as:
σ
ln Nµ95% = ln Nµ − λα √ , (41)
n
where Nµ is the geometric mean of the life, Nµ,95% is the single-sided lower bound of the
geometric mean, λα = Φ−1 (1 − α), and Φ−1 (·) is the inverse cumulative density function
(CDF) of the standard normal variable when the standard deviation σ is known. For the
95% single-sided lower bound, zα = Φ−1 (0.95) = 1.645. When the standard deviation is
unknown, the estimate of the standard deviation based on the samples should be used.
In such cases, the value of λα is obtained using the Student distribution with a degree of
freedom (DOF) of n − 1, i.e., λα = t−1 (1 − α, n − 1). The function t−1 (·, n − 1) is the inverse
CDF of the Student distribution with a DOF of n − 1. The 1/750 failure probability log-life
corresponds to the −3σ shift from the mean of the log-life distribution. When using ln Nµ95%
as the mean of the log-life distribution, the 1/750 failure probability life Nµ95%
−3σ writes:
ln Nµ95% 95%
−3σ = ln Nµ − 3σ. (42)
The safety factor that ensures a 1/750 failure to a 95% confidence can be expressed by
transforming Equation (43) into the linear scale as:
Nµ λα
S = 95% = exp √ + 3 σ (44)
Nµ−3σ n
Mathematics 2021, 9, 2810 12 of 20
It is seen that the prescribed safe life depends on the number of tested samples n
and the standard deviation of the distribution. The deterministic prescribed safe life Ar is
finally obtained as:
Ar = Nµ /S. (45)
The cumulative burst probability when the disk is used y cycles can be expressed as:
" #
( t − x 0 )2
Z y
0 1
Fb (y| x ) = √ exp − dt, (46)
0 2πσb 2σb2
where x 0 is the log-transformed geometric mean of the samples and y is the log-transformed
number of cycles, i.e., y = ln N. The failure rate in terms of the failures per cycle can be
expressed as:
∂F (y| x 0 )/∂y f b (y| x 0 )
h(y| x 0 ) = b = , (47)
1 − Fb (y| x 0 ) 1 − Fb (y| x 0 )
where f b (y| x 0 ) is the burst PDF:
" #
0 1 ( y − x 0 )2
f b (y| x ) = √ exp − . (48)
2πσb 2σb2
where s is the standard deviation of the sample, t(n − 1) is the standard Student distribution
with a DOF of n − 1, and σs = √σn .
The final failure rate can be obtained by marginalizing out the sampling error variable
x 0 as: Z
h(y) = h(y| x 0 ) f s ( x 0 )dx 0 . (50)
With the above discussion, turbine disk life testing results on two disks were obtained
as shown in Table 3.
It is seen that the key information is the standard deviation of the population. For
one thing, whether it is considered as a known quantity requires a justification, and its
actual value relies on empirical evidence and historical data. For another, when it needs
to be estimated from the samples, the proper method should be used due to its limited
sample size.
the log-life. With the above empirical evidence and assumption The empirical evidence of
Nmax /Nmin = 6, combined with the assumption that ln Nmax − ln Nmin = 6σ, yields:
ln Nmax − ln Nmin
1 Nmax ln 6
σ= = ln = . (51)
6 6 Nmin 6
The safety factor that ensures a 1/750 failure to a 95% confidence can be expressed, by
substituting Equation (51) into Equation (43) as,
Nµ 1 + 1.645
√
S= 95%
= 62 6 n (52)
Nµ−3σ
Given that assumption that the standard deviation of the log-life distribution is known,
the first equation of the right-hand side in Equation (49) is used. In addition, the term
1 − Fb (y| x 0 ) ≈ 1 for rare events. In such a case, the hazard rate function Equation (50)
reduces to the convolution of two normal PDFs, and the resulting hazard rate function is:
Z
h(y) = h(y| x 0 ) f s ( x 0 )dx 0 ≈ Norm(y, ln Nµ , σb2 + σs2 ). (53)
Note that y = ln N; the hazard rate in terms of failure per flight hour is found by using
the chain rule as,
1
h( H ) = · Norm(ln( βH ), ln Nµ , σb2 + σs2 ), (54)
H
where β is the exchange rate (cycles/hour) converting the cycle to equivalent engine flight
hours. Using the data in Table 3, the log-transformed geometric mean ln Nµ = 24,205,
the standard deviation of the log-life distribution is σ√ b = (ln 6) /6, and the standard
deviation of the sampling distribution is σs = (ln 6)/(6 2). Substitute these parameters
into Equation (45) to obtain the prescribed safe life as Ar = 6981 cycles.
respectively. It should be noted that the proposed noninformative Bayesian method does
not rely on empirical evidence and assumptions other than the log-normal distribution for
disk life.
0.8
0.6
0.4
0.2
0
9 9.5 10 10.5 11
Cycles
2500 5000 7500 10000 12500
-5
10
Failure rate (failure/hrs.)
-10
10
NB
Ref.
10-15
1000 2000 3000 4000 5000
Time (hrs.)
Figure 4. Comparisons of the failure rate results of the two methods. The solid line and the dashed
line are results from the noninformative Bayesian (NB) and existing method (Ref.), respectively.
5. Conclusions
The study developed a noninformative Bayesian inference method for small sample
problems. In particular, a 1/σq -type of prior was proposed to formulate the Bayesian pos-
terior. It was shown that this type of prior can represent the classical Jeffreys’ prior (Fisher
information), flat prior, and asymptotic locally invariant prior. More importantly, this type
of prior was derived as the limiting state of normal-inverse-Gamma (NIG) conjugate priors,
allowing for fast and efficient evaluations of the Bayesian posteriors and predictors in
closed-form expressions under a Gaussian likelihood. To illustrate the overall method and
compare the performance of noninformative priors, a realistic fatigue reliability problem
was discussed in detail. The combined fitting and prediction performance in terms of
Mathematics 2021, 9, 2810 15 of 20
the likelihood was used to evaluate different priors. It was observed that Jeffreys’ prior,
1/σ2 , yielded the maximum likelihood. The developed method was further applied to
an actual aeroengine disk lifing problem with two test samples, and the results on the
prescribed safe life and risk in terms of failure/hour were comparable with those obtained
using the existing method incorporating the empirical evidence and assumptions. It should
be emphasized that the study did not promote abandoning justified experience and/or
information in prior construction. It was argued in this study that in the absence of those
data or experience, the noninformative prior should be used to avoid introducing unjus-
tified information into the prior. The following conclusions were drawn based on the
current results:
• The 1/σq -type of prior can be used equivalently as the NIG conjugate priors at their
limiting states. The great features of conjugate priors, such as having analytical
posterior and prediction PDFs under a Gaussian likelihood, can be retained when
using noninformative priors for Bayesian linear regression analysis;
• For the 1/σq -type of noninformative prior, the classical Jeffreys’ prior 1/σ2 yielded an
optimal fitting and prediction performance in terms of the likelihood or Bayes factors.
When q = 0, the Bayesian estimator reduces to the regular least squares estimator.
The results of the two case studies showed the advantage of using the noninformative
Bayesian estimator with the prior of ∝ 1/σ2 over the regular least squares estimator
and other empirical methods under small sample sizes.
Author Contributions: Conceptualization, X.G.; Data curation, S.W.; Formal analysis, W.W. and
M.H.; Methodology, J.H., W.W. and X.G.; Resources, M.H. and S.W. All authors have read and agreed
to the published version of the manuscript.
Funding: The study was supported by National Natural Science Foundation of China, Nos. 11872088,
12088101, NSAF U1930403, CAEP CX20200035, and 2020-JCJQ-JJ-465. The support is gratefully acknowledged.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Acknowledgments: The study was supported by National Natural Science Foundation of China,
Nos. 11872088, 12088101, NSAF U1930403, CAEP CX20200035, and 2020-JCJQ-JJ-465. The support is
gratefully acknowledged.
Conflicts of Interest: The authors declare no conflict of interest.
where α > 0 and β > 0 are shape and scale parameters, respectively. The posterior
distribution of σ2 , given n data points, has the following new shape and scale parameters:
Mathematics 2021, 9, 2810 16 of 20
!
2
n ∑ n ( y − xi θ )
(α0 , β0 ) = α+ ,β+ i i
2 2
(A2)
n SSE
= α+ ,β+ ,
2 2
where SSE = ∑in (yi − xi θ )2 is the sum of squared errors. Consider the joint conjugate
of (θ, σ2 ), a widely adopted conjugate prior for Bayesian linear regression is the normal-
inverse-Gamma (NIG) prior for (θ, σ2 ), and it can be written as Equation (A3).
where: α+k/2+1
βα 1
Z= , (A4)
(2π )k/2 Γ(α) |Σθ | σ2
p
and: α+k/2+1
1
Z0 = . (A5)
σ2
Using the inverse gamma distribution for σ2 , the initial distribution parameter (α, β)
still needs to be determined. Unfortunately, there is no formal rule to do that. Gelman [35]
discussed the inverse-Gamma distribution IG(e, e) where e was set to a low value such as
1, 0.01, or 0.001. A difficulty of this prior is that e must be set to a proper value. Inferences
become very sensitive to e for cases where low values of σ2 are possible. Due to this reason,
the IG(e, e) family of priors for σ2 was not recommended by the author. It was further argued
that the resulting inference based on it for cases where σ2 is estimated near zero was the case
for which classical and Bayesian inferences differed the most [35]. Browne and Draper [42]
also mentioned this disadvantage and suggested using a uniform one U(0, 1/e) to reduce
this defect. However, choosing an appropriate value for e for U(0, 1/e) still requires
attention and understanding of the data. Indeed, the effect of the prior and its parameters
is highly subjected to the data and problem, and its sensitivity to Bayesian inferences based
on sparse data or extensive data can be dramatically different as the data dominants the
posterior due to the law of large numbers.
p(θ, σ2 ) p(y|θ, σ2 )
p(θ, σ2 |y) = . (A6)
p(y)
The term p(y) = p(θ, σ2 ) p(y|θ, σ2 )dθdσ2 is the marginal distribution of the data. It
R
can be shown that after some algebraic operations, the posterior can be written as:
∗
1 α +k/2+1
p(θ, σ2 |y) ∝
σ2 (A7)
1 1
× exp − 2 β∗ + (θ − µ∗ )T Σ∗−1 (θ − µ∗ ) ,
σ 2
Mathematics 2021, 9, 2810 17 of 20
where: −1 −1
µ∗ = Σ −1 + x T x Σ µ + xT y
∗ −1
Σ = Σ −1 + x T x .
α∗ = α + n/2
∗
= β + 12 µT Σ−1 µ + y T y − µ∗T Σ∗−1 µ∗
β
The initial NIG conjugate parameter (µ, Σ, α, β) is updated to a new set of parameters
(µ∗ , Σ ∗ , α ∗ , β ∗ ).
where Finv(γ, k, n − k) represents the γ-percentile value of the F-distribution with (k, n − k)
degrees of freedom.
Under the condition that the integration and derivation may be exchanged and the
log-function is twice differentiable, the following Equation (A13) can be attained.
Mathematics 2021, 9, 2810 18 of 20
∇2 pθ ( x ) ∇ pθ ( x )∇ · pθ ( x )T
∇2 ln pθ ( x ) = −
pθ ( x ) [ pθ ( x )]2 (A13)
2
∇ pθ ( x )
= − ∇ ln pθ ( x ) · ∇ ln pθ ( x )T .
pθ
It is noted that:
∇2 p θ ( x )
Z
Eθ = ∇2 pθ ( x )dx = ∇2 1 = 0. (A14)
pθ ( x )
The following result, that is the negative expected Hessian matrix of the log-function
is equal to the Fisher information matrix, can be obtained.
h i h i
I(θ ) = Eθ ∇ ln pθ ( x ) · ∇ ln pθ ( x ) T = −Eθ ∇2 ln pθ ( x ) , (A15)
ln pθ 0 ( x ) ≈ ln pθ ( x ) + (θ − θ 0 )∇ ln pθ ( x )
1 . (A17)
+ (θ − θ 0 )T ∇2 ln pθ ( x )(θ − θ 0 )
2
Substitute Equation (A17) into Equation (A16):
1 h i
D(θ ||θ 0 ) ≈ −Eθ [∆θ · ∇ ln pθ ( x )] − Eθ ∆θ · ∇2 ln pθ ( x ) · ∆θ T , (A18)
2
where ∆θ = (θ − θ 0 ). The first term of the right-hand side of Equation (A18) is zero; and
the second term of the right-hand side is related to Equation (A15). The KL divergence can
finally be expressed as,
1
D(θ ||θ 0 ) ≈ ∆θ T · I(θ ) · ∆θ. (A19)
2
Furthermore, differentiate Equation (A16) with respect to θ 0 to obtain:
Notice that the expectation of the second term in Equation (A21), when θ 0 = θ, reads,
∇2 p θ 0 ( x )
Z
Eθ = ∇2 pθ ( x ) = ∇2 1 = 0, (A22)
pθ 0 ( x )
Mathematics 2021, 9, 2810 19 of 20
and the first term is the Fisher information matrix. As a result, the Fisher information matrix
is the Hessian matrix of the Kullback–Leibler distance evaluated at the true parameter θ.
References
1. Royall, R.M. The effect of sample size on the meaning of significance tests. Am. Stat. 1986, 40, 313–315.
2. Raudys, S.J.; Jain, A.K. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Trans.
Pattern Anal. Mach. Intell. 1991, 13, 252–264. [CrossRef]
3. Nelson, C.R.; Kim, M.J. Predictable stock returns: The role of small sample bias. J. Financ. 1993, 48, 641–661. [CrossRef]
4. Fan, X.; Thompson, B.; Wang, L. Effects of sample size, estimation methods, and model specification on structural equation
modeling fit indexes. Struct. Equ. Model. A Multidiscip. J. 1999, 6, 56–83. [CrossRef]
5. McNeish, D. On using Bayesian methods to address small sample problems. Struct. Equ. Model. A Multidiscip. J. 2016, 23, 750–773.
[CrossRef]
6. Winkler, R.L. Probabilistic prediction: Some experimental results. J. Am. Stat. Assoc. 1971, 66, 675–685. [CrossRef]
7. Melchers, R.E.; Beck, A.T. Structural Reliability Analysis and Prediction; John Wiley & Sons: Hoboken, NJ, USA, 2018.
8. He, J.; Chen, J.; Guan, X. Lifetime distribution selection for complete and censored multi-level testing data and its influence on
probability of failure estimates. Struct. Multidiscip. Optim. 2020, 62, 1–17. [CrossRef]
9. Du, Y.M.; Ma, Y.H.; Wei, Y.F.; Guan, X.; Sun, C. Maximum entropy approach to reliability. Phys. Rev. E 2020, 101, 012106.
[CrossRef]
10. Zhou, D.; He, J.; Du, Y.M.; Sun, C.; Guan, X. Probabilistic information fusion with point, moment and interval data in reliability
assessment. Reliab. Eng. Syst. Saf. 2021, 213, 107790. [CrossRef]
11. Gao, C.; Fang, Z.; Lin, J.; Guan, X.; He, J. Model averaging and probability of detection estimation under hierarchical uncertainties
for Lamb wave detection. Mech. Syst. Signal Process. 2022, 165, 108302. [CrossRef]
12. Gregory, P. Bayesian Logical Data Analysis for the Physical Sciences: A Comparative Approach with Mathematica Support; Cambridge
University Press: Cambridge, UK, 2005.
13. Guan, X.; Jha, R.; Liu, Y. Model selection, updating, and averaging for probabilistic fatigue damage prognosis. Struct. Saf. 2011,
33, 242–249. [CrossRef]
14. Hamdia, K.M.; Zhuang, X.; He, P.; Rabczuk, T. Fracture toughness of polymeric particle nanocomposites: Evaluation of models
performance using Bayesian method. Compos. Sci. Technol. 2016, 126, 122–129. [CrossRef]
15. Yang, J.; He, J.; Guan, X.; Wang, D.; Chen, H.; Zhang, W.; Liu, Y. A probabilistic crack size quantification method using in-situ
Lamb wave test and Bayesian updating. Mech. Syst. Signal Process. 2016, 78, 118–133. [CrossRef]
16. Wang, D.; He, J.; Guan, X.; Yang, J.; Zhang, W. A model assessment method for predicting structural fatigue life using Lamb
waves. Ultrasonics 2018, 84, 319–328. [CrossRef]
17. Guan, X.; He, J. Life time extension of turbine rotating components under risk constraints: A state-of-the-art review and case
study. Int. J. Fatigue 2019, 129, 104799. [CrossRef]
18. He, J.; Huo, H.; Guan, X.; Yang, J. A Lamb wave quantification model for inclined cracks with experimental validation. Chin. J.
Aeronaut. 2021, 34, 601–611. [CrossRef]
19. Huo, H.; He, J.; Guan, X. A Bayesian fusion method for composite damage identification using Lamb wave. Struct. Health Monit.
2020, 1475921720945000. [CrossRef]
20. Berger, J.O. Robust Bayesian analysis: Sensitivity to the prior. J. Stat. Plan. Inference 1990, 25, 303–328. [CrossRef]
21. Berger, J.O.; Moreno, E.; Pericchi, L.R.; Bayarri, M.J.; Bernardo, J.M.; Cano, J.A.; De la Horra, J.; Martín, J.; Ríos-Insúa, D.; Betrò, B.;
et al. An overview of robust Bayesian analysis. Test 1994, 3, 5–124. [CrossRef]
22. Kass, R.E.; Wasserman, L. The selection of prior distributions by formal rules. J. Am. Stat. Assoc. 1996, 91, 1343–1370. [CrossRef]
23. Bishop, P.; Bloomfield, R.; Littlewood, B.; Povyakalo, A.; Wright, D. Toward a formalism for conservative claims about the
dependability of software-based systems. IEEE Trans. Softw. Eng. 2010, 37, 708–717. [CrossRef]
24. Gelman, A.; Hennig, C. Beyond subjective and objective in statistics. J. R. Stat. Soc. Ser. A 2017, 180, 967–1033. [CrossRef]
25. Zhang, J.; Shields, M.D. The effect of prior probabilities on quantification and propagation of imprecise probabilities resulting
from small datasets. Comput. Methods Appl. Mech. Eng. 2018, 334, 483–506. [CrossRef]
26. Schervish, M.J.; DeGroot, M.H. Probability and Statistics; Pearson Education: London, UK, 2014.
27. Jaynes, E.T. Prior probabilities. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 227–241. [CrossRef]
28. Jaynes, E.; Bretthorst, G. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 2003.
29. Gelman, A.; Simpson, D.; Betancourt, M. The prior can often only be understood in the context of the likelihood. Entropy 2017,
19, 555. [CrossRef]
30. Jeffreys, H. An Invariant Form for the Prior Probability in Estimation Problems. Proc. R. Soc. Lond. Ser. A 1946, 186, 453–461.
31. Hartigan, J. Invariant prior distributions. Ann. Math. Stat. 1964, 35, 836–845. [CrossRef]
32. Bernardo, J.M. Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. Ser. B 1979, 41, 113–147. [CrossRef]
Mathematics 2021, 9, 2810 20 of 20
33. Berger, J.O.; Bernardo, J.M. Ordered group reference priors with application to the multinomial problem. Biometrika 1992,
79, 25–37. [CrossRef]
34. Jeffreys, H. The Theory of Probability; Oxford University Press: Oxford, UK, 1998.
35. Gelman, A. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper).
Bayesian Anal. 2006, 1, 515–534. [CrossRef]
36. Simpson, D.; Rue, H.; Riebler, A.; Martins, T.G.; Sørbye, S.H. Penalising model component complexity: A principled, practical
approach to constructing priors. Stat. Sci. 2017, 32, 1–28. [CrossRef]
37. Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [CrossRef]
38. Guan, X.; He, J.; Jha, R.; Liu, Y. An efficient analytical Bayesian method for reliability and system response updating based on
Laplace and inverse first-order reliability computations. Reliab. Eng. Syst. Saf. 2012, 97, 1–13. [CrossRef]
39. He, J.; Guan, X.; Jha, R. Improve the accuracy of asymptotic approximation in reliability problems involving multimodal
distributions. IEEE Trans. Reliab. 2016, 65, 1724–1736. [CrossRef]
40. American Society for Testing and Materials. ASTM E739-10(2015)-Standard Practice for Statistical Analysis of Linear or Linearized
Stress-Life (S-N) and Strain-Life (ε-N) Fatigue Data; ASTM International: West Conshohocken, PA, USA, 2015.
41. Boyd-Lee, A.; Harrison, G.; Henderson, M. Evaluation of standard life assessment procedures and life extension methodologies
for fracture-critical components. Int. J. Fatigue 2001, 23, 11–19. [CrossRef]
42. Browne, W.J.; Draper, D. A comparison of Bayesian and likelihood-based methods for fitting multilevel models. Bayesian Anal.
2006, 1, 473–514. [CrossRef]
43. Kotz, S.; Nadarajah, S. Multivariate T-Distributions and Their Applications; Cambridge University Press: Cambridge, UK, 2004.
44. Graybill, F.A.; Bowden, D.C. Linear segment confidence bands for simple linear models. J. Am. Stat. Assoc. 1967, 62, 403–408.
[CrossRef]