Доступ предоставлен для: Guest
International Journal for Uncertainty Quantification
Главный редактор: Habib N. Najm (open in a new tab)
Ассоциированный редакторs: Dongbin Xiu (open in a new tab) Tao Zhou (open in a new tab)
Редактор-основатель: Nicholas Zabaras (open in a new tab)

Выходит 6 номеров в год

ISSN Печать: 2152-5080

ISSN Онлайн: 2152-5099

The Impact Factor measures the average number of citations received in a particular year by papers published in the journal during the two preceding years. 2017 Journal Citation Reports (Clarivate Analytics, 2018) IF: 1.7 To calculate the five year Impact Factor, citations are counted in 2017 to the previous five years and divided by the source items published in the previous five years. 2017 Journal Citation Reports (Clarivate Analytics, 2018) 5-Year IF: 1.9 The Immediacy Index is the average number of times an article is cited in the year it is published. The journal Immediacy Index indicates how quickly articles in a journal are cited. Immediacy Index: 0.5 The Eigenfactor score, developed by Jevin West and Carl Bergstrom at the University of Washington, is a rating of the total importance of a scientific journal. Journals are rated according to the number of incoming citations, with citations from highly ranked journals weighted to make a larger contribution to the eigenfactor than those from poorly ranked journals. Eigenfactor: 0.0007 The Journal Citation Indicator (JCI) is a single measurement of the field-normalized citation impact of journals in the Web of Science Core Collection across disciplines. The key words here are that the metric is normalized and cross-disciplinary. JCI: 0.5 SJR: 0.584 SNIP: 0.676 CiteScore™:: 3 H-Index: 25

Indexed in

ASYMPTOTICALLY INDEPENDENT MARKOV SAMPLING: A NEW MARKOV CHAIN MONTE CARLO SCHEME FOR BAYESIAN INFERENCE

James L. Beck

Department of Computing and Mathematical Sciences, Division of Engineering and Applied Science, 1200 E California Blvd., California Institute of Technology, Pasadena, California 91125, USA

Konstantin M. Zuev

Department of Computing and Mathematical Sciences, Division of Engineering and Applied Science, 1200 E California Blvd., California Institute of Technology, Pasadena, California 91125, USA

Abstract

In Bayesian inference, many problems can be expressed as the evaluation of the expectation of an uncertain quantity of interest with respect to the posterior distribution based on relevant data. Standard Monte Carlo method is often not applicable because the encountered posterior distributions cannot be sampled directly. In this case, the most popular strategies are the importance sampling method, Markov chain Monte Carlo, and annealing. In this paper, we introduce a new scheme for Bayesian inference, called asymptotically independent Markov sampling (AIMS), which is based on the above methods. We derive important ergodic properties of AIMS. In particular, it is shown that, under certain conditions, the AIMS algorithm produces a uniformly ergodic Markov chain. The choice of the free parameters of the algorithm is discussed and recommendations are provided for this choice, both theoretically and heuristically based. The efficiency of AIMS is demonstrated with three numerical examples, which include both multimodal and higherdimensional target posterior distributions.

KEYWORDS: Bayesian inference, uncertainty quantification, Markov chain Monte Carlo, importance sampling, simulated annealing


1. THREE CORNERSTONES OF COMPUTATIONAL BAYESIAN INFERENCE

In Bayesian statistics, many problems can be expressed as the evaluation of the expectation of a quantity of interest with respect to the posterior distribution. Standard Monte Carlo simulation [1], where expectations are estimated by sample averages based on samples drawn independently from the posterior, is often not applicable because the encountered posterior distributions are multidimensional non-Gaussian distributions that cannot be explicitly normalized. In this case, the most popular strategies are importance sampling and Markov chain Monte Carlo (MCMC) methods.We briefly review these two methods first because they play an important role in the new MCMC method introduced in this paper.

1.1 Importance Sampling

This is nearly as old as the Monte Carlo method (see, for instance, [2]), and works as follows. Suppose we want to evaluate 𝔼π[h], which is an expectation of a function of interest h : Θ → ℝ under distribution π(·) defined on a parameter space Θ ⊆ ℝd,

(1)

Unless otherwise stated, all probability distributions are assumed to have densities with respect to Lebesgue measure, π(dθ) = π(θ)dθ. For simplicity, the same symbol will be used to denote both the distribution and its density, and we write θ ∼ π(·) to denote that θ is distributed according to π(·). Suppose also that we are not able to sample directly from π(·), although we can compute π(θ) for any θ ∈ Θ to within a proportionality constant. Instead, we sample from some other distribution q(·) on Θ which is readily computable for any θ ∈ Θ. Let θ(1), …, θN be N independent and identically distributed (i.i.d.) samples from q(·), and w(i) = π(θ(i))/q(i)) denote the importance weight of the ith sample, then we can estimate 𝔼π[h] by

(2)

The estimator ĥN converges almost surely as N → ∞ to 𝔼π[h] by the strong law of large numbers for any choice of distribution q(·), provided supp(π) θ supp(q). Note that the latter condition automatically holds in Bayesian updating using data 𝒟, where q(θ) = π0(θ) is the prior density and π(θ) ∝ π0(θ)L(θ) is the posterior p(θ|𝒟), where L stands for the likelihood function p(𝒟|θ).

The estimator ĥN in (2) generally has a smaller mean square error than a more straightforward unbiased importance sampling estimator:

(3)

This is especially clear when h is nearly a constant: if hc, then ĥNc, while ĥN has a larger variation. Although ĥN is biased for any finite N, the bias can be made small by taking sufficiently large N, and the improvement in variance makes it a preferred alternative to ĥN [3, 4]. Another major advantage of using ĥN instead of ĥN, which is especially important for Bayesian applications, is that in using the former we need to know π(θ) only up to a multiplicative normalizing constant, whereas in the latter, this constant must be known exactly.

The accuracy of ĥN depends critically on the choice of the importance sampling distribution (ISD) q(·), which is also called the instrumental or trial distribution. If q(·) is chosen carelessly such that the the importance weights w(i) have a large variation, then ĥN is essentially based only on the few samples θ(i) with the largest weights, yielding generally a very poor estimate. Hence, for importance sampling to work efficiently, q(·) must be a good approximation of π(·)—“the importance sampling density should mimic the posterior density” [5]—so that the variance varq[w] is not large. Since usually the prior and posterior are quite different, it is therefore highly inefficient to use the prior as the importance sampling distribution. When Θ is high-dimensional and π(·) is complex, finding a good importance sampling distribution can be very challenging, limiting the applicability of the method [6].

For the estimator ĥN in (3), it is not difficult to show that the optimal importance sampling density, i.e., q(·) that minimizes the variance of ĥN, is q(θ) ∝ |h(θ)|π(θ). This result is sometimes attributed to Rubinstein [7], although it was proved earlier by Kahn and Marshall [2]. It is not true, however, that q(·) is optimal for the estimator ĥN. Note also that this optimality result is not useful in practice, since when h(θ) ≥ 0, the required normalizing constant of q(·) is ∫Θ h(θ)π(θ)dθ, the integral of interest.

1.2 MCMC Sampling

Instead of generating independent samples from an ISD, we could generate dependent samples by simulating a Markov chain whose state distribution converges to the posterior distribution π(·) as its stationary distribution. MCMC sampling originated in statistical physics, and now is widely used in solving statistical problems [3, 4, 8, 9].

The Metropolis-Hastings algorithm [10, 11], the most popular MCMC technique, works as follows. Let q(·|θ) be a distribution on Θ, which may or may not depend on θ ∈ Θ. Assume that q(·|θ) is easy to sample from and it is either computable (up to a multiplicative constant) or symmetric, i.e., q(ξ|θ) = q(θ|ξ). The sampling distribution q(·|θ) is called the proposal distribution. Starting from essentially any θ(1) ∈ supp(π), the Metropolis-Hastings algorithm proceeds by iterating the following two steps. First, generate a candidate state ξ from the proposal density q(·|θ(n)). Second, either accept ξ as the next state of the Markov chain, θ(n+1) = ξ, with probability α(ξ|θ(n)) = min { 1, [π(ξ)q(n)|ξ)]/ [π(θ(n)))q(ξ|θ(n))]}; or reject ξ and set θ(n+1) = θ(n) with the remaining probability 1 − α(ξ|θ(n)). It can be shown (see, for example, [4]), that under fairly weak conditions, π(·) is the stationary distribution of the Markov chain θ(1), θ(2), …, and

(4)

Since the chain needs some time (so called “burn-in” period) to converge to stationarity, in practice an initial portion of, say, N0 states is usually discarded and

(5)

is used as an estimator for 𝔼π[h].

The two main special cases of the Metropolis-Hastings algorithm are independent Metropolis-Hastings (IMH), where the proposal distribution q(ξ|θ) = qg(ξ) is independent of θ (so qg is a global proposal), and random walk Metropolis-Hastings (RWMH), where the proposal distribution is of the form q(ξ|θ) = ql(ξ − θ), i.e., a candidate state is proposed as ξ = θ(n) + ϵn, where ϵn ξ ql(·) is a random perturbation (so ql is a local proposal). In both cases, the choice of the proposal distribution strongly affects the efficiency of the algorithms. For IMH to work well, as with importance sampling, the proposal distribution must be a good approximation of the target distribution π(·), otherwise a large fraction of the candidate samples will be rejected and the Markov chain will be too slow in covering the important regions for π(·). When, however, it is possible to find a proposal qg(·), such that qg(·) ≈ π(·), IMH should always be preferred to RWMH because of better efficiency, i.e., better approximations of 𝔼π[h] for a given number of samples N. Unfortunately, such a proposal is difficult to construct in the context of Bayesian inference where the posterior π(·) is often complex and high-dimensional. This limits the applicability of IMH.

Since the random walk proposal ql(·) is local, it is less sensitive to the target distribution. That is why, in practice, RWMH is more robust and used more frequently than IMH. Nonetheless, there are settings where RWMH also does not work well because of the complexity of the posterior distribution. Although (4) is true in theory, a potential problem with RWMH (and, in fact, with any MCMC algorithm) is that the generated samples θ(1), …, θN often consist of highly correlated samples. Therefore, the estimator N in (5) obtained from these samples tends to have a large variance for a modest amount of samples. This is especially true when the posterior distribution contains several widely separated modes: a chain will move between modes only rarely and it will take a long time before it reaches stationarity. If this is the case, an estimate produced by N will be very inaccurate. At first glance, it seems natural to generate several independent Markov chains, starting from different random seeds, and hope that different chains will get trapped by different modes. However, multiple runs will not in general generate a sample in which each mode is correctly represented, since the probability of a chain reaching a mode depends more on the mode′s “basin of attraction” than on the probability concentrated in the mode [12].

1.3 Annealing

The concept of annealing (or tempering), which involves moving from an easy-to-sample distribution to the target distribution via a sequence of intermediate distributions, is one of the most effective methods of handling multiple isolated modes. Together with importance sampling and MCMC, annealing constitutes the third cornerstone of computational Bayesian inference.

The idea of using the RWMH algorithm in conjunction with annealing was introduced independently in [13] and [14] for solving difficult optimization problems. The resulting algorithm, called simulated annealing, works as follows. Suppose we want to find the global minimum of a function of interest h : Θ → ℝ. This is equivalent to finding the global maximum of ƒT (θ) = exp[−h(θ)/T] for any given T > 0. By analogy with the Gibbs distribution in statistical mechanics, T is called the temperature parameter. Let T0 > T1 > … be a sequence of monotonically decreasing temperatures, in which T0 is large enough so that the probability distribution π0(θ) ∝ ƒT0 (θ) is close to uniform, and limj→∞ Tj = 0. At each temperature Tj, the simulated annealing method generates a Markov chain with πj(θ) ∝ exp[−h(θ)/Tj] as its stationary distribution. The final state of the Markov chain at simulation level j is used as the initial state for the chain at level j + 1. The key observation is that for any function h such that ∫Θ exp[−h(θ)/T]dθ < ∞ for all T > 0, distribution πj(·), as j increases, puts more and more of its probability mass (converging to 1) into a neighborhood of the global minimum of h. Therefore, a sample drawn from πj(·) would almost surely be in a vicinity of the global minimum of h when Tj is close to zero.

The success of simulated annealing in finding the global minimum crucially depends on the schedule of temperatures used in the simulation. It was proved in [15] that if a logarithmic schedule Tj = T0/log(j + 1) is used, then, under certain conditions, there exists a value for T0 such that use of this schedule guarantees that the global minimum of h will be reached almost surely. In practice, however, such a slow annealing schedule is not computationally efficient. It is more common to use either a geometric schedule, Tj+1 = γTj with 0 < γ < 1, or some adaptive schedule, which defines the temperature for the next annealing level based on characteristics of the samples observed at earlier levels. For examples of adaptive annealing schedules, see, for instance, [9].

In Bayesian inference problems, the idea of annealing is typically employed in the following way. First, we construct (in advance or adaptively), a sequence of distributions π0(·), …, πm(·) interpolating between the prior distribution π0(·) and the posterior distribution π(·) ≡ πm(·). Next, we generate i.i.d. samples θ(1)0, …, θ(N)0  from the prior, which is assumed to be readily sampled. Then, at each annealing level j, using some MCMC algorithm and samples θ(1)j−1, …, θ(N)j−1 from the previous level j−1, we generate samples θ(1)j , …, θ(N)j  which are approximately distributed according to πj(·). We proceed sequentially in this way, until the posterior distribution has been sampled. The rationale behind this strategy is that sampling from the multimodal and, perhaps, high-dimensional posterior in such a way is likely to be more efficient than a straightforward MCMC sampling of the posterior.

The problem of sampling a complex probability distribution is encountered in statistical mechanics, molecular dynamics, computational Bayesian inference, scientific computing, machine learning, and other fields. As a result, many different efficient algorithms have been recently developed, e.g., hybrid (Hamiltonian) Monte Carlo [16–18], the method of simulated tempering [19, 20], the tempered transition method [12], annealed importance sampling [21], the adaptive Metropolis-Hastings algorithm [22], sequential Monte Carlo sampler [23, 24], transitional Markov chain Monte Carlo method [25], to name a few.

In this paper we introduce a new MCMC scheme for Bayesian inference, called asymptotically independent Markov sampling (AIMS), which combines the three approaches described above—importance sampling, MCMC, and annealing—in the following way. Importance sampling with πj−1(·) as the ISD is used for a construction of an approximation Nj (·) of πj(·), which is based on samples θ(1)j−1, …, θ(N)j−1 ∼ πj−1(·). This approximation is then employed as the independent (global) proposal distribution for sampling from πj(·) by the IMH algorithm. Intermediate distributions π0(·), …, πm(·) interpolating between prior and posterior are constructed adaptively, using the essential sample size (ESS) to measure how much πj−1(·) differs from πj(·). When the number of samples N → ∞, the approximation Nj (·) converges to πj(·), providing the optimal proposal distribution. In other words, when N → ∞, the corresponding MCMC sampler produces independent samples, hence the name of the algorithm.

Remark 1. The term “Markov sampling” has several different meanings. In this paper it is used as synonymous to “MCMC sampling.”

In this introductory section, we have described all the main ingredients that we will need in the subsequent sections. The rest of the paper is organized as follows. In Section 2, the AIMS algorithm is described. The efficiency of AIMS is illustrated in Section 3 with three numerical examples that include both multimodal and high-dimensional posterior distributions. Concluding remarks are made in Section 4. The ergodic properties of AIMS are derived in the Appendix.

2. ASYMPTOTICALLY INDEPENDENT MARKOV SAMPLING

Let π0(·) and π(·) be the prior and the posterior distributions defined on a parameter space Θ, respectively, so that, according to Bayes′ Theorem, π(θ) ∝ π0(θ)L(θ), where L denotes the likelihood function for data 𝒟. Our ultimate goal is to draw samples that are distributed according to π(·).

In AIMS, we sequentially generate samples from intermediate distributions π0(·), …, πm(·) interpolating between the prior π0(·) and the posterior π(·) ≡ πm(·). The sequence of distributions could be specially constructed for a given problem but the following scheme [21, 25] generally yields good efficiency:

(6)

where 0 = β0 < β1 < … < βm = 1. We will refer to j and βj as the annealing level and the annealing parameter at level j, respectively. In the next subsection, we assume that βj is given and therefore the intermediate distribution πj(·) is also known. In Section 2.2, we describe how to choose the annealing parameters adaptively.

2.1 AIMS at Annealing Level j

Our first goal is to describe how AIMS generates sample θ(1)j, …, θ(Nj)j  from πj(·) based on the sample θ(1)j−1, …, θ(Nj−1)j−1  ∼ πj−1(·) obtained at the previous annealing level.We start with an informal motivating discussion that leads to the simulation algorithm. In the Appendix, we rigorously prove that the corresponding algorithm indeed generates samples which are asymptotically distributed according to πj(·), as the sample size Nj → ∞. Moreover, the larger Nj−1, the less correlated generated samples θ(1)j , …, θ(Nj)j  are—a very desirable, yet rarely affordable, property for any MCMC algorithm.

Let Kj(·|·) be any transition kernel such that πj(·) is a stationary distribution with respect to Kj(·|·). By definition, this means that

(7)

Applying importance sampling with the sampling density πj−1(·) to integral (7), we have

(8)

where Nj−1j (·) will be used as the global proposal distribution in the independent Metropolis-Hastings algorithm, and

(9)

are the importance weights and normalized importance weights, respectively. Note that to calculate wjj−1, we do not need to know the normalizing constants of πj−1(·) and πj(·). If adjacent intermediate distributions πj−1(·) and πj(·) are sufficiently close (in other words, if Δβj = βj − βj−1 is small enough), then the importance weights (9) will not vary wildly, and therefore we can expect that for reasonably large Nj−1, approximation (8) is accurate.

Remark 2. In [26], the stationary condition (7) was used for an analytical approximation of the target PDF to evaluate the evidence (marginal likelihood) for a model.

Remark 3. Note that for any finite Nj−1, distribution Nj−1j (·) will usually have both continuous and discrete parts. This follows from the fact that the transition kernel in Markov chain simulation usually has the following form: K(dθ|ξ) = k(θ|ξ)dθ + r(ξ)δξ(dθ), where k(·|·) describes the continuous part of the transition kernel, δξ(·) denotes the Dirac mass at ξ, and r(ξ) = 1 − ∫Θ k(θ|ξ)dθ. This is the form, for example, for the Metropolis-Hastings algorithm. Therefore, (8) must be understood as the approximate equality of distributions, not densities. In other words, (8) means that 𝔼Nj−1j [h] ≈ 𝔼πj [h] and 𝔼Nj−1j  [h] → 𝔼πj [h], when Nj−1 → ∞, for all integrable functions h. See also Example 2.1 below.

From now on, we consider a special case where Kj(·|·) is the random walk Metropolis-Hastings (RWMH) transition kernel. In this case, it can be written as follows:

(10)

where qj(·|ξ) is a symmetric local proposal density, and aj(ξ) is the probability of having a proper transition ξ to Θ ∖ {ξ}:

(11)

Example 2.1. As a simple illustration of (8), consider the case when πj(·) = 𝒩(·|0, 1), πj−1(·) = 𝒩(·|0, 2), and qj(·|ξ) = 𝒩(·|ξ, 1/2), where 𝒩(·|μ,σ2) denotes the Gaussian density with mean μ and variance σ2. The approximation Nj−1j (·) based on the samples θ(1)j−1, …, θ(Nj−1)j−1 ∼ 𝒩(·|0, 2) is shown in the top panels of Fig. 1, for Nj−1 = 5 and Nj−1 = 50. Suppose that h1(θ) = θ and h2(θ) = θ2 are the functions of interest. Then 𝔼πj [h1] = 0 and 𝔼πj [h2] = 1. The convergence of h1(Nj−1) = 𝔼Nj−1j  [h1] and h2(Nj−1) = 𝔼Nj−1j  [h2] is shown in the bottom panel of Fig. 1.

FIG. 1: The top panels show the distribution πj(·) (solid lines) and its approximation Nj−1j (·), for Nj−1 = 5 (left) and Nj−1 = 50 (right). Dashed lines and bars correspond to the continuous and discrete parts of Nj−1j (·), respectively. The bottom panel shows the convergence of h1(Nj−1) = 𝔼Nj−1j  [h1] and h2(Nj−1) = 𝔼Nj−1j  [h2] to the true values, 0 and 1, respectively [Example 2.1].

For sampling from πj(·), we will use the IMH algorithm with the global proposal distribution Nj−1j (·). To accomplish this, we have to be able to calculate the ratio Nj−1j (θ) / Nj−1j (ξ) for any θ, ξ ∈ Θ as a part of the expression for the acceptance probability αj(ξ|θ) = min {1, [πj(ξ) Nj−1j (θ)] / [Nj−1j (ξ)]}. However, as already mentioned, the distribution Nj−1j (·) does not have a density since it has both continuous and discrete components, and therefore the ratio Nj−1j (θ) / Nj−1j (ξ) makes no sense. To overcome this “lack-of continuity problem,” taking into account (8) and (10), let us formally define the global proposal distribution over Θ as

(12)

if θ ∉ {θ(1)j−1, …, θ(Nj−1)j−1 }, and

(13)

Note that Nj−1j (·) is a distribution on Θ, but it does not have a density. However, Nj−1j (·) induces another distribution on Θ ∖ {θ(1)j−1, …, θ(Nj−1)j−1 } which does have a density, given by the right-hand side of (12). This motivates (12).

Now, using (12) and (13), we can calculate the ratio Nj−1j (θ) / Nj−1j (ξ) as follows:

  1. If θ, ξ ∉ {θ(1)j−1, …, θ(Nj−1)j−1 }, then
    (14)
  2. If θ ∉ {θ(1)j−1, …, θ(Nj−1)j−1 }, and ξ = θ(k)j−1, then
    (15)
  3. If θ = θ(k)j−1 and ξ ∉ {θ(1)j−1, …, θ(Nj−1)j−1 }, then
    (16)
  4. θ = θ(k)j−1 and ξ = θ(l)j−1, then [ Nj−1j (θ) / Nj−1j (ξ)] is not defined.

Notice that in the first three cases the ratio Nj−1j (θ) / Nj−1j (ξ) is readily computable, while in case IV it is not even defined. Therefore, it is very desirable to avoid case IV. The key observation that allows us to do this is the following: suppose that the initial state θ(1)j  of the Markov chain that is generated is such that θ(1)j  ∈ Θj ≝ Θ ∖ {θ(1)j−1, …, θ(Nj−1)j−1 }, then θij  ∈ Θj for all i ≥ 1. Indeed, the only way for the chain to enter the set {θ(1)j−1, …, θ(Nj−1)j−1 } is to generate a candidate state ξ ∈ {θ(1)j−1, …, θ(Nj−1)j−1 }; however, according to case II, such a candidate will always be rejected. Thus, by replacing the state space Θ by Θj and using (14) and (15) for evaluation of Nj−1j (θ) / Nj−1j (ξ), we are able to calculate the acceptance probability αj(ξ|θ) = min {1, [πj(ξ) Nj−1j (θ)] / [Nj−1j (ξ)]} involved in the IMH algorithm. It is clear that the replacement of Θ by Θj is harmless for the ergodic properties of the Markov chain when Θ ⊆ ℝd.

Remark 4. One may wonder why not just use the continuous part of Nj−1j (·) as the global proposal density within the IMH algorithm. In other words, why not use the density Nj−1j,cont(·), which is proportional to the function defined by (12), as the proposal density. Indeed, in this case we would not have any difficulties with calculating the ratio Nj−1j (θ) / Nj−1j (ξ). The problem is that it is not clear how to sample from Nj−1j,cont(·), while sampling from Nj−1j  (dθ) = ∑Nj−1i=1 w(i)j−1 Kj (dθ|θ(i)j−1) is straightforward.

The above discussion leads to the following algorithm for sampling from the distribution πj(·):


AIMS at annealing level j


Input:

  • ▻ θ(1)j−1, …, θ(Nj−1)j−1  ∼ πj−1(·), samples generated at annealing level j − 1;
  • ▻ θ(1)j  ∈ Θj = Θ ∖ {θ(1)j−1, …, θ(Nj−1)j−1 }, initial state of a Markov chain;
  • qj(·|ξ), symmetric proposal density associated with the RWMH kernel;
  • Nj, total number of Markov chain states to be generated.

Algorithm:

for i = 1, …, Nj − 1 do

  1. Generate a global candidate state ξgNj−1j (·) as follows:
    1. Select k from {1, …, Nj− 1 } with probabilities w(i)j−1 given by (9).
    2. Generate a local candidate ξlqj (·|θ(k)j−1).
    3. Accept or reject ξl by setting
      (17)
  2. Update θ(i)j  → θ(i+1)j  by accepting or rejecting ξg as follows:
    • if ξg = θ(k)j−1

      Set θ(i+1)j = θ(i)j

    • else

      Set

      (18)
    • end if
end for

Output:

  • ► θ(1)j, …, θ(Nj)j , Nj, states of a Markov chain with a stationary distribution πj(·)

Schematically, the AIMS algorithm at annealing level j is shown in Fig. 2. The proof that πj(·) is indeed a stationary distribution for the Markov chain generated by AIMS is given in the Appendix.

FIG. 2: AIMS at annealing level j: disks ⚫ and circles ⚪ represent θ(1)j−1, …, θ(Nj−1)j−1  and θ(1)j , …, θ(Nj)j , respectively; concentric circles show the correspondence between θ(k)j−1 that has been chosen in step 1a and the corresponding local candidate ξlq(·|θ(k)j−1) that has been generated in step 1b. In this schematic picture, all shown candidate states are accepted as new states of the Markov chain.

It is important to highlight the low computational complexity of the described algorithm. In the Bayesian inference problems of interest to us, the major calculations are the evaluations of the likelihood function L(θ) (e.g., computing responses of a dynamic system). Since these evaluations of L(θ) are “expensive” for any θ, they dominate the total computational complexity. All other arithmetic operations are relatively “cheap” when compared with likelihood evaluations. To generate a new sample in AIMS, we need to evaluate the likelihood just once, which is the minimal number one can have. Indeed, to update the Markov chain from θ(i)j  to θ(i+1)j , we only need to compute Ll), which is a part of πjl) in (17). All other likelihoods are already computed either at the previous annealing level [L(i)j−1 ), where i = 1, …, Nj−1] or at the previous Markov chain update at the current annealing level [L(i)j )]. As a result, calculation of the acceptance probability in (18) does not involve any new likelihood evaluations [note that we compute (18) only when ξg = ξl]. Thus, the computational complexity of obtaining Nj samples approximately distributed according to πj(·) is 𝒪(Nj).

Remark 5. As usual for MCMC algorithms, the fact of convergence of a Markov chain to its stationary distribution does not depend on the initial state; however, the speed of convergence does. One reasonable way to chose the initial state θ(1)j  ∈ Θj in practical applications is the following: generate θ(1)jqj(·|θ(k)j−1 ), where k = arg maxk w(k) j−1, i.e. θ(k)j−1has the largest normalized importance weight.

Remark 6. From a computer implementation perspective, it is important to highlight that the accept/reject steps in (17) and (18) can be combined; namely, after step 1) b, set

(19)

2.2 The Full AIMS Procedure

At the zeroth annealing level j = 0, we generate prior samples θ(1)0 , …, θ(N0)0 , which usually can be readily drawn directly by a suitable choice of the prior distribution π0(·). Then, using the algorithm described in the previous subsection, we generate samples θ(1)1, …, θ(N1)1 , which are approximately distributed according to intermediate distribution π1(θ) ∝ π0(θ)L(θ)β1. We proceed like this until the posterior distribution πm(θ) ∝ π0(θ)L(θ)βmm = 1) has been sampled. To make the description of AIMS complete, we have to explain how to choose the annealing parameters βj, for j = 2, …, m − 1.

It is clear that the choice of the annealing parameters is very important, since, for instance, it affects the accuracy of the importance sampling approximation (8) and therefore the efficiency of the whole AIMS procedure. At the same time, it is difficult to make a rational choice of the βj values in advance, since this requires some prior knowledge about the posterior distribution, which is often not available. For this reason, we propose an adaptive way of choosing the annealing scheme.

In importance sampling, a useful measure of degeneracy of the method is the effective sample size (ESS) Neff introduced in [27] and [28]. The ESS measures how similar the importance sampling distribution πj−1(·) is to the target distribution πj(·). Suppose Nj−1 independent samples θ(1)j−1, …, θ(Nj−1)j−1  are generated from πj−1(·), then the ESS of these samples is defined as

(20)

where w(θ) = πj(θ) / πj−1(θ). The ESS can be interpreted as implying that Nj−1 weighted samples (θ(1)j−1, w(1)j−1), …, (θ(Nj−1)j−1 , w(Nj−1)j−1 ) are worth Neffj−1 (≤ Nj−1) i.i.d. samples drawn from the target distribution πj(·). One cannot evaluate the ESS exactly but an estimate effj−1 of Neffj−1 is given by

(21)

where wj−1 = (w(1)j−1, …, w(Nj−1)j−1 ) and w(i)j−1 is the normalized importance weight of θ(i)j−1.

At annealing level j, when βj−1 is already known, the problem is to define βj. Let γ = effj−1 / Nj−1 ∈ (0, 1) be a prescribed threshold that characterizes the “quality” of the weighted sample (the larger γ is, the “better” the weighted sample is). Then we obtain the following equation:

(22)

Observe that this equation can be expressed as an equation for βj by using (9):

(23)

Solving this equation for βj gives us the value of the annealing parameter at level j.

Remark 7. Note that when j ≥ 2, the θ(1)j−1, …, θ(Nj−1)j−1are generated by the Markov chain sampler described in the previous subsection and therefore are not independent. This means that because of the autocorrelations produced by the Markov chain used, the “true” ESS of this sample is, in fact, smaller than the one given by (20). This is useful to remember when choosing γ. Also, this is another reason to select the prior distribution π0(·) so that samples can be generated independently at the start of each AIMS run.

Combining the AIMS algorithm at a given annealing level with the described adaptive annealing scheme gives rise to the following procedure:


The AIMS procedure


Input:

  • ▻ γ threshold for the ESS;
  • N0, N1, …, where Nj is the total number of Markov chain states to be generated at annealing level j;
  • q1(·|ξ), q1(·|ξ), …, where qj(·|ξ) is the symmetric proposal density associated with the RWMH kernel at annealing level j.

Algorithm:

  • Set j = 0, current annealing level.
  • Set β0 = 0, current annealing parameter.
  • Sample θ(1)0, …, θ(N0)0 π0(·).
  • Calculate W(i)0 = [L(i)0 )1−β0]/ [∑(N0)i=1 L(i)0 )1−β0], i = 1, …, N0.
  • Calculate the ESS eff0  = eff0  (W0) using (21), which measures how similar the prior distribution π0(·) is to the target posterior distribution π(·).
  • while effj   / Nj < γ do
    • Find βj+1 from Eq. (23).
    • Calculate normalized importance weights w(i)j , i = 1, …, Nj using (9).
    • Generate a Markov chain θ(1)j+1, …, θ(Nj+1)j+1  with the stationary distribution πj+1(·) using the AIMS algorithm at annealing level j + 1.
    • Calculate W(i)j+1 = [L(i)j+1 )1−βj+1]/ [∑(Nj+1)i=1 L(i)j+1 )1−βj+1], i = 1, …, Nj+1.
    • Calculate the ESS effj+1  = effj+1  (Wj+1) using (21), which measures how similar the intermediate distribution πj+1(·) is to the posterior π(·).
    • Increment j to j + 1.
  • end while
  • Set βj+1 = 1, current annealing parameter.
  • Set m = j + 1, the total number of distributions in the annealing scheme.
  • Set w(i)m−1 = W(i)m−1, i = 1, …, Nm−1.
  • Generate a Markov chain θ(1)m, …, θ(Nm)m  with the stationary distribution πm(·) = π(·) using the AIMS algorithm at annealing level m.

Output:

  • ► θ(1)m, …, θ(Nm)m  ⩪ π(·), samples that are approximately distributed according to the posterior distribution.

Remark 8. Among all published algorithms we are aware of, the transitional Markov chain Monte Carlo (TMCMC) method [25] is the one closest to AIMS. The main difference between AIMS and TMCMC is in how samples at a given annealing level are generated. In AIMS, θ(1)j , …, θ(Nj)jare obtained by generating a single Markov chain using IMH with the global proposal distribution Nj−1j (·), which is constructed based on the samples θ(1)j−1, …, θ(Nj−1)j−1from the previous annealing level. On the contrary, in TMCMC the resampling procedure is first performed: θ(1)j−1, …, θ(Nj−1)j−1are resampled with respect to the weights (9). After that, starting from each θ(i)j−1, i = 1, …, Nj−1, a Markov chain of length n(i)jis generated using RWMH with the local proposal distribution, where n(i)j(possibly zero) is the number of times θ(i)j−1 has been resampled, soi n(i)j  = Nj. As a result, in TMCMC θ(1)j , …, θ(Nj)jare obtained as samples from multiple Markov chains, which are typically very short (the n(i)jare small compared to Nj). Therefore, they may not settle down into the stationary state [for finite Nj−1, the resampled θ(i)j−1 serving as initial seeds for multiple Markov chains need not be distributed according to the stationary distribution πj(·)]. In Section 3, we compare the performance of AIMS and TMCMC using high-dimensional examples.

2.3 Implementation Issues

As it follows from the description, the AIMS procedure has the following parameters: γ, the threshold for the effective sample size; Nj, the length of a Markov chain generated at annealing level j = 1, …, m; and qj(·|ξ), the symmetric proposal density associated with the RWMH kernel at level j = 1, …, m. Here, we discuss the choice of these parameters and how this choice affects the efficiency of AIMS.

First, it is absolutely clear that, as for any Monte Carlo method, the larger the number of generated samples is, the more accurate the corresponding estimates of (1) are. However, we would like to highlight the difference between the roles of Nj−1 and Nj at annealing level j. While Nj is directly related to the convergence of the chain θ(1)j , …, θ(Nj)j   to its stationary distribution πj(·), Nj−1 affects this convergence implicitly through the global proposal distribution Nj−1j (·): the larger Nj−1, the more accurate approximation (8) is, and therefore the less correlated θ(1)j , …, θ(Nj)j  are. When Nj−1 → ∞, samples θ(1)j , …, θ(Nj)j   become independent draws from πj(·), hence the name of the algorithm. Thus, if we increase N = Nj−1 = Nj, the effect is twofold: first, the sample size increases, thereby increasing the effective number of independent samples at the jth level (typical for any Monte Carlo method); second, the samples become less correlated (a useful feature of AIMS), again increasing the effective number of independent samples. As a result of these two effects, increasing N has a strong influence on the effective number of independent posterior samples and so strongly reduces the variance of the estimator for (1).

Suppose now that we are at the last annealing level and generating a Markov chain θ(1)m, …, θ(Nm)m  with the stationary distribution πm(·) = π(·). We will refer to this chain as the posterior Markov chain. A critical question faced by users of MCMC methods is how to determine when it is safe to stop sampling from the posterior distribution and use samples θ(1)m, …, θ(Nm)m  for estimation. In other words, how large should Nm be? One possible solution of this “convergence assessment problem” is to use one of the numerous published diagnostic techniques; for example, see [29] for a comparative review of MCMC convergence diagnostics. Unfortunately, none of the published diagnostics allows one to say with certainty that a finite sample from an MCMC algorithm is representative of an underlying stationary distribution. A more empirical approach for assessing convergence is to run several posterior Markov chains θ(1)k,m, …, θ(Nm)k,m , k = 1, …, K, in parallel and monitor the corresponding estimators ĥ1, …, ĥK of 𝔼π[h]. A stopping rule for convergence is then

(24)

where ε is a minimum precision requirement. Note that rule (24), although easy to understand and easy to implement, does not assure convergence of the chains [especially if π(·) is multi-modal]: “the potential for problems with multiple modes exists whenever there is no theoretical guarantee that the distribution is unimodal” [21].

The threshold γ affects the speed of annealing. If γ is very small, i.e., close to zero, then AIMS will have very few intermediate distributions interpolating between the prior and posterior distributions, and this will lead to inaccurate results for a moderate number of samples. If γ is very large, i.e., close to 1, then AIMS will have too many intermediate distributions, which will make the algorithm computationally very expensive.

The proposed method for finding βj values is based on the ESS, and βj is defined from Eq. (22) [or, equivalently, from (23)]. A similar adaptive approach for defining an annealing scheme was proposed in [25]. It is based on the coefficient of variation (COV) of the importance weights (9). More precisely, the equation for βj is given by

(25)

where δ > 0 is a prescribed threshold. It is easy to show that the ESS criterion (22) and the COV criterion (25) are mathematically equivalent; in fact, effj−1  = Nj−1 / (1 + δ2). We prefer to use the former criterion since γ has a clear meaning: it is the factor by which the (essential) sample size of the weighted sample is reduced as a penalty for sampling from the importance sampling density instead of the target distribution. It has been found in [25] that δ = 1 is usually a reasonable choice of the threshold. This corresponds to γ = 1/2. Our simulation results (see Section 3) also show that annealing schemes with γ around 1/2 yield good efficiency.

The choice of the local proposal density qj(·|ξ) associated with the RWMH kernel determines the ergodic properties of the Markov chain generated by AIMS at level j; it also determines how efficiently the chain explores local neighborhoods of samples θ(1)j−1, …, θ(Nj−1)j−1  generated at the previous level. This makes the choice of qj(·|ξ) very important.

It has been observed by many researchers that the efficiency of Metropolis-Hastings-based MCMC methods is not sensitive to the type of the proposal density; however, it strongly depends on its variance (e.g., [30, 31]). For this reason, we suggest using a Gaussian density as the local proposal:

(26)

where ξ and c2j II are the mean and diagonal covariance matrix, respectively. The scaling parameter c2j determines the “spread” of the local proposal distribution. In the Appendix, we prove (Theorem 3) that under certain conditions, the acceptance rate 𝒜j (i.e., the expected probability of having a proper Markov transition θ(i)j  to θ(i+1)j  ≠ θ(i)j ) satisfies 𝒜j ≥ 1/M, where constant M depends on qj(·|ξ) and therefore on c2j . This result can be potentially used for finding an optimal c2j  that would minimize M. Alternatively, a more empirical way of choosing the scaling factor consists of adjusting c2j  based on the estimated acceptance rate. This works as follows: first, choose an initial value for the scaling factor, c2j,0 , and estimate the corresponding acceptance rate 𝒜j(c2j,0) based on Nj generated Markov states, then modify c2j,0 to obtain an increase in 𝒜j. Whether this optimization in c2j  is useful depends on whether the accuracy of the estimator that is achieved compensates for the additional computational cost. Finally, note that our simulation results show (see Section 3) that as j increases, the corresponding optimal scaling factor c2j decreases slightly. This observation coincides with intuition, since when j increases, the intermediate distributions πj(·) become more concentrated.

3. ILLUSTRATIVE EXAMPLES

In this section we illustrate the use of AIMS with three examples: (1) mixture of ten Gaussian distributions in two dimensions (a multimodal case); (2) sum of two multivariate Gaussian distributions in higher dimensions; and (3) Bayesian updating of a neural network model.

3.1 Multimodal Mixture of Gaussians in Two Dimensions

To demonstrate the efficiency of AIMS for sampling from multimodal distributions, consider simulation from a truncated two-dimensional mixture of M Gaussian densities:

(27)

where 𝒰[0,a]×[0,a](·) denotes the uniform distribution on the square [0, a] × [0, a]. In this example, a = 10, M = 10, σ = 0.1, w1 = … = w10 = 0.1, and the mean vectors μ1, …, μ10 are drawn uniformly from the square [0, 10] × [0, 10]. Because of our interest in Bayesian updating, we refer to π(·) in (27) as a posterior distribution.

Figure 3(a) displays the scatterplot of 103 posterior samples obtained from AIMS. Notice there are two clusters of samples that overlap significantly near θ = (4, 4) that reflect two closely spaced Gaussian densities but the other eight clusters are widely spaced. The parameters of the algorithm were chosen as follows: sample size N = 103 per annealing level; the threshold for the ESS γ = 1/2; the local proposal density qj(·|ξ) = 𝒩(·|ξ, c2II2), with c = 0.2. The trajectory of the corresponding posterior Markov chain, i.e., the chain generated at the last annealing level with stationary distribution π(·), is shown in Fig. 3(b). Black crosses × represent the mean vectors μ1, …, μ10. As expected, the chain does not exhibit a local random walk behavior and it moves freely between well-separated modes of the posterior distribution. Figures 3(c) and 3(d) are discussed later.

FIG. 3: (a) Scatterplots of 103 posterior samples; (b) the trajectories of the corresponding posterior Markov chain obtained from AIMS; and (c), (d) corresponding plots from RWMH. Black crosses × represent the modes μ1, …, μ10 of π(·) [example 3.1].

The described implementation of AIMS leads to a total number of m = 6 intermediate distributions in the annealing scheme. Figure 4 shows how annealing parameter βj changes as a function of j for 50 independent runs of the algorithm. It is found that in all considered examples, βj grows like an exponential with j except for the last step (because βm is set to 1 to get the posterior).

FIG. 4: Annealing parameter βj as a function of annealing level j for 50 independent runs of AIMS [example 3.1].

To demonstrate the asymptotic properties of AIMS, in Fig. 5 we plot the normalized sample autocorrelation function (ACF) of three Markov chains generated by AIMS: two intermediate chains (j = 2, 4) and the posterior chain (j = 6). For each chain, we consider three different cases: N = 102, 103 and 104 samples. Figure 5 shows that for every Markov chain, the larger N, the less correlated its samples θ(1)j , …, θ(N)j  are. When the stationary distribution πj(·) is relatively simple (j = 2), the influence of N on the sample ACF is especially clear: the sample ACF drops quickly at lag 1, and at lag 3 the states of the Markov chain with N = 104 are essentially uncorrelated. However, if the stationary distribution is complex (j = 6), then the sample ACF is less sensitive to the sample size N. Indeed, if πj(·) is complex, then, to obtain an accurate approximation Nj (·) ≈ πj(·), N must be large. This behavior of the ACF with increasing N is rare for MCMC algorithms; for example, increasing N does not affect the correlation between TMCMC samples or MH samples.

FIG. 5: The sample autocorrelation function of three Markov chains generated by AIMS: two intermediate chains (j = 2, 4) and the posterior chain (j = 6) [example 3.1].

Let us now compare the performance of AIMS with the random walk Metropolis-Hastings algorithm. For a fair comparison, the Metropolis-Hastings algorithm was implemented as follows. First, a sample of N0 = 103 points θ(1)j , …, θ(N0)j  was drawn from the prior distribution π0(·) = 𝒰[0,a]×[0,a](·) and the corresponding values of the likelihood function L(θ) = ∑Mi=1 wi 𝒩(θ|μi, σ2II2) were calculated, Li = L(i)0 ). Then, starting from the point with the largest likelihood, θ(1) = θ(k)0, k = arg max Li, a Markov chain θ(1), …, θ(N), with stationary distribution π(·) was generated using the Metropolis-Hastings algorithm. The proposal distribution used was q(·|ξ) = N(·|ξ, c2II2) with c = 0,2, and the length of the chain was N = 5 × 103. Thus, the total number of samples used in both AIMS and RWMH was Nt = 6 × 103. The scatterplot of posterior samples obtained from RWMH and the trajectory of the corresponding Markov chain are show in Figs. 3(c) and 3(d), respectively. While the AIMS algorithm successfully sampled all ten modes with the approximately correct proportion of total samples, RWHM completely missed seven modes.

Suppose that we are interested in estimating the posterior mean vector, μπ = (μπ1, μπ2 ), and the components (σπ1 )2, (σπ2 )2, σπ12 of the posterior covariance matrix ∑π. Their true values are given in Table 1 along with the AIMS estimates in terms of their means and coefficients of variation averaged over 50 independent simulations, all based on 103 posterior samples.

TABLE 1: True values of the posterior parameters and the AIMS estimates in terms of their means and coefficients of variation averaged over 50 simulations [example 3.1]

Parameter μπ1 μπ2 π1 )2 π2 )2 σπ12
True value 5.23 5.75 4.51 3.37 −1.30
AIMS mean 5.20 5.73 4.56 3.32 −1.25
AIMS cov 2.4% 2.0% 8.2% 8.2% 27.7%

Figure 6 displays the mean square error (MSE) of the AIMS estimator for the posterior mean and covariance matrix for different values of the scaling factor c. The MSE was estimated based on 50 independent runs of the algorithm. The MSE as a function of c is nearly flat around the optimal, copt ≈ 0.15, i.e., the one that minimizes the MSE.

FIG. 6: Mean square error of the AIMS estimator for the mean and covariance matrix as a function of the scaling factor c showing the optimal value is copt ≈ 0.15 [example 3.1].

3.2 Mixture of Two Higher-Dimensional Gaussians

To demonstrate the efficiency of AIMS for higher dimensionality, consider simulation from a truncated sum of two multivariate Gaussian densities:

(28)

where a = 2, μ1 = (0.5, …, 0.5), μ2 = (−0.5, …, −0.5), and σ = 0.5. Thus, πd(·) is a bimodal distribution on a d-dimensional cube [−a, a]d. Suppose that a quantity of interest is the function h : [−a, a]d → [−a, a] that gives the largest component of θ = (θ1, …, θd) ∈ [−a, a]d :

(29)

and we want to estimate its expectation with respect to πd(·) using posterior samples θ(1), …, θ(N) ∼ πd(·) as follows:

(30)

This example is taken from [25], where the TMCMC for sampling from posterior densities was introduced.

Here, we consider five cases: d = 2, 4, 6, 10, and 20. The performance of TMCMC was examined for only the first three cases in [25]. The last two cases are higher dimensional, and therefore more challenging.

The details of implementation and simulation results from 50 independent runs are summarized in Table 2. The exact value h is calculated by Monte Carlo with N = 105 samples. First, observe that AIMS outperforms TMCMC when d = 2, 4, 6. Both methods are capable of generating samples from both modes of the posterior; however, the probabilities of the modes (each is 1/2 in this example) are found more accurately by AIMS.

TABLE 2: Summary of the simulation results: d is the dimension of the sample space; h and ĥN are the exact value of 𝔼πd[h] and its estimated value, respectively; δ in parentheses is the corresponding coefficient of variation; N, γ, copt, and m are the number of samples used per annealing level, the threshold for the ESS, the (nearly) optimal value of the scaling parameter, and the average number of distributions in the annealing scheme, respectively. The AIMS results are based on 50 independent runs. The TMCMC results are taken from [25] and are based on 50 independent runs [example 3.2]

Case d h TMCMC: ĥN (δ) AIMS: ĥN (δ) N γ copt m
1 2 0.29 0.28 (12.3%) 0.29 (8.8%) 103 1/2 0.2 3
2 4 0.51 0.54 (10.0%) 0.51 (6.9%) 103 1/2 0.4 4
3 6 0.64 0.65 (15.7%) 0.64 (10.4%) 103 1/2 0.6 4.95
4 10 0.76 0.76 (26.7%) 103 1/2 0.7 5.84
10 0.76 0.76 (12.2%) 2 × 103 1/2 0.6 5.98
5 20 0.92 0.95 (42.1%) 4 × 103 1/2 0.5 5.58

Remark 9. In addition to the first three cases, five other scenarios with different probabilities of modes and different values of σ were examined in [25]. It is found that AIMS outperforms TMCMC in all these cases too.

Results presented in Table 2 help to shed some light on the properties of the optimal scaling parameter copt for the proposal density qj(·|ξ) = 𝒩(·|ξ, c2IId). It appears that copt depends not only on the dimension d, which is expected, but also on N, the number of samples used per each annealing level. The latter dependence is explained by the fact that the global proposal distribution Nj (·) for the AIMS Markov chain depends both on N and c: Nj (·) is a weighted sum of N RWMH transition kernels with Gaussian proposal distributions, whose spread is controlled by c. When N is fixed, copt is a monotonically increasing function of d, since in higher dimensions, for optimal local exploration of the neighborhoods of θ(1)j−1, …, θ(N)j−1 , we have to be able to make larger local jumps from θ(k)j−1 to ξl . When d is fixed, copt is a monotonically decreasing function of N, since the more samples θ(1)j−1, …, θ(N)j−1  that have been generated at the previous level, the more we can focus on local exploration of their neighborhoods without worrying too much about regions that lie far away. If we think of the support of qj(·|θ(k)j−1) = 𝒩(·|θ(k)j−1, c2IId) as lying mostly in a d-dimensional ball of radius c centered at θ(k)j−1, then we can explain the dependence of copt on N as follows: the more d-dimensional balls of radius c we have, the smaller c we can use for covering the sample space.

Next we look at how the local and global acceptance rates (see Remark 12 in Appendix A) depend on the scaling parameter c. Figures 7–9 display these acceptance rates along with the coefficient of variation δ of the AIMS estimator for the first three cases: d = 2, 4, and 6, based on 50 independent runs. As expected, the global acceptance rate is always smaller than the local acceptance rate, and the minimum value of δ corresponds to the maximum value of the global acceptance rate. Observe also that the peak of the global acceptance rate slides to the left when j increases. This suggests that it is more efficient to use smaller values of c at higher annealing levels. Indeed, it is natural to expect that coptj  > coptj+1, since the intermediate distribution πj+1(·) is more concentrated than πj(·).

FIG. 7: Coefficient of variation δ of the AIMS estimate (top panel), global acceptance rate (middle panel), and local acceptance rate (bottom panel) as functions of c for case 1 (d = 2) [example 3.2].

FIG. 8: Coefficient of variation δ of the AIMS estimate (top panel), global acceptance rate (middle panel), and local acceptance rate (bottom panel) as functions of c for case 2 (d = 4) [example 3.2].

FIG. 9: Coefficient of variation δ of the AIMS estimate (top panel), global acceptance rate (middle panel), and local acceptance rate (bottom panel) as functions of c for case 3 (d = 6) [example 3.2].

Finally, we draw attention to case 4 in Table 2 where d = 10 with N = 103 and N = 2 × 103 samples per annealing level. Usually for Monte Carlo-based methods the coefficient of variation δ of the estimator is proportional to 1/√Nt, where Nt is the total number of samples. Thus, the doubling of sample size will result in the reduction of δ by the factor of 1/√2 ≈ 0.71. For AIMS, however, the decrease of δ is more significant: from δ = 26.7% to δ = 12.2%, i.e., approximately by the factor of 0.46. This is because, as explained in Section 2.3, the increase of N affects not only the total sample size, but also improves the global proposal distribution Nj (·). This improvement of Nj (·) results in the generation of less correlated samples at each annealing level, and therefore leads to an additional reduction of the coefficient of variation δ.

3.3 Bayesian Updating of a Neural Network with one Hidden Layer

To illustrate the use of AIMS for Bayesian updating, consider its application to a feed-forward neural network model, one of the most popular and most widely used models for function approximation. The goal is to approximate a (potentially highly nonlinear) function ƒ : X → ℝ, where X ⊂ ℝp is a compact set, based on a finite number of measurements yi = ƒ(xi), i = 1, …, n, by using a finite sum of the form

(31)

where θ denotes the model parameters αj, γj ∈ ℝ, and βj ∈ ℝp for j = 1, …, M, 〈·, ·〉 is the standard scalar product in ℝp, and ψ is a sigmoidal function, the typical choice being either the logistic function or the tanh function that is used in this example:

(32)

Model (31) is called a feed-forward neural network (FFNN) with activation function (32), p input units, one hidden layer with M hidden units, and one output unit. The parameters βj and αj are called the connection weights from the input units to the hidden unit j and the connection weights from the hidden unit j to the output unit, respectively. The term γj is a designated bias of the hidden unit j and it can be viewed as a connection weight from an additional constant unit input. Schematically, the FFNN model is shown in Fig. 10.

FIG. 10: The feed-forward neural network model with one hidden layer (shown by hatching) [example 3.3].

The rationale behind the FFNN approximation method follows from the universal approximation property of FFNN models [32, 33]; that is, an FFNN with sufficient number of hidden units and properly adjusted connection weights can approximate most functions arbitrarily well. More precisely, finite sums (31) over all positive integers M are dense in the set of real continuous functions on the p-dimensional unit cube.

Let 𝒜 denote the FFNN architecture, i.e., the input-output model (31) together with information about the type of activation function ψ, number of input units p, and number of hidden units M. In this example, we use p = 1, M = 2, and ψ is given by (32), so the model parameters θ = (α1, α2, β1, β2, γ1, γ2) ∈ Θ = ℝ6.

Deterministic model 𝒜 of function ƒ given by (x, θ) in (31) can be used to construct a Bayesian (stochastic) model ℳ of function ƒ by stochastic embedding (see the details in [34, 35]). Recall that by definition, a Bayesian model ℳ consists of two components:

  1. An input-output probability model yp(y|x, θ, ℳ), which is obtained by introducing the prediction-error
    (33)
    which is the difference between the true output y = ƒ(x) and the deterministic model output (x, θ). A probability model for ε is introduced by using the principle of maximum entropy [36, 37], which states that the probability model should be selected to produce the most uncertainty subject to constraints that we wish to impose (the selection of any other probability model would lead to an unjustified reduction in the prediction uncertainty). In this example, we impose the following constraints: 𝔼[ε] = 0 and var[ε] = σ2 with ε unbounded. The maximum entropy PDF for the prediction error is then ε ∼ 𝒩(0, σ2). This leads to the following input-output probability model:
    (34)
    Here, the prediction-error variance σ2 is included in the set of model parameters where, for convenience, we define θ7 = log σ−2, so the parameter space is now Θ = ℝ7.
  2. A prior PDF π0(θ|ℳ) over the parameter space which is chosen to quantify the initial relative plausibility of each value of θ in Θ. In this example, the prior distributions are assumed to be
    (35)
    with σα = σβ = σγ = σθ7 = 5. Thus, the prior PDF in our case is
    (36)

Let 𝒟 denote the training data, 𝒟 = {(x1, y1), …, (xn, yn)}, treated as independent samples; then, the likelihood function which expresses the probability of getting data 𝒟 based on the probability model (34) is given by

(37)

In this example, data are synthetically generated from (34) with α1 = 5, α2 = −5, β1 = −1, β2 = −3, γ1 = 5, γ2 = 2, σ = 0.1, and the input xi = i/10, for i = 1, …, n = 100.

Finally, using Bayes theorem, we can write the posterior PDF π(θ|𝒟,ℳ) for the uncertain model parameters:

(38)

Under the Bayesian framework, the mean prediction of y = ƒ(x) from observable x can be obtained by integrating out the nuisance parameters:

(39)

To demonstrate the efficiency of AIMS for the mean prediction problem, we use it to sample from the posterior PDF (38) and use Monte Carlo simulation in (39). The parameters of the AIMS algorithm are chosen as follows: sample size N = 3 × 103 per annealing level; the threshold for the ESS γ = 1/2; the proposal density qj(·|ξ) = 𝒩(·|ξ, c2II7), with c = 0.5. This implementation of AIMS leads to a total number of m = 10 intermediate distributions in the annealing scheme. The obtained posterior samples θ(1)m, …, θ(N)m  are then used to approximate the integral on the right-hand side of (39):

(40)

The true function y = ƒ(x) as well as its AIMS approximation m (x) are shown in Fig. 11. A few “intermediate approximations” j (x), which are based on θ(1)j , …, θ(N)j  ∼ πj, are plotted to show how j (x) approaches ƒ(x) when jm. To visualize the uncertainty for the AIMS approximation, we plot its 5th and 95th percentiles in Fig. 12.

FIG. 11: The true function ƒ(x) (solid curve), its posterior approximation 10 (x) (dashed curve), which is constructed using AIMS, and “intermediate annealing approximations”: 0 (x) (dotted curve), which is based on prior samples, 2 (x) and 3 (x) (dashed-dotted curves) [example 3.3].

FIG. 12: The true function ƒ(x) (solid curve), its AIMS approximation 10 (x) (dashed curve), and 5th and 95th percentiles of 10 (x) (dotted curves) [example 3.3].

4. CONCLUDING REMARKS

In this paper, a new scheme for sampling from posterior distributions, called asymptotically independent Markov sampling (AIMS), is introduced. The algorithm is based on three well-established and widely-used stochastic simulation methods: importance sampling, MCMC, and simulated annealing. The key idea behind AIMS is to use N samples drawn from πj−1(·) as an importance sampling density to construct an approximation  Nj (·) of πj(·), where π0(·), …, πm(·) is a sequence of intermediate distributions interpolating between the prior π0(·) and posterior π(·) = πm(·). This approximation is then employed as the independent proposal distribution for sampling from πj(·) by the independent Metropolis-Hastings algorithm. When N → ∞, the AIMS sampler generates independent draws from the target distribution, hence the name of the algorithm.

Important ergodic properties of AIMS are derived in the Appendix. In particular, it is shown that under certain conditions (that are often fulfilled in practice), the AIMS algorithm produces a uniformly ergodic Markov chain. The choice of the free parameters of the algorithm is discussed and recommendations are provide for their values, both theoretically and heuristically based. The efficiency of AIMS is demonstrated with three examples, which include both multimodal and higher-dimensional target posterior distributions.

ACKNOWLEDGMENT

This work was supported by the National Science Foundation under award number EAR-0941374 to the California Institute of Technology. This support is gratefully acknowledged. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the National Science Foundation.

REFERENCES

1. Metropolis, N. and Ulam, S., The Monte Carlo method, J. Am. Stat. Assoc., 44:335–341, 1949.

2. Kahn, H. and Marshall, A. W., Methods of reducing sample size in Monte Carlo computations, J. Oper. Res. Soc. Am., 1(5):263–278, 1953.

3. Liu, J. S., Monte Carlo Strategies in Scientific Computing, Springer Series in Statistics, New York, 2001.

4. Robert, C. P. and Casella, G., Monte Carlo Statistical Methods, 2nd ed. Springer Texts in Statistics, New York, 2004.

5. Geweke, J., Bayesian inference in econometric models using Monte Carlo integration, Econometrica, 57(6):1317–1339, 1989.

6. Au, S. K. and Beck, J. L., Importance sampling in high dimensions, Struct. Safety, 25(2):139–163, 2003.

7. Rubinstein, R., Simulation and the Monte Carlo Method, John Wiley, New York, 1981.

8. Gilks, W. R., Richardson, S., and Spiegelhalter, D. J., Markov Chain Monte Carlo in Practice, Chapman and Hall, London, 1996.

9. Neal, R. M., Probabilistic Inference Using Markov Chain Monte Carlo Methods, Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto, 1993.

10. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E., Equation of state calculations by fast computing machines, J. Chem. Phys., 21(6):1087–1092, 1953.

11. Hastings,W. K., Monte Carlo sampling methods using Markov chains and their applications, Biometrika, 57(1):97–109, 1970.

12. Neal, R. M., Sampling from multimodal distributions using tempered transitions, Stat. Comput., 6:353–366, 1996.

13. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., Optimization by simulated annealing, Sci., 220(4598):671–680, 1983.

14. Černý, V., A thermodynamical approach to the travelling salesman problem: An efficient simulation algorithm, J. Optim. Theory Appl., 45(1):41–51, 1985.

15. Geman, S. and Geman, D., Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell., 20(6):721–741, 1984.

16. Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D., Hybrid Monte Carlo, Phys. Lett. B, 195(2):216–222, 1987.

17. Cheung, S. H. and Beck, J. L., Bayesian model updating using Hybrid Monte Carlo simulation with application to structural dynamic models with many uncertain parameters, J. Eng. Mech., 135(4):243–255, 2009.

18. Neal, R. M., MCMC using Hamiltonian dynamics, In Handbook of Markov Chain Monte Carlo, edited by S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, Chapman & Hall/CRC Press, London, 2011.

19. Marinari, E. and Parisi, G., Simulated tempering: A new Monte Carlo scheme, Europhys. Lett., 19(6):451–458, 1992.

20. Geyer, C. J. and Thompson, E. A., Annealing Markov chain Monte Carlo with applications to ancestral inference, J. Am. Stat. Assoc., 90(431):909–920, 1995.

21. Neal, R. M., Annealed importance sampling, Stat. Comput., 11:125–139, 2001.

22. Beck, J. L. and Au, S. K., Bayesian updating of structural models and reliability using Markov chain Monte Carlo simulation, J. Eng. Mech., 128(4):380–391, 2002.

23. Del Moral, P., Doucet, A., and Jasra, A., Sequential Monte Carlo samplers, J. R. Stat. Soc., Ser. B, 68(3):411–436, 2006.

24. Wan, J. and Zabaras, N., A Bayesian approach to multiscale inverse problems using the sequential Monte Carlo method, Inverse Probl., 27(10):105004, 2011.

25. Ching, J. and Chen, Y.-C., Transitional Markov chain Monte Carlo method for Bayesian model updating, model class selection, and model averaging, J. Eng. Mech., 133(7):816–832, 2007.

26. Cheung, S. H. and Beck, J. L., Calculation of posterior probabilities for Bayesian model class assessment and averaging from posterior samples based on dynamic system data, J. Comput. Aided Civil Infrastruct. Eng., 25(5):304–321, 2010.

27. Kong, A., Liu, J. S., and Wong, W. H., Sequential imputations and Bayesian missing data problems, J. Am. Stat. Assoc., 89(425):278–288, 1994.

28. Liu, J. S., Metropolized independent sampling with comparison to rejection sampling and importance sampling, Stat. Comput., 6(2):113–119, 1996.

29. Cowles, M. K. and Carlin, B. P., Markov chain Monte Carlo convergence diagnostics: A comparative review, J. Am. Stat. Assoc., 91(434):883–904, 1996.

30. Gelman, A., Roberts, G. O., and Gilks, W. R., Efficient metropolis jumping rules, Bayesian Stat., 5:599–607, 1996.

31. Au, S. K. and Beck, J. L., Estimation of small failure probabilities in high dimensions by subset simulation, Probab. Eng. Mech., 16(4):263–277, 2001.

32. Cybenko, G., Approximations by superpositions of a sigmoidal functions, Math. Control, Signals, Syst., 2:303–314, 1989.

33. Hornik, K., Stinchcombe, M., and White, H., Multilayer feedforward networks are universal approximators, Neural Networks, 2:359–366, 1989.

34. Beck, J. L., Probability Logic, Information Quantification and Robust Predictive System Analysis, Technical Report EERL 2008-05, Earthquake Engineering Research Laboratory, California Institute of Technology, Pasadena, CA, 2008.

35. Beck, J. L., Bayesian system identification based on probability logic, Struct. Control Health Monit., 17:825–847, 2010.

36. Jaynes, E. T., Information theory and statistical mechanics, Phys. Rev., 106:620–630, 1957.

37. Jaynes, E. T., Probabiity Theory: The Logic of Science, Ed. G. L. Bretthorst, Cambridge University Press, Cambridge, 2003.

38. Tierney, L., Markov chains for exploring posterior distributions, Ann. Stat., 22(4):1701–1762, 1994.

39. Meyn, S. and Tweedie, R. L., Markov Chains and Stochastic Stability, Cambridge University Press, Cambridge, 2009.

APPENDIX A. ERGODIC PROPERTIES OF AIMS

Because the discussion in Section 2.1, which motivated AIMS at annealing level j, involved delta functions and formal equalities (12) and (13), we cannot simply rely on the convergence of the IMH algorithm in verification of AIMS; a rigorous proof is needed. First we prove that the described algorithm indeed generates a Markov chain with a stationary distribution πj(·). We also explain that when the proposal density qj(·|ξ) is reasonably chosen, πj(·) is the unique (and, therefore, limiting) stationary distribution of the corresponding Markov chain.

Theorem 1. Let θ(1)j , θ(2)j , … be the Markov chain on Θj = Θ ∖ {θ(1)j−1, …, θ(Nj−1)j−1 } generated by the AIMS algorithm at annealing level j, then πj(·) is a stationary distribution of the Markov chain.

Proof. Let 𝒦j(·|·) denote the transition kernel of the Markov chain generated by AIMS at annealing level j. From the description of the algorithm it follows that 𝒦j(·|·) has the following form:

(A.1)

where 𝒜j(θ) is the probability of having a proper transition θ to Θj ∖ {θ}:

(A.2)

A sufficient condition for πj(·) to be a stationary distribution is for 𝒦j(·|·) to satisfy the detailed balance condition:

(A.3)

Without loss of generality, we assume that θ ≠ ξ, since otherwise (A.3) is trivial. In this case 𝒜j(dξ|θ) is given by the first term in (A.1), since the second term vanishes. Thus, all we need to prove is that the function

(A.4)

is symmetric with respect to permutation θ ↔ ξ, for all θ, ξ ∈ Θj. Taking into account (12) and the simple fact that a min{1,b/a} = b min {1,a/b} for all a, b > 0, we have

(A.5)

This proves that πj(·) is a stationary distribution of the AIMS Markov chain.

A stationary distribution is unique and is the limiting distribution for a Markov chain if the chain is aperiodic and irreducible (see, for example, [38]). In the case of AIMS, aperiodicity is guaranteed by the fact that the probability of having a repeated sample θ(j+1)j  = θ(i)j  is not zero: for example, if the local candidate state ξl is rejected in step 1c, then we automatically have θ(j+1)j  = θ(i)j . A Markov chain with stationary distribution π(·) is irreducible if, for any initial state, it has positive probability of entering any set to which π(·) assigns positive probability. It is clear that if the proposal distribution qj(·|ξ) is “standard” (e.g., Gaussian, uniform, log-normal, etc.), then AIMS generates an irreducible Markov chain. In this case, πj(·) is therefore the unique stationary distribution of the AIMS Markov chain, and for every θ ∈ Θj

(A.6)

with ‖ · ‖TV denoting the total variation distance. Recall that the total variation distance between two measures μ1(·) and μ2(·) on Θ is defined as ‖μ1(·) − μ1(·)‖TV = supA⊂Θ1(A) − μ2(A)|. In a simulation setup, the most important consequence of convergence property (A.6) is, of course, that the sample mean converges to the expectation of a measurable function of interest almost surely:

(A.7)

Convergence (A.6) ensures the proper behavior of the AIMS chain θ(1)j , θ(2)j , … regardless of the initial state θ(1)j . A more detailed description of convergence properties involves the study of the speed of convergence of 𝒦nj  (·|θ) to πj(·). Evaluation (or estimation) of this speed is very important for any MCMC algorithm, since it relates to a stopping rule for this algorithm: the higher the speed of convergence 𝒦nj  (·|θ) → πj(·), the fewer samples are needed to obtain an accurate estimate in (A.7). Recall, following [39], that a chain θ(1), θ(2), … is called uniformly ergodic if

(A.8)

The property of uniform ergodicity is stronger than (A.6), since it guarantees that the speed of convergence is uniform over the whole space. Moreover, a Markov chain is uniformly ergodic if and only if there exist r > 1 and R < ∞ such that for all θ ∈ Θ

(A.9)

that is, the convergence in (A.8) takes place at uniform geometric rate [39].

Theorem 2. If there exists a constant M such that for all θ ∈ Θj

(A.10)

then the AIMS algorithm at annealing level j produces a uniformly ergodic chain and

(A.11)

Proof. To prove the first part of the theorem we will need the notion of a small set [39]. A set A ⊂ Θ is called a small set if there exists an integer m > 0 and a nontrivial measure μm on Θ, such that for all θ ∈ A, B ⊂ Θ

(A.12)

In this case we say that A is μm-small. It can be shown [39] that a Markov chain is uniformly ergodic if and only if its state space is μm-small for some m. Thus, to prove the theorem, it is enough to show that Θj is a small set.

If (A.10) is satisfied, then the following holds for transition kernel (A.1) for θ ∈ Θj and B ⊂ Θj:

(A.13)

The sample space Θj is therefore πj/M-small, and the corresponding Markov chain is uniformly ergodic.

To prove bound (A.11), first observe, using (A.13), that

(A.14)

For n > 1, using the Chapman-Kolmogorov equation 𝒦m+n(A|θ) = ∫Θ 𝒦m(A|ξ)𝒦n(dξ|θ) and stationarity of πj(·) with respect to 𝒦j(·|·), we have

(A.15)

where the last equality follows from the fact that ∫Θj 𝒦n−1j (dξ|θ) = ∫Θj πj(ξ)dξ = 1. Finally, we obtain

(A.16)

Remark 10. Note that if there exists a constant M such that (A.10) holds for all θ ∈ Θj, then M > 1 automatically.

Corollary 1. If Θ ⊂ ℝd is a compact set and qj(·|ξ) is a Gaussian distribution centered at ξ then the AIMS algorithm at annealing level j produces a uniformly ergodic chain and (A.11) holds with M given by

(A.17)

Proof. Let us show that in this case condition (A.10) is always fulfilled. For any θ ∈ Θj we have

(A.18)

Thus, (A.10) holds with M given by (A.17).

Remark 11. Note that the assumption of compactness of the sample space Θ is not very restrictive and is typically satisfied in most Bayesian statistics problems. Indeed, to fulfill this condition, it is enough to take a prior distribution π0(·) with compact support. Next, it is clear from the proof that the conclusion of Corollary 1 holds for different “reasonable” (not only Gaussian) proposal distributions qj(·|ξ). Therefore, the AIMS algorithm will produce a uniformly ergodic Markov chain in many practical cases.

It has been recognized for a long time that when using an MCMC algorithm, it is useful to monitor its acceptance rate 𝒜, i.e., expected probability of having a proper Markov jump θ(i) to θ(i+1) ≠ θ(i). While in the case of the RWMH algorithm, the finding of the optimal acceptance rate is a difficult problem: neither high nor low 𝒜 is good [30]; for IMH the picture is rather simple: the higher 𝒜, the better [4]. Since AIMS is based on the IMH algorithm, their properties are very similar. In particular, one should aim for the highest possible acceptance rate of the global candidate state ξg when implementing AIMS.

We finish this section with a result that provides bounds for the acceptance rate of the AIMS algorithms. These bounds can be useful for finding the optimal implementation parameters.

Theorem 3. Let 𝒜j be the expected probability of having a proper Markov transition associated with the AIMS algorithm at annealing level j. Then

(A.19)

where aj(i)j−1 ) is probability (11) associated with having a proper transition under the RWMH transition kernel (10). If (A.10) holds, then

(A.20)

Proof. For every θ ∈ Θj, the probability 𝒜j(θ) of transition θ to Θj ∖ {θ} is given by (A.2). For its expected value we have

(A.21)

To prove the lower bound (A.20), we use (12) in the equation defining 𝒜j :

(A.22)

where the last probability is equal to 1/2, because θ and ξ are i.i.d. according to πj(·), and hence the result.

Remark 12. The AIMS algorithm at annealing level j has two accept/reject steps: one is for the local candidate ξl (step 1c) and another is for the global candidate ξg (step 2). The right-hand side of (A.19) is nothing else but the local acceptance rate, i.e., expected probability of generating a proper local candidate state ξl ∉ {θ(1)j−1 , …, θ(Nj−1)j−1 }. Basically (A.19) says that the global acceptance rate 𝒜j can never exceed the local acceptance rate. In fact, it can be deduced directly from the description of the algorithm, since if the local candidate ξl is rejected, then the global candidate ξg is automatically rejected and we have a repeated sample θ(i+1)j  = θ(i)j .