{"title": "A-NICE-MC: Adversarial Training for MCMC", "book": "Advances in Neural Information Processing Systems", "page_first": 5140, "page_last": 5150, "abstract": "Existing Markov Chain Monte Carlo (MCMC) methods are either based on general-purpose and domain-agnostic schemes, which can lead to slow convergence, or require hand-crafting of problem-specific proposals by an expert. We propose A-NICE-MC, a novel method to train flexible parametric Markov chain kernels to produce samples with desired properties. First, we propose an efficient likelihood-free adversarial training method to train a Markov chain and mimic a given data distribution. Then, we leverage flexible volume preserving flows to obtain parametric kernels for MCMC. Using a bootstrap approach, we show how to train efficient Markov Chains to sample from a prescribed posterior distribution by iteratively improving the quality of both the model and the samples. A-NICE-MC provides the first framework to automatically design efficient domain-specific MCMC proposals. Empirical results demonstrate that A-NICE-MC combines the strong guarantees of MCMC with the expressiveness of deep neural networks, and is able to significantly outperform competing methods such as Hamiltonian Monte Carlo.", "full_text": "A-NICE-MC: Adversarial Training for MCMC\n\nJiaming Song\n\nStanford University\n\ntsong@cs.stanford.edu\n\nShengjia Zhao\n\nStanford University\n\nzhaosj12@cs.stanford.edu\n\nStefano Ermon\n\nStanford University\n\nermon@cs.stanford.edu\n\nAbstract\n\nExisting Markov Chain Monte Carlo (MCMC) methods are either based on general-\npurpose and domain-agnostic schemes, which can lead to slow convergence, or\nproblem-speci\ufb01c proposals hand-crafted by an expert. In this paper, we propose A-\nNICE-MC, a novel method to automatically design ef\ufb01cient Markov chain kernels\ntailored for a speci\ufb01c domain. First, we propose an ef\ufb01cient likelihood-free adver-\nsarial training method to train a Markov chain and mimic a given data distribution.\nThen, we leverage \ufb02exible volume preserving \ufb02ows to obtain parametric kernels\nfor MCMC. Using a bootstrap approach, we show how to train ef\ufb01cient Markov\nchains to sample from a prescribed posterior distribution by iteratively improving\nthe quality of both the model and the samples. Empirical results demonstrate that\nA-NICE-MC combines the strong guarantees of MCMC with the expressiveness of\ndeep neural networks, and is able to signi\ufb01cantly outperform competing methods\nsuch as Hamiltonian Monte Carlo.\n\n1\n\nIntroduction\n\nVariational inference (VI) and Monte Carlo (MC) methods are two key approaches to deal with\ncomplex probability distributions in machine learning. The former approximates an intractable\ndistribution by solving a variational optimization problem to minimize a divergence measure with\nrespect to some tractable family. The latter approximates a complex distribution using a small number\nof typical states, obtained by sampling ancestrally from a proposal distribution or iteratively using a\nsuitable Markov chain (Markov Chain Monte Carlo, or MCMC).\nRecent progress in deep learning has vastly advanced the \ufb01eld of variational inference. Notable\nexamples include black-box variational inference and variational autoencoders [1\u20133], which enabled\nvariational methods to bene\ufb01t from the expressive power of deep neural networks, and adversarial\ntraining [4, 5], which allowed the training of new families of implicit generative models with ef\ufb01cient\nancestral sampling. MCMC methods, on the other hand, have not bene\ufb01ted as much from these recent\nadvancements. Unlike variational approaches, MCMC methods are iterative in nature and do not\nnaturally lend themselves to the use of expressive function approximators [6, 7]. Even evaluating\nan existing MCMC technique is often challenging, and natural performance metrics are intractable\nto compute [8\u201311]. De\ufb01ning an objective to improve the performance of MCMC that can be easily\noptimized in practice over a large parameter space is itself a dif\ufb01cult problem [12, 13].\nTo address these limitations, we introduce A-NICE-MC, a new method for training \ufb02exible MCMC\nkernels, e.g., parameterized using (deep) neural networks. Given a kernel, we view the resulting\nMarkov Chain as an implicit generative model, i.e., one where sampling is ef\ufb01cient but evaluating the\n(marginal) likelihood is intractable. We then propose adversarial training as an effective, likelihood-\nfree method for training a Markov chain to match a target distribution. First, we show it can be used in\na learning setting to directly approximate an (empirical) data distribution. We then use the approach\nto train a Markov Chain to sample ef\ufb01ciently from a model prescribed by an analytic expression (e.g.,\na Bayesian posterior distribution), the classic use case for MCMC techniques. We leverage \ufb02exible\nvolume preserving \ufb02ow models [14] and a \u201cbootstrap\u201d technique to automatically design powerful\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdomain-speci\ufb01c proposals that combine the guarantees of MCMC and the expressiveness of neural\nnetworks. Finally, we propose a method that decreases autocorrelation and increases the effective\nsample size of the chain as training proceeds. We demonstrate that these trained operators are able to\nsigni\ufb01cantly outperform traditional ones, such as Hamiltonian Monte Carlo, in various domains.\n\n2 Notations and Problem Setup\n\nA sequence of continuous random variables {xt}1t=0, xt 2 Rn, is drawn through the following\nMarkov chain:\n\nx0 \u21e0 \u21e10\n\nxt+1 \u21e0 T\u2713(xt+1|xt)\n\nwhere T\u2713(\u00b7|x) is a time-homogeneous stochastic transition kernel parametrized by \u2713 2 \u21e5 and \u21e10\nis some initial distribution for x0. In particular, we assume that T\u2713 is de\ufb01ned through an implicit\ngenerative model f\u2713(\u00b7|x, v), where v \u21e0 p(v) is an auxiliary random variable, and f\u2713 is a deterministic\ntransformation (e.g., a neural network). Let \u21e1t\n\u2713 denote the distribution for xt. If the Markov chain is\n\u2713. We\nboth irreducible and positive recurrent, then it has an unique stationary distribution \u21e1\u2713 = lim\n\u21e1t\nt!1\nassume that this is the case for all the parameters \u2713 2 \u21e5.\nLet pd(x) be a target distribution over x 2 Rn, e.g, a data distribution or an (intractable) posterior\ndistribution in a Bayesian inference setting. Our objective is to \ufb01nd a T\u2713 such that:\n\n1. Low bias: The stationary distribution is close to the target distribution (minimize |\u21e1\u2713 pd|).\n2. Ef\ufb01ciency: {\u21e1t\n3. Low variance: Samples from one chain {xt}1t=0 should be as uncorrelated as possible\n\n\u2713}1t=0 converges quickly (minimize t such that |\u21e1t\n\n\u2713 pd| < ).\n\n(minimize autocorrelation of {xt}1t=0).\n\nWe think of \u21e1\u2713 as a stochastic generative model, which can be used to ef\ufb01ciently produce samples\nwith certain characteristics (speci\ufb01ed by pd), allowing for ef\ufb01cient Monte Carlo estimates. We\nconsider two settings for specifying the target distribution. The \ufb01rst is a learning setting where we do\nnot have an analytic expression for pd(x) but we have access to typical samples {si}m\ni=1 \u21e0 pd; in the\nsecond case we have an analytic expression for pd(x), possibly up to a normalization constant, but no\naccess to samples. The two cases are discussed in Sections 3 and 4 respectively.\n\n3 Adversarial Training for Markov Chains\n\nConsider the setting where we have direct access to samples from pd(x). Assume that the transition\nkernel T\u2713(xt+1|xt) is the following implicit generative model:\nv \u21e0 p(v) xt+1 = f\u2713(xt, v)\n\n(1)\n\nAssuming a stationary distribution \u21e1\u2713(x) exists, the value of \u21e1\u2713(x) is typically intractable to compute.\n\u2713(x) at time t is also intractable, since it involves integration over all the\nThe marginal distribution \u21e1t\npossible paths (of length t) to x. However, we can directly obtain samples from \u21e1t\n\u2713, which will be\nclose to \u21e1\u2713 if t is large enough (assuming ergodicity). This aligns well with the idea of generative\nadversarial networks (GANs), a likelihood free method which only requires samples from the model.\nGenerative Adversarial Network (GAN) [4] is a framework for training deep generative models using\na two player minimax game. A generator network G generates samples by transforming a noise\nvariable z \u21e0 p(z) into G(z). A discriminator network D(x) is trained to distinguish between \u201cfake\u201d\nsamples from the generator and \u201creal\u201d samples from a given data distribution pd. Formally, this\nde\ufb01nes the following objective (Wasserstein GAN, from [15])\n\nmin\nG\n\nmax\n\nD\n\nV (D, G) = min\nG\n\nmax\n\nD\n\nEx\u21e0pd[D(x)] Ez\u21e0p(z)[D(G(z))]\n\n(2)\n\nIn our setting, we could assume pd(x) is the empirical distribution from the samples, and choose\nz \u21e0 \u21e10 and let G\u2713(z) be the state of the Markov Chain after t steps, which is a good approximation\nof \u21e1\u2713 if t is large enough. However, optimization is dif\ufb01cult because we do not know a reasonable t\nin advance, and the gradient updates are expensive due to backpropagation through the entire chain.\n\n2\n\n\fFigure 1: Visualizing samples of \u21e11 to \u21e150 (each row) from a model trained on the MNIST dataset.\nConsecutive samples can be related in label (red box), inclination (green box) or width (blue box).\n\nFigure 2: T\u2713(yt+1|yt). Figure 3: Samples of \u21e11 to \u21e130 from models (top: without shortcut connec-\n\ntions; bottom: with shortcut connections) trained on the CelebA dataset.\n\nTherefore, we propose a more ef\ufb01cient approximation, called Markov GAN (MGAN):\n\nmin\n\n\u2713\n\nmax\n\nD\n\nEx\u21e0pd[D(x)] E\u00afx\u21e0\u21e1b\n\n\u2713\n\n[D(\u00afx)] (1 )Exd\u21e0pd,\u00afx\u21e0T m\n\n\u2713 (\u00afx|xd)[D(\u00afx)]\n\n(3)\n\nwhere 2 (0, 1), b 2 N+, m 2 N+ are hyperparameters, \u00afx denotes \u201cfake\u201d samples from the\ngenerator and T m\n\u2713 (x|xd) denotes the distribution of x when the transition kernel is applied m times,\nstarting from some \u201creal\u201d sample xd.\nWe use two types of samples from the generator for training, optimizing \u2713 such that the samples will\nfool the discriminator:\n\n1. Samples obtained after b transitions \u00afx \u21e0 \u21e1b\n2. Samples obtained after m transitions, starting from a data sample xd \u21e0 pd.\n\n\u2713, starting from x0 \u21e0 \u21e10.\n\nIntuitively, the \ufb01rst condition encourages the Markov Chain to converge towards pd over relatively\nshort runs (of length b). The second condition enforces that pd is a \ufb01xed point for the transition\noperator. 1 Instead of simulating the chain until convergence, which will be especially time-consuming\nif the initial Markov chain takes many steps to mix, the generator would run only (b + m)/2 steps\non average. Empirically, we observe better training times by uniformly sampling b from [1, B] and\nm from [1, M ] respectively in each iteration, so we use B and M as the hyperparameters for our\nexperiments.\n\nz = encoder\u2713(xt)\n\n3.1 Example: Generative Model for Images\nWe experiment with a distribution pd over images, such as digits (MNIST) and faces (CelebA). In\nthe experiments, we parametrize f\u2713 to have an autoencoding structure, where the auxiliary variable\nv \u21e0N (0, I) is directly added to the latent code of the network serving as a source of randomness:\n(4)\nwhere is a hyperparameter we set to 0.1. While sampling is inexpensive, evaluating probabilities\naccording to T\u2713(\u00b7|xt) is generally intractable as it would require integration over v. The starting\ndistribution \u21e10 is a factored Gaussian distribution with mean and standard deviation being the mean\nand standard deviation of the training set. We include all the details, which ares based on the DCGAN\n[16] architecture, in Appendix E.1. All the models are trained with the gradient penalty objective for\nWasserstein GANs [17, 15], where = 1/3, B = 4 and M = 3.\nWe visualize the samples generated from our trained Markov chain in Figures 1 and 3, where each\nrow shows consecutive samples from the same chain (we include more images in Appendix F) From\n\nz0 = ReLU(z + v) xt+1 = decoder\u2713(z0)\n\n1We provide a more rigorous justi\ufb01cation in Appendix B.\n\n3\n\n\fFigure 1 it is clear that xt+1 is related to xt in terms of high-level properties such as digit identity\n(label). Our model learns to \ufb01nd and \u201cmove between the modes\u201d of the dataset, instead of generating\na single sample ancestrally. This is drastically different from other iterative generative models trained\nwith maximum likelihood, such as Generative Stochastic Networks (GSN, [18]) and Infusion Training\n(IF, [19]), because when we train T\u2713(xt+1|xt) we are not specifying a particular target for xt+1. In\nfact, to maximize the discriminator score the model (generator) may choose to generate some xt+1\nnear a different mode.\nTo further investigate the frequency of various modes in the stationary distribution, we consider the\nclass-to-class transition probabilities for MNIST. We run one step of the transition operator starting\nfrom real samples where we have class labels y 2{ 0, . . . , 9}, and classify the generated samples\nwith a CNN. We are thus able to quantify the transition matrix for labels in Figure 2. Results show\nthat class probabilities are fairly uniform and range between 0.09 and 0.11.\nAlthough it seems that the MGAN objective encourages rapid transitions between different modes,\nit is not always the case. In particular, as shown in Figure 3, adding residual connections [20] and\nhighway connections [21] to an existing model can signi\ufb01cantly increase the time needed to transition\nbetween modes. This suggests that the time needed to transition between modes can be affected by\nthe architecture we choose for f\u2713(xt, v). If the architecture introduces an information bottleneck\nwhich forces the model to \u201cforget\u201d xt, then xt+1 will have higher chance to occur in another mode;\non the other hand, if the model has shortcut connections, it tends to generate xt+1 that are close to\nxt. The increase in autocorrelation will hinder performance if samples are used for Monte Carlo\nestimates.\n\n4 Adversarial Training for Markov Chain Monte Carlo\n\nWe now consider the setting where the target distribution pd is speci\ufb01ed by an analytic expression:\n(5)\nwhere U (x) is a known \u201cenergy function\u201d and the normalization constant in Equation (5) might be\nintractable to compute. This form is very common in Bayesian statistics [22], computational physics\n[23] and graphics [24]. Compared to the setting in Section 3, there are two additional challenges:\n\npd(x) / exp(U (x))\n\n1. We want to train a Markov chain such that the stationary distribution \u21e1\u2713 is exactly pd;\n2. We do not have direct access to samples from pd during training.\n\npd(x0)\npd(x)\n\ng\u2713(x|x0)\n\nA\u2713(x0|x) = min\u27131,\n\npd(x)T\u2713(x0|x) = pd(x0)T\u2713(x|x0)\ng\u2713(x0|x)\u25c6 = min\u27131, exp(U (x) U (x0))\n\n4.1 Exact Sampling Through MCMC\nWe use ideas from the Markov Chain Monte Carlo (MCMC) literature to satisfy the \ufb01rst condition\nand guarantee that {\u21e1t\n\u2713}1t=0 will asymptotically converge to pd. Speci\ufb01cally, we require the transition\noperator T\u2713(\u00b7|x) to satisfy the detailed balance condition:\n(6)\nfor all x and x0. This condition can be satis\ufb01ed using Metropolis-Hastings (MH), where a sample x0\nis \ufb01rst obtained from a proposal distribution g\u2713(x0|x) and accepted with the following probability:\n(7)\nTherefore, the resulting MH transition kernel can be expressed as T\u2713(x0|x) = g\u2713(x0|x)A\u2713(x0|x) (if\nx 6= x0), and it can be shown that pd is stationary for T\u2713(\u00b7|x) [25].\nThe idea is then to optimize for a good proposal g\u2713(x0|x). We can set g\u2713 directly as in Equation (1)\n(if f\u2713 takes a form where the probability g\u2713 can be computed ef\ufb01ciently), and attempt to optimize\nthe MGAN objective in Eq. (3) (assuming we have access to samples from pd, a challenge we will\naddress later). Unfortunately, Eq. (7) is not differentiable - the setting is similar to policy gradient\noptimization in reinforcement learning. In principle, score function gradient estimators (such as\nREINFORCE [26]) could be used in this case; in our experiments, however, this approach leads to\nextremely low acceptance rates. This is because during initialization, the ratio g\u2713(x|x0)/g\u2713(x0|x) can\nbe extremely low, which leads to low acceptance rates and trajectories that are not informative for\ntraining. While it might be possible to optimize directly using more sophisticated techniques from\nthe RL literature, we introduce an alternative approach based on volume preserving dynamics.\n\ng\u2713(x|x0)\n\ng\u2713(x0|x)\u25c6\n\n4\n\n\f4.2 Hamiltonian Monte Carlo and Volume Preserving Flow\nTo gain some intuition to our method, we introduce Hamiltonian Monte Carlo (HMC) and volume\npreserving \ufb02ow models [27]. HMC is a widely applicable MCMC method that introduces an auxiliary\n\u201cvelocity\u201d variable v to g\u2713(x0|x). The proposal \ufb01rst draws v from p(v) (typically a factored Gaussian\ndistribution) and then obtains (x0, v0) by simulating the dynamics (and inverting v at the end of the\nsimulation) corresponding to the Hamiltonian\n\n(8)\nwhere x and v are iteratively updated using the leapfrog integrator (see [27]). The transition from\n(x, v) to (x0, v0) is deterministic, invertible and volume preserving, which means that\n\nH(x, v) = v>v/2 + U (x)\n\ng\u2713(x0, v0|x, v) = g\u2713(x, v|x0, v0)\n\n(9)\nMH acceptance (7) is computed using the distribution p(x, v) = pd(x)p(v), where the acceptance\nprobability is p(x0, v0)/p(x, v) since g\u2713(x0, v0|x, v)/g\u2713(x, v|x0, v0) = 1. We can safely discard v0\nafter the transition since x and v are independent.\nLet us return to the case where the proposal is parametrized by a neural network; if we could satisfy\nEquation 9 then we could signi\ufb01cantly improve the acceptance rate compared to the \u201cREINFORCE\u201d\nsetting. Fortunately, we can design such an proposal by using a volume preserving \ufb02ow model [14].\nA \ufb02ow model [14, 28\u201330] de\ufb01nes a generative model for x 2 Rn through a bijection f : h ! x,\nwhere h 2 Rn have the same number of dimensions as x with a \ufb01xed prior pH(h) (typically a\nfactored Gaussian). In this form, pX(x) is tractable because\n\n1\n\n(10)\n\npX(x) = pH(f1(x))det @f 1(x)\n\n@x\n\n\n\nand can be optimized by maximum likelihood.\nIn the case of a volume preserving \ufb02ow model f, the determinant of the Jacobian @f (h)\nis one. Such\n@h\nmodels can be constructed using additive coupling layers, which \ufb01rst partition the input into two\nparts, y and z, and then de\ufb01ne a mapping from (y, z) to (y0, z0) as:\n\ny0 = y\n\n(11)\nwhere m(\u00b7) can be a complex function. By stacking multiple coupling layers the model becomes\nhighly expressive. Moreover, once we have the forward transformation f, the backward transforma-\ntion f1 can be easily derived. This family of models are called Non-linear Independent Components\nEstimation (NICE)[14].\n\nz0 = z + m(y)\n\n4.3 A NICE Proposal\nHMC has two crucial components. One is the introduction of the auxiliary variable v, which prevents\nrandom walk behavior; the other is the symmetric proposal in Equation (9), which allows the MH step\nto only consider p(x, v). In particular, if we simulate the Hamiltonian dynamics (the deterministic\npart of the proposal) twice starting from any (x, v) (without MH or resampling v), we will always\nreturn to (x, v).\nAuxiliary variables can be easily integrated into neural network proposals. However, it is hard to\nobtain symmetric behavior. If our proposal is deterministic, then f\u2713(f\u2713(x, v)) = (x, v) should hold\nfor all (x, v), a condition which is dif\ufb01cult to achieve 2. Therefore, we introduce a proposal which\nsatis\ufb01es Equation (9) for any \u2713, while preventing random walk in practice by resampling v after every\nMH step.\nOur proposal considers a NICE model f\u2713(x, v) with its inverse f1\nvariable. We draw a sample x0 from the proposal g\u2713(x0, v0|x, v) using the following procedure:\n\n, where v \u21e0 p(v) is the auxiliary\n\n\u2713\n\n1. Randomly sample v \u21e0 p(v) and u \u21e0 Uniform[0, 1];\n2. If u > 0.5, then (x0, v0) = f\u2713(x, v);\n\n2The cycle consistency loss (as in CycleGAN [31]) introduces a regularization term for this condition; we\n\nadded this to the REINFORCE objective but were not able to achieve satisfactory results.\n\n5\n\n\fHigh\n\nU (x, v)\n\nf\n\nf1\n\nLow\n\nU (x, v)\n\n\u201chigh\u201d acceptance\n\u201clow\u201d acceptance\n\np(x, v)\n\nFigure 4: Sampling process of A-NICE-MC. Each step, the proposal executes f\u2713 or f1\nthe high probability regions f\u2713 will guide x towards pd(x), while MH will tend to reject f1\nhigh probability regions both operations will have a reasonable probability of being accepted.\n\n. Outside\n. Inside\n\n\u2713\n\n\u2713\n\n3. If u \uf8ff 0.5, then (x0, v0) = f1\n\n\u2713\n\n(x, v).\n\nWe call this proposal a NICE proposal and introduce the following theorem.\nTheorem 1. For any (x, v) and (x0, v0) in their domain, a NICE proposal g\u2713 satis\ufb01es\n\ng\u2713(x0, v0|x, v) = g\u2713(x, v|x0, v0)\n\nProof. In Appendix C.\n\n4.4 Training A NICE Proposal\nGiven any NICE proposal with f\u2713, the MH acceptance step guarantees that pd is a stationary\ndistribution, yet the ratio p(x0, v0)/p(x, v) can still lead to low acceptance rates unless \u2713 is carefully\nchosen. Intuitively, we would like to train our proposal g\u2713 to produce samples that are likely under\np(x, v).\nAlthough the proposal itself is non-differentiable w.r.t. x and v, we do not require score function\ngradient estimators to train it. In fact, if f\u2713 is a bijection between samples in high probability\nregions, then f1\nduring training\nand only train f\u2713(x, v) to reach the target distribution p(x, v) = pd(x)p(v). For pd(x), we use the\nMGAN objective in Equation (3); for p(v), we minimize the distance between the distribution for the\ngenerated v0 (tractable through Equation (10)) and the prior distribution p(v) (which is a factored\nGaussian):\n\nis automatically also such a bijection. Therefore, we ignore f1\n\n\u2713\n\n\u2713\n\nmin\n\n\u2713\n\nmax\n\nD\n\nL(x; \u2713, D) + Ld(p(v), p\u2713(v0))\n\n(12)\n\nwhere L is the MGAN objective, Ld is an objective that measures the divergence between two\ndistributions and is a parameter to balance between the two factors; in our experiments, we use KL\ndivergence for Ld and = 1 3.\nOur transition operator includes a trained NICE proposal followed by a Metropolis-Hastings step,\nand we call the resulting Markov chain Adversarial NICE Monte Carlo (A-NICE-MC). The sampling\nprocess is illustrated in Figure 4. Intuitively, if (x, v) lies in a high probability region, then both f\u2713\nand f1\nshould propose a state in another high probability region. If (x, v) is in a low-probability\nprobability region, then f\u2713 would move it closer to the target, while f1\ndoes the opposite. However,\nthe MH step will bias the process towards high probability regions, thereby suppressing the random-\nwalk behavior.\n\n\u2713\n\n\u2713\n\n4.5 Bootstrap\nThe main remaining challenge is that we do not have direct access to samples from pd in order to\ntrain f\u2713 according to the adversarial objective in Equation (12), whereas in the case of Section 3, we\nhave a dataset to get samples from the data distribution.\nIn order to retrieve samples from pd and train our model, we use a bootstrap process [33] where the\nquality of samples used for adversarial training should increase over time. We obtain initial samples\nby running a (possibly) slow mixing operator T\u27130 with stationary distribution pd starting from an\narbitrary initial distribution \u21e10. We use these samples to train our model f\u2713i, and then use it to obtain\nnew samples from our trained transition operator T\u2713i; by repeating the process we can obtain samples\nof better quality which should in turn lead to a better model.\n\n3The results are not very sensitive to changes in ; we also tried Maximum Mean Discrepancy (MMD, see\n\n[32] for details) and achieved similar results.\n\n6\n\n\fFigure 5: Left: Samples from a model with shortcut connections trained with ordinary discriminator.\nRight: Samples from the same model trained with a pairwise discriminator.\n\nFigure 6: Densities of ring, mog2, mog6 and ring5 (from left to right).\n\n4.6 Reducing Autocorrelation by Pairwise Discriminator\n\nAn important metric for evaluating MCMC algorithms is the effective sample size (ESS), which\nmeasures the number of \u201ceffective samples\u201d we obtain from running the chain. As samples from\nMCMC methods are not i.i.d., to have higher ESS we would like the samples to be as independent as\npossible (low autocorrelation). In the case of training a NICE proposal, the objective in Equation (3)\nmay lead to high autocorrelation even though the acceptance rate is reasonably high. This is because\nthe coupling layer contains residual connections from the input to the output; as shown in Section 3.1,\nsuch models tend to learn an identity mapping and empirically they have high autocorrelation.\nWe propose to use a pairwise discriminator to reduce autocorrelation and improve ESS. Instead of\nscoring one sample at a time, the discriminator scores two samples (x1, x2) at a time. For \u201creal\ndata\u201d we draw two independent samples from our bootstrapped samples; for \u201cfake data\u201d we draw\n\u2713 (\u00b7|x1) such that x1 is either drawn from the data distribution or from samples after running\nx2 \u21e0 T m\nthe chain for b steps, and x2 is the sample after running the chain for m steps, which is similar to the\nsamples drawn in the original MGAN objective.\nThe optimal solution would be match both distributions of x1 and x2 to the target distribution.\nMoreover, if x1 and x2 are correlated, then the discriminator should be able distinguish the \u201creal\u201d\nand \u201cfake\u201d pairs, so the model is forced to generate samples with little autocorrelation. More details\nare included in Appendix D. The pairwise discriminator is conceptually similar to the minibatch\ndiscrimination layer [34]; the difference is that we provide correlated samples as \u201cfake\u201d data, while\n[34] provides independent samples that might be similar.\nTo demonstrate the effectiveness of the pairwise discriminator, we show an example for the image\ndomain in Figure 5, where the same model with shortcut connections is trained with and without\npairwise discrimination (details in Appendix E.1); it is clear from the variety in the samples that the\npairwise discriminator signi\ufb01cantly reduces autocorrelation.\n\n5 Experiments\n\nCode for reproducing the experiments is available at https://github.com/ermongroup/a-nice-mc.\nTo demonstrate the effectiveness of A-NICE-MC, we \ufb01rst compare its performance with HMC on\nseveral synthetic 2D energy functions: ring (a ring-shaped density), mog2 (a mixture of 2 Gaussians)\nmog6 (a mixture of 6 Gaussians), ring5 (a mixture of 5 distinct rings). The densities are illustrated\nin Figure 6 (Appendix E.2 has the analytic expressions). ring has a single connected component of\nhigh-probability regions and HMC performs well; mog2, mog6 and ring5 are selected to demonstrate\ncases where HMC fails to move across modes using gradient information. A-NICE-MC performs\nwell in all the cases.\nWe use the same hyperparameters for all the experiments (see Appendix E.4 for details). In particular,\nwe consider f\u2713(x, v) with three coupling layers, which update v, x and v respectively. This is to\nensure that both x and v could affect the updates to x0 and v0.\n\nHow does A-NICE-MC perform? We evaluate and compare ESS and ESS per second (ESS/s) for\nboth methods in Table 1. For ring, mog2, mog6, we report the smallest ESS of all the dimensions\n\n7\n\n\fTable 1: Performance of MCMC samplers as measured by Effective Sample Size (ESS). Higher is\nbetter (1000 maximum). Averaged over 5 runs under different initializations. See Appendix A for the\nESS formulation, and Appendix E.3 for how we benchmark the running time of both methods.\n\nESS\nring\nmog2\nmog6\nring5\n\nA-NICE-MC\n\n1000.00\n355.39\n320.03\n155.57\n\nHMC\n1000.00\n\n1.00\n1.00\n0.43\n\nESS/s A-NICE-MC\nring\nmog2\nmog6\nring5\n\n128205\n50409\n40768\n19325\n\nHMC\n121212\n\n78\n39\n29\n\n(a) E[px2\n\n1 + x2\n2]\n\n(b) Std[px2\n\n1 + x2\n2]\n\n(c) HMC\n\n(d) A-NICE-MC\n\nFigure 7: (a-b) Mean absolute error for estimating the statistics in ring5 w.r.t. simulation length.\nAveraged over 100 chains. (c-d) Density plots for both methods. When the initial distribution is a\nGaussian centered at the origin, HMC overestimates the densities of the rings towards the center.\n\n(as in [35]); for ring5, we report the ESS of the distance between the sample and the origin, which\nindicates mixing across different rings. In the four scenarios, HMC performed well only in ring; in\ncases where modes are distant from each other, there is little gradient information for HMC to move\nbetween modes. On the other hand, A-NICE-MC is able to freely move between the modes since the\nNICE proposal is parametrized by a \ufb02exible neural network.\nWe use ring5 as an example to demonstrate the results. We assume \u21e10(x) = N (0, 2I) as the initial\ndistribution, and optimize through maximum likelihood. Then we run both methods, and use the\nresulting particles to estimate pd. As shown in Figures 7a and 7b, HMC fails and there is a large gap\nbetween true and estimated statistics. This also explains why the ESS is lower than 1 for HMC for\nring5 in Table 1.\nAnother reasonable measurement to consider is Gelman\u2019s R hat diagnostic [36], which evaluates\nperformance across multiple sampled chains. We evaluate this over the rings5 domain (where the\nstatistics is the distance to the origin), using 32 chains with 5000 samples and 1000 burn-in steps\nfor each sample. HMC gives a R hat value of 1.26, whereas A-NICE-MC gives a R hat value of\n1.002 4. This suggest that even with 32 chains, HMC does not succeed at estimating the distribution\nreasonably well.\n\nDoes training increase ESS? We show in Figure 8 that in all cases ESS increases with more\ntraining iterations and bootstrap rounds, which also indicates that using the pairwise discriminator is\neffective at reducing autocorrelation.\nAdmittedly, training introduces an additional computational cost which HMC could utilize to obtain\nmore samples initially (not taking parameter tuning into account), yet the initial cost can be amortized\nthanks to the improved ESS. For example, in the ring5 domain, we can reach an ESS of 121.54 in\napproximately 550 seconds (2500 iterations on 1 thread CPU, bootstrap included). If we then sample\nfrom the trained A-NICE-MC, it will catch up with HMC in less than 2 seconds.\nNext, we demonstrate the effectiveness of A-NICE-MC on Bayesian logistic regression, where the\nposterior has a single mode in a higher dimensional space, making HMC a strong candidate for the\ntask. However, in order to achieve high ESS, HMC samplers typically use many leap frog steps\nand require gradients at every step, which is inef\ufb01cient when rxU (x) is computationally expensive.\nA-NICE-MC only requires running f\u2713 or f1\nonce to obtain a proposal, which is much cheaper\ncomputationally. We consider three datasets - german (25 covariates, 1000 data points), heart (14\ncovariates, 532 data points) and australian (15 covariates, 690 data points) - and evaluate the lowest\nESS across all covariates (following the settings in [35]), where we obtain 5000 samples after 1000\n\n\u2713\n\n4For R hat values, the perfect value is 1, and 1.1-1.2 would be regarded as too high.\n\n8\n\n\fFigure 8: ESS with respect to the number of training iterations.\n\nTable 2: ESS and ESS per second for Bayesian logistic regression tasks.\n\nESS\n\nA-NICE-MC\n\ngerman\nheart\n\naustralian\n\n926.49\n1251.16\n1015.75\n\nHMC\n2178.00\n5000.00\n1345.82\n\nESS/s\ngerman\nheart\n\naustralian\n\nA-NICE-MC\n\n1289.03\n3204.00\n1857.37\n\nHMC\n216.17\n1005.03\n289.11\n\nburn-in samples. For HMC we use 40 leap frog steps and tune the step size for the best ESS possible.\nFor A-NICE-MC we use the same hyperparameters for all experiments (details in Appendix E.5).\nAlthough HMC outperforms A-NICE-MC in terms of ESS, the NICE proposal is less expensive to\ncompute than the HMC proposal by almost an order of magnitude, which leads to higher ESS per\nsecond (see Table 2).\n\n6 Discussion\n\nTo the best of our knowledge, this paper presents the \ufb01rst likelihood-free method to train a parametric\nMCMC operator with good mixing properties. The resulting Markov Chains can be used to target\nboth empirical and analytic distributions. We showed that using our novel training objective we\ncan leverage \ufb02exible neural networks and volume preserving \ufb02ow models to obtain domain-speci\ufb01c\ntransition kernels. These kernels signi\ufb01cantly outperform traditional ones which are based on elegant\nyet very simple and general-purpose analytic formulas. Our hope is that these ideas will allow us\nto bridge the gap between MCMC and neural network function approximators, similarly to what\n\u201cblack-box techniques\u201d did in the context of variational inference [1].\nCombining the guarantees of MCMC and the expressiveness of neural networks unlocks the poten-\ntial to perform fast and accurate inference in high-dimensional domains, such as Bayesian neural\nnetworks. This would likely require us to gather the initial samples through other methods, such\nas variational inference, since the chances for untrained proposals to \u201cstumble upon\u201d low energy\nregions is diminished by the curse of dimensionality. Therefore, it would be interesting to see whether\nwe could bypass the bootstrap process and directly train on U (x) by leveraging the properties of\n\ufb02ow models. Another promising future direction is to investigate proposals that can rapidly adapt\nto changes in the data. One use case is to infer the latent variable of a particular data point, as in\nvariational autoencoders. We believe it should be possible to utilize meta-learning algorithms with\ndata-dependent parametrized proposals.\n\nAcknowledgements\n\nThis research was funded by Intel Corporation, TRI, FLI and NSF grants 1651565, 1522054, 1733686.\nThe authors would like to thank Daniel L\u00e9vy for discussions on the NICE proposal proof, Yingzhen Li\nfor suggestions on the training procedure and Aditya Grover for suggestions on the implementation.\n\nReferences\n[1] R. Ranganath, S. Gerrish, and D. Blei, \u201cBlack box variational inference,\u201d in Arti\ufb01cial Intelligence\n\nand Statistics, pp. 814\u2013822, 2014.\n\n[2] D. P. Kingma and M. Welling, \u201cAuto-encoding variational bayes,\u201d arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[3] D. J. Rezende, S. Mohamed, and D. Wierstra, \u201cStochastic backpropagation and approximate\n\ninference in deep generative models,\u201d arXiv preprint arXiv:1401.4082, 2014.\n\n9\n\n\f[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio, \u201cGenerative adversarial nets,\u201d in Advances in neural information processing systems,\npp. 2672\u20132680, 2014.\n\n[5] S. Mohamed and B. Lakshminarayanan, \u201cLearning in implicit generative models,\u201d arXiv preprint\n\narXiv:1610.03483, 2016.\n\n[6] T. Salimans, D. P. Kingma, M. Welling, et al., \u201cMarkov chain monte carlo and variational\n\ninference: Bridging the gap.,\u201d in ICML, vol. 37, pp. 1218\u20131226, 2015.\n\n[7] N. De Freitas, P. H\u00f8jen-S\u00f8rensen, M. I. Jordan, and S. Russell, \u201cVariational mcmc,\u201d in Pro-\nceedings of the Seventeenth conference on Uncertainty in arti\ufb01cial intelligence, pp. 120\u2013127,\nMorgan Kaufmann Publishers Inc., 2001.\n\n[8] J. Gorham and L. Mackey, \u201cMeasuring sample quality with stein\u2019s method,\u201d in Advances in\n\nNeural Information Processing Systems, pp. 226\u2013234, 2015.\n\n[9] J. Gorham, A. B. Duncan, S. J. Vollmer, and L. Mackey, \u201cMeasuring sample quality with\n\ndiffusions,\u201d arXiv preprint arXiv:1611.06972, 2016.\n\n[10] J. Gorham and L. Mackey, \u201cMeasuring sample quality with kernels,\u201d arXiv preprint\n\narXiv:1703.01717, 2017.\n\n[11] S. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman, \u201cDesigning fast absorbing markov\n\nchains.,\u201d in AAAI, pp. 849\u2013855, 2014.\n\n[12] N. Mahendran, Z. Wang, F. Hamze, and N. De Freitas, \u201cAdaptive mcmc with bayesian optimiza-\n\ntion.,\u201d in AISTATS, vol. 22, pp. 751\u2013760, 2012.\n\n[13] S. Boyd, P. Diaconis, and L. Xiao, \u201cFastest mixing markov chain on a graph,\u201d SIAM review,\n\nvol. 46, no. 4, pp. 667\u2013689, 2004.\n\n[14] L. Dinh, D. Krueger, and Y. Bengio, \u201cNice: Non-linear independent components estimation,\u201d\n\narXiv preprint arXiv:1410.8516, 2014.\n\n[15] M. Arjovsky, S. Chintala, and L. Bottou, \u201cWasserstein gan,\u201d arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[16] A. Radford, L. Metz, and S. Chintala, \u201cUnsupervised representation learning with deep convo-\n\nlutional generative adversarial networks,\u201d arXiv preprint arXiv:1511.06434, 2015.\n\n[17] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, \u201cImproved training of\n\nwasserstein gans,\u201d arXiv preprint arXiv:1704.00028, 2017.\n\n[18] Y. Bengio, E. Thibodeau-Laufer, G. Alain, and J. Yosinski, \u201cDeep generative stochastic networks\n\ntrainable by backprop,\u201d 2014.\n\n[19] F. Bordes, S. Honari, and P. Vincent, \u201cLearning to generate samples from noise through infusion\n\ntraining,\u201d ICLR, 2017.\n\n[20] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d in\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770\u2013778,\n2016.\n\n[21] R. K. Srivastava, K. Greff, and J. Schmidhuber, \u201cHighway networks,\u201d arXiv preprint\n\narXiv:1505.00387, 2015.\n\n[22] P. J. Green, \u201cReversible jump markov chain monte carlo computation and bayesian model\n\ndetermination,\u201d Biometrika, pp. 711\u2013732, 1995.\n\n[23] W. Jakob and S. Marschner, \u201cManifold exploration: a markov chain monte carlo technique\nfor rendering scenes with dif\ufb01cult specular transport,\u201d ACM Transactions on Graphics (TOG),\nvol. 31, no. 4, p. 58, 2012.\n\n10\n\n\f[24] D. P. Landau and K. Binder, A guide to Monte Carlo simulations in statistical physics. Cam-\n\nbridge university press, 2014.\n\n[25] W. K. Hastings, \u201cMonte carlo sampling methods using markov chains and their applications,\u201d\n\nBiometrika, vol. 57, no. 1, pp. 97\u2013109, 1970.\n\n[26] R. J. Williams, \u201cSimple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning,\u201d Machine learning, vol. 8, no. 3-4, pp. 229\u2013256, 1992.\n\n[27] R. M. Neal et al., \u201cMcmc using hamiltonian dynamics,\u201d Handbook of Markov Chain Monte\n\nCarlo, vol. 2, pp. 113\u2013162, 2011.\n\n[28] D. J. Rezende and S. Mohamed, \u201cVariational inference with normalizing \ufb02ows,\u201d arXiv preprint\n\narXiv:1505.05770, 2015.\n\n[29] D. P. Kingma, T. Salimans, and M. Welling, \u201cImproving variational inference with inverse\n\nautoregressive \ufb02ow,\u201d arXiv preprint arXiv:1606.04934, 2016.\n\n[30] A. Grover, M. Dhar, and S. Ermon, \u201cFlow-gan: Bridging implicit and prescribed learning in\n\ngenerative models,\u201d arXiv preprint arXiv:1705.08868, 2017.\n\n[31] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, \u201cUnpaired image-to-image translation using\n\ncycle-consistent adversarial networks,\u201d arXiv preprint arXiv:1703.10593, 2017.\n\n[32] Y. Li, K. Swersky, and R. Zemel, \u201cGenerative moment matching networks,\u201d in International\n\nConference on Machine Learning, pp. 1718\u20131727, 2015.\n\n[33] B. Efron and R. J. Tibshirani, An introduction to the bootstrap. CRC press, 1994.\n[34] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, \u201cImproved\ntechniques for training gans,\u201d in Advances in Neural Information Processing Systems, pp. 2226\u2013\n2234, 2016.\n\n[35] M. Girolami and B. Calderhead, \u201cRiemann manifold langevin and hamiltonian monte carlo\nmethods,\u201d Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73,\nno. 2, pp. 123\u2013214, 2011.\n\n[36] S. P. Brooks and A. Gelman, \u201cGeneral methods for monitoring convergence of iterative simula-\n\ntions,\u201d Journal of computational and graphical statistics, vol. 7, no. 4, pp. 434\u2013455, 1998.\n\n[37] M. D. Hoffman and A. Gelman, \u201cThe no-u-turn sampler: adaptively setting path lengths in\nhamiltonian monte carlo.,\u201d Journal of Machine Learning Research, vol. 15, no. 1, pp. 1593\u20131623,\n2014.\n\n[38] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,\nJ. Dean, M. Devin, et al., \u201cTensor\ufb02ow: Large-scale machine learning on heterogeneous dis-\ntributed systems,\u201d arXiv preprint arXiv:1603.04467, 2016.\n\n[39] D. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d arXiv preprint\n\narXiv:1412.6980, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2661, "authors": [{"given_name": "Jiaming", "family_name": "Song", "institution": "Stanford University"}, {"given_name": "Shengjia", "family_name": "Zhao", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}