Diffusion probabilistic modelling
Generative modelling is attracting large research community and providing path for achieving the goal of Artificial General Intelligence(AGI). For generation tasks, Diffusion based generation has picked pace since 2020 after the release of famous Denoising Diffusion probabilistic modelling research article of Berkley.
In this article we are going to review how diffusion works, what are the mathematical intricacies in its derivation and how the recent modifications are playing with diffusion. Tighten your seat belts and lets get started on a roller coster ride to ML Diffusion Reactor.
Diffusion Models are inspired from Non-Equilibrium Thermodynamic which empahsize that almost all systems found in nature are not in thermodynamic equilibrium. They are changing or can be triggered to change over time. They are continuously and discontinuously subject to flux of matter and energy to and from other systems and to chemical reactions.
What are diffusion models?
Today I am going to show you two images of one of the world’s seven wonders
Which one image do you think is actually depicting TajMahal clearly? well none of them but still in the second image, monument is visible and depicting the actual clear image which is
The point that I wish to make is that retrieving the actual image from just a gibberish noise sampled from a fixed probability distribution(say Gaussian) is not easy/nearly impossible.
While retrieving the less noised image from a noised image is still easier.
This is the main point that Diffusion modelling leverages. Learning in diffusion framework involves estimating small perturbations to a diffusion process. Estimating small perturbations is more tractable than explicitly describing the full distribution with a single, non-analytically-normalizable, potential function. Furthermore, since a diffusion process exists for any smooth target distribution, this method can capture data distributions of arbitrary form.
Forward and Reverse Diffusion — Mathematically explained
Diffusion models are latent variable models of the form
x1, . . . , xT are latents of the same dimensionality as the data i.e. x0 ∼ q(x0)
Forward diffusion
We define a forward noising process q which produces latents x1 through xT by adding Gaussian noise at time t with variance βt ∈ (0, 1).
What distinguishes diffusion models from other types of latent variable models is that the approximate posterior q(x1:T |x0), called the forward process or diffusion process, is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule β1 , . . . , βT
Given sufficiently large T and a well behaved schedule of βt, the latent xT is nearly an isotropic Gaussian distribution. Thus, if we know the exact reverse distribution q(xt−1|xt), we can sample xT∼N(0,I) and run the process in reverse to get a sample from q(x0).
When we set βT sufficiently large(close to 1), q(xT|xT-1) converges to a standard Gaussian for all x0.
A notable property of the forward process is that it admits sampling “xt” at an arbitrary timestep t in closed form
where
Reverse diffusion
The joint distribution q(x0:T ) is called the reverse process, and it is defined as a Markov chain with learned Gaussian transitions starting at q(xT)∼N(0,I).
Diffusion Model
However, since q(xt−1|xt) depends on the entire data distribution, we approximate it using a neural network pθ (x0:T ).
Note that we assume this neural network to give parameters(mean and variance) of gaussian distribution as xT comes from gaussian and if we improve that a little, still the xT-1 would be a gaussian and so for rest of xt-1 as well till x0.
Cost Function for model optimisation
Its interesting to note that the setup defined so far is very similar to VAE of going from x -> z~N(0, I) -> x’ and therefore its meaningful to use the variational lower bound to optimize the negative log-likelihood.
With some simple juggling of terms, the Expectation can be further simplified as the sum of KL divergence terms and entropy terms:-
Giving each term some label, we could see that overall variational loss can be seen as having distance between probability distributions(actual<q> and predicted<p>).
Noteworthy is that Lt compare pθ(xt−1|xt) against forward process posteriors, conditioned on x0 i.e. q(xt-1 | xt, x0).
This forward process posteriors, conditioned on x0 is tractable in nature and using bay’s rule, it can be shown that
Using the simple replacement of x0 term
which is being approximated using the conditioned probability distributions in the reverse diffusion process with pθ (xt−1 |xt ) and therefore
Because “xt” is available as input at training time, we can reparameterize the Gaussian noise term instead to make it predict “ε”[interchangably used for “zt”] from the input “xt” at time step t.
Therefore the simplified objective loss term becomes
How to train and sample from diffusion model?
Having explained most of the Forward and Reverse process details, we could now write the Training and Sampling procedures for Diffusion models.
There are further ways to make Markovian chain to Non-Markoving chain as intorduced in article DDIM and add some guidance signals at time of diffusion called as Guided Diffusion approach which I will touch in some other blog.
References
[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in Neural Information Processing Systems 33 (2020): 6840–6851.
[2] Nichol, Alexander Quinn, and Prafulla Dhariwal. “Improved denoising diffusion probabilistic models.” International Conference on Machine Learning. PMLR, 2021.
[3] Dhariwal, Prafulla, and Alexander Nichol. “Diffusion models beat gans on image synthesis.” Advances in Neural Information Processing Systems 34 (2021).
[4] https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
[5] https://www.youtube.com/watch?v=lvv4N2nf-HU&t=2430s
[6] Sohl-Dickstein, Jascha, et al. “Deep unsupervised learning using nonequilibrium thermodynamics.” International Conference on Machine Learning. PMLR, 2015.
Keep Learning, Keep Hustling