How I like to think about diffusion

Ethan Smith
Jan 26
4 min read

Updated: May 10

It's a bit hard to see in the diagram but in addition to being convolved with a gaussian, these points are also drifting towards zero.

There's two perspectives here actually. There's setting a point to

xt = x0 * alpha + noise * sigma

where sigma and alpha are both numbers between 0 and 1

and then there's

xt = x0 + noise * sigma

but sigma goes towards infinity at the end of the diffusion schedule.

In both cases we can achieve a desired signal to noise ratio but one case involves reducing the image signal while the other keeps it constant and continues raising the noise's variance to overwhelm the signal entirely. I believe these are the Variance Preserving and Variance Exploding perspectives respectively.

We go from a very high spread distribution to something narrower and narrower until it is a mixture of diracs each reflecting a single sample.

Interestingly, this process is opposite to entropy flow in nature and the arrow of time. Typically entropy, or disorder, increases. When you put a drop of coloring in water, it will slowly spread out to become a homogenous mixture. The way this moves forward in time is a stochastic process, but the end result distribution is well understood, even if we don't know the exact end state it will end up in. In all cases the end result appears similar.

Now imagine if we could reverse this process. Could you guess from a totally homogenous mixture where in the glass the coloring was dropped? No. But we can guess at this. Generative diffusion is estimating that guess based on where it could have happened with greatest proability.

I put together this repo, Boneless Flow, to explore it some more. Instead of training a model, with weights, to obtain estimates of the flow towards the manifold of clean data, we can compute the ground truth flow analytically if we have the whole dataset in memory.

A problem with this is that the ground truth score actually won't allow for generating new samples

https://arxiv.org/abs/2412.20292

Tweedie's formula is an equation that sits at the foundation of diffusion. It describes the visual above of moving towards regions of higher probability.

I liked this post on it. Though I wanted to give some additional intuition.

Here X is a clean data point, e is noise, and we get out Y when adding that noise to X.

f (denoiser) is a function that takes in Y, the noisy data point, and attempts to estimate X, what it was before being noised.

f_hat is a specific instantiation of f namely, the MMSE denoiser.

It says that we can denoise Y by starting from it and moving in a direction that increases probability, similar to our original diagram, scaled by sigma, a known level of noise

We can consider an example of this without needing to consider diffusion: Linear regression

In a linear regression problem we have a set of points following some kind of trend, we attempt to estimate this trend with a model, a straight line represented by Y_hat = mX + b.

Y_hat is the predicted value from our model.

Y is the actual data points, which we can think of as clean data points from our model but with added noise. This is often a reasonable assumption due to other unaccounted for variables, which we can represent as noise.

Therefore we can think of the Y's represented as

Y = Y_hat + e

Y = mX + b + e

This noise, assuming gaussian distribution, creates a fuzzy field/distribution around our model, which we can visualize as either of these two. Where height of the distribution represents probability density, or the opaqueness

Now like before, we can ascend the probability curve, thus denoising the real data points with respect to our model.

I think this illustrates here a profound relationship between noise and blur, for which blurring can be seen as denoising.

Noise is individual random samples from a distribution.

Blurring involves taking the average of many samples, or if we know the distribution, typically pointing out its mean value.

We combat noise by blurring, like this:

We don't know what the image looks like without noise, but we can take advantage of the contiguity of images to assume that nearby points should have related values, thus smoothing out the randomness. (note this is spatial blurring, which is a bit different from the kind of blurring regression performs, but both leverage the aggregate of noisy samples to give us a denoised estimate)

Blurring gives us the best unbiased estimator of the data without noise. It yields points that have minimal distance to all the observed noisy samples. The unbiased choice is also the most conservative, safe choice: it makes minimal guesses.

We could also average or blur across other axes of data points. For example, instead of blurring spatially individual pixels, we can treat every picture of a face as a sample and average those together.

Note, this doesn't actually give us a real sample of a person, but its informative of common traits in appearance from a certain cluster. This is reflective of typical samples vs the mean sample.

How I like to think about diffusion

Recent Posts

1 Comment