Technical aside | Ethansmith2000

This also becomes apparent when training image generation models, and choosing between conditioning on text or a lossy image representation (encoded text embeddings and encoded image embeddings respectively).

Before training, when looking at a single sample, the two may as well be the same. Both of the embeddings are just be an array of numbers of the same size, and if we pair them each with the same image and overfit exclusively on this sample, we won't see much of a difference.

However, once we train on many pairs and converge to a final model, we'll find that the model conditioned on image embeddings will generally have much less diversity of output compared to that of text embeddings despite feeding the same number of raw values to the model.

There's a few things at play here.

The image embedding is more informative as to what the model should output compared to the text. This falls out naturally from the image embedding being a deterministic compression of an image, and there are strong, reliable correlations between the input and the output that the model can make use of towards its predictive accuracy.

Looking at it in the other direction, a single instance of an image embedding could have come a couple of images, but we'd expect these images to all be very closely related.

Meanwhile, the relationship between text and images is a many-to-many problem. One text could represent many images, and one image could represent many texts (think of the breadth of images that could be labeled "a cat").

For this reason, the correlations between input and output are typically weaker, or less reliable. The mapping between text and image, unlike the image encoding process, is probabilistic. We can summarize this as follows:

text -> image: Probabilistic, one-to-many
image -> text: Probabilistic, one-to-many
image -> image embedding: Deterministic, one-to-one
image embedding -> image: Probabilistic, one-to-many

Secondly, text often has the constraint of how many words can be used and the constraint of generally being restricted to the space of strings of text that are grammatically coherent. The space is smaller. This limits the Fundamental Information we can convey. In other words the saying "Picture is worth 1000 words" in a literal sense. If we allow for very long texts offering more and more granular details, reduce the ambiguity by sticking to common format, or even break free from grammar, we'll find that the outputs from text-conditioned models end up much less diverse.

Information is stored in the distribution

Correlations great information, decoding both image and text embeddings, at the end of the day, they’re both just a vector of numbers of the same exact dimensions. But the correlation between input and output is stronger