The mean preference is a bad estimate of preferences.
- Ethan Smith
- May 18
- 6 min read
Updated: 22 hours ago

I felt compelled to make this post after seeing yet another reinforcement learning paper for diffusion models that does spectacularly in terms of being able to fit to the reward function, but the actual results look terrible, collapse to a single style, and are otherwise kitsch.

This is not at all a criticism of the author's work. They proposed a new method of reinforcement learning, and it successfully fit the objective quite well. But the images all have the same tired, fake HDR, airbrushed look that make them all painful to look at.
I'm reminded of the diagram when OpenAI intended to show that DALLE3 had improved over their previous DALLE2, but the newer outputs were just so much worse.

The problem is that the reward function itself, I think, sucks.
In this post, I want to discuss the nature of human preferences and why standard reward models, which learn the expectation or average over many human labels, might be doing us a disservice.
The Nature of Human Preferences
The kinds of things humans like differ wildly from person to person. As for what makes a pleasing image, some people may prefer a piece by Van Gogh, others may prefer anime drawings, and others may prefer abstract art.
When collecting ratings of these things across many people, we might get wildly different answers.

Inevitably, because we all have our own subjective preferences, there is a lot of variance and inconsistency in human ratings. One person's trash is another's treasure.
We hope that by getting many unbiased ratings, we can "average out" this noise and reach the common denominator of what makes an image aesthetically pleasing, regardless of the exact kind of content.
In other words, while some may prefer oil paintings and others may prefer 3D-rendered art, what are the universal components that are pleasing to all? Maybe certain colors are universally enjoyed on average by a certain population. Maybe there are certain compositions or uses of light that are generally preferred.
However, I think if there is any preference that persists on average across such a large population, it is something so primitive and vague that it entirely misses the individual nuances of what people actually love.
I say this because many aesthetic preferences are mutually exclusive. Let's imagine a toy example of this.
Let's say half of a population enjoys dark, brooding art, and the other appreciates the opposite: light and happy scenes. We then collect a dataset composed of these two art styles, again half and half. The dark crowd always rates the dark art highly and the light art negatively, and vice versa.
For our model, this is unfortunately not an informative trait it can use to make predictions. If half of the labels over dark art are rated highly and the other half are rated lowly, the minimum error prediction is to predict a value right in the middle, effectively making this trait invariant to its prediction.
Instead, the model will need to depend on other features that on average yield higher ratings. Like, let's say pictures of people are rated more highly on average by both groups; the low ratings are slightly higher, and the high ratings are also slightly higher.
I would worry that at scale, over many different ratings, a lot of the nuance of what makes a certain image preferable is canceled out, and we are left only chasing a mean that embodies a bunch of "safe" traits that are meant to be universally receptive but appeal specifically to no one!
The space of human reward functions is extremely variant and often contradictory. Here, the floor axes represent different kinds of images, and the vertical axis shows how highly a certain person may rate them. For a model to estimate such noisy variant scores from many people with the least error, it would learn the mean preference, which, notice, is not rated highly by anyone here. We're trying to fit a single reward function to many highly incompatible reward functions that each exist in our data in conflicting ways, which manifests as highly variant, inconsistent datapoints, making it difficult to fit to.

When our reward dataset is more objective, say, successfully solving math problems, rendering text that can be read correctly by an OCR, or the compressibility of an image, I think we can get away with it though.
A finding that might support the claim is that the LAION aesthetic predictor v2 was trained on a dataset of images labeled 0-10. Though in practice, we observe in generating predictions over millions of images, something like 99% of the data falls between 4-6, suggesting that the model learns a very conservative fit to the data due to the noisy labels. Granted, I also don't know the nature of the labeling process.
Another way I like to visualize this problem is to imagine doing time series predictions.
Let's imagine the pattern in blue occurs many times in the dataset. Half of the time it continues steeply upward, and the other half of the time is continues steeply downward.

Just predicting the mean tells us a terrible story about what possible continuations might look like.

If we predict mean and variance, we at least have a sense of spread, but still the majority of probability mass is on the center, which is inaccurate.

Once we get to Gaussian mixture model territory, we can respect the multiple clusters of outcomes.

Now, if we can accept that the typical strategy of reward modeling ignores a lot of nuance and focuses on the universal component instead of respecting clusters of human preferences, then we can consider some solutions. First we'll discuss solutions for creating the actual reward model. We also need to figure out how to use that reward model to train an image model.
Training more nuanced reward models
1) Gaussian Mixture models
Noting that our data is composed of many multimodal distributions we'd like to learn, we can try using a model that is capable of fitting to such distributions. Namely, the Gaussian mixture model is composed of many Gaussians.
Let's visualize our light and dark example. Again we have very split distributions of scores

With a mixture of Gaussians, we can learn to divide and conquer. One Gaussian can take on modeling the light crowd, and the other can model the dark crowd. Because our data is split half and half, we'd likely also see that the mixing weights are [0.5, 0.5], which, if we were to mix, would have us right back where we started. But now, we have effectively factorized the view, we can assess looking at each Gaussian independently.

One issue is that the number of Gaussians we choose to model with is an integer value. We don't know how many Gaussians we should be using, which is a common issue in unsupervised clustering problems, i.e., how many means to use in k-means. However, all cases might still be more useful than the base case of a single Gaussian, and we can choose how much granularity we want.
2) Conditioning on the rater
If we have information about our raters, we could provide that as a condition to the model, factorizing the full distribution of ratings into many conditional Gaussians, where each one models a specific crowd. If we have any metadata as well about scores, i.e., users from Australia typically rate art from the 1970s with X score, these factors could also be engineered into a condition. We're now modeling p(rater|x) for across many raters but still able to get p(x) through marginalization
3) Z-vector or random condition.
Similar to the spirit of VAEs, we may be able to inject a randomly sampled z-vector in one of the deeper parts of the network, and get a different rating per z-vector we provide
Reinforcement Learning with our nuanced reward model
A problem now is that in all of these cases, we are outputting multiple scores. The GMM outputs as many scores as we have Gaussians, conditioning on the rater gives us a possible score per rater condition, and there's a similar scenario for the Z-vector case.
We could aim to maximize all scores, but then we'd be back at our original problem of getting an average output that is universally pleasing to all raters.
Instead, I think we should condition the generative model on the specific rater/vector and accordingly reward it as such. We can think of this like task embeddings, i.e., "make an output that is pleasing to rater 5." In other words, we now have many, many reward models and many input conditions describing the task scenario. It should be equivariant, meaning as we swap the nature of evaluation, we also swap the nature of the instruction/task.
This expands our training space substantially, since we can train on a single image but also have many different scenarios to evaluate it on.

Comments