Giving Models Their Own Sense of Taste

Ethan Smith
Oct 13
6 min read

A friend of mine asked me, "How can we give models their own sense of taste?" Furthermore, he remarked that typically we use humans as oracles when it comes to preference learning, which usually consists of training across many collected human ratings and that he distinctly wanted to see how we could get a model to develop its own sense of taste similar to how he and I have our own tastes.

I sat on this for a little bit. Firstly, I went back to my article about the problem and bore of averaging over many human preferences. With enough samples, a reward model averaging over many preferences is a singular point we reach deterministically and misses the idea that many people each have their own preferences. The average of all tastes feels hardly worth calling a taste. To start, I'd establish that I'd want to avoid utilizing a method that is simply taking the expected value of all tastes. In the article, I wrote about how every human really has their own reward function, and figured this would be a good thing to consider.

Then I thought about we form our own sense of taste. It doesn't seem like something we are simply born with (although genes may play a role), arguably it seems like something that is learned. But how?

I took a stab at a high-level hypothesis that I figured would be enough to begin juggling with how it could surface in machine learning, even if not entirely true.

Hypothesis: Humans learn their own sense of taste by:

Exploring: Engaging with artifacts of reality, and observing what induces positive feelings vs negative feelings. This may be a product of what a human can relate to based on lived experiences and what they have previously engaged with.
Social Factors: Given how much of human development and behavior can be explained by mimicry, perhaps we build our sense of taste by watching others, seeing what's popular, and selectively integrating these parts into ourselves. It's hard to imagine someone developing taste in a vacuum.

Sampling New Tastes

1 feels more complicated. I'll attempt to address it later in the article. So for now we'll try to consider 2.

Here's what we can consider for now:

Every human has their own reward function, which can be seen as their own taste.
This taste develops socially, sharing similarities with other human's reward functions, but is still distinct, i.e. it could resemble a linear combination of many other human's reward functions.

One proposal for what we can try first is not necessarily letting a model develop its own taste, but figure out a means of sampling new tastes.

More precisely, we have a distribution of many unique reward functions that we'd like to model to generate new kinds of tastes.

To make this a tractable problem, if we had a dataset of many trained reward models, we could imagine training a generative model of our choice over them to generate novel reward models that are in-distribution with those seen in the dataset, but still distinct.

Reward models can be quite large. It's not uncommon to see whole multi-billion parameter LLMs used as reward models. However, let's take the case where a reward model is something like a frozen visual embedding model like CLIP or DINO along with a trainable MLP head for turning the output embedding into a score. And again, assume we have trained perhaps 1000s, 100,000s, or millions of these, each over an individual's preferences (this is in no way easy to get)

In this case the weights of the MLP head are small enough that we can could train a diffusion model to generate and sample novel plausible reward models, and thus tastes.

Following, we could then perform RL on a generative model of choice to "view" this taste.

I think this captures the idea that the tastes we develop as humans are built out of the tastes we view others have. And in an abstract sense, we can even say we've modeled the generative function of how a taste has developed.

I think this could have a lot of interesting use cases in terms of simulating novel customers and their preferences. Though, in terms of letting a model discover and evolve its own sense of taste, this is hardly satisfying. The tastes kind of pop into existence instead of iteratively develop. We are missing the explore piece.

Continuously Evolving a Taste

This part will be significantly more speculative and vibey. Capturing the process of how humans explore and discover what they feel and like seems quite difficult and it's uncertain what an analogue might be for a model. Though I think we can come up with a loose, but interesting framework.

The goal is to train a reward model from scratch through a process that both is guided by a social component and self-guided exploration and self-rating.

For this case, we will imagine having a pretrained LLM along with the reward model we are training.

Desired features of such a training pipeline:

Social Learning
1. The learning of the reward model should be guided towards matching or sharing similarities with other pretrained reward models on individual human preferences
Exploration and Evaluation
1. We can sample data points and stochastically sample a response from the pretrained LLM asking it to provide a score to the given data point, hopefully reflecting how its pretrained knowledge might translate into a "feeling" or enjoyment of the data point
Curiosity
1. We should identify data points that are rare or unseen with respect to the LLM as an opportunity to explore new kinds of content less familiar to the LLM.

Implementation 1 - LLM remains frozen, easier but possibly less faithful

Definitions

LLM = Pretrained LLM
Dr = Dataset of reward models
Dt = Dataset of texts with pre-annotated labels of their logprobs using the pretrained LLMs
K = number of reward models to consider when grabbing the top-k most similar
social_step_every_n = how often to perform a social learning step
rareness_threshold_schedule = a threshold that considers the rareness of a text data point. Over the course of training, sampled text data points will becomer progressively rarer.
R = reward model we training
training_steps = number of training steps
LLM_score(x) = prompting the LLM to score the text data point
p_LLM(x) = Evaluating the probability of a sample

Randomly initialize reward model weights, R.
For i in training_steps:
1. rareness_threshold = rareness_threshold_schedule(i)
2. Sample a text data point, x, from Dt with a p_LLM(x) < rareness_threshold
3. if i % social_step_every_n == 0:
  1. Through model similarity methods like CKA, identify top-K most similar reward models from our reward model dataset, Dr, to the reward model we are training.
  2. R_test = Randomly sample one reward model from this set.
  3. score_target = R_test(x) Evaluate the text using the chosen reward model
4. else:
  1. score_target = LLM_score(x)
5. Loss = |R(x) - score_target|^2

In this implementation, we alternate between training steps where

Social Step: The reward model is trained to match the prediction of another randomly sampled reward model, prioritizing reward models that are similar to it, based on the heuristic that socially we engage with people who are similar to us.
Explore and Evaluate Step: The frozen LLM scores a data point, and the reward model is trained to match this prediction.

We train until convergence, and once the reward model is done, we use it to perform RL on the LLM.

Implementation 2 - Active Learning with LLM in the loop. Reward model and LLM are trained simultaneously

Definitions:

LLM = Pretrained LLM
Dr = Dataset of reward models
Dt = Dataset of texts
K = number of reward models to consider when grabbing the top-k most similar
social_step_every_n = how often to perform a social learning step
rareness_threshold_schedule = a threshold that considers the rareness of a text data point. Over the course of training, sampled text data points will becomer progressively rarer.
R = reward model we training
training_steps = number of training steps
LLM_begin_training_step = step at which we begin performing RL training over the LLM
p_LLM(x) = Evaluating the probability of a sample

Randomly initialize reward model weights, R.
For i in training_steps:
1. rareness_threshold = rareness_threshold_schedule(i)
2. Sample a text data point, x, from Dt with a p_LLM(x) < rareness_threshold
3. if i % social_step_every_n == 0:
  1. Through model similarity methods like CKA, identify top-K most similar reward models from our reward model dataset, Dr, to the reward model we are training.
  2. R_test = Randomly sample one reward model from this set.
  3. score_target = R_test(x) Evaluate the text using the chosen reward model
4. else:
  1. score_target = LLM_score(x)
5. Loss = |R(x) - score_target|^2
6. if i > LLM_begin_training_step:
  1. Perform RL step to update LLM weights using our actively training reward model R

This implementation is similar to the prior, however given that the LLM is now training within this loop, the p_LLM(x) function as well as LLM_score(x) function change over the course of training. Theoretically, we might expect text data points the LLM originally deems rare to become less rare as they are seen. Additionally, we might expect that the LLM's ratings adjust as it is reinforced towards certain kinds of data points (though this part might be more nuanced since we aren't necessarily using RL to determine how to score points, this could possibly use some adjustments).

There is likely lots of holes to poke here on the specifics here, or even if it would give us the results we want, though the general framework appears plausible.

Giving Models Their Own Sense of Taste

Sampling New Tastes

Continuously Evolving a Taste

Recent Posts

Comments