The Symmetry Between Models and Data | The Isomorphism Between You and Your World.

Ethan Smith
Jul 28
17 min read

There was an interesting quote from Janus's article on simulators.

"GPT" is not the text which [it] writes itself

Furthermore, they write:

There is a categorical distinction between a thing which evolves according to GPT's law and the law itself.

I initially agreed with this. It's congruent with how I've understood LLMs—where GPT is not the characters described in its text, but the rules that give life to them. However, seeing it phrased like this gave me a second thought. What if it could be both simultaneously?

Let's talk about isomorphisms: cases where we can represent the same exact data in multiple forms and convert between them.

Firstly, I’ll give some background on transformations: ways of transforming data into another form, and to start, we’ll consider the Fourier transform.

The Fourier transform is a way to convert a signal in the time domain into the Fourier domain. Let's say we have recorded a number of samples of, say, an audio and tracked the intensity of the signal from moment to moment. We can convert this signal losslessly and bijectively into the Fourier/spectral domain, where we express the signal not as a time series of intensities but instead as a sum of different basis frequencies.

This reveals a deep connection between data and transformations/functions. The Fourier transform is special in that it is invertible: we can transform back and forth losslessly between the pre-transformed data and the post-transformed data. When we compute the Fourier transform, we are expressing our data in the basis of another set of data, the pre-chosen harmonic frequencies. More specifically, instead of representing a signal as its intensity at every given moment, we can take a library of simple waves at different frequencies and express our signal as a combination of them, like ratios of ingredients in a recipe. But we can really use any set of bases. For example, we could imagine having a library of other arbitrary reference signals and transforming our signal domain where values represent the relation (dot product) with each of the reference signals.

This extends more broadly to linear transformations as a whole (although invertibility/bijectivity is not necessarily guaranteed). Typically when we do matrix multiplication, we might think of the matrix on the left as the transformation matrix and the other as the data matrix. Though this is really a matter of semantics and the role we consider of each matrix. At the core, they look identical. We're just expressing one matrix, a set of vectors, in the basis vectors of the other. We say the transformation is isomorphic if we are capable of finding a transformation matrix that inverts the result back to the pre-transformed data matrix.

Neural networks are largely just matrix multiplications (and point-wise nonlinearities). The change of basis can be seen as expressing the input data as the strength of its relation to reference concepts.

Imagine a few-layer MLP that takes in a list of variables about a house and predicts a price for it. At each layer we have the opportunity to create new output variables as weighted sums of the input variables. In combination, we can reach higher levels of abstraction. When our variables are approximately normal and centered around 0, we reach the largest values when the input vector has high cosine similarity (dimensions share the same sign) with a given transformation vector, making the transformation vector effectively represent that concept. Effectively, we have a stored pattern, or a “prototype vector,” representing that concept in space, and we assess how much incoming data matches this pattern, resulting in the degree of activation for this neuron or concept by its alignment in vector space. To note, almost everything we do in deep learning comes down to vector searches. That is, expressing data in a vector space, comparing data to each other, and building up a library of these prototype reference vectors to express an incoming signal on the basis of your references. This video explains it elegantly

Let's imagine we have the following raw variables:

[sq_ft, num_rooms, num_fireplaces, level_of_furnishing]

From there we could imagine creating new variables (with names I'm making up; these don't necessarily have names because they are “hidden variables”) like

spaciousness = sq_ft x alpha + num_rooms x beta + num_fireplaces x gamma

homeiness = num_fireplaces x alpha + level_of_furnishing x beta

emptiness = sq_ft x alpha + num_rooms x beta + num_fireplaces x gamma + level_of_furnishing x delta

furnishing = num_fireplaces x alpha + level_of_furnishing x beta

Greek letters represent the learned weights of the transformation.

Something like that. In the next layer, these can be combined again (spaciousness + homeiness = ?), giving us an increasingly high-level view.

A similar case happens for vision models, but also, really, for everything else.

We can see here that the convolutional filters in early layers only detect simple shapes like edges where light changes to dark and shadows at different angles. In later layers these are combined to form more abstract shapes like eyes, wheels, or corners. In the even later stages, we get full objects.

A very hand-crafted form of this is the Viola-Jones algorithm, which detects faces by depending on there being certain lighter and darker regions, partially because light usually comes from overhead. When more of these features are found, and with greater alignment to the reference pattern, the higher the odds that we are looking at a face.

Really, all of our arousal and reactions can be explained by these prototype patterns we have and how highly activated they are. There was an experiment with geese where eggs were removed from their nest, triggering a fixed action pattern of them going to retrieve the egg. Interestingly, in one case a volleyball was used instead of an egg, and this caused an even more intense reaction, presumably because a larger egg may signal a promising offspring. This is to say that while we may have a complex set of patterns stored in our bank, the mechanism by which they act is simple in that something matches a pattern, and a response is triggered. I see no reason to reject that this may extrapolate to humans as well, where facial features trigger the recognition of a person. One tangible example could be that human attraction is triggered by patterns pertaining to masculine and feminine features, and the more distinct they are from the other gender, the stronger the signal that this is in fact a compatible potential mate of the opposite gender, and perhaps why it can be fooled by cartoon characters who distill or even emphasize such patterns. All of our emotional reactions, too, may be the product of incoming stimuli that align with our learned patterns and the associated physical responses and context we have with them.

Google’s DeepDream, among other feature visualization studies, reveals what these patterns are, and what happens when we amplify them shows us what the most activating stimuli are.

And it’s not just visual patterns that are activated. Over many layers of processing, it may be aligning with abstract patterns of multiple senses over longer periods of time we notice, like identifying a certain play in a football game, or recognizing a bug in code, or the music from a certain time period.

Okay, a little sidetracked. This is all cool. However, we're still a bit away from showing why we might think of GPT as both the function that generates text and the text itself.

Let's imagine we have infinite parameters and our data is devoid of noise (note this is a really big assumption). We take in one value and give back a score, a regression task. We could imagine this to be something like predicting return on investment given the number of days since a company's last dividend payment. We'd like to see that for samples shown to the model during training, it can predict close to the true value while also hopefully generalizing well to unseen samples.

In this diagram, the blue line shows what it predicts for each value, the black line shows the true underlying function, which is the actual value we should be predicting, and the blue dots are the data points we had to train on. We've fit our network to just 5 measly data points, which it actually predicts perfectly, but everything else is quite a bit off.

Depending on our random initialization, we could end up learning a number of different functions that all fit the data points perfectly but are each wrong for the unseen regions in their own flavor.

It's often nice to visualize it as a range of possibilities rather than a specific function, where the denser the color, the more likely it is for a sampled function to pass through that point.

By the way, this is what Gaussian processes are all about. I recommend the interactive tutorial here if curious. Kind of like how we learned two different functions above, the shaded region shows the uncertainty over the functions that could be learned here. Pretty much the mean value over all the functions we would learn and the deviation we expect. As we add in more data points, we "pinch off" regions to have 0 uncertainty (because we have their true value) and cross exactly through the data point.

You can imagine, as we keep doing this in the limit of infinite data, we recover the exact function with no uncertainty anywhere.

At this point the function is effectively isomorphic with a lookup table of the data we have. In other words, because we have covered the entire space, when we get an input value, we could feed it through our model to get a prediction, or we can just reference the data point we have of the exact same value. Both give us an equivalent result everywhere.

In the limit of infinite data and parameters, and assuming optimization goes smoothly, we can fit a phenomenon perfectly. This happens regardless of specific architecture or other design choices. Architecture is a more interesting debate when considering what the function should be in the unseen regions and what kinds of intrinsic inductive biases let us make a reasonable guess for what should go there. This is also a valid function that could be learned, but it seems unlikely just from what we've generally seen of real-world stuff, right? So an inductive bias is just that; it's an educated guess based on our expectations.

A way of thinking of inductive biases is how the model’s design or optimization lends itself to generalizing to new data points. In many cases, we will incentivize the model to learn smoothness, having smooth/flatter, slower-changing slopes as opposed to steep ones with sharp turns. Similarly, this can be abstracted to how language models deal with unseen scenarios and ambiguous information. For example, if we were asked to continue the sequence [1,2,3,4,,] we naturally would answer 5,6. Though really there is no correct answer here, there’s just a natural-seeming one to us. The rule could have been to add 1 until we reach 4 and then start adding 2 each time. However, we are inclined towards simple solutions and similar patterns we’ve seen before via Occam’s Razor and Solomonoff Induction. If we cut the sequence down to [1,2,,] it then becomes more ambiguous. It could very reasonably be [1,2,4,8], multiplying by 2 each time. The explanations we gravitate towards are indicative of our inductive biases.

I think often of the ARC challenge, which was a problem set for LLMs and humans alike. Interestingly, humans often are able to solve similar problems to each other and fail in similar ways. On the other hand, LLMs fail on problems in ways that are less intuitive to us. Something that may seem very easy to us might be difficult for an LLM, and vice versa. For problems that do not have enough information for a single objective solution, there may be, say, 3 solutions that could “make sense,” though something about human thought may always have us lean towards one, while LLMs may pick a different one.

Importantly, noting that neural networks are also just universal function approximators and their close relation to Gaussian processes, this reasoning trivially applies to them as well, though because of the randomness component in neural network training, we’ll say there’s an isomorphism between the data and a posterior distribution of models rather than one exact model, and this is close enough for our purposes. Not to mention this becomes increasingly tight with the infinite data and parameters.

But this is irrelevant anyway; since we have infinite data and parameters, we can just brute-force overfit our model and still come out with minimal error. This is the Bitter Lesson at work. At some level of scale, all our finer work on crafting useful inductive biases and feature engineering is useless and possibly even detrimental to model performance.

I like to imagine the predicament like this.

We have in front of us a shape that we would like to learn about. This shape is the manifold of our data, a decision boundary, or similar coveted insight about our data. The problem is that it's invisible. We also have some paint, but a very limited amount. We can cover this object in paint to reveal its edges and contours.

Let's imagine we successfully and sparsely covered the object with paint. The whole thing is not covered, but there are paint marks almost uniformly around it. For the invisible regions in between paint globs, we have no idea what could be there. Though we could make all kinds of reasonable guesses, like assuming smoothness.

But what we'd really like to have is more paint (and more model scale). At this point, we are overfitting everywhere, fitting every data point perfectly. But there is just so much data and excessive model parameters that overfitting all the data points is overfitting to the function itself.

Our 1d example is fairly easy to cover completely, but the 786,432 dimensional space of all RGB 512x512 images is too unfathomably large to fill with all the possible data points. However, it's worth talking about what happens as we approach this limit. We can have extremely tight fits to the phenomena at hand while still generalizing very well.

I think this is part of why synthetic data is an exciting space right now. We have the potential to fill out areas of our less covered manifold without needing to collect data manually and go beyond what is presently available. Now, the counterargument to this is we can’t pull information out of thin air. The ambiguous regions of our manifold translate to ambiguous outputs that we can’t assume are part of the ground truth manifold. However, if we could curate synthetic data, then this qualifies as providing information, similar to the Maxwell’s demon thought experiment, where the simple gatekeeping of information itself constitutes modulating the information of a system.

This is my stance on when we see image generation models vomit out exact scenes and images. This is not overfitting. Overfitting implies high variance in unseen regions because of how much we have fit the noise in the data. What's happening here may be undesirable, but it's hard to call it overfitting. It still otherwise generalizes very well. When LLMs can recall quotes or excerpts from common pieces of text, we applaud them for successfully recalling truth about our reality. Though when image generation models do it, it's seen as an error. If you were to ask for the Mona Lisa or Starry Night and get back something else, wouldn't that be stranger? Not to mention there is some variance in exact textures and details.

The moral of this story is that in the limit of data, a database/lookup table of actual data points and a trained model learning the function of the dataset appear identical. Every generative model then sits somewhere on this gray continuum where one end is just a very complete database and our function, instead of producing new content, it just "pulls out" existing samples.

Does this yet mean that GPT is both the function that generates its rollouts and the aggregate of all its outputs themselves?

Let's try another thought experiment.

These days, model/knowledge distillation is a commonly used technique in machine learning. It is often done to obtain smaller models, or often it is a means of iterative improvement. It can take on many forms, but it generally involves one model is training on targets produced by another model. For language modeling, this could be training on text generated by another model, the logits (the distribution of next-words for each token), or intermediate features. The below shows a case where the student model aims to match the intermediate features of the teacher model.

Let's imagine we used a trained GPT to generate infinite text (let's assume no truncated sampling like top-p or top-k, as this conveys an incomplete picture of the distribution). The typical use case is to distill into a model of a different size/dimensions, sample in a certain manner, or produce answers to specific tasks, though our goal here is to just copy a model's behavior as faithfully as possible from one to another of the same size. So following, we train a student model, of the same size, over all of that data. The result is that the two models have very close, if not identical, predicted logits over data and thus produce similar text as well.

To me, this reads like we just converted a model/function into its data isomorphism and then converted it back.

Model -> data -> same-ish Model

Unlike the Fourier transform, this is far from linear, and it is a stochastic transformation given the student model’s random initialization and random order of seeing data, hence the “same-ish” label. However, like our previous examples, as we increase the amount of data and thus constraints on the solution space, we go from randomly sampling a wide range of possible models that could explain the limited data to a tighter and tighter posterior distribution until we match the function, and there really is only one true and sensible solution. Through this, we end up recovering approximate bijectivity and determinism.

Because of random initialization, it is possible that despite having identical output behavior, the internal activity might differ. Though I would argue in the infinite data regime, differences in weights and hidden states we would see would likely be nearly identical up to a permutation/rescaling/rotation, but outside of that, this is a possibility we may observe.

So another option is we train the model to match internal features with the teacher. We could even train to match everything everywhere all at once. Though at some point we might as well just copy over the weights directly?

This vaguely reminds me of the Theseus's Ship problem. Theseus's Ship, however, is concerned with whether identity lives in the exact matter of something or the structure/meaning it creates.

Somewhat similarly, we are transferring the parts of one model to another via knowledge transfer. We are curious if models can be considered effectively equivalent if their output distributions are nearly identical, which, as it stands, may already cause the two to have very similar internal processes, but we can also make this more strict by forcing the two models to match internal processes as well via losses that incentivize matching hidden state values.

Let us deconstruct a bit of what's going on here. We have

The predicted logits/distribution over next token (actions/observable behavior), which are caused by
Internal hidden states and activations (brain activity), which are caused by
The model's trained weights (brain structure specifics)

If there is only one singular way to achieve a specific set of predictions over infinite data and parameters, then simply matching the output logits could also result in matching the other two. We could possibly also assume that if not identical, internals may still be "close enough," at least close enough that they are outwardly 1:1 identical. Thus, we "smooth" out all of the randomness of sampling, initialization, etc., such that that optimization proceeds towards a practically deterministic solution every time. The posterior becomes tight enough to appear like a point.

So, it seems there's a pretty tractable case for how a model, a function generating a distribution p'(x), can be mapped into a set of data points reflecting that data distribution, and back again.

Now having established this correspondence, we can go back to what this means for your own world model. There is effectively a You as your weights and a You as your actions, which, as we've determined in the limit, are approximately synonymous.

What do I mean by You? I want to take a moment to firmly disambiguate You from definitions of consciousness, which is an irrelevant concept here.

Let's say we create a robot that is a 1:1 model of a human. We are curious as to whether upon pinching it if there is only something there that yields a performance of the behavior for displaying pain, or if there is also something inside there that actually genuinely felt that pain. Consciousness is what is in question. Meanwhile, the You is only relevant to the first part: the fact that a pain signal was sent to a brain, assessed through your unique web of neurons, and produced your unique response, and that this response is a unique result of the brain and body that generated it. You is a function, a model, that produces your inner world, thoughts, actions and just about everything that can be grounded in physical processes.

We’ll make a softer split of You into an Inner You and an Outer You. This one is much less important and has a blurrier line, since the Outer You is a direct product of the Inner You by physical processes.

The Outer You encapsulates everything that can be observed visually: your actions, your expressions: the You that you present to the world.

The Inner You encapsulates your experiences, your memories, and your thoughts: the You that cannot be trivially seen but still has explanations that exist in the physical world.

The Outer You is definitively measurable. Much, if not all one day with the proper technology, of the Inner You can be observed through monitoring brain activity.

I'd like to think on the following questions:

Does You persist if we copy your brain 1:1? (The model weights)
Does You persist if we copy the brain into a different medium?
- How far abstracted can this be pushed? In other words, does it hinge on the exact mechanisms of the brain or the resulting symbolic program?
Can we extract a You from recorded outputs, either brain activity, behaviors, vocal recordings, writings, etc? (The model’s hidden and output activity)

In Pantheon, brains are scanned and uploaded as a program on a computer. The process requires that. The process requires that the person be alive as well to scan brain activity, but it destroys the brain. I think this is to spice up the plot more than anything, or it could be argued that at their level of technology, active readings of brain activity were needed for the scan to be successful, but from my understanding, all activity of the brain arises from its design, so it's likely that would be sufficient.

Anyway, this would be most similar to the copying-over-weights, where a digital program is constructed to mirror the structure of the brain and hence its control flow of activity.

So the question is, is this still You?

Keep in mind we're not concerned with whether your consciousness survives and makes it over to the digital copy, just whether it reflects the You in its model of the world and the way it processes experience.

If the answer is affirmative, then the next question is whether this could also be possible from a very extensive documentation of you, possibly impossibly long.

This is where it gets a bit hazier for me.

Unlike the teacher GPT generating samples randomly to reflect its distribution. It's not as clear how this would be done for a person. Our selves are revealed in responses to a changing environment. We can't necessarily be put in a box and asked to sample all of our behaviors.

So the next best option is a very extensive life recording. Perhaps if a camera followed you all the time, we recorded every vocalization, expression, and everything you create/"write" to the outer physical world, could we faithfully obtain the Inner You?

This also prompts skepticism. If we did have infinite data of everything you outwardly present to the world, I would be fairly confident that we might be able to have a convincing copy of Outer You. In other words, the copy may be indistinguishable from the original from the outside. Though I'm not 100% sure if we could infer the inner world model from this limited view. But that point, if something could match your behavior and expressions all the time with 100% accuracy, would that be You for all intents and purposes?

Regardless, it is fascinating to imagine the possibility that a fabricated You from just recordings or even your writings could be able to fool even friends and family. If this were enough, then You would be just as much your corporeal form as the You represented by all the words you write, every motion you make, and all the things you do.

Part of the issue here is that there are many produced samples that do not result in trivially observable outputs. For instance, we could have footage of you working through a math test for an hour; we do not necessarily observe your reasoning chains. Even worse, you're doing it all mentally. Then we just have an hour-long recording of you sitting at a desk, penciling an answer in every once in a while. It also completely omits dreams, which are a significant part of experience. This is unlike the feedforward style of GPT, where every process is pipelined towards an output.

So the next step up is to also record brain activity extensively. This, I think, could be sufficient to come extremely close if not exactly You. In practice, we’d be fitting to a moving target because we change through time. The data of brain recordings of you as a child are results of a different function than the function that is You right now. However, similar to the GPT example, if we could somehow freeze the state of our mind and generate mass amounts of activity and mental events, maybe it would be feasible.

Some time ago, I tried fitting a function to myself. I did this by training GPT2 on all of the things I could find that I ever wrote: every school assignment, email, anything.

The results were eerie, in the best way. I could tell this snippet, for instance, was referencing an application I made to a club I was part of in college, where I discussed my fascination with art.

While I've always had a passion for photography, it was only after I had my first child that I had found my way into this field.

My husband and I found our first home-made collage on Craigslist about a year and a half ago. It was a simple collage of clouds and leaves that we found along with some other random items. It was through them that we found our dream of making art. We had no idea what we were doing or how

However, it's blended with GPT2's existing knowledge and imprecision. It ended up depicting the life of a totally different person but using my voice and stories.

So, in practice, can we reproduce ourselves simply from our actions and the data we create? Probably not. Though theoretically the door appears open.

The Symmetry Between Models and Data | The Isomorphism Between You and Your World.

Recent Posts

Comments