Are AI Deep Network Models Converging?
Are artificial intelligence models evolving towards a unified representation of reality? The Platonic Representation Hypothesis says ML models are converging.
A recent MIT paper has come to my attention for its impressive claim: AI models are converging, even across different modalities — vision and language. “We argue that representations in AI models, particularly deep networks, are converging” is how The Platonic Representation Hypothesis paper begins.
But how can different models, trained on different datasets and for different use cases converge? What has led to this convergence?
1. The Platonic Representation Hypothesis
We argue that there is a growing similarity in how datapoints are represented in different neural network models. This similarity spans across different model architectures, training objectives, and even data modalities.
1.1 introduction
The paper’s central argument is that models of various origins and modalities are converging to a representation of reality — the joint distribution over the events of the world that generate the data we observe and use to train the models.
The authors argue that this convergence towards a platonic representation is driven by the underlying structure and nature of the data that models are trained on, and by the growing complexity and capability of the models themselves. As models encounter various datasets and wider applications, they require a representation that captures the fundamental properties commonly found in all data types.
1.2 Plato’s Cave
The paper specifically references Plato’s Allegory of the Cave to draw an analogy between how AI models are hypothesized to develop a unified representation of reality and Plato’s philosophical ideas about perception and reality. In Plato’s allegory, prisoners in a cave see only shadows of real objects projected on a wall, which they only believe to be reality. However, the true forms of these objects exist outside the cave and are more real than the shadows the prisoners perceive.
2. Are AI Models Converging?
AI models of various scales, even built on diverse architecture and trained for different tasks, are showing signs of convergence in how they represent data. As these models grow in size and complexity and the feeding data becomes larger and varied, their methods of processing data begin to align.
Do models trained on different data modalities — vision or text, also converge? The answer could be yes!
2.1 Vision Models that Talk
This alignment spans over visual and textual data — the paper later confirms that the limitations of this theory are that it’s focused on these two modularities and not other modalities such as audio, or robotics perception of the world. One of the cases [1] to support this is LLaVA, which shows projecting visual features into language features using a 2-layer MLP, resulting in state-of-the-art results.
2.2 Language Models that See
Another interesting example is A Vision Check-up for Language Models [2] which explores the extent to which large language models understand and process visual data. The study uses code as a bridge between images and text, as a novel approach to feed visual data to LLMs. The paper reveals that LLMs can generate images by code that while may not look realistic, still contain enough visual information to train vision models.
2.3 Bigger Models, Bigger Alignment
The alignment of different models is correlated with their scale. As an example, models trained on CIFAR-10 classification that are bigger, show greater alignment with each other, compared to smaller models. This means that with the current trend of building models in the order of 10s and now 100s of billions, these giants will be even more aligned.
“all strong models are alike, each weak model is weak in its own way.”
3. Why are AI Models Converging?
Intraining an AI model, there are elements that contribute most to why AI models converge:
3.1 Tasks are Becoming General
As models are trained to solve tasks that are more and more general simultaneously, the size of their solution space becomes smaller and more constrained. More generality means trying to learn data points that are closer to reality.
The Platonic Representation Hypothesis paper formulates this as The Multitask Scaling Hypothesis:
“There are fewer representations that are competent for N tasks than there are for M < N tasks. As we train more general models that solve more tasks at once, we should expect fewer possible solutions.”
In other words, the solution to a complex problem is much more narrow than the solution to an easy problem. As we are training models that are more and more general on gigantic internet-wide datasets across different modalities, you can only imagine how constrained the solution space will be.
3.2 Models are Getting Bigger
As the capacity of models increases, through more sophisticated architectures, larger datasets, or more complex training algorithms, these models develop representations that are more similar to each other.
While The Platonic Representation Hypothesis paper doesn’t offer proofs or examples for this hypothesis that they call The Capacity Hypothesis — that “Bigger models are more likely to converge to a shared representation than smaller models”, it seems trivial that bigger models at least have more capacity to come up with mutual solution spaces than small models.
As AI models scale, thanks to their depth and complexity, they have a greater capacity for abstraction. This allows them to capture underlying concepts and patterns of the data and wave off noise or outliers, thus arriving at a representation that is more generalized and possibly closer to the real world.
3.3 The Simplicity Bias
Imagine training two large-scale neural networks on two separate tasks: one model must be able to recognize faces from images, and another is trained to interpret the emotions of faces. Initially, these two tasks might seem unrelated — but would you be surprised to see both models converge on similar ways of representing facial features? After all, it all comes down to an accurate identification and interpretation of key facial landmarks (eyes, nose, mouth, etc).
Several literature points out a tendency of deep neural networks to find simpler and more general solutions [3,4,5]. In other words, deep networks favor simple solutions. Often called The Simplicity Bias the paper formulates it as such:
Deep networks are biased toward finding simple fits to the data, and the bigger the model, the stronger the bias. Therefore, as models get bigger, we should expect convergence to a smaller solution space.
Why do neural networks show this behavior? Networks show simplicity bias mostly because of the fundamental properties of the learning algorithms used to train them. Algorithms tend to favor simpler, more generalizable models as a way to prevent overfitting and enhance generalization. During training, simpler models are more likely to emerge because by capturing the dominant patterns in the data, they minimize the loss function more efficiently.
Simplicity bias acts as a natural regulator during training. It pushes models toward an optimal way of representing and processing data, which is both general across tasks and simple enough to be efficiently learned and applied, and this increases the chance of models learning mutual hypothesis spaces.
4. Implications of This Convergence
So what if models are converging? First of all, this shows that data across different modalities can be more useful than thought before. Fine-tuning vision models from pre-trained LLMs or vice-versa could yield surprisingly good results.
Another implication pointed out by the paper is that “Scaling may reduce hallucination and bias”. The argument is that as models scale, they can learn from a larger and more diverse dataset, which helps them develop a more accurate and robust understanding of the world. This enhanced understanding allows them to make predictions and generate outputs that are not only more reliable but also less biased.
5. A Pinch of Salt
When it comes to the arguments posed by the paper, you have to consider some limitations, almost all of which are also addressed by the paper as well.
Firstly, the paper assumes a bijective projection of reality in which one real-world concept Z, has projections X and Y that can be learned. However, some concepts are uniquely inherent in one modularity. Sometimes, language can express a concept or feeling that many images can’t, and in the same way, language can fail to take the place of an image in describing a visual concept.
Secondly, as mentioned before, the paper focuses on two modalities: vision and language. Thirdly, the argument that “AI models are Converging” only holds for multi-task AI models and not specific ones, such as ADAS or Sentiment Analysis models.
Lastly, while the paper shows that the alignment of different models increases, it doesn’t indicate the models’ representations become similar. The score of alignment between larger models is indeed higher than smaller ones, but still, a score of 0.16/1.00 leaves some open questions to the research.
🌟 Join +1000 people learning about Python🐍, ML/MLOps/AI🤖, Data Science📈, and LLM 🗯
check out my X/Twitter, where I keep you updated Daily.
Thanks for reading,
— Hesam
[1] Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In NeurIPS, 2023.
[2] Sharma, P., Rott Shaham, T., Baradad, M., Fu, S., Rodriguez-Munoz, A., Duggal, S., Isola, P., and Torralba, A. A vision check-up for language models. In arXiv preprint, 2024.
[3]H. Shah, K. Tamuly, The Pitfalls of Simplicity Bias in Neural Networks, 2020, https://arxiv.org/abs/2006.07710
[4] A brief note on Simplicity Bias
[5] Deep Neural Networks are biased, at initialisation, towards simple functions
LLMs don't represent reality. They represent training data, which has no more than a loose relationship to the real world. If they are converging, it has more to do with the lack of creativity in extending the underlying technology than about anything else.