Blog Post

Contextures: The Mechanism of Representation Learning

Runtian Zhai, PhD Dissertation
Computer Science Department, Carnegie Mellon University
zhairuntian at hotmail dot com
April 15, 2025
Paper   Slides   Poster

Estimated reading time: 10 min

Science of Representation Learning

Despite the remarkable empirical success of foundation models, two questions have not been answered to a satisfactory extent.

  • What representations do foundation models learn, and why are these representations useful for a variety of downstream tasks?
  • Can increasing the model size always improve the performance? If it cannot, how to make further progress?
Answering these questions is crucial, especially at this moment when the field is rapidly evolving but scaling is seemingly producing diminishing returns. For this reason, at NeurIPS 2024, Ilya Sutskever predicted that "pretraining as we know it will end". My belief is that there is still room for progress in pretraining, but it requires a deeper understanding of the science of representation learning, so that we can develop the next generation of pretraining methods.

The Contexture Theory

The central argument of the contexture theory established in this dissertation is that representations are learned from the association between the input X and a context variable A. Here are some example:

Learning Method Input X Context Variable A
Supervised learning Sample Label of X
Graph representation learning Node Neighbor of X
Masked language modeling Text Masked version of X
Contrastive learning Image Cropped version of X
Vision-language models Image Text caption of X
The contexture theory is related to a concept in psychology called the two systems of thinking, defined by psychologist Daniel Kahneman. The figure below illustrates this concept (source). System-1 thinking is fast, automatic, and intuitive, while system-2 thinking is slow, deliberate, and analytical. The contexture theory suggests that representation learning can do any type of system-1 thinking, as long as one can define X and A, and there is a sufficient amount of data. It can do system-1 thinking better than humans, thanks to their large memory and fast computation. This is a manifestation of Ilya Sutskever's deep learning hypothesis, which states that "If you have a large neural network, it can do anything a human can do in a fraction of a second." However, representation learning cannot do system-2 thinking, which is why foundation models today are still not as good as humans at reasoning.
Two systems of thinking
The association between X and A can be determined by their joint distribution P+(x, a). This P+ induces a linear expectation operator TP+, which maps a function on A to a function on X. For any function g(a), TP+ maps g to f(x) = EP+[g(A)|x]. Then, we can do a singular value decomposition (SVD) on TP+, which gives us a set of ordered singular values (called the spectrum), and left and right singular functions. Left singular functions are functions of X, and right ones are functions of A.

Key result 1: The optimal d-dimensional encoder spans the same linear space as the top-d left singular functions of P+. We say that such an encoder learns the contexture of P+.

We call this linear space the top-d eigenspace. To see why this result is true, we can make an analogy to principal component analysis (PCA). Let us suppose that there are only N possible values of X, and M possible values of A. Then, a function on X is essentially an N-dimensional vector. Since TP+ is a linear operator, it is essentially an N-by-M matrix. Our goal is to learn a d-dimensional embedding E for the N possible values of X, which is an N-by-d matrix. PCA states that if E consists of the top-d left singular vectors of TP+, then E is the optimal d-dimensional embedding, in the sense that it minimizes the reconstruction error.

The next question is how to obtain the linear span of the top-d left singular functions. The conventional method is called kernel PCA, which involves eigen-decomposition of an m-by-m matrix, where m is the size of the dataset. The time complexity of kernel PCA is O(m3), which is not scalable to large datasets. Deep learning provides a faster way to do this. We can train a large model to optimize a variational objective R, such that R is optimized if and only if the encoder learns the contexture. Then, optimizing this R automatically gives us the optimal encoder.

Key result 2: The contexture can be learned by training a large model to optimize a variational objective.

In the paper, I prove that this metehod works for a lot of machine learning paradigms, such as supervised learning, contrastive learning, denoising autoencoders, language models, etc.

The Essence of Transferability and Implications for the Scaling Law

Why can a representation trasfer to a downstream task so different from the pretraining task? I prove that such transferability relies on the compatibility between the downstream task and the context from which the representation is learned. Specifically, a context and a task are compatible, if the knowledge of the context helps learn a predictor for the task. In the paper, I define a quantitative measure of compatibility, and prove that the encoder that learns the contexture achieves the lowest approximation error on all compatible tasks, and thus it is optimal.

The contexture theory also has intriguing implications for the scaling law.

Key result 3: Increasing the model size inevitably produces diminishing returns. Further improvement requires better contexts.

The intuition is that as we increase the model size, the function class of the encoder gets closer to the entire function space on X. As a result, the linear space spanned by the optimizer of R on this function class converges to the top-d eigenspace. When these two spaces are close enough, scaling up the model size will have little effect on the performance.

Here is an experiment that corroborates this intuition. Two encoders are trained on the same dataset. Encoder 1 consists of the exact top-d singular functions obtained by kernel PCA. Encoder 2 is an MLP trained by optimizing a variational objective. We measure the alignment between the two encoders using canonical correlation analysis (CCA). The result is shown in the figure below.

CCA
We fix the depth of the MLP to be 3 layers, and vary the width. When the model is sufficiently wide, the CCA can be higher than 0.85, which is considered a high alignment. However, further increasing the width could have a negative effect, probably because larger models are harder to train.

Towards Better Contexts

The first thing we need to understand is which contexts are better. Better contexts lead to better encoders, so it depends on how we evaluate an encoder. There are two ways: Extrinsic evaluation and intrinsic evaluation. Extrinsic evaluation evaluates an encoder by its performance on a specific downstream task, which is what we ultimately care about in practice. Intrinsic evaluation does not use any specific task. It could be more useful because we might not know all the downstream tasks at pretrain time. Moreover, we want the encoder to be transferable, so do not want it to be only good on one task. Here we focus on intrinsic evaluation.

Key result 4: A good context should have a moderate association between X and A.

To see why, let us consider two clearly bad contexts. In the first context, A is a random variable independent of X. In the second context, A = X. Both contexts are useless because A does not provide additional information about X. Hence, a context is not useful if the association between X and A is too weak or too strong.

The association between X and A controls the shape of the spectrum, that is the decay rate of the singular values of TP+. As illustrated in the figure below, when the association is weak, the spectrum decays fast. When the association is strong, the spectrum decays slowly. The optimal case is when the spectrum decays at a moderate rate, which is the case when the association is moderate. In the paper, I propose a metric to quantitatively measure the usefulness of a context, and the metric only depends on the singular values.

Spectrum
Now suppose we have a number of different contexts, but none of them is good because they are all too weak or too strong. In this case, we can mix them to form a better context with moderate association.

Key result 5: Mixing multiple contexts can lead to better contexts.

In the paper, I introduce three base operations for mixing contexts: convolution, convex combination and concatenation. The three operations should be used in different situations. Convolution is useful when all contexts have strong associations. Convex combination balances between strong and weak associations. Concatenation is useful when all contexts have weak associations.

Summary

In this dissertation, I propose a new theory of representation learning called the contexture theory. The theory provides a unified framework for understanding the mechanism of representation learning. It explains why representations are transferable, and why increasing the model size produces diminishing returns. The theory also provides a principled way to design better contexts, which is crucial for improving the performance of foundation models. I believe that the contexture theory is a step towards a deeper understanding of the science of representation learning, and it will help us develop the next generation of pretraining methods.