Contextures: The Mechanism of Representation Learning
Runtian Zhai, PhD Dissertation
Computer Science Department, Carnegie Mellon University
zhairuntian at hotmail dot com
April 15, 2025
Paper Slides Poster
Estimated reading time: 10 min
Science of Representation Learning
Despite the remarkable empirical success of foundation models, two questions have not been answered to a satisfactory extent.
- What representations do foundation models learn, and why are these representations useful for a variety of downstream tasks?
- Can increasing the model size always improve the performance? If it cannot, how to make further progress?
The Contexture Theory
The central argument of the contexture theory established in this dissertation is that representations are learned from the association between the input X and a context variable A. Here are some example:
Learning Method | Input X | Context Variable A |
---|---|---|
Supervised learning | Sample | Label of X |
Graph representation learning | Node | Neighbor of X |
Masked language modeling | Text | Masked version of X |
Contrastive learning | Image | Cropped version of X |
Vision-language models | Image | Text caption of X |

Key result 1: The optimal d-dimensional encoder spans the same linear space as the top-d left singular functions of P+. We say that such an encoder learns the contexture of P+.
We call this linear space the top-d eigenspace. To see why this result is true, we can make an analogy to principal component analysis (PCA). Let us suppose that there are only N possible values of X, and M possible values of A. Then, a function on X is essentially an N-dimensional vector. Since TP+ is a linear operator, it is essentially an N-by-M matrix. Our goal is to learn a d-dimensional embedding E for the N possible values of X, which is an N-by-d matrix. PCA states that if E consists of the top-d left singular vectors of TP+, then E is the optimal d-dimensional embedding, in the sense that it minimizes the reconstruction error.
The next question is how to obtain the linear span of the top-d left singular functions. The conventional method is called kernel PCA, which involves eigen-decomposition of an m-by-m matrix, where m is the size of the dataset. The time complexity of kernel PCA is O(m3), which is not scalable to large datasets. Deep learning provides a faster way to do this. We can train a large model to optimize a variational objective R, such that R is optimized if and only if the encoder learns the contexture. Then, optimizing this R automatically gives us the optimal encoder.
Key result 2: The contexture can be learned by training a large model to optimize a variational objective.
In the paper, I prove that this metehod works for a lot of machine learning paradigms, such as supervised learning, contrastive learning, denoising autoencoders, language models, etc.
The Essence of Transferability and Implications for the Scaling Law
Why can a representation trasfer to a downstream task so different from the pretraining task? I prove that such transferability relies on the compatibility between the downstream task and the context from which the representation is learned. Specifically, a context and a task are compatible, if the knowledge of the context helps learn a predictor for the task. In the paper, I define a quantitative measure of compatibility, and prove that the encoder that learns the contexture achieves the lowest approximation error on all compatible tasks, and thus it is optimal.
The contexture theory also has intriguing implications for the scaling law.
Key result 3: Increasing the model size inevitably produces diminishing returns. Further improvement requires better contexts.
The intuition is that as we increase the model size, the function class of the encoder gets closer to the entire function space on X. As a result, the linear space spanned by the optimizer of R on this function class converges to the top-d eigenspace. When these two spaces are close enough, scaling up the model size will have little effect on the performance.
Here is an experiment that corroborates this intuition. Two encoders are trained on the same dataset. Encoder 1 consists of the exact top-d singular functions obtained by kernel PCA. Encoder 2 is an MLP trained by optimizing a variational objective. We measure the alignment between the two encoders using canonical correlation analysis (CCA). The result is shown in the figure below.

Towards Better Contexts
The first thing we need to understand is which contexts are better. Better contexts lead to better encoders, so it depends on how we evaluate an encoder. There are two ways: Extrinsic evaluation and intrinsic evaluation. Extrinsic evaluation evaluates an encoder by its performance on a specific downstream task, which is what we ultimately care about in practice. Intrinsic evaluation does not use any specific task. It could be more useful because we might not know all the downstream tasks at pretrain time. Moreover, we want the encoder to be transferable, so do not want it to be only good on one task. Here we focus on intrinsic evaluation.
Key result 4: A good context should have a moderate association between X and A.
To see why, let us consider two clearly bad contexts. In the first context, A is a random variable independent of X. In the second context, A = X. Both contexts are useless because A does not provide additional information about X. Hence, a context is not useful if the association between X and A is too weak or too strong.
The association between X and A controls the shape of the spectrum, that is the decay rate of the singular values of TP+. As illustrated in the figure below, when the association is weak, the spectrum decays fast. When the association is strong, the spectrum decays slowly. The optimal case is when the spectrum decays at a moderate rate, which is the case when the association is moderate. In the paper, I propose a metric to quantitatively measure the usefulness of a context, and the metric only depends on the singular values.

Key result 5: Mixing multiple contexts can lead to better contexts.
In the paper, I introduce three base operations for mixing contexts: convolution, convex combination and concatenation. The three operations should be used in different situations. Convolution is useful when all contexts have strong associations. Convex combination balances between strong and weak associations. Concatenation is useful when all contexts have weak associations.
Summary
In this dissertation, I propose a new theory of representation learning called the contexture theory. The theory provides a unified framework for understanding the mechanism of representation learning. It explains why representations are transferable, and why increasing the model size produces diminishing returns. The theory also provides a principled way to design better contexts, which is crucial for improving the performance of foundation models. I believe that the contexture theory is a step towards a deeper understanding of the science of representation learning, and it will help us develop the next generation of pretraining methods.