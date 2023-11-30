Also known as autoassociative self-supervised learning, self-prediction methods train a model to predict part of an individual data sample, given information about its other parts. Models trained with these methods are typically generative models, rather than discriminative.

Yann LeCun has characterized self-supervised methods as a structured practice of “filling in the blanks.” Broadly speaking, he described the process of learning meaningful representations from the underlying structure of unlabeled data in simple terms: “pretend there is a part of the input you don’t know and predict that.” 4 For example:

Predict any part of the input from any other part

Predict the future from the past

Predict the masked from the visible

Predict any occluded part from all available parts

Self-supervised systems built upon these philosophies often employ certain model architectures and training techniques.



Autoencoders

An autoencoder is a neural network trained to compress (or encode) input data, then reconstruct (or decode) the original input using that compressed representation. They are trained to minimize reconstruction error, using the original input itself as ground truth.

Though autoencoder architectures vary, they typically introduce some form of bottleneck: as data traverses the encoder network, each layer’s data capacity is progressive reduced. This forces the network to learn only the most important patterns hidden within the input data—called latent variables, or the latent space—so that the decoder network can accurately reconstruct the original input despite now having less information.

Modifications to this basic framework enable autoencoders to learn useful features and functions.

Denoising autoencoders are given partially corrupted input data and trained to restore the original input by removing useless information (“noise”). This reduces overfitting and makes such models useful for tasks like restoring corrupted input images and audio data.

Whereas most autoencoders encode discrete models of latent space, Variational autoencoders (VAEs) learn continuous models of latent space: by encoding latent representations of input data as a probability distribution, the decoder can generate new data by sampling a random vector from that distribution.



Autoregression

Autoregressive models use past behavior to predict future behavior. They work under the logic that any data with an innate sequential order—like language, audio or video—can be modeled with regression.

Autoregression algorithms model time-series data, using the value(s) of the previous time step(s) to predict the value of the following time step. Whereas in conventional regression algorithms, like those used for linear regression, independent variables are used to predict a target value (or dependent variable), in autoregression the independent and dependent variable are essentially one and the same: it’s called autoregression because regression is performed on the variable itself.

Autoregression is used prominently in causal language models like the GPT, LLaMa and Claude families of LLMs that excel at tasks like text generation and question answering. In pre-training, language models are provided the beginning of sample sentences drawn from unlabeled training data and tasked with predicting the next word, with the “actual” next word of the sample sentence serving as ground truth.



Masking

Another self-supervised learning method involves masking certain parts of an unlabeled data sample and tasking models with predicting or reconstructing the missing information. Loss functions use the original (pre-masking) input as ground truth. For example, masked autoencoders are like an inversion of denoising audioencoders: they learn to predict and restore missing information, rather than remove extraneous information.

Masking is also used in the training of masked language models: random words are omitted from sample sentences and models are trained to fill them in. Though masked language models like BERT (and the many models built off its architecture, like BART and RoBERTa) are often less adept at text generation than autoregressive models, they have the advantage of being bidirectional: they can predict not only the next word, but also previous words or words found later on in a sequence. This makes them well suited to tasks requiring strong contextual understanding, like translation, summarization and search.



Innate relationship prediction

Innate relationship prediction trains a model to maintain its understanding of a data sample after it is transformed in some way. For example, rotating an input image and tasking a model with predicting the change degree and direction of rotation relative to the original input.5