[GENERAL] Contrastive Learning in 3 Minutes

This blog dives into some of the recently proposed contrastive losses that have pushed the results of unsupervised learning to heights similar to supervised learning.

[GENERAL] Contrastive Learning in 3 Minutes

The Exponential Progress of Contrastive Learning in Self-Supervised Tasks

This blog was originally published by Ta-Ying Cheng on Towards Data Science.

After a few years of research steered towards the supervised domain of image recognition tasks, many have now turned to a much more unexplored territory: performing the same tasks through a self-supervised learning manner. One of the cornerstones that lead to the dramatic advancements in this seemingly impossible task is the introduction of contrastive learning losses. This blog dives into some of the recently proposed contrastive losses that have pushed the results of unsupervised learning to heights similar to supervised learning.

InfoNCE Loss

One of the earliest contrastive learning losses proposed was the InfoNCE loss by Oord et al. Their paper Representation Learning with Contrastive Predictive Coding proposed the following loss:

InfoNCE Loss

where the numerator is essentially the output of a positive pair, and the denominator is the sum of all value of positive and negative pairs. Ultimately, this simple loss forces the positive pairs to have a greater value (pushing the log term to 1 and hence less to 0) and negative pairs further apart.


SimCLR Overview
Figure 1. SimCLR Overview. Image retrieved from https://arxiv.org/abs/2002.05709.

SimCLR is the first paper to suggest using contrastive loss for self-supervised image recognition learning through image augmentations.
By generating positive pairs by doing data augmentation on the same image and vice versa, we can allow models to learn features to distinguish between images without explicitly providing any ground truths.

Momentum Contrast (MoCo)

MoCo Overview
Figure 2. MoCo Overview. Image retrieve from https://arxiv.org/abs/1911.05722.

The previous InfoNCE loss is proposed on a mini-batch of one positive and a number of negatives. He et al. extended this concept by portraying the contrastive learning as analogous to learning to match the best key with a given queue. The intuition led to the foundation of momentum contrast (MoCo), which is essentially a dictionary/memory network of key and values with key stored across multiple batches and slowly eliminating the oldest batch in a queue-like manner. This allows the training to be more stable as it is similar to a momentum where the change in keys is less drastic.

Decoupled Contrastive Learning

The improvements of by swapping InfoNCE with DCL
Figure 3. The improvements of by swapping InfoNCE with DCL. Image retrieved from https://arxiv.org/abs/2110.06848.

Previous papers in contrastive learning either required large batch sizes or a momentum mechanism. The recent paper decoupled contrastive learning (DCL) hope to change this by bringing a simple change to the original InfoNCE loss: simply removing the positive pair from the denominator.
While seemingly simple, DCL actually allows better convergence and ultimately formed a even better baseline compared to previous papers SimCLR and MoCo.

Testing Each Concept

Codes of the papers above have been provided by the authors. To test these concepts, one can simply download different datasets to see how well the unsupervised learning method works.

Our Open Datasets Community is particularly useful for retrieving datasets. The community organizes all the popular datasets (e.g., ImageNet, CIFAR100) so that you could easily find them and be redirected to our official websites. It is especially helpful when you are trying to build your own dataloaders.