In my opinion, the aim of metric learning is to learn an embedding function such that two samples that are similar conceptually (or semantically, i.e. at high-level, not at the level of pixels for example) should be also close in the embedding space, where an embedding is usually a d-dimensional vector.
If the model has correctly captured the similarity function you should be able to "compare" samples by reasoning on something as simple as an Euclidean distance on the embedding space.
A popular approach for metric learning are Siamese Networks, in which you have two neural networks where the second is a copy (i.e. same layers and weights) of the first. During training you provide pairs of data samples in the form (anchor, positive) and (anchor, negative): basically, you force positive pairs to share a common embedding, while negative ones are pushed apart from the anchor. Indeed, variations of this idea exists, like the triplet loss and the introduction of one or more "margins" (to prevent collapsing embeddings and obvious solutions.)
The main motivation for metric learning is that comparing two data points in input space is often meaningless and ambiguous (e.g. images of airplanes can be found to be similar due to blue sky and not to the plane itself), because you can't capture high-level (or semantic) features of the data.
Instead, contrastive learning try to constrain the model to learn a suitable representation of the input data.
- Also in this case you have pairs of inputs, but the difference is that the second input is usually a "variation" of the first. This is usually done via data augmentation. In some cases, you start from the same image, augment it twice (but differently!) to get two versions of it.
- The goal is to enable the model learn features to represent conceptually similar data in a meaningful way: e.g., you can teach the model about rotation/translation invariance.
- The applications of contrastive learning are usually about pre-training, for later fine-tuning aimed at improving (classification) performance, ensure properties (like invariances) and robustness, but also to reduce number of data used, and even improve in low-shot scenarios in which you want to correctly predict some new class even if the model has been trained on zero or very few samples from such class.
To conclude, metric learning is used to compare data to understand their similarity (like in face recognition) while contrastive learning deals with learning better representations to improve the model under various aspects. I can add that, to me, both fields fall under what is called representation learning, which is quite a generic and wider concept.