Transformers were originally proposed, as the title of "Attention is All You Need" implies, as a more efficient seq2seq model ablating the RNN structure commonly used til that point.
However in pursuing this efficiency, a single headed attention had reduced descriptive power compared to RNN based models. Multiple heads were proposed to mitigate this, allowing the model to learn multiple lower-scale feature maps as opposed to one all-encompasing map:
In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions [...] This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention...
As such, multiple attention heads in a single layer in a transformer is analogous to multiple kernels in a single layer in a CNN: they have the same architecture, and operate on the same feature-space, but since they are separate 'copies' with different sets of weights, they are hence 'free' to learn different functions.
In a CNN this may correspond to different definitions of visual features, and in a Transformer this may correspond to different definitions of relevance:1
For example:
Architecture |
Input |
(Layer 1) Kernel/Head 1 |
(Layer 1) Kernel/Head 2 |
CNN |
Image |
Diagonal edge-detection |
Horizontal edge-detection |
Transformer |
Sentence |
Attends to next word |
Attends from verbs to their direct objects |
Notes:
- There is no guarantee that these are human interpretable, but in many popular architectures they do map accurately onto linguistic concepts:
While no single head performs well at many relations, we find that particular heads correspond remarkably well to particular relations. For example, we find heads that find direct objects of verbs, determiners of nouns, objects of prepositions, and objects of possesive pronouns...