👷 Readings in Digital Typography, Scientific Visualization, Information Retrieval and Machine Learning

[Michal Štefánik] Attention semantics: What attention heads actually know and why should we care 29. 10. 2020


Presentation slides (Google Docs, animated)

Attention semantics
Presentation slides (without animations) for the 2020-10-29 talk by Michal Štefánik

Attention semantics: What attention heads actually know and why should we care
Video recording for the 2020-10-29 talk by Michal štefánik

Abstract

Transformer models can strike the impression of a magical black box, such as huge neural models quite often are. One might think that its architecture must be a result of systematic, incremental development that has led to the optimal package on output, that it's not a good idea to mangle with.

The more surprising it can be then, that its main inner components, attention heads, can sufficiently perform distinct NLP tasks in standalone, or that removing some of the heads in pre-trained models can be beneficial for the accuracy of end tasks.

In this presentation, we'll go over a quick survey of what research has been done with an aim to understand the functionality of each of the heads and how successful it was. On the way, we'll highlight several more or less intuitive, yet interesting observations, concerning Transformers' inner parts. Eventually, we'll outline some of the consequences, that interpretation and/or sensible management of the particular attention heads might have.

The related research will be supplemented with the author's in-hand experiments, that aim for the utilization of Attention in a more accurate and interpretable Information Retrieval system.

Literature

  1. Are Sixteen Heads Really Better than One? https://arxiv.org/pdf/1905.10650.pdf
  2. Looking for Grammar in all the Right Places https://aletheap.github.io/posts/2020/07/looking-for-grammar/
  3. Head Pruning in Transformer Models: https://towardsdatascience.com/head-pruning-in-transformer-models-ec222ca9ece7
  4. Big Bird: Transformers for Longer Sequences: https://arxiv.org/pdf/2007.14062v1.pdf