Attention Semantics
What attention heads actually know and why should we care
FI:PV212: Readings in Digital ...
Michal Štefánik
stefanik.m@mail.muni.cz
Transformer [1]
[1]: https://arxiv.org/abs/1706.03762 (Attention is All You Need)
Attention [1]
[1]: https://arxiv.org/abs/1706.03762 (Attention is All You Need)
[3]: https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb
Transformer as autoencoder
[2]: http://jalammar.github.io/illustrated-transformer/
Attention heads layout
https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1
Specific heads semantics [2]
[2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?)
Specific heads semantics [2]
● Many elementary patterns, “No-Op” attention to [SEP] (?)
○ “Four attention heads (in layers 2, 4, 7, and 8) on average put >50% of
their attention on the previous token and five attention heads (in layers 1,
2, 2, 3, and 6) put >50% of their attention on the next token.”
● Transitive information propagation is beneficial [3], but we can not see any
other heads later attending to [SEP], dots, of commas
○ “Attention heads processing [SEP] almost entirely (more than 90%)
attend to themselves and the other [SEP] token.”
○ “(...) the gradients for attention to [SEP] become very small. (...) attending
more or less to [SEP] does not substantially change BERT’s outputs.”
[3]: https://arxiv.org/pdf/2007.14062.pdf (Big Bird: Transformers for Longer Sequences)
Specific heads semantics [2]
[2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?)
Specific heads semantics [2]
[2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?)
Specific heads semantics [2]
[2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?)
Syntactic heads [2]
● No “syntactic” heads
● But syntactic properties are
decomposed to simpler tasks!
Relations manual: https://nlp.stanford.edu/software/dependencies_manual.pdf
Coreference heads [2]
● “(...) what percent of the time does the head
word of a coreferent mention most attend to
the head of one of that mention’s
antecedents.”
● Coreference (semantic task) is also resolved
by particular heads
[2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?)
Dependency parsing groups of heads [2]
● Prediction of andencendants (heads) for each
token
● “(...) linear combination of (all) attention
weights.”
● “there is not much more syntactic information
in BERT’s vector representations compared
to its attention maps.”
[2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?)
Heads overview [2]
[2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?)
Layers semantics: Probing [3]
[3]: https://arxiv.org/pdf/1905.05950.pdf (BERT Rediscovers the Classical NLP Pipeline)
Scalar Mixing Weights
(...) for each task we introduce scalar
parameters γτ and aτ
(0), aτ
(1) ,... , aτ
(L),
and let:
where sτ
= softmax(aτ
). We learn these
weights jointly with the probing classifier
Pτ
, in order to allow it to extract
information from the many layers of an
encoder (...) we extract the learned
coefficients in order to estimate the
contribution of different layers to that
particular task
Cumulative Scoring
(...) we train a series of classifiers
{PT
(L)}L
which use scalar mixing to
attend to layer L as well as all previous
layers.
We
Layers semantics [3]
● Ordering of the tasks in layers: syntactic < semantic
● Localizable resolution of syntactical tasks,
distributed resolution of semantic tasks
● “Availability of heuristics” suspicion
[3]: https://arxiv.org/pdf/1905.05950.pdf (BERT Rediscovers
the Classical NLP Pipeline)
Layers semantics: per-case analysis [3]
[3]: https://arxiv.org/pdf/1905.05950.pdf (BERT Rediscovers
the Classical NLP Pipeline)
● “Availability of heuristics” suspicion
demonstrated on a few cases
Are Sixteen Heads Really Better than One? [4]
[4]: https://arxiv.org/pdf/1905.10650.pdf (Are Sixteen Heads Really Better than One?)
● Previous experiments show a suspicion on redundancy of Transformers Attention heads
● Experiments show that some (many!) heads can be removed without harming the
performance
○ But it depends on the task
○ The more complex, the less heads can go off
Head ablations: ablating one head [4]
[4]: https://arxiv.org/pdf/1905.10650.pdf (Are Sixteen Heads Really Better than One?)
● Previous experiments show a suspicion on redundancy of Transformers Attention heads
● Experiments show that some (many!) heads can be removed without harming the
performance
○ But it depends on the task
○ The more complex, the less heads can go off
Head ablations: ablating one head: per-task [4]
[4]: https://arxiv.org/pdf/1905.10650.pdf (Are Sixteen Heads Really Better than One?)
Head ablations: incremental ablations
[4]: https://arxiv.org/pdf/1905.10650.pdf (Are Sixteen Heads Really Better than One?)
Head ablations: incremental ablations
[4]: https://arxiv.org/pdf/1905.10650.pdf (Are Sixteen Heads Really Better than One?)
So what is it for?
[5]: https://github.com/huggingface/transformers
● Pruning of selected attention heads increases speed and decrease model size
○ and is already integrated into fine-tuning pipeline of some libraries, e.g. Transformers [5]
● Identifying head’s functionality, or knowing how to identify it, can be useful, when you want to
utilize attention for specific task
Utilizing attention for specific task
A light peak into my research
● Attention is also an inherent way of denoting, which parts of text are important for given output
○ One head does only linear transformation of the input
○ Stacking them can create complex non-linearity
● But now that we can interpret heads’ functionality, we can pick the ones that we know are relevant
for our problem
Utilizing attention for weighting text
● Can we weight text using Attention?
● To find out, we propose a set of experiments, inspired by previous literature
● Itself, identification of key phrases in the text can be beneficial for many tasks: Summarization,
Information Retrieval, Keyword extraction, indexing
● Arguably similar to Correferencing, that was associated with specific heads
Utilizing attention for weighting text: experiments
https://github.com/stefanik12/claims-checker/blob/master/notebooks/attention_linking.ipynb
Static analysis of the model
● Identification of heads, whose attention can best
distinguish key parts from less important
● Mean relative attention =
mean_key_segments - mean_other_segments
Utilizing attention for weighting text: experiments
https://github.com/stefanik12/claims-checker/blob/master/notebooks/attention_linking.ipynb
Utilizing attention for weighting text: experiments
[6]: https://aletheap.github.io/posts/2020/07/looking-for-grammar : Looking for Grammar in all the Right Places
Weighting of attention heads
● Identification of heads, that are best for predicting
keywords
● Cross-entropy [6], and linear weights of each head
○ Reproduction of Scalar Mixing Weights
from [2]
Utilizing attention for weighting text: experiments
https://github.com/stefanik12/claims-checker/blob/master/notebooks/attention_kw_classification.ipynb
Utilizing attention for weighting text: experiments
https://github.com/stefanik12/claims-checker/blob/master/notebooks/attention_kw_classification.ipynb
Fine-tuning model and analysing heads
● We train the BERT-base model end-to-end
for keyword identification
○ reaching F1 = 0.36/0.37 pruned
○ SOTA=0.42
● We reproduce “remove-one” ablation
experiment
● We also rank head by their “ablation drop”
Utilizing attention for weighting text: experiments
https://github.com/stefanik12/claims-checker/blob/master/notebooks/attention_kw_classification.ipynb
Ablation of worst-to-best-performing Attention heads
Thanks!
Michal Štefánik
stefanik.m@mail.muni.cz
Feel free to check out our theses:
https://is.muni.cz/auth/rozpis/tema tag MIR
or contact us later!