Attention Semantics What attention heads actually know and why should we care FI:PV212: Readings in Digital ... Michal Štefánik stefanik.m@mail.muni.cz Transformer [1] [1]: https://arxiv.org/abs/1706.03762 (Attention is All You Need) Attention [1] [1]: https://arxiv.org/abs/1706.03762 (Attention is All You Need) [3]: https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb Transformer as autoencoder [2]: http://jalammar.github.io/illustrated-transformer/ Attention heads layout https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1 Specific heads semantics [2] [2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?) Specific heads semantics [2] ● Many elementary patterns, “No-Op” attention to [SEP] (?) ○ “Four attention heads (in layers 2, 4, 7, and 8) on average put >50% of their attention on the previous token and five attention heads (in layers 1, 2, 2, 3, and 6) put >50% of their attention on the next token.” ● Transitive information propagation is beneficial [3], but we can not see any other heads later attending to [SEP], dots, of commas ○ “Attention heads processing [SEP] almost entirely (more than 90%) attend to themselves and the other [SEP] token.” ○ “(...) the gradients for attention to [SEP] become very small. (...) attending more or less to [SEP] does not substantially change BERT’s outputs.” [3]: https://arxiv.org/pdf/2007.14062.pdf (Big Bird: Transformers for Longer Sequences) Specific heads semantics [2] [2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?) Specific heads semantics [2] [2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?) Specific heads semantics [2] [2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?) Syntactic heads [2] ● No “syntactic” heads ● But syntactic properties are decomposed to simpler tasks! Relations manual: https://nlp.stanford.edu/software/dependencies_manual.pdf Coreference heads [2] ● “(...) what percent of the time does the head word of a coreferent mention most attend to the head of one of that mention’s antecedents.” ● Coreference (semantic task) is also resolved by particular heads [2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?) Dependency parsing groups of heads [2] ● Prediction of andencendants (heads) for each token ● “(...) linear combination of (all) attention weights.” ● “there is not much more syntactic information in BERT’s vector representations compared to its attention maps.” [2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?) Heads overview [2] [2]: https://arxiv.org/pdf/1906.04341.pdf (What Does BERT Look At?) Layers semantics: Probing [3] [3]: https://arxiv.org/pdf/1905.05950.pdf (BERT Rediscovers the Classical NLP Pipeline) Scalar Mixing Weights (...) for each task we introduce scalar parameters γτ and aτ (0), aτ (1) ,... , aτ (L), and let: where sτ = softmax(aτ ). We learn these weights jointly with the probing classifier Pτ , in order to allow it to extract information from the many layers of an encoder (...) we extract the learned coefficients in order to estimate the contribution of different layers to that particular task Cumulative Scoring (...) we train a series of classifiers {PT (L)}L which use scalar mixing to attend to layer L as well as all previous layers. We Layers semantics [3] ● Ordering of the tasks in layers: syntactic < semantic ● Localizable resolution of syntactical tasks, distributed resolution of semantic tasks ● “Availability of heuristics” suspicion [3]: https://arxiv.org/pdf/1905.05950.pdf (BERT Rediscovers the Classical NLP Pipeline) Layers semantics: per-case analysis [3] [3]: https://arxiv.org/pdf/1905.05950.pdf (BERT Rediscovers the Classical NLP Pipeline) ● “Availability of heuristics” suspicion demonstrated on a few cases Are Sixteen Heads Really Better than One? [4] [4]: https://arxiv.org/pdf/1905.10650.pdf (Are Sixteen Heads Really Better than One?) ● Previous experiments show a suspicion on redundancy of Transformers Attention heads ● Experiments show that some (many!) heads can be removed without harming the performance ○ But it depends on the task ○ The more complex, the less heads can go off Head ablations: ablating one head [4] [4]: https://arxiv.org/pdf/1905.10650.pdf (Are Sixteen Heads Really Better than One?) ● Previous experiments show a suspicion on redundancy of Transformers Attention heads ● Experiments show that some (many!) heads can be removed without harming the performance ○ But it depends on the task ○ The more complex, the less heads can go off Head ablations: ablating one head: per-task [4] [4]: https://arxiv.org/pdf/1905.10650.pdf (Are Sixteen Heads Really Better than One?) Head ablations: incremental ablations [4]: https://arxiv.org/pdf/1905.10650.pdf (Are Sixteen Heads Really Better than One?) Head ablations: incremental ablations [4]: https://arxiv.org/pdf/1905.10650.pdf (Are Sixteen Heads Really Better than One?) So what is it for? [5]: https://github.com/huggingface/transformers ● Pruning of selected attention heads increases speed and decrease model size ○ and is already integrated into fine-tuning pipeline of some libraries, e.g. Transformers [5] ● Identifying head’s functionality, or knowing how to identify it, can be useful, when you want to utilize attention for specific task Utilizing attention for specific task A light peak into my research ● Attention is also an inherent way of denoting, which parts of text are important for given output ○ One head does only linear transformation of the input ○ Stacking them can create complex non-linearity ● But now that we can interpret heads’ functionality, we can pick the ones that we know are relevant for our problem Utilizing attention for weighting text ● Can we weight text using Attention? ● To find out, we propose a set of experiments, inspired by previous literature ● Itself, identification of key phrases in the text can be beneficial for many tasks: Summarization, Information Retrieval, Keyword extraction, indexing ● Arguably similar to Correferencing, that was associated with specific heads Utilizing attention for weighting text: experiments https://github.com/stefanik12/claims-checker/blob/master/notebooks/attention_linking.ipynb Static analysis of the model ● Identification of heads, whose attention can best distinguish key parts from less important ● Mean relative attention = mean_key_segments - mean_other_segments Utilizing attention for weighting text: experiments https://github.com/stefanik12/claims-checker/blob/master/notebooks/attention_linking.ipynb Utilizing attention for weighting text: experiments [6]: https://aletheap.github.io/posts/2020/07/looking-for-grammar : Looking for Grammar in all the Right Places Weighting of attention heads ● Identification of heads, that are best for predicting keywords ● Cross-entropy [6], and linear weights of each head ○ Reproduction of Scalar Mixing Weights from [2] Utilizing attention for weighting text: experiments https://github.com/stefanik12/claims-checker/blob/master/notebooks/attention_kw_classification.ipynb Utilizing attention for weighting text: experiments https://github.com/stefanik12/claims-checker/blob/master/notebooks/attention_kw_classification.ipynb Fine-tuning model and analysing heads ● We train the BERT-base model end-to-end for keyword identification ○ reaching F1 = 0.36/0.37 pruned ○ SOTA=0.42 ● We reproduce “remove-one” ablation experiment ● We also rank head by their “ablation drop” Utilizing attention for weighting text: experiments https://github.com/stefanik12/claims-checker/blob/master/notebooks/attention_kw_classification.ipynb Ablation of worst-to-best-performing Attention heads Thanks! Michal Štefánik stefanik.m@mail.muni.cz Feel free to check out our theses: https://is.muni.cz/auth/rozpis/tema tag MIR or contact us later!