Attention sparsification Look into the future and the past (behind the context window) FI:PV212: Readings in Digital ... Michal Štefánik stefanik.m@mail.muni.cz Why talk about it? https://paperswithcode.com/sota/question-answering-on-squad20 Why talk about it? https://paperswithcode.com/task/named-entity-recognition-ner Why talk about it? https://paperswithcode.com/task/semantic-segmentation Attention [1] [1]: https://arxiv.org/abs/1706.03762 (Attention is All You Need) [3]: https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb Transformer [1] [1]: https://arxiv.org/abs/1706.03762 (Attention is All You Need) Transformer as autoencoder [2]: http://jalammar.github.io/illustrated-transformer/ Transformer as autoencoder [2]: http://jalammar.github.io/illustrated-transformer/ Transformer as autoencoder [2]: http://jalammar.github.io/illustrated-transformer/ Transformer families https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html [4]: https://arxiv.org/pdf/1810.04805.pdf (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) Transformers: scaling [5]: https://arxiv.org/pdf/2001.08361.pdf (Scaling Laws for Neural Language Models, 2020) Transformers: scaling https://twitter.com/pavtalk/status/1285410751092416513 Transformers: (down)scaling Transformers: scaling [6]: https://arxiv.org/pdf/2004.05150.pdf (Longformer: The Long-Document Transformer) Attention customizations [1]: https://arxiv.org/abs/1706.03762 (Attention is All You Need) [3]: https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb Transformer-XL [7]: https://arxiv.org/abs/1706.03762 (Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context) Transformer-XL [7] [7]: https://arxiv.org/abs/1706.03762 (Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context) ● Window length 784 on training, 3800 on evaluation ● Novel Relative positional encodings ● Not very relevant evaluation (bpc: Bytes-per-character) ● Not any smaller Sparse Transformers [8] [8]: https://arxiv.org/pdf/1904.10509.pdf (Generating Long Sequences with Sparse Transformers) Sparse Transformers [8] [8]: https://arxiv.org/pdf/1904.10509.pdf (Generating Long Sequences with Sparse Transformers) ● Head factorization - decomposition of functionality to two heads ● Global token positions - first attempt to propagate information over attention ● Shows well-performing replacement of convolution with attention ● Evaluation on other sequences (Classical music) ● Missing evaluation on actual NLP end-tasks Longformer [6] [6]: https://arxiv.org/pdf/2004.05150.pdf (Longformer: The Long-Document Transformer) Longformer [6] [6]: https://arxiv.org/pdf/2004.05150.pdf (Longformer: The Long-Document Transformer) ● Linear scaling of attention weights’ size ● Different context windows by layers (1->12: 32->512) ● New idea of Dilation: attend to every second position, on 2 bottom layers ● Transfer of existing RoBERTa weights ● Evaluation on end tasks (requiring long context window) ● Humble ablation study Big Bird [8] [8]: https://arxiv.org/pdf/2007.14062v1.pdf (Big Bird: Transformers for Longer Sequences) Big Bird [8] [8]: https://arxiv.org/pdf/2007.14062v1.pdf (Big Bird: Transformers for Longer Sequences) [9]: collective-dynamics-of-small-world-networks.pdf (Collective dynamics of ‘small-world’ networks) ● Random graph in attention: model of random positions selection is rationalized by information propagation, reasoned in [9] ● Randomness actually seem to work: ablation study shows +3-5% acc superiority to Longformer (that is a lot) Big Bird [8] [8]: https://arxiv.org/pdf/2007.14062v1.pdf (Big Bird: Transformers for Longer Sequences) ● Compared to Longformer, it only adds random connections ● It reasons it by minimizing the distance of between each pair of nodes = tokens ● Some nice theoretical properties: with random attention heads, Big Bird is Universal Approximator of any seq2seq function on its context window (like full Transformer) ● Turing complete ● Serious evaluation on “long” end tasks (not just bpc) and also some “short” tasks ● Possible cheating by pre-training with Pegasus objective Thanks! Michal Štefánik stefanik.m@mail.muni.cz Feel free to check out our theses: https://is.muni.cz/auth/rozpis/tema tag MIR or contact us later!