unless otherwise stated Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Large language models: What does "large" and "language" mean here? Jindřich Libovický 15. 9. 2023 1 SitSem Seminar, Telč Large Language Models 1. Language models and neural networks 2. Selection of language model types 3. NLP Tasks solved with language models 4. Generative models 5. From LMs to Assistants 6. Stochastic parrots and other problems 7. Research at ÚFAL 2 Outline How do LMs work 3 Estimate the probability of a word / sentence / text in a context. 4 Language Model Large Language Models Since 1990’s An important component in speech recognition and machine translation 5 Where do LMs come from Sound Acoustic model Possible phones Language model Transcript Source language Translation model Possible phrase translations Language model Target language Large Language Models 1. Models of what is good/bad in a language 2. Representation learning models 3. Generative models 6Large Language Models LMs used as… Goal: Estimate probability of text P(she is a doctor) > P(dog’s name is dog) > P(dsa ds gf afgra fw) ● Historic significance as a component in machine translation or speech recognition systems ○ Noisy channel model ○ P(target|source) = P(source|target) × P(target ) / P(source) ○ best target = argmax P(source|target) × P(target) ● Statistical: trained to maximize likelihood of the training data 7Large Language Models Language Models ● Machine learning model ○ Parameterized function mapping the input to a prediction ● Built around non-linear transformations of intermediate results ○ “Layers” ○ Affine transformations followed by non-linear “activation function” ○ Great match with parallel processing of batches of data on GPUs ● Structured architecture ○ Recurrent networks ○ Encoder-decoder ○ Attention mechanism 8Large Language Models Neural Networks 9Large Language Models Network Layers and Error Back-Propagation ● NNs work with real numbers, text is discrete ● Words segmented to tokens (subwords) ● Tokens represented by vectors in continuous space (embeddings) ○ parameters of the NN - trainable ● Output is normalized and interpreted as probability distribution over token vocabulary 10Large Language Models Text Processing with Neural Networks 11Large Language Models Visualization of Embeddings from an MT System ● Originally published for MT in 2017 by Google ● Current state of the art in many NLP tasks ● Architecture based on the attention mechanism ● Encoder-decoder paradigm ○ Encoder loads up the input ○ Decoder generates the output ● Can both score and generate 12Large Language Models Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Types of Language Models 13 1. Encoder-decoder modely ○ Machine translation, text summarization 2. Encoder-only modely ○ BERT, RoBERTa, ALBERT, … ○ Pretrained representation for downstream tasks 3. Decoder-only = generative models ○ GPT, ChatGPT 14 Neural Language Models Large Language Models 15Large Language Models Intermediate Representations ● The original Transformer for MT Encoder + Decoder ● For representations, encoder is enough ● Training without decoder Masked Language Modeling 16Large Language Models Encoder-only: BERT Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171-4186). Multilingual Pretraining ● The same as monolingual, but many languages at once ● Benefits from language similarity for low-resourced languages ● Multilingual tasks, e.g., training data filtering for machine translation 17Large Language Models Image: Libovický, J., Rosa, R., & Fraser, A. (2020, November). On the Language Neutrality of Pre-trained Multilingual Representations. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1663-1674). ● BERT by Google, 2018 110M parameters, 16GB of text ● RoBERTa by Facebook AI, 2019 123M 160 GB of text ● XLM-R by Facebook AI, 2019 125M parameters, 2.5TB of text ● RobeCzech by ÚFAL, 2020 125M parameters, 80GB of text 18Large Language Models Notable BERTs Parameters for base setup, Large setup twice as many params NLP Tasks Solved using LMs 19 20Large Language Models Pretrain and Finetune Paradigm ● Sentiment analysis ● Hate speech detection ● Spam detection ● Plagiarism detection … 21Large Language Models Classification classifier 22Large Language Models Named Entity Recognition (1) 23Large Language Models Named Entity Recognition (2) ● Input text with facts (e.g., a Wikipedia article) ● A factual question ● Model searches for an answer in the text 24Large Language Models Answer Span Selection (1) Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, November). SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2383-2392). 25Large Language Models Answer Span Selection (2) 26Large Language Models Zero-shot Transfer Between Languages Generative Models 27 Decoder-only Models 28Large Language Models ● Decoder — just like encoder, but at training time masked not to attend to the future ● Training objective = predict the next word based on the previous words ○ Prompt provided from outside ○ Already generated text Colleagues from ÚFAL & Švandovo divadlo prepared a generated a theatre play for the 100th anniversary of Karel Čapek’s R.U.R. 29Large Language Models Generating Any Text… https://theaitre.com Source: https://arxiv.org/pdf/2005.14165.pdf, the GPT-3 preprint. 30Large Language Models Few-shot Learning Capabilities with GPT-3 Source: https://arxiv.org/pdf/2204.02311.pdf, the PaLM preprint. 31Large Language Models Zero-shot Capabilities of PaLM GPT-3-sized model Trained specifically for conversation 32Large Language Models LaMDA Source: https://arxiv.org/pdf/2201.08239.pdf, The LaMDA pre-print 33Large Language Models Emergent Capabilities This is what IMO big means ● GPT-2 — Feb 2019, 1.5B parameters ● GPT-3 — May 2020, 175 B parameters ○ Open AI did not provide weights and wants to sell the API ○ Open source alternatives: GPT-J, OPT by Facebook ○ Trained on 5TB of text ○ 16× bigger than BERT ● PaLM — Apr 2022, 540 B parameters ○ Technically impossible to run outside of Google ○ Innovative software engineering to make the model this big ● Bloom — Oct 2022, 175 B params., open-source initiative ○ Multilingual: 40 languages + some programming languages ○ Stress on data fairness 34Large Language Models Notable Decoder-only models (1) ● LLaMA — Feb 2024, 7B - 60B parameters ○ Made public for academic research, weird licence ○ Better use of so-called scaling laws ● GPT-4 — Mar 2023, ??? parameters ??? data ● LLaMA2 — Jul 2023, 7B - 70B parameters ○ Even smarter training scheme ○ Includes instruction-tuned, a.k.a. assistant model 35Large Language Models Notable Decoder-only models (2) From LM to an Assistant 36 37Large Language Models Three steps of InstructGPT ● Annotators write scripts of conversation with the assistant ● Scripts are used for direct finetuning ● 105 –106 conversations are needed in this stage 38Large Language Models Supervised Finetuning 39Large Language Models Reinforcement learning The model is no longer mimicking training data, it has a goal: Satisfy the (simulated) user (that wants correct and useful answers) 40Large Language Models RL changes everything? ● OpenAssistant — German open source initiative ● Meta’s LLaMA2 — Slightly smaller models by Meta, fully open-sourced ● Alpaca, Vicugna — LLaMA-based assistant from Stanford …. any many commercial products: Google Bard, Bing AI Chat, Perplexity AI, Claude AI 41Large Language Models It’s not just ChatGPT Stochastic Parrots & Other Problems 42 Crawling the Internet — not representative, people with extreme/wierd opinions write more texts than the rest of society Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021. Crowd-sourcing — using cheap labour, so-called gig economy – precarization of labour Mary L Gray and Siddharth Suri. Ghost work: How to stop Silicon Valley from building a new global underclass. Eamon Dolan Books, 2019. Mining existing databases — unpaid labour, nontransparent “payment” for “free services” Nick Couldry and Ulises A Mejias. The costs of connection: How data is colonizing human life and appropriating it for capitalism. Stanford University Press, 2020. 43Large Language Models Problematic Training Data Generated using https://transformer.huggingface.co/doc/gpt2-large 44Large Language Models Toxic Language on the Internet → Toxic Models Generated text can look very trustworthy 45Large Language Models Misuse for Fake News Generation Source: Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021, March). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610-623). ● Apps like automatic filtering of CVs and job recommendation Precision and not recall-driven => room for discrimination ● Minority language is worse represented Text with minority views (typically African American) are worse searchable ● Huge amounts of data are only available for some languages Increases technological gap between developed and developing countries ● Model training has a large carbon footprint 46 Problematic Applications Large Language Models 47Large Language Models A “scandal” with a sentient model LM Research @ ÚFAL 48 ● More technical/infrastructure project than research ● Main objective: ○ Open and fair data for training LMs and MT ○ Open and fair LM and MT models ● Petabytes of data from Internet Archive into clean datasets (alternative to currently used CommonCrawl that extremely noisy) ● Search for parallel texts / sentences → high-quality machine translation (CUNI and Edinburgh) 49Large Language Models HPLT Project ● Large language models trained by Scandinavian partners (LUMI cluster with AMD hardware) 50Large Language Models HPLT Partners ● Total 4 M€ / 3 years ● Prestigious ERC Starting Grant (1.4 M€ / 5 years) ● Text generation tasks: structured data to language, summarization ● Fundamental research on combining symbolic approaches with large language models ● Big stress on evaluation of correctness of generated text 51Large Language Models Ondřej Dušek: NG-NLG ● NLP tasks in languages without task-specific data ● Zero-shot cross-lingual transfer using pretrained representations / or machine translation ● Language-and-vision task: training with western images, applied in non-western languages ● What is proper text segmentation for multilingual NLP 52Large Language Models CUNI’s Primus: Multilingual Representations 53/52 Large Language Models ● Large LMs = neural networks with billions of parameters ● Pre-train and finetune paradigm, cross-lingual transfer ● Zero-shot and few-shot learning capabilities ● Reinforcement learning turns LM into an assistant ● Problematic data: toxic content, low-resource languages Summary ufal.mff.cuni.cz