unless otherwise stated
Charles University
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
Large language models: What does
"large" and "language" mean here?
Jindřich Libovický
15. 9. 2023
1
SitSem Seminar, Telč
Large Language Models
1. Language models and neural networks
2. Selection of language model types
3. NLP Tasks solved with language models
4. Generative models
5. From LMs to Assistants
6. Stochastic parrots and other problems
7. Research at ÚFAL
2
Outline
How do LMs work
3
Estimate the probability of a
word / sentence / text
in a context.
4
Language Model
Large Language Models
Since 1990’s
An important component in speech recognition and machine translation
5
Where do LMs come from
Sound Acoustic model
Possible
phones
Language model Transcript
Source
language
Translation model
Possible
phrase
translations
Language model
Target
language
Large Language Models
1. Models of what is good/bad in a language
2. Representation learning models
3. Generative models
6Large Language Models
LMs used as…
Goal: Estimate probability of text
P(she is a doctor) > P(dog’s name is dog) > P(dsa ds gf afgra fw)
● Historic signiﬁcance as a component in machine translation or speech
recognition systems
○ Noisy channel model
○ P(target|source) = P(source|target) × P(target ) / P(source)
○ best target = argmax P(source|target) × P(target)
● Statistical: trained to maximize likelihood of the training data
7Large Language Models
Language Models
● Machine learning model
○ Parameterized function mapping the input to a prediction
● Built around non-linear transformations of intermediate results
○ “Layers”
○ Aﬃne transformations followed by non-linear “activation function”
○ Great match with parallel processing of batches of data on GPUs
● Structured architecture
○ Recurrent networks
○ Encoder-decoder
○ Attention mechanism
8Large Language Models
Neural Networks
9Large Language Models
Network Layers and Error Back-Propagation
● NNs work with real numbers, text is discrete
● Words segmented to tokens (subwords)
● Tokens represented by vectors in continuous space (embeddings)
○ parameters of the NN - trainable
● Output is normalized and interpreted as probability distribution over
token vocabulary
10Large Language Models
Text Processing with Neural Networks
11Large Language Models
Visualization of Embeddings from an MT System
● Originally published for MT in 2017 by Google
● Current state of the art in many NLP tasks
● Architecture based on the attention mechanism
● Encoder-decoder paradigm
○ Encoder loads up the input
○ Decoder generates the output
● Can both score and generate
12Large Language Models
Transformer
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances
in neural information processing systems, 30.
Types of Language Models
13
1. Encoder-decoder modely
○ Machine translation, text summarization
2. Encoder-only modely
○ BERT, RoBERTa, ALBERT, …
○ Pretrained representation for downstream tasks
3. Decoder-only = generative models
○ GPT, ChatGPT
14
Neural Language Models
Large Language Models
15Large Language Models
Intermediate Representations
● The original Transformer for MT
Encoder + Decoder
● For representations,
encoder is enough
● Training without decoder
Masked Language Modeling
16Large Language Models
Encoder-only: BERT
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171-4186).
Multilingual Pretraining
● The same as monolingual, but
many languages at once
● Beneﬁts from language similarity
for low-resourced languages
● Multilingual tasks, e.g., training
data ﬁltering for machine
translation
17Large Language Models
Image: Libovický, J., Rosa, R., & Fraser, A. (2020, November). On the Language
Neutrality of Pre-trained Multilingual Representations. In Findings of the
Association for Computational Linguistics: EMNLP 2020 (pp. 1663-1674).
● BERT by Google, 2018
110M parameters, 16GB of text
● RoBERTa by Facebook AI, 2019
123M 160 GB of text
● XLM-R by Facebook AI, 2019
125M parameters, 2.5TB of text
● RobeCzech by ÚFAL, 2020
125M parameters, 80GB of text
18Large Language Models
Notable BERTs
Parameters for base setup,
Large setup twice as many params
NLP Tasks Solved using LMs
19
20Large Language Models
Pretrain and Finetune Paradigm
● Sentiment analysis
● Hate speech detection
● Spam detection
● Plagiarism detection
…
21Large Language Models
Classiﬁcation
classiﬁer
22Large Language Models
Named Entity Recognition (1)
23Large Language Models
Named Entity Recognition (2)
● Input text with facts (e.g.,
a Wikipedia article)
● A factual question
● Model searches for an
answer in the text
24Large Language Models
Answer Span Selection (1)
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, November).
SQuAD: 100,000+ Questions for Machine Comprehension of Text. In
Proceedings of the 2016 Conference on Empirical Methods in Natural
Language Processing (pp. 2383-2392).
25Large Language Models
Answer Span Selection (2)
26Large Language Models
Zero-shot Transfer Between Languages
Generative Models
27
Decoder-only Models
28Large Language Models
● Decoder — just like encoder, but at
training time masked not to attend to
the future
● Training objective = predict the next
word based on the previous words
○ Prompt provided from outside
○ Already generated text
Colleagues from ÚFAL & Švandovo
divadlo prepared a generated a
theatre play for the 100th anniversary
of Karel Čapek’s R.U.R.
29Large Language Models
Generating Any Text…
https://theaitre.com
Source: https://arxiv.org/pdf/2005.14165.pdf, the GPT-3 preprint.
30Large Language Models
Few-shot Learning Capabilities with GPT-3
Source: https://arxiv.org/pdf/2204.02311.pdf, the PaLM preprint.
31Large Language Models
Zero-shot Capabilities of PaLM
GPT-3-sized model
Trained speciﬁcally for
conversation
32Large Language Models
LaMDA
Source: https://arxiv.org/pdf/2201.08239.pdf, The LaMDA pre-print
33Large Language Models
Emergent Capabilities
This is what IMO big means
● GPT-2 — Feb 2019, 1.5B parameters
● GPT-3 — May 2020, 175 B parameters
○ Open AI did not provide weights and wants to sell the API
○ Open source alternatives: GPT-J, OPT by Facebook
○ Trained on 5TB of text
○ 16× bigger than BERT
● PaLM — Apr 2022, 540 B parameters
○ Technically impossible to run outside of Google
○ Innovative software engineering to make the model this big
● Bloom — Oct 2022, 175 B params., open-source initiative
○ Multilingual: 40 languages + some programming languages
○ Stress on data fairness
34Large Language Models
Notable Decoder-only models (1)
● LLaMA — Feb 2024, 7B - 60B parameters
○ Made public for academic research, weird licence
○ Better use of so-called scaling laws
● GPT-4 — Mar 2023, ??? parameters ??? data
● LLaMA2 — Jul 2023, 7B - 70B parameters
○ Even smarter training scheme
○ Includes instruction-tuned, a.k.a. assistant model
35Large Language Models
Notable Decoder-only models (2)
From LM to an Assistant
36
37Large Language Models
Three steps of InstructGPT
● Annotators write scripts of conversation with the assistant
● Scripts are used for direct ﬁnetuning
● 105
–106
conversations are needed in this stage
38Large Language Models
Supervised Finetuning
39Large Language Models
Reinforcement learning
The model is no longer mimicking training data,
it has a goal:
Satisfy the (simulated) user
(that wants correct and useful answers)
40Large Language Models
RL changes everything?
● OpenAssistant — German open source initiative
● Meta’s LLaMA2 — Slightly smaller models by Meta, fully open-sourced
● Alpaca, Vicugna — LLaMA-based assistant from Stanford
…. any many commercial products:
Google Bard, Bing AI Chat, Perplexity AI, Claude AI
41Large Language Models
It’s not just ChatGPT
Stochastic Parrots & Other Problems
42
Crawling the Internet — not representative, people with extreme/wierd
opinions write more texts than the rest of society
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic
parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and
transparency, pages 610–623, 2021.
Crowd-sourcing — using cheap labour, so-called gig economy – precarization
of labour
Mary L Gray and Siddharth Suri. Ghost work: How to stop Silicon Valley from building a new global underclass. Eamon
Dolan Books, 2019.
Mining existing databases — unpaid labour, nontransparent “payment” for
“free services”
Nick Couldry and Ulises A Mejias. The costs of connection: How data is colonizing human life and appropriating it for
capitalism. Stanford University Press, 2020.
43Large Language Models
Problematic Training Data
Generated using https://transformer.huggingface.co/doc/gpt2-large
44Large Language Models
Toxic Language on the Internet → Toxic Models
Generated text can look very
trustworthy
45Large Language Models
Misuse for Fake News Generation
Source: Bender, E. M., Gebru, T., McMillan-Major, A., &
Shmitchell, S. (2021, March). On the Dangers of Stochastic
Parrots: Can Language Models Be Too Big?🦜. In Proceedings of
the 2021 ACM Conference on Fairness, Accountability, and
Transparency (pp. 610-623).
● Apps like automatic ﬁltering of CVs and job recommendation
Precision and not recall-driven => room for discrimination
● Minority language is worse represented
Text with minority views (typically African American) are worse
searchable
● Huge amounts of data are only available for some languages
Increases technological gap between developed and
developing countries
● Model training has a large carbon footprint
46
Problematic Applications
Large Language Models
47Large Language Models
A “scandal” with a sentient model
LM Research @ ÚFAL
48
● More technical/infrastructure project than research
● Main objective:
○ Open and fair data for training LMs and MT
○ Open and fair LM and MT models
● Petabytes of data from Internet Archive into clean datasets
(alternative to currently used CommonCrawl that extremely noisy)
● Search for parallel texts / sentences
→ high-quality machine translation (CUNI and Edinburgh)
49Large Language Models
HPLT Project
● Large language models trained by Scandinavian partners
(LUMI cluster with AMD hardware)
50Large Language Models
HPLT Partners
● Total 4 M€ / 3 years
● Prestigious ERC Starting Grant (1.4 M€ / 5 years)
● Text generation tasks:
structured data to language, summarization
● Fundamental research on combining symbolic
approaches with large language models
● Big stress on evaluation of correctness of generated text
51Large Language Models
Ondřej Dušek: NG-NLG
● NLP tasks in languages without task-speciﬁc data
● Zero-shot cross-lingual transfer using
pretrained representations / or machine translation
● Language-and-vision task: training with western images,
applied in non-western languages
● What is proper text segmentation for multilingual NLP
52Large Language Models
CUNI’s Primus: Multilingual Representations
53/52
Large Language Models
● Large LMs = neural networks with billions of parameters
● Pre-train and ﬁnetune paradigm, cross-lingual transfer
● Zero-shot and few-shot learning capabilities
● Reinforcement learning turns LM into an assistant
● Problematic data: toxic content, low-resource languages
Summary
ufal.mff.cuni.cz