Large Language Models (LLM)
PA154 Language Modeling (11.1)
Pavel Rychlý
pary@fi.muni.cz April 27, 2023
Large Models
■ bigger is better
■ many Layers
■ need big machines
■ using advanced hardware: GPU,TPU on multiple servers
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
2/22
Usage of Large Models
training of big models on huge data is expensive (Long training time)
fine tuning on small data of target task
combining Language modeL with additionaL NN/Layer, training onLy new Layer
■ big modeL is frozen, onLy used
Classification     start        Text Extract
Transformer Linear
Entailment
Similarity
Start	Premise	Delim	Hypothesis	Extract
				
				
Start	Text 1	Delim	Text 2	Extract
-				
Start	Text 2	Delim	Text 1	Extract
Transformer
Linear
Transformer
Transformer
Linear
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
3/22
Pre-trained models
■ word2vec, fastText: pre-trained word embeddings
■ transformers: BERT
■ transformer modifications:
■ RoBerta, Albert,...
■ Language specific models
■ multilingual, models
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
4/22
Pre-trained fastText
■ 157 Languages
■ word vectores with dimension 300
■ up to 1 or 2 mil. words
■ Czech:
■ 2 mil. words
■ text format: 1.2 GB, binary format 4.2 GB
■ Breton:
■ 602k words
■ text format: 340 MB, binary format 4.2 GB
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
5/22
Pre-trained fastText
Czech embeddings trained on Common CrawL: cc. cs . 300. vec. gz
2000000 300
, 0.0052 0.1646 0.0675 0.0577 0.2342 0.0089 0.1601 0.0240 -0 . 0.0485 0.0674 0.0261 0.0220 -0.0779 -0.0309 -0.2006 0.0100 a 0.1253 0.0177 0.0770 -0.0103 0.0687 0.0175 0.0171 0.0013 -</s> 0.0251 -0.0350 0.0364 0.0349 0.0159 -0.0586 -0.4607 0.0 : -0.0715 -0.0175 -0.0210 0.0818 -0.0174 -0.0204 0.0574 0.00 v 0.1013 0.1792 -0.0174 0.0365 0.0920 0.0802 -0.1830 0.0271 na -0.1200 0.2000 0.2071 0.0144 0.3272 -0.0145 -0.1196 0.080 ) 0.0614 -0.1514 0.0203 0.1658 0.0958 -0.0628 -0.0841 -0.064 se -0.1456 0.1170 0.0285 -0.0062 -0.0890 -0.0042 -0.0969 -0. C 0.0671 -0.1871 0.0332 0.1324 0.1774 -0.0685 0.0082 -0.0666 " 0.1381 -0.2536 0.0805 0.0379 0.2684 -0.0038 0.0437 -0.0905 je 0.0170 -0.1937 0.0388 -0.0084 0.1255 -0.0953 -0.0267 -0.0 - -0.2497 -0.0093 0.1759 -0.0839 0.1842 -0.0276 0.1605 -0.08 s 0.0081 -0.0854 -0.0566 0.0116 -0.5178 -0.0091 -0.2048 0.05
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
6/22
Scaling transformers
■ main factors:
■ number of model parameters N
■ size of the dataset D
■ amount of compute operations C
■ evaluation on test Loss (cross-entropy)
■ there is a capacity Limit for a fix N, D, or C
■ performance improves predictably as Long as we scale up N and D in tandem
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
7/22
Scaling transformers
■ number of model parameters N
■ size of the dataset D
■ amount of compute operations C
Compute Dataset Size Parameters
PF-days, non-embedding tokens non-embedding
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
8/22
Scaling transformers
■ Larger models require fewer samples to reach the same performance
■ Larger models are much slower per sample
■ smaller models reach same performance faster
-1-1-1— -1-1-1-1—
W 100 ic'1 10-9 10-6 io-3 13-;
Tokens Processed Compute (PF-days)
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
9/22
Scaling transformers
bigger dataset reduces overfitting N = 300M parameters
6
5^
W A
w 4 o
3-
2-
Test Loss
----Train Loss
I I 11111
I I 11111
10
104 Step
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
10'
BERT
■ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
■ encoder only (not Language modelling)
■ pre-training on raw text
■ masking tokens, is-next-sentence
■ big pre-trained models available
■ domain (task) adaptation
Input: The man went to the   [MASK]l   .  He bought a   [MASK]2 of milk . Labels: [MASK]1 = store;   [MASK]2 = gallon
Sentence A = The man went to the store. SentencG B = He bought a gallon of milk. Label = IsNextSentence
Sentence A = The man went to the store. SentenceB=Penguins are flightless. Label = NotNextSentence
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
11/22
BERT's sizes
■ BASE
■ L=12,H=768,A=12
■ Total Parameters=110M
■ LARGE
■ L=24,H=1024,A=16
■ Total Parameters=340M
Pavel Rychly • Large Language Models (LLM) • April 27,2023
12/22
ALBERT
■ A Lite BERT
■ factorized embedding parameters
■ cross-Layer parameter sharing
■ inter-sentence coherence Loss
Next Sentence Prediction —>► Sentence-Order Prediction
■ much smaLLer: No. parameters: 108M —>► 12M (base)
Sentence A = The man went to the store. Sentence B = He bought a gallon of milk. Label = IsNextSentence
Sentence A = The man went to the store. Sentence B = Penguins are flightless. Label = NotNextSentence
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
13/22
GPT
■ Open Al
■ GPT-2:1.5 billion parameters
■ GPT-3:175 billion parameters
■ very good text generation
—>► potentially harmful applications
■ Misuse of Language Models
■ bias - generate stereotyped or prejudiced content: gender, race, religion
■ Sep 2020: Microsoft have "exclusive" use of GPT-3
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
14/22
GPT3's sizes
Model Name		^layers				Batch Size	Learning	Rate
GPT-3 Small	125M	12	768	12	64	0.5M	6.0 x 11	j-4
GPT-3 Medium	350M	24	1024	16	64	0.5M	3.0 x 1(	j-4
GPT-3 Large	76QM	24	1536	16	96	0.5M	2.5 x 1(	j-4
GPT-3 XL	1.3B	24	2048	24	128	1M	2.0 x 1(	->-4
GPT-3 2.7B	2.7B	32	2560	32	80	1M	1.6 x K	j-4
GPT-3 6.7B	6.7B	32	4096	32	128	2M	1.2 x 1(	j-4
GPT-3 13B	13.OB	40	5140	40	128	2M	1.0 x 1(	j-4
GPT-3 175B or "GPT-S"	175.OB	96	12288	96	12K	3.2M	0.6 x 11	->-4
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
15/22
GPT3 performance
100
Aggregate Performance Across Benchmarks
80
Few Shot One Shot Zero Shot
0
0,1B
0.4B    0.8B 1.3B    2.6B      6JB 13B
Parameters in LM (Billions)
Pavel Rychlý • Large Language Models (LLM) • April 27, 2023
T5: Text-To-Text Transfer Transformer
Google Al transfer Learning
C4: CoLossaL CLean CrawLed Corpus
"translate English to German: That is good."
"cola sentence: The course is jumping well."
"stsb sentencel : The rhino grazed on the grass. sentence2: A rhino is grazing in a field."
"summarize: state authorities dispatched emergency crews tuesday to survey the damage after an onslaught of severe weather in mississippi..."
"Das ist gut."
"not acceptable"
"3.8"
f-\
"six people hospitalized after a storm in attala county."
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
17/22
Subword tokenizers
■ universal tokenization: subword units
■ Byte-Pair Encoding (BPE)
■ WordPiece
■ SentencePiece
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
18/22
Intrinsic evaluation
■ direct evaluation of word embeddings
■ semantic similarity (WordSim-353, SimLex-999,...)
■ word analogy (Google Analogy, BATS (Bigger Analogy Test Set))
■ concept categorization (ESSLLI-2008)
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
19/22
Extrinsic evaluation
■ using the model in a downstream NLPtask
■ Part-of-Speech Tagging, Noun Phrase Chunking, Named Entity Recognition, Shallow Syntax Parsing, Semantic RoLe Labeling, Sentiment Analysis, Text Classification, Paraphrase Detection, Textual Entailment Detection
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
20/22
Multi-task benchmarks
■ GLUE (https : //gluebenchmark. com)
nine sentence- or sentence-pair Language understanding tasks
■ SuperGLUE (https : //super. gluebenchmark. com) more difficult Language understanding tasks
■ XTREME - Cross-LinguaL Transfer Evaluation of MuLtiLinguaL Encoders
(https://sites.research.google/xtreme) 40 typoLogicaLLy diverse Languages, 9 tasks
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
21/22
Libraries and Frameworks
■ Dive into Deep Learning: online book https://d2l.ai
■ Hugging Face Transformers: many ready to use models https://huggingface.co/transformers
■ jiant: Library, many tasks for evaluation https://j iant.info
■ GLuonNLP: reproduction of Latest research results https://nip.gluon.ai
■ Low LeveL Libraries: NumPy, PyTorch, TensorFLow, MXNet
Pavel Rychly • Large Language Models (LLM) • April 27, 2023
22/22