Large Language Models (LLM) PA154 Language Modeling (11.1) Pavel Rychlý pary@fi.muni.cz April 27, 2023 Large Models ■ bigger is better ■ many Layers ■ need big machines ■ using advanced hardware: GPU,TPU on multiple servers Pavel Rychly • Large Language Models (LLM) • April 27, 2023 2/22 Usage of Large Models training of big models on huge data is expensive (Long training time) fine tuning on small data of target task combining Language modeL with additionaL NN/Layer, training onLy new Layer ■ big modeL is frozen, onLy used Classification start Text Extract Transformer Linear Entailment Similarity Start Premise Delim Hypothesis Extract Start Text 1 Delim Text 2 Extract - Start Text 2 Delim Text 1 Extract Transformer Linear Transformer Transformer Linear Pavel Rychly • Large Language Models (LLM) • April 27, 2023 3/22 Pre-trained models ■ word2vec, fastText: pre-trained word embeddings ■ transformers: BERT ■ transformer modifications: ■ RoBerta, Albert,... ■ Language specific models ■ multilingual, models Pavel Rychly • Large Language Models (LLM) • April 27, 2023 4/22 Pre-trained fastText ■ 157 Languages ■ word vectores with dimension 300 ■ up to 1 or 2 mil. words ■ Czech: ■ 2 mil. words ■ text format: 1.2 GB, binary format 4.2 GB ■ Breton: ■ 602k words ■ text format: 340 MB, binary format 4.2 GB Pavel Rychly • Large Language Models (LLM) • April 27, 2023 5/22 Pre-trained fastText Czech embeddings trained on Common CrawL: cc. cs . 300. vec. gz 2000000 300 , 0.0052 0.1646 0.0675 0.0577 0.2342 0.0089 0.1601 0.0240 -0 . 0.0485 0.0674 0.0261 0.0220 -0.0779 -0.0309 -0.2006 0.0100 a 0.1253 0.0177 0.0770 -0.0103 0.0687 0.0175 0.0171 0.0013 - 0.0251 -0.0350 0.0364 0.0349 0.0159 -0.0586 -0.4607 0.0 : -0.0715 -0.0175 -0.0210 0.0818 -0.0174 -0.0204 0.0574 0.00 v 0.1013 0.1792 -0.0174 0.0365 0.0920 0.0802 -0.1830 0.0271 na -0.1200 0.2000 0.2071 0.0144 0.3272 -0.0145 -0.1196 0.080 ) 0.0614 -0.1514 0.0203 0.1658 0.0958 -0.0628 -0.0841 -0.064 se -0.1456 0.1170 0.0285 -0.0062 -0.0890 -0.0042 -0.0969 -0. C 0.0671 -0.1871 0.0332 0.1324 0.1774 -0.0685 0.0082 -0.0666 " 0.1381 -0.2536 0.0805 0.0379 0.2684 -0.0038 0.0437 -0.0905 je 0.0170 -0.1937 0.0388 -0.0084 0.1255 -0.0953 -0.0267 -0.0 - -0.2497 -0.0093 0.1759 -0.0839 0.1842 -0.0276 0.1605 -0.08 s 0.0081 -0.0854 -0.0566 0.0116 -0.5178 -0.0091 -0.2048 0.05 Pavel Rychly • Large Language Models (LLM) • April 27, 2023 6/22 Scaling transformers ■ main factors: ■ number of model parameters N ■ size of the dataset D ■ amount of compute operations C ■ evaluation on test Loss (cross-entropy) ■ there is a capacity Limit for a fix N, D, or C ■ performance improves predictably as Long as we scale up N and D in tandem Pavel Rychly • Large Language Models (LLM) • April 27, 2023 7/22 Scaling transformers ■ number of model parameters N ■ size of the dataset D ■ amount of compute operations C Compute Dataset Size Parameters PF-days, non-embedding tokens non-embedding Pavel Rychly • Large Language Models (LLM) • April 27, 2023 8/22 Scaling transformers ■ Larger models require fewer samples to reach the same performance ■ Larger models are much slower per sample ■ smaller models reach same performance faster -1-1-1— -1-1-1-1— W 100 ic'1 10-9 10-6 io-3 13-; Tokens Processed Compute (PF-days) Pavel Rychly • Large Language Models (LLM) • April 27, 2023 9/22 Scaling transformers bigger dataset reduces overfitting N = 300M parameters 6 5^ W A w 4 o 3- 2- Test Loss ----Train Loss I I 11111 I I 11111 10 104 Step Pavel Rychly • Large Language Models (LLM) • April 27, 2023 10' BERT ■ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ■ encoder only (not Language modelling) ■ pre-training on raw text ■ masking tokens, is-next-sentence ■ big pre-trained models available ■ domain (task) adaptation Input: The man went to the [MASK]l . He bought a [MASK]2 of milk . Labels: [MASK]1 = store; [MASK]2 = gallon Sentence A = The man went to the store. SentencG B = He bought a gallon of milk. Label = IsNextSentence Sentence A = The man went to the store. SentenceB=Penguins are flightless. Label = NotNextSentence Pavel Rychly • Large Language Models (LLM) • April 27, 2023 11/22 BERT's sizes ■ BASE ■ L=12,H=768,A=12 ■ Total Parameters=110M ■ LARGE ■ L=24,H=1024,A=16 ■ Total Parameters=340M Pavel Rychly • Large Language Models (LLM) • April 27,2023 12/22 ALBERT ■ A Lite BERT ■ factorized embedding parameters ■ cross-Layer parameter sharing ■ inter-sentence coherence Loss Next Sentence Prediction —>► Sentence-Order Prediction ■ much smaLLer: No. parameters: 108M —>► 12M (base) Sentence A = The man went to the store. Sentence B = He bought a gallon of milk. Label = IsNextSentence Sentence A = The man went to the store. Sentence B = Penguins are flightless. Label = NotNextSentence Pavel Rychly • Large Language Models (LLM) • April 27, 2023 13/22 GPT ■ Open Al ■ GPT-2:1.5 billion parameters ■ GPT-3:175 billion parameters ■ very good text generation —>► potentially harmful applications ■ Misuse of Language Models ■ bias - generate stereotyped or prejudiced content: gender, race, religion ■ Sep 2020: Microsoft have "exclusive" use of GPT-3 Pavel Rychly • Large Language Models (LLM) • April 27, 2023 14/22 GPT3's sizes Model Name ^layers Batch Size Learning Rate GPT-3 Small 125M 12 768 12 64 0.5M 6.0 x 11 j-4 GPT-3 Medium 350M 24 1024 16 64 0.5M 3.0 x 1( j-4 GPT-3 Large 76QM 24 1536 16 96 0.5M 2.5 x 1( j-4 GPT-3 XL 1.3B 24 2048 24 128 1M 2.0 x 1( ->-4 GPT-3 2.7B 2.7B 32 2560 32 80 1M 1.6 x K j-4 GPT-3 6.7B 6.7B 32 4096 32 128 2M 1.2 x 1( j-4 GPT-3 13B 13.OB 40 5140 40 128 2M 1.0 x 1( j-4 GPT-3 175B or "GPT-S" 175.OB 96 12288 96 12K 3.2M 0.6 x 11 ->-4 Pavel Rychly • Large Language Models (LLM) • April 27, 2023 15/22 GPT3 performance 100 Aggregate Performance Across Benchmarks 80 Few Shot One Shot Zero Shot 0 0,1B 0.4B 0.8B 1.3B 2.6B 6JB 13B Parameters in LM (Billions) Pavel Rychlý • Large Language Models (LLM) • April 27, 2023 T5: Text-To-Text Transfer Transformer Google Al transfer Learning C4: CoLossaL CLean CrawLed Corpus "translate English to German: That is good." "cola sentence: The course is jumping well." "stsb sentencel : The rhino grazed on the grass. sentence2: A rhino is grazing in a field." "summarize: state authorities dispatched emergency crews tuesday to survey the damage after an onslaught of severe weather in mississippi..." "Das ist gut." "not acceptable" "3.8" f-\ "six people hospitalized after a storm in attala county." Pavel Rychly • Large Language Models (LLM) • April 27, 2023 17/22 Subword tokenizers ■ universal tokenization: subword units ■ Byte-Pair Encoding (BPE) ■ WordPiece ■ SentencePiece Pavel Rychly • Large Language Models (LLM) • April 27, 2023 18/22 Intrinsic evaluation ■ direct evaluation of word embeddings ■ semantic similarity (WordSim-353, SimLex-999,...) ■ word analogy (Google Analogy, BATS (Bigger Analogy Test Set)) ■ concept categorization (ESSLLI-2008) Pavel Rychly • Large Language Models (LLM) • April 27, 2023 19/22 Extrinsic evaluation ■ using the model in a downstream NLPtask ■ Part-of-Speech Tagging, Noun Phrase Chunking, Named Entity Recognition, Shallow Syntax Parsing, Semantic RoLe Labeling, Sentiment Analysis, Text Classification, Paraphrase Detection, Textual Entailment Detection Pavel Rychly • Large Language Models (LLM) • April 27, 2023 20/22 Multi-task benchmarks ■ GLUE (https : //gluebenchmark. com) nine sentence- or sentence-pair Language understanding tasks ■ SuperGLUE (https : //super. gluebenchmark. com) more difficult Language understanding tasks ■ XTREME - Cross-LinguaL Transfer Evaluation of MuLtiLinguaL Encoders (https://sites.research.google/xtreme) 40 typoLogicaLLy diverse Languages, 9 tasks Pavel Rychly • Large Language Models (LLM) • April 27, 2023 21/22 Libraries and Frameworks ■ Dive into Deep Learning: online book https://d2l.ai ■ Hugging Face Transformers: many ready to use models https://huggingface.co/transformers ■ jiant: Library, many tasks for evaluation https://j iant.info ■ GLuonNLP: reproduction of Latest research results https://nip.gluon.ai ■ Low LeveL Libraries: NumPy, PyTorch, TensorFLow, MXNet Pavel Rychly • Large Language Models (LLM) • April 27, 2023 22/22