Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning based on slides by Danqi Chen, Princeton University Lecture 11: Question Answering Lecture plan 1. What is question answering? (10 mins) Your default final project! 2. Reading comprehension (50 mins) / How to answer questions over a single passage of text 3. Open-domain (textual) question answering (20 mins) / How to answer questions over a large collection of documents Due today: Ass 4 Final project proposal Hopefully Azure is working okay for everyone now 4 1. What is question answering? Question (Q) Answer (A) The goal of question answering is to build systems that automatically answer questions posed by humans in a natural language The earliest QA systems dated back to 1960s! (Simmons et al., 1964) Question: a) What do worms eat? Answers: b) Worms eat grass worms eat grass worms eat what c) Grass is eaten by worms worms eat grass worms eat grass (complete agreement of dependencies) 3 Question answering: a taxonomy Question (Q) Answer (A) • What information source does a system build on? • A text passage, all Web documents, knowledge bases, tables, images.. • Question type • Factoid vs non-factoid, open-domain vs closed-domain, simple vs compositional, .. • Answer type • A short segment of text, a paragraph, a list, yes/no, ... 4 Lots of practical applications Go gle Where is the deepest lake in the world? Q. All Q Maps 0 Images GD News □ Videos : More Settings Tools About 21,100,000 results (0.71 seconds) Siberia Lake Baikal, in Siberia, holds the distinction of being both the deepest lake in the world and the largest freshwater lake, holding more than 20% of the unfrozen fresh water on the surface of Earth. 5 Lots of practical applications Go gle How can I protect myself from COVID-19? Q, All 0 Images GD News A jTesla was the fourth of five children. He had an older brother J named Dane and three sisters, Milka, Angelina and Marica. jDane was killed in a horse-riding accident when Nikola was jfive. In 1861, Tesla attended the "Lower" or "Primary" School ! in Smiljan where he studied German, arithmetic, and ! religion. In 1862, the Tesla family moved to Gospic, Austrian ! Empire, where Tesla's rather worked as a pastor Nikola I completed "Lower" or "Primary" School, followed by the ; "Lower Real Gymnasium" or "Normal School." Q: What language did Tesla study while in school? A: German 13 2. Reading comprehension Reading comprehension: building systems to comprehend a passage of text and answer questions about its content (P, Q) —► A i Kannada language is the official language of Karnataka and i spoken as a native language by about 66.54% of the people | as of 2011. Other linguistic minorities in the state were Urdu j | (10.83%), Telugu language (5.84%), Tamil language | (3.45%), Marathi language (3.38%), Hindi (3.3%), Tulu I language (2.61%), Konkani language (1.29%), Malayalam ! (1.27%) and Kodava Takk (0.18%). In 2007 the state had a | I birth rate of 2.2%, a death rate of 0.7%, an infant mortality Irate of 5.5% and a maternal mortality rate of 0.2%. The total ! | fertility rate was 2.2. Q: Which linguistic minority is larger, Hindi or Malayalam? A: Hindi 14 Why do we care about this problem? • Useful for many practical applications • Reading comprehension is an important testbed for evaluating how well computer systems understand human language • Wendy Lehnert 1977: "Since questions can be devised to query any aspect of text comprehension, the ability to answer questions is the strongest possible demonstration of understanding." • Many other NLP tasks can be reduced to a reading comprehension problem: Information extraction (Barack Obama, educated at, ?) Semantic role labeling UCD finished the 2006 championship as Dublin champions , by beating St Vincents in the final. Question: Where did Barack Obama graduate from? Passage: Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. finished Who finished something? - UCD What did someone finish? - the 2006 championship What did someone finish something as? - Dublin champions How did someone finish something? - by beating St Vincents in the final beating Who beat someone? - UCD When did someone beat someone? - in the final Who did someone beat? - St Vincents (Levy et al, 2017) (He et al, 2015) 15 Stanford question answering dataset (SQuAD) • 100k annotated (passage, question, answer) triples Large-scale supervised datasets are also a key ingredient for training effective neural models for reading comprehension! • Passages are selected from English Wikipedia, usually 100—150 words. • Questions are crowd-sourced. • Each answer is a short segment of text (or span) in the passage. This is a limitation— not all the questions can be answered in this way! • SQuAD still remains the most popular reading comprehension dataset; it is "almost solved" today and the state-of-the-art exceeds the estimated human performance. In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, pel and hail... Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called "showers". What causes precipitation to fall? gravity What is another main form of precipitation besides drizzle, rain, snow, sleet and hail? graupel Where do water droplets collide with ice crystals to form precipitation? within a cloud (Rajpurkar et al., 2016): SQuAD: 100,000+ Questions for Machine Comprehension 16 Stanford question answering dataset (SQuAD) • Evaluation: exact match (0 or 1) and Fl (partial credit). • For development and testing sets, 3 gold answers are collected, because there could be multiple plausible answers. • We compare the predicted answer to each gold answer (a, an, the, punctuations are removed) and take max scores. Finally, we take the average of all the examples for both exact match and Fl. • Estimated human performance: EM = 82.3, Fl = 91.2 Q: What did Tesla do in December 1878? A: {left Graz, left Graz, left Graz and severed all relations with his family} Prediction: {left Graz and served} Exact match: max{0, 0, 0} =0 Fl: max{0.67, 0.67, 0.61} = 0.67 17 Other question answering datasets • TriviaQA: Questions and answers by trivia enthusiasts. Independently collected web paragraphs that contain the answer and seem to discuss question, but no human verification that paragraph supports answer to question • Natural Questions: Question drawn from frequently asked Google search questions. Answers from Wikipedia paragraphs. Answer can be substring, yes, no, or NOTPRESENT. Verified by human annotation. • HotpotQA. Constructed questions to be answered from the whole of Wikipedia which involve getting information from two pages to answer a multistep query: Q: Which novel by the author of "Armada" will be adapted as a feature film by Steven Spielberg? A: Ready Player One 18 Neural models for reading comprehension How can we build a model to solve SQuAD? (We are going to use passage, paragraph and context, as well as question and query interchangeably) • Problem formulation • Input: C = (chc2,cN), Q = (ql,q2,qM), cbqt G V • Output: 1 < start < end < N • A family of LSTM-based models with attention (2016-2018) Attentive Reader (Hermann et al., 2015), Stanford Attentive Reader (Chen et al., 2016), Match-LSTM (Wang et al, 2017), BiDAF (Seo et al, 2017), Dynamic coattention network (Xiong et al., 2017), DrQA (Chen et al., 2017), R-Net (Wang et al., 2017), ReasoNet (Shen et al., 2017).. • Fine-tuning BERT-like models for reading comprehension (2019+) 19 N-100, M -15 answer is a span in the passage LSTM-based vs BERT models Mod ft ing Layer Attention Flow Phrase Embed Layer Word Embed Layer C har acter Embed Layer Query2Context and Context2Query Attention h-i h2 äfl ff □ o □ n n 9 2 qj Query Image credit: (Seo et al, 2017) Query 2Context :::: U2 h, h2 hT Context2Query r,_i r,ir,- rvi V F O l II II 1 C/) ,L • .11.1 -u2 -u. Word Character Embedding Embedding GLOVE Char-CNN c i r [CLS] pstartj pendj Encoder (BERT) i—f—r %^ [SEP] ^ Question -Y- Passage 3, Image credit: J & M, edition 3 20 Recap: seq2seq model with attention • Instead of source and target sentences, we also have two sequences: passage and question (lengths are unbalanced) • We need to model which words in the passage are most relevant to the question (and which question words) Attention is the key ingredient here, similar to which words in the source sentence are most relevant to the current target word... • We don't need an autoregressive decoder to generate the target sentence word-by-word. Instead, we just need to train two classifiers to predict the start and end positions of the answer! // a m' entarte he hit me with _.._J Y Source sentence (input) 21 BiDAF: the Bidirectional Attention Flow model Output Layer Modeling Layer Attention Flow Layer Phrase Embed Layer Word Embed Layer Character Embed Layer Start t End t Dense + Softmax LSTM + Softmax fft ttt 9i 92 'ft 9t Query2Context and Context2Query Attention T □ T 3 T □ □ □ xT Context □ qj Query (Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension 22 BiDAF: Encoding Encoding Phrase Embed Layer Word Embed Layer Character Embed Layer h. T □ 1=1 □ x2 n X3 □ if Embedding vector _»_ concat [ooool Seattle Context qj _i Query 0 O O O O S e a t t I e Use a concatenation of word embedding (GloVe) and character embedding (CNNs over character embeddings) for each word in context and query. efe) = /([GloVe(cj); charEmb(cj)]) e(^) = /([GloVefe); charEmbfe)]) f: klgk-wocx"-} Kefcwaorks omitted kere Then, use two bidirectional LSTMs separately to produce contextual embeddings for both context and query : LSTM("^_i, e(a)) eRH = LSTM(^_i, e(qt)) e RH - LSTM(t,+1, e(a)) eRH ^ = LSTM( W, efe)) e RH 23 BiDAF: Attention Attention _l 1_ g, 92 [f 1 9t 1 1 Query2Context and Context2Query Attention h. l h2 hTf U! Context-to-query attention: For each context word, choose the most relevant words from the query words. Q: Who leads the United States? C: Barak Obama is the president of the USA. For each context word, find the most relevant query word. (Slides adapted from Minjoon Seo) 24 BiDAF: Attention Attention _l 1_ g, 92 [f 1 9t 1 1 Query2Context and Context2Query Attention h. l h2 hTf U! Query-to-context attention: choose the context words that are most relevant to one of query words. While\Seattle\s weather is very nice in summer, its weather is very rainy in winter, making it one of the most gloomy cities in the U.S. LA is... Q: Which city is gloomy in winter? (Slides adapted from Minjoon Seo) 25 BiDAF: Attention Attention jl 9i 92 |F "J g- 1" Query2Context and Context2Query Attention h. hp hTf ui The final output is • First, compute a similarity score for every pair of (ci5 q^): sij = wJim[ci;a*;c* ©o/l G M Wsim e M • Context-to-query attention (which question words are more relevant to q) aij = soitmaxj(Sij) g M' 6 if M a.i = ^aijqj g M j'=i 2if Query-to-context attention (which context words are relevant to some question words): N ßi = softmaxi(max^1(iS'i>J-)) g b = ^ftc* g M2if 26 BiDAF: Modeling and output layers Output Layer Modeling Layer Start End i Dense + Softmax LSTM + Softmax m, mi t n . ttt The final training loss is C = -logpstart(s*) -logPend(e*) Modeling layer: pass gt to another two layers of bi-directional LSTMs. • Attention layer is modeling interactions between query and context • Modeling layer is modeling interactions within context words BiLSTM(gi) e R2H m, Output layer: two classifiers predicting the start and end positions: Pstart = softmax(wsTtart [gi; mJ) 2W = softmax(weTnd [gi; mj]) m't = BiLSTM(m^) e R2H wstart, wend £ M10if 27 BiDAF: Performance on SQuAD This model achieved 77.3 Fl on SQuAD vl.l. • Without context-to-query attention => 67.7 Fl • Without query-to-context attention => 73.7 Fl • Without character embeddings => 75 A Fl F1 Logistic regression 51.0 Fine-Grained Gating (Carnegie Mellon U) 73.3 Match-LSTM (Singapore Management U) 73.7 DCN (Salesforce) 75.9 BiDAF (UW & Allen Institute) 77.3 Multi-Perspective Matching (IBM) 78.7 ReasoNet (MSR Redmond) 79.4 DrQA(Chenetal.2017) 79.4 r-net (MSR Asia) [Wang et al., ACL 2017] 79.7 Human performance 91.2 64 (Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension 28 Attention visualization Super Bowl 50 was an American football garre to determine the champion of the National Football League ( NFL) for the 2015 season . The American Football Conference ( AFC ) champion Denver Broncos defeated the National Football Conference ( NFC ) champion Carolina Panthers 24-10 to earn theirthird Super Bowl title . The game was played on February 7 , 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara , California . As this was the50th Super Bowl, the league emphasized the " golden anniversary " with varbus gold-themed initiatives, as well astemporarily suspending the tradition of naming each Super Bowl garre with Roman numerals ( under which the game would have been known as" Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50 . Where | ||| ||||||| | did llll 11IIII I at, the, at, Stadium, Levi, in, Santa, Ana [] Super, Super, Super,Super, Super Bowl, Bowl, Bowl, Bowl, Bowl 50 initiatives 29 BERT for reading comprehension • BERT is a deep bidirectional Transformer encoder pre-trained on large amounts of text (Wikipedia + BooksCorpus) • BERT is pre-trained on two training objectives: • Masked language model (MLM) • Next sentence prediction (NSP) • BERTbasehas 12 layers and 110M parameters, BERTiargehas 24 layers and 330M parameters Pre-training Fine-Tuning BERT for reading comprehension Start/End Span c ■ ■ "'"[SEP] BERT "[SEP] Question Paragraph C = -logpstart(s*) - logPend(e*) Pstart(«) = softmaxi(wJtarthi) Pend{i) = softmaxi(wJndhi) Question = Segment A Passage = Segment B Answer = predicting two endpoints in segment B + + + + + + + + + + [CLS] How many V_ have 1 L 1 1=1 C Question ? [SEP] BERT large If Reference Segment Embeddings J Question: Reference Text: How many parameters does BERT-large have? BERT-large is really big... it has 24 layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance. Image credit: https://mccormickml.com/ where h;- is the hidden vector of q, returned by BERT BERT for reading comprehension C = -logPstartOO - logPend(e*) • All the BERT parameters (e.g., HUM) as well as the newly introduced parameters h start i hend (e.g., 768 x 2 = 1536) are optimized together for C. • It works amazing well. Stronger pre-trained language models can lead to even better performance and SQuAD becomes a standard dataset for testing pre-trained models. Fl EM Human performance 91.2* 82.3* BiDAF 77.3 67.7 BERT-base 88.5 80.8 BERT-large 90.9 84.1 XLNet 94.5 89.0 RoBERTa 94.6 88.9 ALBERT 94.8 89.3 (dev set, except for human performance) Start/End Span 1 c T[SEP] BERT E(CLS] E1 Question Paragraph 32 Comparisons between BiDAF and BERT models • BERT model has many many more parameters (110M or 330M) BiDAF has —2.5M parameters. • BiDAF is built on top of several bidirectional LSTMs while BERT is built on top of Transformers (no recurrence architecture and easier to parallelize). • BERT is pre-trained while BiDAF is only built on top of GloVe (and all the remaining parameters need to be learned from the supervision datasets). Pre-training is clearly a game changer but it is expensive.. 33 Comparisons between BiDAF and BERT models Are they really fundamentally different? Probably not. • BiDAF and other models aim to model the interactions between question and passage. • BERT uses self-attention between the concatenation of question and passage = attention (V, P) + attention (V, Q) + attention (Q, P) + attention (Q, Q) • (Clark and Gardner, 2018) shows that adding a self-attention layer for the passage attention^ P) to BiDAF also improves performance. t t question _ passage Transformer layer 3 Transformer layer 2 Transformer layer 1 34 Can we design better pre-training objectives? The answer is yes! £(football) = £mlm (football) + £sbo (football) = — logP(football | x7) — log P(football | X4,xq,P3) 12 34 an American football game t t t t Xl x3 x4 X5 x6 x7 x8 Xf) x10 Xn x12 t t t t t t t t t t t t Transformer Encoder t t t t t t t t t t t t Super Bowl 50 was [MASK] [MASK] [MASK] [MASK] | to | determine the champion Two ideas: 1) masking contiguous spans of words instead of 15% random words 2) using the two end points of span to predict all the masked words in between = compressing the information of a span into its two endpoints y% = /(xs-i,xe+i,pi-s+i) (Joshi & Chen et al., 2020): SpanBERT: Improving Pre-training by Representing and Predicting Spans 35 SpanBERT performance Fl scores Google BERT Our BERT SpanBERT 95 -92-5- 91.3 y 88 80 94.6 73 65 ^^^^^^^^^^^^^^^ ■ ^^1^I 1 83.6 84.8 81.7 81.8 83.0 82.8 SQuADvl.l SQuADv2.0 NewsQA TriviaQA SearchQA HotpotQA Natural Questions 36 Is reading comprehension solved? • We have already surpassed human performance on SQuAD. Does it mean that reading comprehension is already solved? Of course not! • The current systems still perform poorly on adversarial examples or examples from out-of-domain distributions Article: Super Bowl 50 Paragraph: "Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver's Executive Vice President of Football Operations and General Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV." Question: "What is the name of the quarterback who was 38 in Super Bowl XXXIII?" Original Prediction: John Elway Prediction under adversary: Jeff Dean Original AddSent AddOneSent AddAny AddCommon Match Single Match Ens. BiDAF Single 71.4 75.4 75.5 27.3 39.0 29.4 41.8 34.3 45.7 7.6 38.9 11.7 51.0 4.8 41.7 BiDAF Ens. 80.0 34.2 46.9 2.7 52.6 (Jia and Liang, 2017): Adversarial Examples for Evaluating Reading Comprehension Systems 37 Is reading comprehension solved? Systems trained on one dataset can't generalize to other datasets: Evaluated on SQuAD TriviaQA NQ QuAC NewsQA SQuAD 75.6 46.7 48.7 20.2 41.1 TriviaQA 49.8 58.7 42.1 20.4 10.5 NQ 53.5 46.3 73.5 21.6 24.7 QuAC 39.4 33.1 33.8 33.3 13.8 NewsQA 52.1 38.4 41.7 20.4 60.1 (Sen and Saffari, 2020): What do Models Learn from Question Answering Datasets? 38 reading comprehension solved? BERT-large model trained on SQuAD Test 7YP£ Failure * Example Test cases (with expected behavior and prediction) and Description Rate (1;) □ MFT: comparisons 20.0 C: Victoria is younger than Dylan. Q: Who is less young? A: Dylan Victoria 1 MFT: intensifiers to superlative: most/least 91.3 C: Anna is worried about the project. Matthew is extremely worried about the project. Q: Who is least worried about the project? A: Anna i£;: Matthew MFT: match properties to categories 82.4 * C: There is a tiny purple box in the room. Q: What size is the box? A: tiny : purple MFT: nationality vs job 49.4 C: Stephanie is an Indian accountant. Q: What is Stephanie's job? A: accountant -i : Indian accountant momy MFT: animal vs vehicles 26.2 C: Jonathan bought a truck. Isabella bought a hamster. Q: Who bought an animal? A: Isabella Jonathan Taxe MFT: comparison to antonym 67.3 C: Jacob is shorter than Kimberly^. Q: Who is taller? A: Kimberly Jacob MFT: more/less in context, more/less antonym in question 100.0 C: Jeremy is more optimistic than Taylor.^ Q: Who is more pessimistic? A: Taylor 'Jj: Jeremy M INV: Swap adjacent characters in Q (typo) 11.6 C: ...Newcomcn designs had a duty of about 7 million, but most were closer to 5 million.... Q: What was the ideal duty ■+ udty of a Ncwcomen engine? A: INV >!/: 7 million ■> 5 million o p2 INV: add irrelevant sentence to C 9.8 (no example) (Ribeiro et al., 2020): Beyond Accuracy: Behavioral Testing of NLP Models with Checklist Is reading comprehension solved? BERT-large model trained on SQuAD 3 5 MFT: change in one person only 41.5 c: Both Luke and Abigail were writers, but there was a change in Abigail, who is now a model. Q: Who is a model? A: Abigail -s>: Abigail were writers, but there was a change in Abigail Temp MFT: Understanding before/after, last/first 82.9 c: Logan became a farmer before Danielle did. Q: Who became a farmer last? A: Danielle -V-: Logan MFT: Context has negation 67.5 c: Aaron is not a writer. Rebecca is. Q: Who is a writer? A: Rebecca 5;: Aaron MFT: Q has negation, c does not 100.0 c: Aaron is an editor. Mark is an actor. Q: Who is not an actor? A: Aaron Mark MFT: Simple coreference, he/she. 100.0 c: Melissa and Antonio are friends. He is a journalist, and she is an adviser. Q: Who is a journalist? A: Antonio Melissa 'd o MFT: Simple coreference, his/her. 100.0 c: Victoria and Alex are friends. Her mom is an agent U Q: Whose mom is an agent? A: Victoria Alex MFT: former/latter 100.0 c: Kimberly and Jennifer arc friends. The former is a teacher Q: Who is a teacher? A: Kimberly Jennifer -j MFT: subject/object distinction 60.8 M. c: Richard bothers Elizabeth. Q: Who is bothered? A: Elizabeth ^: Richard Ctí -T, MFT: subj/obj distinction with 3 agents 95.7 * c: Jose hates Lisa. Kevin is hated by Lisa. Q: Who hates Kevin? A: Lisa if.: Jose (Ribeiro et al., 2020): Beyond Accuracy: Behavioral Testing of NLP Models with Checklist 3. Open-domain question answering Question (Q) Answer (A) WlKIPEDlA ihe Free Encyclopedia • Different from reading comprehension, we don't assume a given passage. • Instead, we only have access to a large collection of documents (e.g., Wikipedia). We don't know where the answer is located, and the goal is to return the answer for any open-domain questions. • Much more challenging and a more practical problem! In contrast to closed-domain systems that deal with questions under a specific domain (medicine^ technical support). 41 Retriever-reader framework How many of Warsaw's inhabitants spoke Polish in 1933? WlKIPEDlA Document Retriever P 16 RMrsh FtghH* Squadron, f Document Reader 833,500 al. political and economic nub ™i Wt https://github.com/facebookresearch/DrQA Chen et al., 2017. Reading Wikipedia to Answer Open-domain Questions 42 Retriever-reader framework • Input: a large collection of documents = D2,. ..,DN and Q • Output: an answer string A • Retriever: f(9), Q) —> Ph...,PK k is pre-defined (e.g., 100) • Reader: g(Q, [Plf...,PK}) —> A A reading comprehension problem! In DrQA, • Retriever = A standard TF-IDF information-retrieval sparse model (a fixed module) • Reader = a neural reading comprehension model that we just learned • Trained on SQuAD and other distantly-supervised QA datasets Distantly-supervised examples: (Q, A) —> (F> Q, A) Chen et al., 2017. Reading Wikipedia to Answer Open-domain Questions 43 We can train the retriever too Joint training of retriever and reader BERT born January 1882 Lily couldn't «M>. The waitress had brought the largest of chocolate cake seen. Our hand-picked and sun-dried orchard in Georgia. Pre-training Fine-tuning When was Franklin D. Roosevelt born? D. Roosevelt was in believe her eyes piece <;M> she had ever peaches are at our 1882 Roberts et a\., 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model? Maybe the reader model is not necessary too! It is possible to encode all the phrases (60 billion phrases in Wikipedia) using dense vectors and only do nearest neighbor search without a BERT model at inference time! "Barack Obama (1961-present) was the 44th President of the United States." Phrase Indexing _v Barack Obama ... (1961-present 44th president United States. Nearest neighbor \ search Who is the 44th President of the U.S.? When was Obama born? Phrase encoding Question encoding Seo et al., 2019. Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index Lee et al., 2020. Learning Dense Representations of Phrases at Scale 49