Natural Language Processing
^> f*\ with Deep Learning
CS224N/Ling284
Lecture 11: Question Answering
Danqi Chen
Princeton University
Lecture plan
1. What is question answering? (10 mins)
Your default final project!
2. Reading comprehension (50 mins) / How to answer questions over a single passage of text
3. Open-domain (textual) question answering (20 mins)
/ How to answer questions over a large collection of documents
2
1. What is question answering?
Question (Q)
Answer (A)
The goal of question answering is to build systems that automatically answer questions posed by humans in a natural language
The earliest QA systems dated back to 1960s!
(Simmons et al., 1964)
Question:
a)   What do worms eat?
Answers:
b)   Worms eat grass
worms eat
grass
worms eat
what
c)   Grass is eaten by worms worms eat grass worms
eat
grass
(complete agreement of dependencies)
3
Question answering: a taxonomy
Question (Q)
Answer (A)
• What information source does a system build on?
• A text passage, all Web documents, knowledge bases, tables, images..
• Question type
• Factoid vs non-factoid, open-domain vs closed-domain, simple vs compositional, ..
• Answer type
• A short segment of text, a paragraph, a list, yes/no, ...
4
Lots of practical applications
Go gle
Where is the deepest lake in the world?
Q. All     Q Maps     0 Images     GD News     □ Videos     : More
Settings Tools
About 21,100,000 results (0.71 seconds)
Siberia
Lake Baikal, in Siberia, holds the distinction of being both the deepest lake in the world and the largest freshwater lake, holding more than 20% of the unfrozen fresh water on the surface of Earth.
5
Lots of practical applications
Go gle
How can I protect myself from COVID-19?
Q, All     0 Images     fJD News     <? Shopping     Q] Videos     : More
Settings
Tools
The best way to prevent illness is to avoid being exposed to this virus. Learn how COVID-19 spreads and practice these actions to help prevent the spread of this illness.
To help prevent the spread of COVID-19:
• Cover your mouth and nose with a mask when around people who don't live with you. Masks work best when everyone wears one.
• Stay at least 6 feet (about 2 arm lengths) from others.
• Avoid crowds. The more people you are in contact with, the more likely you are to be exposed to
• Avoid unventilated indoor spaces. If indoors, bring in fresh air by opening windows and doors.
• Clean your hands often, either with soap and water for 20 seconds or a hand sanitizer that contains at least 60% alcohol.
• Get vaccinated against COVID-19 when it's your turn.
• Avoid close contact with people who are sick.
• Cover your cough or sneeze with a tissue, then throw the tissue in the trash.
• Clean and disinfect frequently touched objects and surfaces daily.
^)  Learn more on cdc.gov
COVID-19.
For informational purposes only. Consult your local medical authority for advice.
6
Lots of practical applications
Smart Speaker Use Case Frequency January 2020
Listen to streaming music service
39.8%
Ask a question
Listen to News / Sports
Use a favorite Alexa skill or Google Action
Play game or answer trivia
77.1% 59.8%
33.9%
20.3%
76 3
19.0%
16.9%
46.1% 27.7%
EVER TRIED
MONTHLY
9.0%
Listen to Podcast or other talk formats
Control smart home devices
SI
44.9% 32.0«
43.4% 31.9"*
I
11.4%
24.5%
Find a recipe or ' A cooking instructions
Call someone
Search for product information
Check traffic /directions
Access my calendar |g) Send a text message (^) Make a purchase
38.2% 27.9%
35.1% 23.7%
I
■
■■9.5%
!
4.9)
voicebot.ai
Souxc Vxcbot 012020
IBM Watson beated Jeopardy champions
IBM Watson beated Jeopardy champions
(1) Question Processing
(2) Candidate Answer Generation
From Text Resources
Document
and Passsage Retrieval -*-
Answer Extraction
Document titles Anchor text
(3) Candidate Answer Scoring
Candidate Answer
Candidate Answer
Confidence
Candidate
Answer _±_
Candidate Answer
Candidate Answer
Confidence
(4)
Confidence Merging
and Ranking
Merge Equivalent Answers
Logistic Regression Answer Ranker
Answer and Confidence
Image credit: J & M, edition 3
(1) Question processing, (2) Candidate answer generation, (3) Candidate answer scoring, and (4) Confidence merging and ranking.
Question answering in deep learning era
BERTQ(9)
[ CLS ] What does the zip in zip code stand for? [ SEP ]
BERTß(O)
[CLS] ...The term 'ZIP' is an acronym for Zone Improvement Plan... [ SEP ]
BERTB(1)
[CLS] ...group of zebras are referred to as a herd or dazzle... [SEP]
BERTB(2)
[CLS] ...ZIPs for other operating systems may be preceded by... [ SEP ]
BERTg(...)
TopK
TopK
BERThM)
[CLS] What does the zip in zip code stand for? [SEP]...The term 'ZIP' is an acronym for Zone Improvement Plan... [ SEP ]
BERTH(9,2)
[CLS] What does the zip in zip code stand for?
[SEP] ...ZIPs for other operating systems may be preceded by... [ SEP ]
^\Sread(0, "The term", q)
Sread(0, "Zone Improvement Plan", q)
Sread(0,...,q)
?S£~ 5rearf(2,"ZIPs",<?)
fA V STeai{2, "operating systems", q)
*Sread(2, q)
Image credit: (Lee et al., 2019)
Almost all the state-of-the-art question answering systems are built on top of end-to-end training and pre-trained language models (e.g., BERT)!
10
Beyond textual QA problems
Today, we will mostly focus on how to answer questions based on unstructured text.
Knowledge based QA
P Freebase"
100M entities (nodes)    IB assertions (edges)
MichelleObama
/I
PlacesLived Spouse
/
Event21
Location Type
Chicago'
"Female USState 1992.10.03 f
\. StartDate
Event8 ^__Hawaii
ContainedBy | Marriage UnitedStateS ContainedBy
^ContainedBy
BarackObama-PiaceOfBirth-»-Honolulu
Location PlacesLived^
Event3
7\
Type DateOfBirth Profession
/ \
Person
1961.08.04
Politician
Type
I
City
Which states' capitals are also their largest cities by area?
semantic parsing
/ii.Type.USState n Capital.argmax(Type.City n ContainedBy.x, Area)
I
nax(
I
execute
Arizona, Hawaii, Idaho, Indiana, Iowa,Oklahoma, Utah
Image credit: Percy Liang
11
Beyond textual QA problems
Today, we will mostly focus on how to answer questions based on unstructured text.
Visual QA
What color are her eyes? How many slices of pizza are there?
What is the mustache made of? Is this a vegetarian pizza?
(Antol et al., 2015): Visual Question Answering
12
2. Reading comprehension
Reading comprehension = comprehend a passage of text and answer questions about its content (P, Q) —> A
jTesla was the fourth of five children. He had an older brother J named Dane and three sisters, Milka, Angelina and Marica. jDane was killed in a horse-riding accident when Nikola was jfive. In 1861, Tesla attended the "Lower" or "Primary" School ! in Smiljan where he studied German, arithmetic, and ! religion. In 1862, the Tesla family moved to Gospic, Austrian ! Empire, where Tesla's rather worked as a pastor Nikola I completed "Lower" or "Primary" School, followed by the ; "Lower Real Gymnasium" or "Normal School."
Q: What language did Tesla study while in school? A: German
13
2. Reading comprehension
Reading comprehension: building systems to comprehend a passage of text and answer questions about its content (P, Q) —► A
i Kannada language is the official language of Karnataka and i spoken as a native language by about 66.54% of the people | as of 2011. Other linguistic minorities in the state were Urdu j | (10.83%), Telugu language (5.84%), Tamil language | (3.45%), Marathi language (3.38%), Hindi (3.3%), Tulu I language (2.61%), Konkani language (1.29%), Malayalam ! (1.27%) and Kodava Takk (0.18%). In 2007 the state had a | I birth rate of 2.2%, a death rate of 0.7%, an infant mortality Irate of 5.5% and a maternal mortality rate of 0.2%. The total ! | fertility rate was 2.2.
Q: Which linguistic minority is larger, Hindi or Malayalam? A: Hindi
14
Why do we care about this problem?
• Useful for many practical applications
• Reading comprehension is an important testbed for evaluating how well computer systems understand human language
• Wendy Lehnert 1977: "Since questions can be devised to query any aspect of text
comprehension, the ability to answer questions is the strongest possible demonstration of understanding."
• Many other NLP tasks can be reduced to a reading comprehension problem:
Information extraction
(Barack Obama, educated at, ?)
Semantic role labeling
UCD finished the 2006 championship as Dublin champions , by beating St Vincents in the final.
Question: Where did Barack Obama graduate from?
Passage: Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago.
finished
Who finished something? - UCD
What did someone finish? - the 2006 championship
What did someone finish something as? - Dublin champions
How did someone finish something? - by beating St Vincents in the final
beating
Who beat someone? - UCD
When did someone beat someone? - in the final
Who did someone beat? - St Vincents
(Levy et al, 2017)
(He et al, 2015)
15
Stanford question answering dataset (SQuAD)
• 100k annotated (passage, question, answer) triples
Large-scale supervised datasets are also a key ingredient for training effective neural models for reading comprehension!
• Passages are selected from English Wikipedia, usually 100—150 words.
• Questions are crowd-sourced.
• Each answer is a short segment of text (or span) in the passage.
This is a limitation— not all the questions can be answered in this way!
• SQuAD still remains the most popular reading comprehension dataset; it is "almost solved" today and the state-of-the-art exceeds the estimated human performance.
In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, pel and hail... Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called "showers".
What causes precipitation to fall? gravity
What is another main form of precipitation besides drizzle, rain, snow, sleet and hail?
graupel
Where do water droplets collide with ice crystals to form precipitation?
within a cloud
(Rajpurkar et al., 2016): SQuAD: 100,000+ Questions for Machine Comprehension
16
Stanford question answering dataset (SQuAD)
• Evaluation: exact match (0 or 1) and Fl (partial credit).
• For development and testing sets, 3 gold answers are collected, because there could be multiple plausible answers.
• We compare the predicted answer to each gold answer (a, an, the, punctuations are removed) and take max scores. Finally, we take the average of all the examples for both exact match and Fl.
• Estimated human performance: EM = 82.3, Fl = 91.2 Q: What did Tesla do in December 1878?
A: {left Graz, left Graz, left Graz and severed all relations with his family} Prediction: {left Graz and served}
Exact match: max{0, 0, 0} =0 Fl: max{0.67, 0.67, 0.61} = 0.67
17
Neural models for reading comprehension
How can we build a model to solve SQuAD?
(We are going to use passage, paragraph and context, as well as question and query interchangeably)
• Problem formulation
• Input: C = (chc2,cN), Q = (ql,q2,qM), cbqt G V
• Output: 1 < start < end < N
• A family of LSTM-based models with attention (2016-2018)
Attentive Reader (Hermann et al., 2015), Stanford Attentive Reader (Chen et al., 2016), Match-LSTM (Wang et al, 2017), BiDFA (Seo et al., 2017), Dynamic coattention network (Xiong et al., 2017), DrQA (Chen et al., 2017), R-Net (Wang et al., 2017), ReasoNet (Shen et al., 2017)..
• Fine-tuning BERT-like models for reading comprehension (2019+)
18
N-100, M -15
answer is a span in the passage
LSTM-based vs BERT models
Mod ft ing Layer
Attention Flow
Phrase Embed Layer
Word Embed Layer
C har acter Embed Layer
Query2Context and Context2Query Attention
h-i h2
äfl ff
□ o
□   n n
9 2
qj
Query
Image credit: (Seo et al, 2017)
Query 2Context
::::
U2
h, h2 hT Context2Query
	r,_i			rvi rji
				
F o	l	II		II 1
C/)		JL	•	.11.1
-u2
-U1
Word Character Embedding Embedding
GLOVE Char-CNN
c
i r
[CLS]
pstartj pend-
Encoder (BERT)
i—f—r
[SEP] ^Pi_
Question
-Y-
Passage
3,
Image credit: J & M, edition 3
19
Recap: seq2seq model with attention
• Instead of source and target sentences, we also have two sequences: passage and question (lengths are imbalanced)
• We need to model which words in the passage are most relevant to the question (and which question words)
Attention is the key ingredient here, similar to which words in the source sentence are most relevant to the current target word...
• We don't need an autoregressive decoder to generate the target sentence word-by-word. Instead, we just need to train two classifiers to predict the start and end positions of the answer!
;/       a      m'    entarte <START> he      hit     me with
Source sentence (input)
20
BiDAF: the Bidirectional Attention Flow model
Output Layer
Modeling Layer
Attention Flow Layer
Phrase Embed Layer
Word Embed Layer
Character Embed Layer
Start
t
End
t
Dense + Softmax		LSTM + Softmax
		
ITrr
fft ttt 9i 92				'ft 9t	
					
Query2Context and Context2Query Attention
T
□
T
3
T
□
□ □
xT
Context
□
qj
Query
(Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension
21
BiDAF: Encoding
Encoding
Phrase Embed Layer
Word Embed Layer
Character Embed Layer
h.
T □
1=1
□
x2
n
X3
□
if
Embedding vector _»_
concat
[ooool
Seattle
Context
qj
_i
Query
						
0		O	O		O	O
						
S	e	a	t	t	I	e
Use a concatenation of word embedding (GloVe) and character embedding (CNNs over character embeddings) for each word in context and query.
efe) = /([GloVe(cj); charEmb(cj)]) e(^) = /([GloVefe); charEmbfe)])
f: klgk-wocx"-} Kefcwaorks omitted kere
Then, use two bidirectional LSTMs separately to produce contextual embeddings for both context and query
: LSTM("^_i, e(a)) eRH = LSTM(^_i, e(qt)) e RH
- LSTM(t,+1, e(a)) eRH ^ = LSTM( W, efe)) e RH
22
BiDAF: Attention
Attention
_l 1_	g,	92	[f 1	9t	
	1 1				
	Query2Context and Context2Query Attention				
h.	l h2	hTf U!			
Context-to-query attention: For each context word, choose the most relevant words from the query words.
Q: Who leads the United States?
C: Barak Obama is the president of the USA.
For each context word, find the most relevant query word.
(Slides adapted from Minjoon Seo)
23
BiDAF: Attention
Attention
_l 1_	g,	92	[f 1	9t	
	1 1				
	Query2Context and Context2Query Attention				
h.	l h2	hTf U!			
Query-to-context attention: choose the context words that are most relevant to one of query words.
While\Seattle\s weather is very nice in summer, its weather is very rainy
in winter making it one of the most gloomy cities in the U.S. LA is...
Q: Which city is gloomy in winter?
(Slides adapted from Minjoon Seo)
24
BiDAF: Attention
Attention
	jl 9i	92	|F "J		g-	
	1"					
	Query2Context and Context2Query Attention					
h.	hp                            hTf ui					
The final output is
• First, compute a similarity score for every pair of (ci5 q^):
sij = wJim[ci;a*;c* ©o/l G M    Wsim e M
• Context-to-query attention (which question words are more relevant to q)
aij = soitmaxj(Sij) G M'
6 if
M
a.i = ^aijqj G M j'=i
2if
Query-to-context attention (which context words are relevant to some question words):
N
ßi = softmaxi(max^1(iS'i>J-)) G b = ^ftc* G M2if
25
BiDAF: Modeling and output layers
Output Layer
Modeling Layer
Start
End
i
Dense + Softmax
LSTM + Softmax
m,			
			t
	n .	ttt	n. ttt
The final training loss is
C = -logpstart(s*) -logPend(e*)
Modeling layer: pass gt to another two layers of bi-directional LSTMs.
• Attention layer is modeling interactions between query and context
• Modeling layer is modeling interactions within context words
BiLSTM(gi) e R2H
m,
Output layer: two classifiers predicting the start and end positions:
Pstart = softmax(wsTtart [gi; mJ)      2W = softmax(weTnd [gi; mj])
m't = BiLSTM(m^) e R2H  wstart, wend £ M10if
26
BiDAF: Performance on SQuAD
This model achieved 77.3 Fl on SQuAD vl.l.
• Without context-to-query attention => 67.7 Fl
• Without query-to-context attention => 73.7 Fl
• Without character embeddings => 75 A Fl
Published12! LeaderBoard13
Single Model	EM/Fl		EM /Fl
LR Baseline (Rajpurkar et al. 2016)	40	4/51.0	40.4/51.0
Dynamic Chunk Reader (Yu et al., 2016)	62	5/71.0	62.5/71.0
Match-LSTM with Ans-Ptr (Wang & Jiang, 2016)	64	7/73.7	64.7/73.7
Multi-Perspective Matching (Wang et al.. 2016)	65	5/75.1	70.4/78.8
Dynamic Coattention Networks (Xiong et al., 2016)	66	2/75.9	66.2/75.9
FastQA (Weissenborn et al., 2017)	68	4/77.1	68.4/77.1
BiDAF (Seoetal., 2016)	68	0/77.3	68.0/77.3
SEDT (Liu etal., 2017a)	68	1/77.5	68.5/78.0
RaSoR (Lee et al., 2016)	70	8/78.7	69.6 / 77.7
FastQAExt (Weissenborn et al., 2017)	70	8/78.9	70.8/78.9
ReasoNet (Shen et al., 2017b)	69	1/78.9	70.6 / 79.4
Document Reader (Chen et al., 2017)	70	0 / 79.0	70.7 / 79.4
Ruminating Reader (Gong & Bowman, 2017)	70	6/79.5	70.6 / 79.5
jNet (Zhang et al., 2017)	70	6/79.8	70.6/79.8
Conductor-net		N/A	72.6/81.4
Interactive AoA Reader (Cui et al., 2017)		N/A	73.6/81.9
Reg-RaSoR		N/A	75.8/83.3
DCN+		N/A	74.9 / 82.8
AIR-FusionNet		N/A	76.0/83.9
R-Net (Wang et al.. 2017)	72	3 / 80.7	76.5 /84.3
BiDAF + Self Attention + ELMo		N/A	77.9/ 85.3
Reinforced Mnemonic Reader (Hu et al.. 2017)	73	2/81.8	73.2/81.8
(Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension
27
Attention visualization
Super Bowl 50 was an American football garre to determine the champion of the National Football League ( NFL) for the 2015 season . The American Football Conference ( AFC ) champion Denver Broncos defeated the National Football Conference ( NFC ) champicn Carolina Panthers 24-10 to earn theirthird Super Bowl title . The game was played on February 7 , 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara , California . As this was the50th Super Bowl, the league emphasized the " golden anniversary " with varbus gold-themed initiatives, as well astemporarily suspending the tradition of naming each Super Bowl garre with Roman numerals ( under which the game would have been known as" Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
Where   | ||| ||||||| |
did
llll 11IIII
I
at, the, at, Stadium, Levi, in, Santa, Ana []
Super, Super, Super,Super, Super Bowl, Bowl, Bowl, Bowl, Bowl 50
initiatives
28
BERT for reading comprehension
• BERT is a deep bidirectional Transformer encoder pre-trained on large amounts of text (Wikipedia + BooksCorpus)
• BERT is pre-trained on two training objectives:
• Masked language model (MLM)
• Next sentence prediction (NSP)
• BERTbasehas 12 layers and 110M parameters, BERTiargehas 24 layers and 330M parameters
Pre-training
Fine-Tuning
BERT for reading comprehension
Start/End Span
[CLS]
Tok 1
Tok N
[SEP]
Tok 1
Tok M
Question = Segment A Passage = Segment B
Answer = predicting two endpoints in segment B
+
Segment Embeddings
+ +
+ +
+
+
+ +
+
[CLS]   How many
V_
have
1 L
1 1=1 L~
Question
?      [SEP] BERT
large
If
Reference
J
Question Paragraph
C = -logpstart(s*) - logPend(e*)
Pstart(«) = softmaxi(wJtartH) Pend(^) = softmaxi(wJndH)
where H = [hi, h2, . . . , Yin] are the hidden vectors of the paragraph, returned by BERT
Question: Reference Text:
How many parameters does BERT-large have?
BERT-large is really big...  it has 24 layers and an embedding size of 1,024,   for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance.
Image credit: https://mccormickml.com/
BERT for reading comprehension
C = -logPstartOO - logPend(e*)
• All the BERT parameters (e.g., 110M) as well as the newly introduced parameters h
start i hend (e.g., 768 x 2 = 1536) are optimized together for C.
• It works amazingly well. Stronger pre-trained language models can lead to even better performance and SQuAD becomes a standard dataset for testing pre-trained models.
	Fl	EM
Human performance	91.2*	82.3*
BiDAF	77.3	67.7
BERT-base	88.5	80.8
BERT-large	90.9	84.1
XLNet	94.5	89.0
RoBERTa	94.6	88.9
ALBERT	94.8	89.3
(dev set, except for human performance)
Start/End Span
								1
c					T[SEP]			
BERT
E(CLS] E1
Question
Paragraph
31
Comparisons between BiDAF and BERT models
• BERT model has many many more parameters (110M or 330M) and BiDAF has —2.5M parameters.
• BiDAF is built on top of several bidirectional LSTMs while BERT is built on top of Transformers (no recurrence architecture and easier to parallelize).
• BERT is pre-trained while BiDAF is only built on top of GloVe (and all the remaining parameters need to be learned from the supervision datasets).
Pre-training is clearly a game changer but it is expensive..
32
Comparisons between BiDAF and BERT models
Are they really fundamentally different? Probably not.
• BiDAF and other models aim to model the interactions between question and passage.
• BERT uses self-attention between the concatenation of question and passage = attention (V, P) + attention (V, Q) + attention (Q, P) + attention (Q, Q)
• (Clark and Gardner, 2018) shows that adding a self-attention layer for the passage attention(I> P) to BiDAF also improves performance.
t t
question
_
passage
Transformer layer 3
Transformer layer 2
Transformer layer 1
33
Can we design better pre-training objectives?
The answer is yes!
£(football) = £mlm (football) + £sbo (football)
= — logP(football | x7) — logP(football | X4,xq,P3)
12 3 4
an   American football game
								t	t			t		t						
Xl				x3		x4		X5		x6		x7		x8	| x9	x10		Xn		x12
t		t		t		t		t	t			t		t	t	t		t		t
Transformer Encoder																				
t		t		t		t		t	t			t		t	t	t		t		t
Super		Bowl		50		was		[MASK]	[MASK]			[MASK]		[MASK]	to	| determine		the		champion
Two ideas:
1) masking contiguous spans of words instead of 15% random words
2) using the two end points of span to predict all the masked words in between = compressing the information of a span into its two endpoints
y% = /(xs-i,xe+i,pi-s+i)
(Joshi & Chen et al., 2020): SpanBERT: Improving Pre-training by Representing and Predicting Spans
34
SpanBERT performance
Fl scores
Google BERT Our BERT SpanBERT
95 -92-5-
91.3 y
88
80
94.6
73
65
^^^^^^^^^^^^^^^
■   ^^1^I 1
83.6
84.8
81.7 81.8
83.0 82.8
SQuADvl.l      SQuADv2.0        NewsQA TriviaQA SearchQA HotpotQA    Natural Questions
35
Is reading comprehension solved?
• We have already surpassed human performance on SQuAD. Does it mean that reading comprehension is already solved? Of course not!
• The current systems still perform poorly on adversarial examples or examples from out-of-domain distributions
Article: Super Bowl 50
Paragraph: "Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver's Executive Vice President of Football Operations and General Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV."
Question: "What is the name of the quarterback who was 38 in Super Bowl XXXIII?" Original Prediction: John Elway Prediction under adversary: Jeff Dean
Original
AddSent AddOneSent
AddAny AddCommon
Match Single
Match Ens.
BiDAF Single
71.4
75.4
75.5
27.3 39.0
29.4 41.8
34.3 45.7
7.6 38.9
11.7 51.0
4.8 41.7
BiDAF Ens.
80.0
34.2 46.9
2.7 52.6
(Jia and Liang, 2017): Adversarial Examples for Evaluating Reading Comprehension Systems
36
Is reading comprehension solved?
Systems trained on one dataset can't generalize to other datasets:
Evaluated on
SQuAD   TriviaQA   NQ   QuAC NewsQA
SQuAD	75.6	46.7	48.7	20.2	41.1
TriviaQA	49.8	58.7	42.1	20.4	10.5
NQ	53.5	46.3	73.5	21.6	24.7
QuAC	39.4	33.1	33.8	33.3	13.8
NewsQA	52.1	38.4	41.7	20.4	60.1
(Sen and Saffari, 2020): What do Models Learn from Question Answering Datasets?
37
Is reading comprehension solved?
BERT-large model trained on SQuAD
	Test 7T/>£	Failure	* Example Test cases (with expected behavior and $ prediction)
	and Description	Rate (J,)	
Ü	MFT: comparisons	20.0	C: Victoria is younger than Dylan. Q: Who is less young? A: Dylan Victoria
1	MFT: intensifiers to superlative: most/least	91.3	C: Anna is worried about the project. Matthew is extremely worried about the project. Q: Who is least worried about the project? A: Anna %>: Matthew
	MFT: match properties to categories	82.4	* C: There is a tiny purple box in the room. Q: What size is the box? A: tiny T: purple
	MFT: nationality vs job	49.4	C: Stephanie is an Indian accountant. Q: What is Stephanie's job? A: accountant      Indian accountant
momy	MFT: animal vs vehicles	26.2	C: Jonathan bought a truck. Isabella bought a hamster. Q: Who bought an animal? A: Isabella Jonathan
Taxe	MFT: comparison to antonym	67.3	C: Jacob is shorter than Kimbcrlv^. Q: Who is taller? A: Kimberly Jacob
	MFT: more/less in context, more/less antonym in question	100.0	C: Jeremy is more optimistic than Taylor. Q: Who is more pessimistic? A: Taylor Jeremy
bust.	INV: Swap adjacent characters in Q (typo)	11.6	C: ...Newcomen designs had a duty of about 7 million, but most were closer to 5 million.... Q: What was the ideal duty -> udty of a Newcomen engine? A: INV >i<: 7 million ■» 5 million
o	INV: add irrelevant sentence to C	9.8	(no example)
(Ribeiro et al., 2020): Beyond Accuracy: Behavioral Testing of NLP Models with Checklist
38
Is reading comprehension solved?
BERT-large model trained on SQuAD
loral	MF7: change in one person only	41.5	C: Both Luke and Abigail were writers, but there was a change in Abigail, who is now a model. Q: Who is a model? A: Abigail -J/: Abigail were writers, but there was a change in Abigail
Temp	MFT: Understanding before/after, last/first	82.9	C: Logan became a farmer before Danielle did. Q: Who became a farmer last? A: Danielle -t.-: Logan
M	MFT: Context has negation	67.5	* C: Aaron is not a writer. Rebecca is. Q: Who is a writer? A: Rebecca Aaron
■>.	MFT: Q has negation, C docs not	100.0	* C: Aaron is an editor. Mark is an actor. Q: Who is not an actor? A: Aaron Mark
	MFT: Simple coreferencc, he/she.	100.0	C: Melissa and Antonio are friends. He is a journalist, and she is an adviser. Q: Who is a journalist? A: Antonio Melissa
'd c	MFT: Simple coreferencc, his/her.	100.0	C: Victoria and Alex are friends. Her mom is an agent
			Q: Whose mom is an agent? A: Victoria •$>: Alex
	MFT: former/latter	100.0	C: Kimberly and Jennifer are friends. The former is a teacher Q: Who is a teacher? A: Kimberly i|:: Jennifer
J	MFT: subject/object distinction	60.8	C: Richard bothers Elizabeth. Q: Who is bothered? A: Elizabeth Richard
K CO	MFT: subj/obj distinction with 3 agents	95.7	* C: Jose hates Lisa. Kevin is hated by Lisa. Q: Who hates Kevin? A: Lisa i.J;: Jose
(Ribeiro et al., 2020): Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
39
3. Open-domain question answering
Question (Q)
Answer (A)
WlKIPEDlA
Ihe Free Encyclopedia
• Different from reading comprehension, we don't assume a given passage.
• Instead, we only have access to a large collection of documents (e.g., Wikipedia). We don't know where the answer is located, and the goal is to return the answer for any open-domain questions.
• Much more challenging but a more practical problem!
In contrast to closed-domain systems that deal with questions under a specific domain (medicine^ technical support)..
40
Retriever-reader framework
•if
*   >a ( ' s'
WlKIPEDlA
Document Retriever
P
te RMrsh FtghH* Squadron, f
Document Reader
833,500
al. political and oconomlc nub ™i Wt
https://github.com/facebookresearch/DrQA
Chen et al., 2017. Reading Wikipedia to Answer Open-domain Questions
41
Retriever-reader framework
• Input: a large collection of documents = D^, D2,. ..,DN and Q
• Output: an answer string A
• Retriever: f(9), Q) —> Ph...,PK k is pre-defined (e.g., 100)
• Reader:   g(Q, [Plf...,PK}) —> A A reading comprehension problem!
In DrQA,
• Retriever = A standard TF-IDF information-retrieval sparse model (a fixed module)
• Reader = a neural reading comprehension model that we just learned
• Trained on SQuAD and other distantly-supervised QA datasets
Distantly-supervised examples: (Q, A) —> (F> Q, A)
Chen et al., 2017. Reading Wikipedia to Answer Open-domain Questions
42
We can train the retriever too
Joint training of retriever and reader
BERT<s(?)
[ CLS ] What does the zip in zip code stand for? [ SEP ]
BERTi)(0)	TopK	BERTB(«,0)
[CLS] ...The term 'ZIP' is an acronym for Zone Improvement Plan... [ SEP ]		[CLS] What does the zip in zip code stand for? [SEP] ...The term 'ZIP' is an acronym for Zone Improvement Plan... [ SEP ]
		
BERTB(1)		
		
[CLS] ...group of zebras are referred to as a herd or dazzle... [ SEP ]		
	TopK	BERTs(g, 2)
BERTij(2)		
		[CLS] What does the zip in zip code stand for? [SEP] ...ZIPs for other operating systems may be preceded by... [SEP]
[CLS] ...ZIPs for other operating systems may be preceded by... [SEP]		
		
		
BERTB(...)		
		
		
S«|(0, "The term", q)\
^* SreadiO, "Zone Improvement Plan", q)
■ £w(0,q)
Each text passage can be encoded as a vector using BERT and the retriever score can be measured as the dot product between the question representation and passage representation.
However, it is not easy to model as there are a huge number of passages (e.g., 21M in English Wikipedia)
Lee et al., 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering
43
We can train the retriever too
• Dense passage retrieval (DPR) - We can also just train the retriever using question-answer pairs!
Ik Q/A pairs beat
Question q Passage p
• Trainable retriever (using BERT) largely outperforms traditional IR retrieval models
Karpukhin et al., 2020. Dense Passage Retrieval for Open-Domain Question Answering
44
We can train the retriever too
Who tells harry potter that he is a wizard in the harry potter series? v Run
Title: Harry Potter (film series) Retrieval ranking: #90      P(p\q)=o.85 P(a\p,q)=i.oo P(a,p\q)=o.»4
... and uncle. At the age of eleven, half-giant Rubeus Hagrid informs him that he is actually a wizard and that his parents were murdered by an evil wizard named Lord Voldemort. Voldemort also attempted to kill one-year-old Harry on the same night, but his killing curse mysteriously rebounded and reduced him to a weak and helpless form. Harry became extremely famous in the Wizarding World as a result. Harry begins his first year at Hogwarts School of Witchcraft and Wizardry and learns about magic. During the year, Harry and his friends Ron Weasley and Hermione Granger become entangled in the ...
Title: Harry Potter (character) Retrieval ranking: #1      P(p\q)=o.04 P(a\p,q)=o.<)7 P(a,p\q)=o.04
... Harry Potter (character) Harry James Potter is the titular protagonist of J. K. Rowling's "Harry Potter" series. The majority of the books' plot covers seven years in the life of the orphan Potter, who, on his eleventh birthday, learns he is a wizard. Thus, he attends Hogwarts School of Witchcraft and Wizardry to practice magic under the guidance of the kindly headmaster Albus Dumbledore and other school professors along with his best friends Ron Weasley and Hermione Granger. Harry also discovers that he is already famous throughout the novel's magical community, and that his fate is tied with that of...
http://qa.cs.washington.edu:2020/ Karpukhin et al., 2020. Dense Passage Retrieval for Open-Domain Question Answering
Dense retrieval + generative models
Recent work shows that it is beneficial to generate answers instead of to extract answers.
Fusion-in-decoder (FID) = DPR + T5
Model
Where was Alan Turing born?
Alan Turing was a British computer scientist. Born in Maida Vale, London...
Generative seq2seq model
	
f	
Maida	Vale,
London	
	J
NaturalQuestions TriviaQA
ORQA (Lee et al., 2019)	31.3	45.1	-
REALM (Guu et al., 2020)	38.2	-	-
DPR (Karpukhin et al., 2020)	41.5	57.9	-
SpanSeqGen (Min et al., 2020)	42.5	-	-
RAG (Lewis et al., 2020)	44.5	56.1	68.0
T5 (Roberts et al., 2020)	36.6	-	60.5
GPT-3 few shot (Brown et al., 2020)	29.9	-	71.2
Fusion-in-Decoder (base)	48.2	65.0	77.1
Fusion-in-Decoder (large)	51.4	67.6	80.1
Izacard and Grave 2020. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
46
Large language models can do open-domain QA well
... without an explicit retriever stage
President Franklin <M> born <M> January 1882
Lily couldn't «M>. The waitress had brought the largest <M> of chocolate cake <vl> seen.
Our <H> hand-picked and sun-dried <M> orchard in Georgia.
Pre-training Fine-tuning
When was Franklin D. Roosevelt born?
D. Roosevelt was in
believe her eyes <M> piece <;M> she had ever
peaches are <M> at our
Roberts et a\., 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model?
47
Maybe the reader model is not necessary too!
It is possible to encode all the phrases (60 billion phrases in Wikipedia) using dense vectors and only do nearest neighbor search without a BERT model at inference time!
"Barack Obama (1961-present) was the 44th President of the United States."
Phrase Indexing
_v
Barack Obama
... (1961-present
44th president
United States.
Nearest neighbor \ search
	Who is the 44th
	President of the
	U.S.?
	When was
	Obama born?
Phrase encoding
Question encoding
Seo et al., 2019. Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index Lee et al., 2020. Learning Dense Representations of Phrases at Scale
48
DensePhrases: Demo
<-   -»   C    © localhost:51997/#	A    * ©(update )
DensePhrases    Home   Paper GitHub	
Examples "
Latency:
English Wikipedia (2018.12.20)
Thanks!
danqic@cs.princeton.edu
50