1. What is question answering?
3
Answer (A)Question (Q)
The goal of question answering is to build systems that automatically answer
questions posed by humans in a natural language
The earliest QA systems
dated back to 1960s!
(Simmons et al., 1964)
Question answering: a taxonomy
4
Answer (A)Question (Q)
• What information source does a system build on?
• A text passage, all Web documents, knowledge bases, tables, images..
• Question type
• Factoid vs non-factoid, open-domain vs closed-domain, simple vs compositional, ..
• Answer type
• A short segment of text, a paragraph, a list, yes/no, …
Reading comprehension = comprehend a passage of text and answer questions about its
content (P, Q) ⟶ A
13
Tesla was the fourth of five children. He had an older brother
named Dane and three sisters, Milka, Angelina and Marica. Dane
was killed in a horse-riding accident when Nikola was five. In 1861,
Tesla attended the "Lower" or "Primary" School in Smiljan where
he studied German, arithmetic, and religion. In 1862, the Tesla
family moved to Gospić, Austrian Empire, where Tesla's father
worked as a pastor. Nikola completed "Lower" or "Primary" School,
followed by the "Lower Real Gymnasium" or "Normal School."
Q: What language did Tesla study while in school?
A: German
2. Reading comprehension
2. Reading comprehension
14
Kannada language is the official language of Karnataka and spoken as a native
language by about 66.54% of the people as of 2011. Other linguistic minorities in
the state were Urdu (10.83%), Telugu language (5.84%), Tamil language (3.45%),
Marathi language (3.38%), Hindi (3.3%), Tulu language (2.61%), Konkani language
(1.29%), Malayalam (1.27%) and Kodava Takk (0.18%). In 2007 the state had a
birth rate of 2.2%, a death rate of 0.7%, an infant mortality rate of 5.5% and a
maternal mortality rate of 0.2%. The total fertility rate was 2.2.
Q: Which linguistic minority is larger, Hindi or Malayalam?
Reading comprehension = comprehend a passage of text and answer questions about its
content (P, Q) ⟶ A
A: Hindi
Why do we care about this problem?
15
• Useful for many practical applications
• Reading comprehension is an important testbed for evaluating how well computer systems understand
human language
• Wendy Lehnert 1977: “Since questions can be devised to query any aspect of text comprehension,
the ability to answer questions is the strongest possible demonstration of understanding.”
• Many other NLP tasks can be reduced to a reading comprehension problem:
Information extraction
(Barack Obama, educated_at, ?)
Passage: Obama was born in Honolulu, Hawaii. After
graduating from Columbia University in 1983, he
worked as a community organizer in Chicago.
Question: Where did Barack Obama graduate from?
(Levy et al., 2017)
Semantic role labeling
(He et al., 2015)
Stanford question answering dataset (SQuAD)
• 100k annotated (passage, question, answer) triples
16
Large-scale supervised datasets are also a key ingredient for
training effective neural models for reading comprehension!
This is a limitation— not all the questions can be
answered in this way!
• Passages are selected from English Wikipedia, usually 100~150 words.
• Questions are crowd-sourced.
• Each answer is a short segment of text (or span) in the passage.
• SQuAD was for years the most popular reading comprehension
dataset; it is “almost solved” today (though the underlying task is
not,) and the state-of-the-art exceeds the estimated human
performance.
(Rajpurkar et al., 2016): SQuAD: 100,000+ Questions for Machine Comprehension of Text
Stanford question answering dataset (SQuAD)
• Evaluation: exact match (0 or 1) and F1 (partial credit).
• For development and testing sets, 3 gold answers are collected, because there could be multiple
plausible answers.
• We compare the predicted answer to each gold answer (a, an, the, punctuations are removed)
and take max scores. Finally, we take the average of all the examples for both exact match and
F1.
• Estimated human performance: EM = 82.3, F1 = 91.2
17
Q: What did Tesla do in December 1878?
A: {left Graz, left Graz, left Graz and severed all relations with his family}
Prediction: {left Graz and served}
Exact match: max{0, 0, 0} = 0
F1: max{0.67, 0.67, 0.61} = 0.67
Other question answering datasets
• TriviaQA: Questions and answers by trivia enthusiasts. Independently collected web
paragraphs that contain the answer and seem to discuss question, but no human
verification that paragraph supports answer to question
• Natural Questions: Question drawn from frequently asked Google search questions.
Answers from Wikipedia paragraphs. Answer can be substring, yes, no, or NOT_PRESENT.
Verified by human annotation.
• HotpotQA. Constructed questions to be answered from the whole of Wikipedia which
involve getting information from two pages to answer a multistep query:
Q: Which novel by the author of “Armada” will be adapted as a feature film by Steven
Spielberg? A: Ready Player One
18
Neural models for reading comprehension
• A family of LSTM-based models with attention (2016–2018)
19
• Fine-tuning BERT-like models for reading comprehension (2019+)
Attentive Reader (Hermann et al., 2015), Stanford Attentive Reader (Chen et al., 2016), Match-LSTM
(Wang et al., 2017), BiDAF (Seo et al., 2017), Dynamic coattention network (Xiong et al., 2017), DrQA
(Chen et al., 2017), R-Net (Wang et al., 2017), ReasoNet (Shen et al., 2017)..
N~100, M ~15
answer is a span in the passage
• Problem formulation
• Input: 𝐶 = (𝑐1, 𝑐2, … , 𝑐 𝑁), 𝑄 = (𝑞1, 𝑞2, … , 𝑞 𝑀), 𝑐𝑖, 𝑞𝑖 ∈ 𝑉
• Output: 1 ≤ start ≤ end ≤ 𝑁
How can we build a model to solve SQuAD?
(We are going to use passage, paragraph and context, as well as question and query interchangeably)
BiDAF: the Bidirectional Attention Flow model
30
(Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension
BERT for reading comprehension
34
Question = Segment A
Passage = Segment B
Answer = predicting two endpoints in segment B
Image credit: https://mccormickml.com/
where 𝐡𝑖 is the hidden vector of 𝑐𝑖, returned by BERT
BERT for reading comprehension
35
• All the BERT parameters (e.g., 110M) as well as the
newly introduced parameters (e.g., 768 x 2 =
1536) are optimized together for .
• It works amazing well. Stronger pre-trained language
models can lead to even better performance and SQuAD
becomes a standard dataset for testing pre-trained
models.
F1 EM
Human performance 91.2* 82.3*
BiDAF 77.3 67.7
BERT-base 88.5 80.8
BERT-large 90.9 84.1
XLNet 94.5 89.0
RoBERTa 94.6 88.9
ALBERT 94.8 89.3
(dev set, except for human performance)
Comparisons between BiDAF and BERT models
36
• BERT model has many many more parameters (110M or 330M)
BiDAF has ~2.5M parameters.
• BiDAF is built on top of several bidirectional LSTMs while BERT is built on top of
Transformers (no recurrence architecture and easier to parallelize).
• BERT is pre-trained while BiDAF is only built on top of GloVe (and all the remaining
parameters need to be learned from the supervision datasets).
Pre-training is clearly a game changer but it is expensive..
3. Open-domain question answering
43
• Different from reading comprehension, we don’t assume a given passage.
Answer (A)Question (Q)
• Instead, we only have access to a large collection of documents (e.g., Wikipedia). We don’t
know where the answer is located, and the goal is to return the answer for any open-domain
questions.
• Much more challenging and a more practical problem!
In contrast to closed-domain systems that deal with questions under a
specific domain (medicine, technical support).
Retriever-reader framework
44
Document
Reader
Document
Retriever
833,500
https://github.com/facebookresearch/DrQA
Chen et al., 2017. Reading Wikipedia to Answer Open-domain Questions
How many of Warsaw's inhabitants
spoke Polish in 1933?
Large language models can do open-domain QA well
• … without an explicit retriever stage
50
Roberts et al., 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model?
Large language model-based QA (with web search!)
Problems with large language model-based QA
53
Seems totally reasonable!
But (1) it’s not his most
cited paper, and (2) it
doesn’t have that many
citations. Yikes! Also the
reference to a web page
doesn’t help.