1
© Ronen Feldman 1
Information Extraction
Theory and Practice
Ronen Feldman
Bar-Ilan University
ISRAEL
feldman@cs.biu.ac.il
© Ronen Feldman 2
What is Information Extraction?
• IE does not indicate which documents
need to be read by a user, it rather extracts
pieces of information that are salient to the
user's needs.
• Links between the extracted information
and the original documents are maintained
to allow the user to reference context.
• The kinds of information that systems
extract vary in detail and reliability.
• Named entities such as persons and
organizations can be extracted with
reliability in the 90th percentile range, but
do not provide attributes, facts, or events
that those entities have or participate in.
2
© Ronen Feldman 3
Relevant IE Definitions
• Entity: an object of interest such as
a person or organization.
• Attribute: a property of an entity
such as its name, alias, descriptor,
or type.
• Fact: a relationship held between
two or more entities such as
Position of a Person in a Company.
• Event: an activity involving several
entities such as a terrorist act,
airline crash, management change,
new product introduction.
© Ronen Feldman 4
IE Accuracy by Information Type
50-60%Events
60-70%Facts
80%Attributes
90-98%Entities
AccuracyInformation
Type
3
© Ronen Feldman 5
MUC Conferences
MUC 7
MUC 6
MUC 5
MUC 4
MUC 3
MUC 2
MUC 1
Conference
Spaces Vehicles and Missile
Launches
1997
Management Changes1995
Joint Venture and Micro
Electronics
1993
Terrorist Activity1992
Terrorist Activity1991
Naval Operations1989
Naval Operations1987
TopicYear
© Ronen Feldman 6
The ACE Evaluation
• The ACE program is dedicated to the challenge of
extracting content from human language. This is a
fundamental capability that the ACE program
addresses with a basic research effort that is directed
to master first the extraction of “entities”, then the
extraction of “relations” among these entities, and
finally the extraction of “events” that are causally
related sets of relations.
• After two years of research on the entity detection and
tracking task, top systems have achieved a capability
that is useful by itself and that, in the context of the
ACE EDT task, successfully captures and outputs well
over 50 percent of the value at the entity level.
• Here value is defined to be the benefit derived by
successfully extracting the entities, where each
individual entity provides a value that is a function of
the entity type (i.e., “person”, “organization”, etc.) and
level (i.e., “named”, “unnamed”). Thus each entity
contributes to the overall value through the
incremental value that it provides.
4
© Ronen Feldman 7
40.6
63.0
11.3
24.3
9.1
11.8
0.1
0.6
0.0
0.3
0
10
20
30
40
50
60
70
80
90
100
Act value Max value
NormalizedValue/Cost(in%)
relativetoMaximumValue
FAC
LOC
GPE
ORG
PER
Miss 36%, False Alarm 22%, Type Error 6%
Average
of Top 4
Systems
ACE Entity Detection &
Tracking Evaluation -- 2/2002
Goal: Extract entities. Each
entity is assigned a value.
This value is a function of its
Type and Level. This value
is gained when the entity is
successfully detected. This
value is lost when an entity
is missed, spuriously
detected, or
mischaracterized.
Table of Entity Values
0.0020.0040.010.020.04PRO
0.010.020.050.10.2NOM
0.050.10.250.51NAM
FACLOCGPEORGPER
Human performance ~80
© Ronen Feldman 8
Applications of Information
Extraction
• Routing of Information
• Infrastructure for IR and for
Categorization (higher level features)
• Event Based Summarization.
• Automatic Creation of Databases and
Knowledge Bases.
5
© Ronen Feldman 9
Where would IE be useful?
• Semi-Structured Text
• Generic documents like News
articles.
• Most of the information in the
document is centered around a set of
easily identifiable entities.
© Ronen Feldman 10
Approaches for Building IE
Systems
• Knowledge Engineering Approach
– Rules are crafted by linguists in cooperation
with domain experts.
– Most of the work is done by inspecting a set of
relevant documents.
– Can take a lot of time to fine tune the rule set.
– Best results were achieved with KB based IE
systems.
– Skilled/gifted developers are needed.
– A strong development environment is a MUST!
6
© Ronen Feldman 11
Approaches for Building IE
Systems
• Automatically Trainable Systems
– The techniques are based on pure statistics
and almost no linguistic knowledge
– They are language independent
– The main input is an annotated corpus
– Need a relatively small effort when building the
rules, however creating the annotated corpus
is extremely laborious.
– Huge number of training examples is needed in
order to achieve reasonable accuracy.
– Hybrid approaches can utilize the user input in
the development loop.
© Ronen Feldman 12
Components of IE System
7
© Ronen Feldman 13
The Extraction Engine
© Ronen Feldman 14
Why is IE Difficult?
• Different Languages
– Morphology is very easy in English, much harder in German
and Hebrew.
– Identifying word and sentence boundaries is fairly easy in
European language, much harder in Chinese and Japanese.
– Some languages use orthography (like english) while others
(like hebrew, arabic etc) do no have it.
• Different types of style
– Scientific papers
– Newspapers
– memos
– Emails
– Speech transcripts
• Type of Document
– Tables
– Graphics
– Small messages vs. Books
8
© Ronen Feldman 15
Morphological Analysis
• Easy
– English, Japanese
– Listing all inflections of a word is a real
possibility
• Medium
– French Spanish
– A simple morphological component adds value.
• Difficult
– German, Hebrew, Arabic
– A sophisticated morphological component is a
must!
© Ronen Feldman 16
Using Vocabularies
• “Size doesn’t matter”
– Large lists tend to cause more mistakes
– Examples:
• Said as a person name (male)
• Alberta as a name of a person (female)
• It might be better to have small
domain specific dictionaries
9
© Ronen Feldman 17
Part of Speech Tagging
• POS can help to reduce ambiguity,
and to deal with ALL CAPS text.
• However
– It usually fails exactly when you need it
– It is domain dependent, so to get the
best results you need to retrain it on a
relevant corpus.
– It takes a lot of time to prepare a training
corpus.
© Ronen Feldman 18
A simple POS Strategy
• Use a tag frequency table to
determine the right POS.
– This will lead to elimination of rare
senses.
• The overhead is very small
• It improve accuracy by a small
percentage.
• Compared to full POS it provide
similar boost to accuracy.
10
© Ronen Feldman 19
Dealing with Proper Names
• The problem
– Impossible to enumerate
– New candidates are generated all the time
– Hard to provide syntactic rules
• Types of proper names
– People
– Companies
– Organizations
– Products
– Technologies
– Locations (cities, states, countries, rivers,
mountains)
© Ronen Feldman 20
Comparing RB Systems with
ML Based Systems
90.690.3HUB4
Transcribed
Speech
90.493.7MUC7
9396.4MUC6
Wall Street
Journal
HMMRule Based
11
© Ronen Feldman 21
Building a RB Proper Name
Extractor
• A Lexicon is always a good start
• The rules can be based on the lexicon and on:
– The context (preceding/following verbs or nouns)
– Regular expressions
• Companies: Capital* [,] inc, Capital* corporation..
• Locations: Capital* Lake, Capital* River
– Capitalization
– List structure
• After the creation of an initial set of rules
– Run on the corpus
– Analyze the results
– Fix the rules and repeat…
• This Process can take around 2-3 weeks and result in performance
of between 85-90% break even.
• Better performance can be achieved with more effort (2-3 months)
and then performance can get to 95-98%
© Ronen Feldman 22
DIAL – Declarative Information
Analysis Language
An Information Extraction Language
12
© Ronen Feldman 23
The DIAL4 Language
Declarative Information Analysis
Language
• Modular
• Entity Oriented
• Regular Expressions Based
• Rapid and Easy RuleBook Development
© Ronen Feldman 24
DIAL4 Concepts
Each concept may have:
Attributes
Guards
a set of logical conditions on the concept attributes’
values
Actions
a set of operations to perform after finding a
concept instance
Internal
a section for defining internal concepts which can
only be used within the scope of the concept
Functions
a section for defining add-on (Perl) functions that
can only be used within the scope of the concept
Context
text units in which concepts instances will be
searched
13
© Ronen Feldman 25
DIAL4 Rule Structure
• Pattern
defines the text pattern to match when searching
for a concept instance.
• Constraints
defines logical conditions to apply to values
extracted from the pattern match. If these
conditions are not met the match is discarded.
• Actions
a set of operations to perform after finding
a pattern match. This is where concept
instances are added to the Shared Memory.
© Ronen Feldman 26
A Full Rule – an Example
Example of an Instance:
Crown Central Petroleum Corp.
concept Company {};
rule Company {
pattern:
(Capital+) -> name wcCompanyExt “.”?;
constraints:
!(name.FirstToken() IS_IN wcCompNameNonStarters);
actions:
Add();
};
14
© Ronen Feldman 27
Improving the Person Concept:
Assigning values to the concept attributes
concept Person {
attributes:
string FirstName;
string MiddleName;
string LastName;
};
rule Person {
pattern:
wcFirstName -> first Capital -> last;
actions:
Add(FirstName<-first, LastName<-last);
};
© Ronen Feldman 28
Rule for Extraction of Person Names
Based on Title/Position
wordclass wcPosition = adviser minister spokesman
president (vice president) general (gen .);
/* note that wordclass members are tokenized and entries containing multiple
tokens should be enclosed within () */
concept PersonNameStruct { //we define this concept to allow the code reuse
attributes:
string FirstName;
string MiddleName;
string LastName;
};
wordclass wcNamePrefix = ben abu abed von al;
rule PersonNameStruct {
pattern:
Capital -> first (Capital “.”?)? -> middle ((wcNamePrefix “-”?)? Capital) ->last;
actions:
Add(FirstName <- first.Text(), MiddleName <- middle.Text(), LastName <- last.Text());
};
rule Person {
pattern:
LONGEST(wcPosition PersonNameStruct -> name);
actions:
Add(FirstName <- name.FirstName, MiddleName <- name.MiddleName,
LastName <- name.LastName);
};
15
© Ronen Feldman 29
List of Person Names
concept PersonsList{};
wordclass wcCriminalIndicatingVerbs = charged blamed arrested;
wordclass wcPluralBe = are were;
rule PersonsList {
pattern:
wcCriminalIndicatingVerbs wcPluralBe
(PersonNameStruct->> pList “,”?)+ “and”
PersonNameStruct ->> pList;
actions:
iterate (pList) begin
currPerson = pList.CurrentItem();
Add(Person, currPerson, FirstName<-currPerson.FirstName,
LastName <- currPerson.LastName);
end
};
© Ronen Feldman 30
Person Concept: Applying Constraints
rule Person {
pattern:
LONGEST(Capital -> first (MiddleName -> middle)?
((wcNamePrefix “-”?)? Capital) ->last);
constraints:
(first IS_IN wcFirstNames) OR !(middle.IsEmpty());
//in “President V. Putin” don’t match “President” as the first name.
!(first.FirstToken() IS_IN wcPosition);
actions:
Add(FirstName <- first.Text(), MiddleName <- middle.Text(),
LastName <- last.Text());
Block(Person, this_match);
};
We can conclude with high probability that a proper name is a
person name if it has a known first name or a middle name:
– John Smith
– Emil I. Singer
16
© Ronen Feldman 31
Person Concept: Applying Constraints
(cont.)
rule Person {
pattern:
Capital -> first (MiddleName -> middle)?
((wcNamePrefix “-”?)? Capital) ->last;
constraints:
(first IS_IN wcFirstNames) OR !(middle.IsEmpty())
OR (first {1} AFTER wcPosition);
!(first.FirstToken() IS_IN wcPosition);
actions:
Add(FirstName <- first.Text(), MiddleName <- middle.Text(),
LastName <- last.Text());
};
We have two rules for concept Person: the first rule extracts names
with sure internal evidence (first or middle name), and the second
extracts names without internal evidence but with positions/titles
preceding them. Let’s merge these two rules into a single rule.
© Ronen Feldman 32
Concept Guards
• Guards are applied to concept attributes when a rule attempts
to add a concept instance to the Shared Memory.
A concept instance will be added only if all guard conditions
are met .
• Guards enable the concept to ensure conditions on its
attribute values in a central location, without having to add
these conditions to each rule of the concept.
Example:
concept Date {
attributes:
number nDay;
number nMonth;
number nYear;
guards:
(nDay >= 1) AND (nDay <= 31);
(nMonth >= 1) AND (nMonth <=12);
(nYear > 0);
};
17
© Ronen Feldman 33
Introduction to HMMs for IE
© Ronen Feldman 34
Motivation
• We can view the named entity extraction
as a classification problem, where we
classify each word as belonging to one of
the named entity classes or to the noname
class.
• One of the most popular techniques for
dealing with classifying sequences is
HMM.
• Example of using HMM for another NLP
classification task is that of part of speech
tagging (Church, 1988 ; Weischedel et. al.,
1993).
18
© Ronen Feldman 35
What is HMM?
• HMM (Hidden Markov Model) is a finite
state automaton with stochastic state
transitions and symbol emissions (Rabiner
1989).
• The automaton models a probabilistic
generative process.
• In this process a sequence of symbols is
produced by starting in an initial state,
transitioning to a new state, emitting a
symbol selected by the state and repeating
this transition/emission cycle until a
designated final state is reached.
© Ronen Feldman 36
Disadvantage of HMM
• The main disadvantage of using an
HMM for Information extraction is the
need for a large amount of training
data. i.e., a carefully tagged corpus.
• The corpus needs to be tagged with
all the concepts whose definitions we
want to learn.
19
© Ronen Feldman 37
Notational Conventions
• T = length of the sequence of observations
(training set)
• N = number of states in the model
• qt = the actual state at time t
• S = {S1,...SN} (finite set of possible states)
• V = {O1,...OM} (finite set of observation symbols)
• π = {πi} = {P(q1 = Si)} starting probabilities
• A = {aij}=P(qt+1= Si | qt = Sj) transition probabilities
• B = {bi(Ot)} = {P(Ot | qt = Si)} emission probabilities
© Ronen Feldman 38
The Classic Problems Related
to HMMs
• Find P( O | λ ): the probability of an
observation sequence given the HMM
model.
• Find the most likely state trajectory
given λ and O.
• Adjust λ = (π, A, B) to maximize P( O |
λ ).
20
© Ronen Feldman 39
Calculating P( O | λ )
• The most obvious way to do that would be to
enumerate every possible state sequence of
length T (the length of the observation sequence).
Let Q = Q1,...QT , then by assuming independence
between the states we have
– P(O|Q, λ) =
– P(Q|λ) =
• By using Bayes theorem we have
– P(O,Q|λ) = P(O|Q, λ) P(Q|λ)
• Finally The main problem with this is that we need
to do 2TNT multiplications, which is certainly not
feasible even for a modest T like 10.
∏∏ ==
=
T
i
iq
T
i
ii ObqOP i
11
)(),|( λ
∏
−
=
+
1
1
11
T
i
qqq ii
aπ
© Ronen Feldman 40
The forward-backward
algorithm
• In order to solve that we use the forwardbackward
algorithm this is far more efficient. The
forward part is based on the computation of terms
called the alpha terms. We define the alpha values
as follows,
• We can compute the alpha values inductively in a
very efficient way.
• This calculation requires just N2T multiplications.
∑
∑
=
+
=
+
=
⎥
⎦
⎤
⎢
⎣
⎡
=
=
N
i
T
tj
N
i
ijtt
ii
iaOP
Obaij
Obi
1
1
1
1
11
)()|(
)()()(
)()(
λ
αα
πα
21
© Ronen Feldman 41
The backward phase
• In a similar manner we can define a
backward variable called beta that
computes the probability of a partial
observation sequence from t+1 to T.
The beta variable will also be
computed inductively but in a
backward fashion.
)()()(
1)(
11
1
jObai
i
tti
N
j
ijt
T
++
=
∑=
=
ββ
β
© Ronen Feldman 42
Solution for the second
problem
• Our main goal is to find the “optimal” state sequence. We will do
that by maximizing the probabilities of each state individually.
• We will start by defining a set of gamma variables the measure the
probabilities that at time t we are at state Si.
• The denominator is used just to make gamma a true probability
measure.
• Now we can find the best state at each time slot in a local fashion.
• The main problem with this approach is that the optimization is
done locally, and not on the whole sequence of states. This can
lead either to a local maximum or even to an invalid sequence. In
order to solve that problem we use a well known dynamic
programming algorithm called the Viterbi algorithm.
∑=
== N
i
tt
tttt
t
ii
ii
OP
ii
i
1
)()(
)()(
)|(
)()(
)(
βα
βα
λ
βα
γ
Ni
i
q t
t
≤≤
=
1
)](max[arg γ
22
© Ronen Feldman 43
The Viterbi Algorithm
• Intuition
– Compute the most likely sequence starting with the
empty observation sequence; use this result to
compute the most likely sequence with an output
sequence of length one; recurse until you have the
most likely sequence for the entire sequence of
observations.
• Algorithmic Details
– The delta variables compute the highest probability
of a partial sequence up to time t that ends in state
Si. The psi variables enables us to accumulate the
best sequence as we move along the time slices.
• 1. Initialization:
0)(
)()(
1
11
=
=
i
Obi ii
ψ
πδ
© Ronen Feldman 44
Viterbi (Cont).
• Recursion:
• Termination:
• Reconstruction:
• For t = T-1,T-2,...,1.
The resulting sequence, , solves
Problem 2.
[ ] )()(
1
max
)( 1 tiijtt Obai
Ni
j −
≤≤
= δδ
[ ]ijtt ai
Ni
j )(
1
maxarg
)( 1−
≤≤
= δψ
[ ])(
1
max*
i
Ni
P Tδ
≤≤
= [ ])(
1
maxarg*
i
Ni
q TT δ
≤≤
=
)( *
11
*
++= ttt qq ψ
**
2
*
1 , Tqqq
23
© Ronen Feldman 45
Viterbi (Example)
© Ronen Feldman 46
The Just Research HMM
• Each HMM extracts just one field of a given
document. If more fields are needed, several
HMMs need to be constructed.
• The HMM takes the entire document as one
observation sequence.
• The HMM contains two classes of states,
background states and target states. The
background states emit words in which are not
interested, while the target states emit words that
constitute the information to be extracted.
• The state topology is designed by hand and only a
few transitions are allowed between the states.
24
© Ronen Feldman 47
Possible HMM Topologies
prefix states suffix states
initial state final state
background
state
target
state
background
state
target
state
final state
initial state
© Ronen Feldman 48
A more General HMM
Architecture
25
© Ronen Feldman 49
Experimental Evaluation
46.7%55.3%40.1%48.1%30.9%
Status of
Acquisition
Price of
Acquisition
Abbreviation
of Acquired
Company
Acquired
Company
Acquiring
Company
59.5%99.1%83.9%71.1%
End TimeStart TimeLocationSpeaker
© Ronen Feldman 50
BBN’s Identifinder
• An ergodic bigram model.
• Each Named Class has a separate region
in the HMM.
• The number of states in each NC region is
equal to |V|. Each word has its own state.
• Rather then using plain words, extended
words are used. An extended word is a
pair <w,f>, where f is a feature of the word
w.
26
© Ronen Feldman 51
BBN’s HMM Architecture
© Ronen Feldman 52
Possible word Features
1. 2 digit number (01)
2. 4 digit number (1996)
3. alphanumeric string (A34-24)
4. digits and dashes (12-16-02)
5. digits and slashes (12/16/02)
6. digits and comma (1,000)
7. digits and period (2.34)
8. any other number (100)
9. All capital letters (CLF)
10. Capital letter and a period (M.)
11. First word of a sentence (The)
12. Initial letter of the word is capitalized (Albert)
13. word in lower case (country)
14. all other words and tokens (;)
27
© Ronen Feldman 53
Statistical Model
• The design of the formal model is done in
levels.
• At the first level we have the most accurate
model, which require the largest amount of
training data.
• At the lower levels we have back-off
models that are less accurate but also
require much smaller amounts of training
data.
• We always try to use the most accurate
model possible given the amount of
available training data.
© Ronen Feldman 54
Computing State Transition
Probabilities
• When we want to analyze formally the
probability of annotating a given word
sequence with a set of name classes, we
need to consider three different statistical
models:
– A model for generating a name class
– A model to generate the first word in a name
class
– A model to generate all other words (but the
first word) in a name class
28
© Ronen Feldman 55
Computing the Probabilities :
Details
• The model to generate a name class depends on
the previous name class and on the word that
precedes the name class; this is the last word in
the previous name class and we annotate it by w-1.
So formally this amounts to P(NC | NC-1,w-1).
• The model to generate the first word in a name
class depends on the current name class and the
previous name class and hence is P(<w,f>first| NC,
NC-1).
• The model to generate all other words within the
same name class depends on the previoues word
(within the same name class) and the current
name class, so formally it is P(<w,f>| <w,f>-1, NC).
© Ronen Feldman 56
The Actual Computation
),(
),,(
),|(
11
11
11
−−
−−
−− =
wNCc
wNCNCc
wNCNCP
),(
),,,(
),|,(
1
1
1
−
−
−
><
=><
NCNCc
NCNCfwc
NCNCfwP
first
first
),,(
),,,,(
),,|,(
1
1
1
NCfwc
NCfwfwc
NCfwfwP
−
−
−
><
><><
=><><
c(<w,f>,<w,f>-1,NC), counts the number of times that we
have the pair <w,f> after the pair <w,f>-1 and they both are
tagged by the name class NC.
29
© Ronen Feldman 57
Modeling Unknown Words
• The main technique is to create a new
entity called UNKNOWN (marked _UNK_),
and create statistics for that new entity. All
words that were no seen before are
mapped to _UNK_.
• split the collection into 2 even parts, and
each time use one part for training and one
part as a hold out set. The final statistics is
the combination of the results from the two
runs.
• The statistics needs to be collected for 3
different classes of cases: _UNK_ and then
a known word (|V| cases), a known word
and then _UNK_ and two consecutive
_UNK_ words. This statistics is collected
for each name class.
© Ronen Feldman 58
Name Class Back-off Models
• The full model take into account both the
previous name class and the previous
word (P(NC| NC-1,w-1)
• The first back-off model takes into
account just the previous name class
(P((NC| NC-1)).
• The next back-off model would just
estimate the probability of seeing the name
class based on the distribution of the
various name classes (P(NC)).
• Finally, we use a uniform distribution
between all names classes (1/(N+1), where
N is number of the possible name classes)
30
© Ronen Feldman 59
First Word Back-off Models
• The full model takes into account the
current name class and the previous name
class (P(<w,f>first| NC, NC-1)).
• The first back-off model takes into
account just the current name class
(P(<w,f>first| NC)).
• The next back-off model, breaks the <w,f>
pair and just uses multiplication of two
independent events given the current word
class (P(w|NC)P(f|NC))
• The next back-off model is a uniform
distribution between all pairs of words and
features ( , where F# is the # of
possible word features)
|#|||
1
FV
© Ronen Feldman 60
Rest of the Words Back-off
Models
• The full model takes into account the
current name class and the previous word
(P(<w,f>|<w,f>-1, NC)).
• The first back-off model takes into
account just the current name class
(P(<w,f>| NC)).
• The next back-off model, breaks the <w,f>
pair and just uses multiplication of two
independent events given the current word
class (P(w|NC)P(f|NC))
• The next back-off model is a uniform
distribution between all pairs of words and
features ( , where F# is the # of
possible word features)
|#|||
1
FV
31
© Ronen Feldman 61
Combining all the models
• The actual probability is a combination of the
different models. Each model gets a different
weight based on the amount of training available
to that model.
• Lets assume we have 4 models (one full model,
and 3 back-off models), and we are trying to
estimate the probability of P(X|Y). Let P1 be
probability of the event according to the full
model, and P2, P3, P4 ate the back-off models
respectively.
• The weights are computed based on a lambda
parameter that is based on each model and it
immediate back-off model. For instance λ1 will
adjust the wait between the full model and the first
back-off model.
)(
)(#
1
1
)(
)(
1
Ybc
YYbc
Yc
+
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−=λ
© Ronen Feldman 62
Analysis
• Where c(Y) the count of event Y according to the
full model, and bc(Y) is the count of event Y
according to the back-off model. #(Y) is the
number of unique outcomes of Y.
• Lambda has two desirable properties
– If the full model and the back-off model both have
the same support for event Y, then Lambda will be
0 and we will use just the full model.
– If the possible outcomes of Y are distributed
uniformly then the weight of lambda will be close to
0 since there is low confidence in the back-off
model.
)(
)(#
1
1
)(
)(
1
Ybc
YYbc
Yc
+
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−=λ
32
© Ronen Feldman 63
Example
• We want to compute the probability of P(“bank” | “river”, “Not-AName”).
Lets assume that river appears with 3 different words in
the Not-A-Name” name class, and in total there are 9 different
occurrences of river with any of the 3 words.
• so we will use the full model (P1) with 0.75, and the other back-off
models with 0.25. We then compute λ2 which computes the weight
of the first back-off model (P2) against the other back-off models,
and finally λ3 which is the weight of the second back-off model
(P3) against the last back-off model. So to sum up, the probability
of P(X|Y) would be:
• P(X|Y) = λ1 * P1(X|Y) + (1 - λ1) * (λ2 * P2(X|Y) + (1 - λ2) * (λ3 * P3(X|Y)
+ (1 - λ3) * P4(X|Y)))
4
3
4
3
1
9
3
1
1
9
0
11 =⋅=
⎟
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎜
⎝
⎛
+
⎟
⎠
⎞
⎜
⎝
⎛
−=λ
© Ronen Feldman 64
Using different modalities of
text• Mixed Case: Abu Sayyaf carried out an attack on a south western
beach resort on May 27, seizing hostages including three
Americans. They are still holding a missionary couple, Martin and
Gracia Burnham, from Wichita, Kansas, and claim to have
beheaded the third American, Guillermo Sobero, from Corona,
California. Mr. Sobero's body has not been found.
• Upper Case: ABU SAYYAF CARRIED OUT AN ATTACK ON A
SOUTH WESTERN BEACH RESORT ON MAY 27, SEIZING
HOSTAGES INCLUDING THREE AMERICANS. THEY ARE STILL
HOLDING A MISSIONARY COUPLE, MARTIN AND GRACIA
BURNHAM, FROM WICHITA, KANSAS, AND CLAIM TO HAVE
BEHEADED THE THIRD AMERICAN, GUILLERMO SOBERO, FROM
CORONA, CALIFORNIA. MR SOBERO'S BODY HAS NOT BEEN
FOUND.
• SNOR: ABU SAYYAF CARRIED OUT AN ATTACK ON A SOUTH
WESTERN BEACH RESORT ON MAY TWENTY SEVEN SEIZING
HOSTAGES INCLUDING THREE AMERICANS THEY ARE STILL
HOLDING A MISSIONARY COUPLE MARTIN AND GRACIA
BURNHAM FROM WICHITA KANSAS AND CLAIM TO HAVE
BEHEADED THE THIRD AMERICAN GUILLERMO SOBERO FROM
CORONA CALIFORNIA MR SOBEROS BODY HAS NOT BEEN
FOUND.
33
© Ronen Feldman 65
Experimental Evaluation (MUC 7)
90%93%SpanishMixed case
90.7%74%EnglishSNOR
93.6%89%EnglishUpper case
94.9%96.4%EnglishMixed case
HMMRule BasedLanguageModality
© Ronen Feldman 66
How much Data is needed to
train an HMM?
NA94.9%650,000
91.2%93.1%230,000
90.5%92.8%130,000
NA91.9%85,000
89.7%91.5%60,000
88.6%NA23,000
SpanishEnglishNumber of
Tagged Words
34
© Ronen Feldman 67
Limitations of the Model
• The context which is used for deciding on the
type of each word is just the word the precedes
the current word. In many cases, such a limited
context may cause classification errors.
• As an example consider the following text
fragment “The Turkish company, Birgen Air, was
using the plane to fill a charter commitment to a
German company,”. The token that precedes
Birgen is a comma, and hence we are missing
the crucial clue company which is just one token
before the comma.
• Due to the lack of this hint, the IndentiFinder
system classified Birgen Air as a location rather
than as a company. One way to solve this
problem is to augment the model with another
token when the previous token is a punctuation
mark.
© Ronen Feldman 68
Example - input
<DOCUMENT>
<TYPE>NEWS</TYPE>
<ID> 4 Ryan daughters tied to cash ( Thu Feb 13, 9:49 AM )</ID>
<TITLE> 4 Ryan daughters tied to cash </TITLE>
<DATE> Thu Feb 13, 9:49 AM </DATE>
<SOURCE> Yahoo-News </SOURCE>
<BODY> By Matt O'Connor, Tribune staff reporter. Tribune staff reporter Ray Gibson contributed to this
report <p> Four of former Gov. George Ryan's daughters shared in almost 10,000 in secret
payments from Sen. Phil Gramm's presidential campaign in the mid-1990s, according to testimony
Wednesday in federal court. <p>
Alan Drazek, who said he was brought in at Ryan's request as a point man for the Gramm campaign in
Illinois, testified he was told by Scott Fawell, Ryan's chief of staff, or Richard Juliano, Fawell's top
aide, to cut the checks to the women. According to court records made public Wednesday, Ryan's
daughter, Lynda Pignotti, now known as Lynda Fairman, was paid a combined 5,950 in 1995 by the
Gramm campaign in four checks laundered through Drazek's business, American Management
Resources.
<p>
<p>
In 1996, individual checks went to Ryan daughters Nancy Coghlan, who received 1,725, and Joanne
Barrow and Julie R. Koehl, who each pocketed 1,000, the records showed.
<p>
<p>
A source said all four daughters had been given immunity from prosecution by federal authorities and
testified before the grand jury investigating Fawell as part of the Operation Safe Road probe.
<p>
<p> Full story at Chicago Tribune <p>
<p>
<p> </BODY>
</DOCUMENT>
35
© Ronen Feldman 69
Example - Output
Full story at <LOCATION>Baltimore Sun</LOCATION>
By <PERSON>Matt O ' Connor</PERSON> ,
<ORGANIZATION>Tribune</ORGANIZATION> staff reporter . Tribune staff
reporter <PERSON>Ray Gibson</PERSON> contributed to this report
Four of former Gov . <PERSON>George Ryan</PERSON> ' s daughters shared in
almost <MONEY>10 , 000</MONEY> in secret payments from Sen .
<PERSON>Phil Gramm</PERSON> ' s presidential campaign in the <DATE>mid
- 1990 s</DATE> , according to testimony <DATE>Wednesday</DATE> in
federal court .
<PERSON>Alan Drazek</PERSON> , who said he was brought in at
<ORGANIZATION>Ryan</ORGANIZATION> ' s request as a point man for the
Gramm campaign in <LOCATION>Illinois</LOCATION> , testified he was told by
<PERSON>Scott Fawell</PERSON> , Ryan ' s chief of staff , or
<PERSON>Richard Juliano</PERSON> , <PERSON>Fawell</PERSON> ' s top
aide , to cut the checks to the women . According to court records made public
<DATE>Wednesday</DATE> , Ryan ' s daughter , <PERSON>Lynda
Pignotti</PERSON> , now known as <PERSON>Lynda Fairman</PERSON> ,
was paid a combined <PERCENT>5</PERCENT> , 950 in <DATE>1995</DATE>
by the Gramm campaign in four checks laundered through Drazek ' s business ,
<ORGANIZATION>American Management Resources</ORGANIZATION> .
In <DATE>1996</DATE> , individual checks went to Ryan daughters
<PERSON>Nancy Coghlan</PERSON> , who received <MONEY>1 ,
725</MONEY> , and <PERSON>Joanne Barrow and Julie R . Koehl</PERSON> ,
who each <MONEY>pocketed 1 , 000</MONEY> , the records showed .
A source said all four daughters had been given immunity from prosecution by
federal authorities and testified before the grand jury investigating Fawell as
part of the Operation Safe Road probe .
Full story at <ORGANIZATION>Chicago Tribune</ORGANIZATION>
© Ronen Feldman 70
Training: ACE + MUC => MUC
50/29.693.7/40.54percent
86.6/82.197.6/82.1money
77.7/91.790.7/91.3location
68.6/92.576.4/77.6time
59/89.590.9/76.6date
83.1/95.991.1/93.7organization
84.9/88.691.9/85.5person
ace+muc7muc7r/p
36
© Ronen Feldman 71
Results with our new algorithm
60
70
80
90
100
10.7 14.6 19.8 27 36.7 49.9 68 92.5 126 171 233 317 432 588 801 1090 1484 2019 2749 3741
Amount of training (KWords)
Logarithmic scale - each tick
multiplies the amount by 7/6
F-measure(%)
DATE
LOCATION
ORGANIZATION
PERSON
DATE-Nymble
LOC-Nymble
ORG-Nymble
Currently available
Amount of training
© Ronen Feldman 72
Some Useful Resources
• Linguistic Data Consortium (LDC)
– Lexicons
– Annotated Corpora (Text and Speech)
• New Mexico State University
– Gazetteers
– Many lists of names
– Lexicons for different languages
• Various Web Sources
– CIA World Fact Book
– Hoovers
– SEC, Nasdaq (list of public company names)
– US Census data
– Private web sites (like Arabic, Persian, Pakistani names)
37
© Ronen Feldman 73
Shallow Parsing in IE
• Only Core Constituents are extracted
• No attempt is made at full parses
• Relevant prepositional attachments are
extracted.
– I saw the man with a telescope
• Only adverbials related to location and
time are processed, others are ignored.
• Quantifiers, modals, and propositional
attitudes are ignored, or treated in a
simplified way.
© Ronen Feldman 74
Why not Full Parsing?
• Full Parsing for IE was tried in:
– SRI Tacitus system (MUC 3)
– NYU Proteus (Muc-6)
• Main Issues:
– Slow (combinatorial explosion of
possible parses)
– Erroneous
– A simple predicate-argument structure
is needed.
38
© Ronen Feldman 75
Coreference
• The general problem is related to co
referential relations between expressions
– Whole – part relationship
– Containment relationship (set/subset)
• A simplest version is to find which noun
phrases refer to the same entity
• An even more restricted version is to limit
it just to proper names.
• Example:
– The President, George Bush, George W. Bush,
or even “W”, all refer to the same entity.
© Ronen Feldman 76
Easy and hard in Coreference
• “Mohamed Atta, a suspected leader of the
hijackers, had spent time in Belle Glade,
Fla., where a crop-dusting business is
located. Atta and other Middle Eastern
men came to South Florida Crop Care
nearly every weekend for two months.
• “Will Lee, the firm's general manager, said
the men asked repeated questions about
the crop-dusting business. He said the
questions seemed "odd," but he didn't
find the men suspicious until after the
Sept. 11 attack.”
39
© Ronen Feldman 77
The “Simple” Coreference
• Proper Names
– IBM, “International Business Machines”, Big
Blue
– Osama Bin Ladin, Bin Ladin, Usama Bin Laden.
(note the variations)
• Definite Noun Phrases
– The Giant Computer Manufacturer, The
Company, The owner of over 600,000 patents
• Pronouns
– It, he , she, we…..
© Ronen Feldman 78
Coreference Example
Granite Systems provides Service Resource Management
(SRM) software for communication service providers
with wireless, wireline, optical and packet technology
networks. Utilizing Granite' Xng System, carriers can
manage inventories of network resources and
capacity, define network configurations, order and
schedule resources and provide a database of record
to other operational support systems (OSSs). An
award-winning company, including an Inc. 500
company in 2000 and 2001, Granite Systems enables
clients including AT&T Wireless, KPN Belgium, COLT
Telecom, ATG and Verizon to eliminate resource
redundancy, improve network reliability and speed
service deployment. Founded in 1993, the company is
headquartered in Manchester, NH with offices in
Denver, CO; Miami, FL; London, U.K.; Nice, France;
Paris, France; Madrid, Spain; Rome, Italy;
Copenhagen, Denmark; and Singapore.
40
© Ronen Feldman 79
A KE Approach to Corefernce
• Mark each noun phrase with the following:
– Type (company, person, location)
– Singular vs. plural
– Gender (male, female, neutral)
– Syntactic (name, pronoun, definite/indefinite)
• For each candidate
– Find accessible antecedents
• Each antecedent has a different scope
– Proper names’s scope is the whole document
– Definite clauses’s scope is the preceding paragraphs
– Pronouns might be just the previous sentence, or the same
paragraph.
– Filter by consistency check
– Order by dynamic syntactic preference
© Ronen Feldman 80
Filtering Antecedents
• George Bush will not match “she”, or
“it”
• George Bush can not be an
antecedent of “The company” or
“they”
• Using a sort Hierarchy we can use
background information to be
smarter
– Example: “The big automaker is
planning to get out the car business.
The company feels that it can never
longer make a profit making cars.”
41
© Ronen Feldman 81
IE Via BootStrapping
© Ronen Feldman 82
AutoSlog (Riloff, 1993)
• Creates Extraction Patterns from
annotated texts (NPs).
• Uses Sentence analyzer (CIRCUS, Lehnert,
1991) to identify clause boundaries and
syntactic constituents (subject, verb,
direct object, prepositional phrase)
• It then uses heuristic templates to
generate extraction patterns
42
© Ronen Feldman 83
Example Templates
Was aimed at <target>Passive-verb prep <np>
Killed with <instrument>Active-verb prep <np>
Bomb against <target>Noun prep <np>
Bombed <target>Active-verb <dobj>
<victim> was victim<subj> aux noun
<victim> was murdered<subj> passive-verb
ExampleTemplate
© Ronen Feldman 84
AutoSlog-TS (Riloff, 1996)
• It took 8 hours to annotate 160 documents,
and hence probably a week to annotate
1000 documents.
• This bottleneck is a major problem for
using IE in new domains.
• Hence there is a need for a system that
can generate IE patterns from unannotated
documents.
• AutoSlog-TS is such a system.
43
© Ronen Feldman 85
AutoSlog TS
Sentence Analyzer
S: World Trade Center
V: was bombed
PP: by terrorists
AutoSlog Heuristics
Extraction
Patterns
<w> was killed
bombed by <y>
Sentence Analyzer
Extraction Patterns
<w> was killed
<x> was bombed
bombed by <y>
<z> saw
EP REL %
<x> was bobmed 87%
bombed by <y> 84%
<w> was killed 63%
<z> saw 49%
© Ronen Feldman 86
Top 24 Extraction Patterns
Exploded on <np><subj> kidnappedOne of <np>
<subj> was murderedDestroyed <dobj>Was wounded in <np>
Occurred on <np>Responsibility for
<np>
Took place on <np>
<subj> was loctated<subj> occured<subj> was wounded
Claimed <dobj>Caused <dobj><subj> took place
Death of <np>Exploded in <np><subj> was injured
Attack on <np><subj> was
kidnapped
<subj> was killed
Assassination of
<np>
Murder of <np><subj> exploded
44
© Ronen Feldman 87
Evaluation
• Data Set: 1500 docs from MUC-4 (772
relevant)
• AutoSlog generated 1237 patterns which
were manually filtered to 450 in 5 hours.
• AutoSlog-TS generated 32,345 patterns,
after discarding singleton patterns, 11,225
were left.
• Rank(EP) =
• The user reviewed the the top 1970
patterns and selected 210 of them in 85
minutes.
• AutoSlog achieved better recall, while
AutoSlog-TS achieved better precision.
freq
freq
freqrel
2log
−
© Ronen Feldman 88
Learning Dictionaries by Bootstrapping
(Riloff and Jones, 1999)
• Learn dictionary entries (semantic lexicon)
and extraction patterns simultaneously.
• Use untagged text as a training source for
learning.
• Start with a set of seed lexicon entries and
using mutual bootstrapping learn
extraction patterns and more lexicon
entries.
45
© Ronen Feldman 89
Mutual Bootstrapping
Algorithm
• Using AutoSlog generate all possible extraction
patterns.
• Apply patterns to the corpus and save results
to EPData
• SemLex = Seed Words
• Cat_EPList = {}
1. Score all extraction patterns in EPData
2. Best_EP = highest scoring pattern
3. Add Best_EP to Cat_EPList
4. Add Best_EP’s extractions to SemLex
5. Goto 1
© Ronen Feldman 90
Meta Bootstrapping Process
SeedWords
PermanentSemantic
lexicon
CandidateExtractionPatterns
andtheirExtractions
TemporarySemantic
Lexicon
CategoryEPList
Initialize
Add5Best
NPs
Select best_EP
Add best_EP’s
extractions
46
© Ronen Feldman 91
Sample Extraction Patterns
process in <x>offices of <x>expanded into <x>
returned to <x><x> has positionsservices in <x>
taken in <x><x> request informationdistributors in <x>
part in <x>message to <x>customers in <x>
ministers of <x><x> thriveoutlets in <x>
to enter <x>devoted to <x>consulting in <x>
parts of <x>sold to <x>activities in <x>
presidents of <x>motivated <x>seminars in <x>
sought in <x><x> positioningoperates in <x>
become in <x><x> is distributoroperations in <x>
traveled to <x><x> employedfacilities in <x>
living in <x>owned by <x>offices in <x>
terrorism locationwww companywww location
© Ronen Feldman 92
Experimental Evaluation
124/244
(.51)
101/194
(.52)
85/144
(.59)
68/94
(.72)
31/44
(.70)
4/4
(1)
Terr.
Weapon
158/250
(.63)
127/200
(.64)
100/150
(.67)
66/100
(.66)
32/50
(.64)
5/5
(1)
Terr.
Location
107/231
(.46)
101/181
(.56)
86/131
(.66)
63/81
(.78)
22/31
(.71)
0/1
(0)
Web Title
191/250
(.76)
163/200
(.82)
129/150
(.86)
88/100
(.88)
46/50
(.92)
5/5
(1)
Web
Location
95/206
(.46)
86/163
(.53)
72/113
(.64)
52/65
(.80)
25/32
(.78)
5/5
(1)
Web
Company
Iter 50Iter 40Iter 30Iter 20Iter 10Iter 1
47
© Ronen Feldman 93
KDD Cup Task 1
Information Extraction from
Biomedical Articles
System Description
June / July 2002
© Ronen Feldman 94
The Task: Curate or Not-
Curate
A Product – mRNA
or Protein actually
identified (naturally)
within specific cells of
the natural (WildType)
fly.
For each paper, a list
of all genes mentioned
in the paper - for which
we must decide if there
is a product result - is
given
Build a system for automatic analysis of scientific
papers regarding the Drosophila Fruit FlyDrosophila Fruit Fly.
The system should extract (curate) only the
papers that include experimentalexperimental resultsresults
regarding expressionexpression of genegene productsproducts, and
identifyidentify these genes and products
48
© Ronen Feldman 95
Quick Biological Background
RNARNA (Ribonucleic Acid)(Ribonucleic Acid) is a molecule that isis a molecule that is
“mathematically” equivalent to (but chemically
different from) the DNA sequence of the gene.
Transcription means transfer of the genetic
information from the archival copy of DNA to the
short-lived messenger RNA (mRNA)
Transcription
© Ronen Feldman 96
Quick Biological Background
(Continued)
is the process that takes a sequence in one code –
nucleotides, and creates the corresponding
sequence in another code - amino acids (The
building blocks of peptides / proteins). A protein
will be expressed only if its code was “translated”
from the mRNA.
TranslationTranslation
49
© Ronen Feldman 97
The Task: So what’s the problem?
• Very often papers discuss mutations and
forced (ectopic) expression of genes in
addition to natural ones
• Many genes are “just mentioned” within the
papers without actually citing results or are
being used as an auxiliary tool for
investigating other genes
(Example: The White/Red Eye Gene - w)
• The Transcript vs. Protein distinction is tricky
(they usually have the same name … )
© Ronen Feldman 98
Our System: Translating the problem
into an Information Extraction Task
• The scientific papers given are lengthy and complex …
• We’re given only a text version without images
• But they have a very fixed structure
• We’re actually interested only in specific, actual
experimental results
• Fortunately, these results are obtained using a set of wellknown
techniques
• Our approach is Knowledge-Based Information Extraction, i.e.
finding frequent patterns relevant to the domain
So our Solution is …
50
© Ronen Feldman 99
The Figure IS the Result
Molecular Biologists who review these papers,
look mainly for the figures!
Example:
This figure (from *R100,
in the Training Set) that
shows that a specific
transcript is present both
in the eye and the body.
Obvious highlighted sections
(Title and Abstract) are used too.
*Multiple Subtypes of Phospholipase C Are Encoded by the norpA Gene of Drosophila melanogaster
Sunkyu Kim , Richard R. McKay , Karen Miller , Randall D. Shortridge
J. Biol. Chem. 270(24): 14376-82.
© Ronen Feldman 100
The Figure IS the Result
(Continued)
But our system can’t read figures
and actually doesn’t have them …
The Solution …
51
© Ronen Feldman 101
The Alternative: Focus on Figure
Legend
@Northern Analysis of Adult RNA@
When radiolabeled @norpA@ cDNA probes are hybridized to blots of
poly(A) [_2747_tex2html_wrap740.xbm] RNA, three major transcripts can
be identified. As shown in Fig. 3(@panel@@A@), a major @norpA@
transcript that is 7.5 kb in length is easily detected in wild-type
head but is absent from head of @eya@ mutant. The absence of the
7.5-kb transcript from @eya@ head suggests that it is expressed in
the
compound eye. Two other transcripts, one that is 5.5 kb and one that
is 5.0 kb in length, are visible in body. None of these transcripts
are detectable in head or body of @norpA@
[_2747_tex2html_wrap732.xbm]
mutant (Zhu @et al.@, 1993), suggesting that they are encoded by the
@norpA@ gene.
[bc2558926003.gif]
_________________________________________________________________
Figure 3: Northern blot analysis of @norpA@ transcripts
in adult @Drosophila@ tissues. Approximately 5 µg of poly(A)
[_2747_tex2html_wrap740.xbm] RNA was loaded in each lane and probed
with a 3.4-kb @norpA@ cDNA fragment (nucleotides 1-3453) (@A@), an
80-bp exon 4 cDNA fragment (@B@), or an 80-bp exon 4A cDNA fragment
(@C@). @Lane@ designations indicate RNA isolated from adult head or
body (thorax and abdomen) of wild-type (@WT@) @Drosophila@, eyes
absent (@eya@) mutant, or @norpA@ mutant. Mobility of RNA size
standards (in kilobases) are indicated on the @right@. @Panels@@D@-
@F@
show the result of reprobing the blots with @Drosophila@ RP49 cDNA
(O'Connell and Rosbash, 1984) as a control to test for RNA loading.
This is how the
extract from the
same paper looks
as a text file
© Ronen Feldman 102
Extracting the Pattern from the Figure Legend
• Extracting (finding) the Figure Title is easy :
“Figure #” or “Fig. #” beginning at a new line
• Look for patterns incorporating a technique
used in obtaining the results (for example,
Northern blot), or noun phrase or verb
describing an expression (“expression”,
“localization”, “expressed” …) with a synonym
of Gene(s).
52
© Ronen Feldman 103
HP1a, HP1b, and HP1c localize to distinct
regions of Drosophila nuclei.
Extracting the Pattern from the Figure
Legend
Example
These are probably Proteins
(Multi-Capital names are usually
Proteins and not Transcripts).
GeneList(ProductType) ExpressionVerb
© Ronen Feldman 104
Making the Curate Decision :
Extract Evidences and Score Them
• Extract evidences from Title , Abstract , Figure
Legend and GenBank footnotes
• Keep a Score entry for the whole document and
for each product (transcript/protein) of a
candidate gene
• At the end of the document, use the scores to
decide regarding the curation of the document
and the products of the candidate genes.
(If a gene’s score is above a certain threshold, mark the gene as
having an experimental result, and mark the whole document
as curatable).
53
© Ronen Feldman 105
Making the Curate Decision :
Positive and Negative Evidences
“Northern blot analysis of @norpA@ transcripts
in adult @Drosophila@ tissues”
“Figure 2. Ectopic expression of @dNSF1@ in
the nervous system rescues the phenotypes of
@dNSF1@ mutations”
Positive Evidence
Negative Evidence
© Ronen Feldman 106
Implementation : DIAL Rulebook
• The System is implemented in DIALDIAL
(Declarative Information Analysis
Language), a general IE language developed
at ClearForest
• DIAL is based on matching patterns within the
text and then checking constraints on the
patterns.
• Patterns combine syntactic and semantic
elements.
54
© Ronen Feldman 107
© Ronen Feldman 108
55
© Ronen Feldman 109
Results and Evaluation
• Document Curation : 78% F-
Measure
• Gene Products : 67% F-Measure
Results achieved
© Ronen Feldman 110
Results and Evaluation
(Continued)
• Most papers belong to a narrow domain (same vocabulary).
• Many curatable papers have both relevant results (wild-type
expression) and irrelevant ones (Mutations etc.)
• Extracting evidences of specific products of genes cannot
be achieved by categorization. Patterns with the specific
genes must be found.
(No real generalization can be made regarding specific genes,
other than w)
Information Extraction is more suitable
than Categorization for this task.
(Best Categorization Curation Results – about 62-64% F-Measure)
Evaluation
56
© Ronen Feldman 111
A Hybrid Approach
Merging the Rule Base and ML Approaches
© Ronen Feldman 112
Why is HMM not enough?
• The HMM model is flat, so the most it can do is
assign a tag to each token in a sentence.
• This is suitable for the tasks where the tagged
sequences do not nest and where there are no
explicit relations between the sequences.
• Part-of-speech tagging and entity extraction
belong to this category.
• Extracting relationships is different, because the
tagged sequences can (and must) nest, and there
are relations between them which must be
explicitly recognized.
• < ACQUISITION> <ACQUIRER> Ethicon EndoSurgery,
Inc. </ACQUIRER> , a Johnson &
Johnson company, has acquired < ACQUIRED>
Obtech Medical AG < /ACQUIRED>
</ACQUISITION>
57
© Ronen Feldman 113
A Hybrid Approach
• The hybrid strategy, attempts to strike a balance
between the two knowledge engineer chores
– writing the extraction rules
– manually tagging the documents.
• In this strategy, the knowledge engineer writes
SCFG rules, which are then trained on the data
which is available.
• The powerful disambiguating ability of the SCFG
makes writing rules much simpler and cleaner
task.
• The knowledge engineer has the control of the
generality of the rules (s)he writes, and
consequently on the amount and the quality of
the manually tagged training the system would
require.
© Ronen Feldman 114
Defining SCFG
• Classical definition: A stochastic contextfree
grammar (SCFG) is a quintuple
G = (T, N, S, R, P)
– T is the alphabet of terminal symbols (tokens)
– N is the set of nonterminals
– S is the starting nonterminal
– R is the set of rules
– P : R → [0..1] defines their probabilities.
• The rules have the form n → s1s2…sk,
where n is a nonterminal and each si
either token or another nonterminal.
• SCFG is a usual context-free grammar
with the addition of the P function.
58
© Ronen Feldman 115
The Probability Function
• If r is the rule n → s1s2…sk, then P(r) is the
frequency of expanding n using this rule.
• In Bayesian terms, if it is known that a
given sequence of tokens was generated
by expanding n, then P(r) is the apriory
likelihood that n was expanded using the
rule r.
• Thus, it follows that for every nonterminal
n the sum ∑ P(r) over all rules r headed by
n must equal to one.
© Ronen Feldman 116
IE with SCFG
• A very basic “parsing” is employed
for the bulk of a text, but within the
relevant parts, the grammar is much
more detailed.
• The IE grammars can be said to
define sublanguages for very
specific domains.
59
© Ronen Feldman 117
IE with SCFG
• In the classical definition of SCFG it is assumed
that the rules are all independent. In this case it
is possible to find the (unconditional) probability
of a given parse tree by simply multiplying the
probabilities of all rules participating in it.
• The usual parsing problem is given a sequence
of tokens (a string) S, to find the most probable
parse tree T which could generate S. A simple
generalization of the Viterbi algorithm is able to
efficiently solve this problem.
• In practical applications of SCFGs, it is rarely the
case that the rules are truly independent. Then,
the easiest way to cope with this problem while
leaving most of the formalism intact is to let the
probabilities P(r) be conditioned upon the
context where the rule is applied.
© Ronen Feldman 118
Markovian CSFG
• HMM entity extractors, are a simple case of
markovian SCFGs.
• Every possible rule which can be formed
from the available symbols has nonzero
probability.
• Usually, all probabilities are initially set to
be equal, and then adjusted according to
the distributions found in the training
data.
60
© Ronen Feldman 119
Training Issues
• For some problems the available training corpora
appear to be adequate.
• In particular, markovian SCFG parsers trained on
the Penn Treebank perform quite well (Collins
1997, Charniak 2000, Roark 2001, etc).
• But for the task of relationship extraction it turns
out to be impractical to manually tag the amount
of documents that would be sufficient to
adequately train a markovian SCFG.
• At a certain point it becomes more productive to
go back to the original hand-crafted system and
write rules for it, even though it is a much more
skilled labor!
© Ronen Feldman 120
SCFG Syntax
• A rulebook consists of declarations and rules.
All nonterminals must be declared before usage.
• Some of them can be declared as output
concepts, which are the entities, events, and
facts that the system is designed to extract.
Additionally, two classes of terminal symbols
also require declaration: termlists, and ngrams.
– A termlist is a collection of terms from a single
semantic category, either written explicitly or loaded
from external source.
– An ngram can expand to any single token. But the
probability of generating a given token is not fixed in
the rules, but learned from the training dataset, and
may be conditioned upon one or more previous tokens.
Thus, ngrams is one of the ways the probabilities of the
SCFG rules can be context-dependent.
61
© Ronen Feldman 121
Example
output concept Acquisition(Acquirer, Acquired);
ngram AdjunctWord;
nonterminal Adjunct;
Adjunct :- AdjunctWord Adjunct | AdjunctWord;
termlist AcquireTerm = acquired bought (has acquired) (has
bought);
Acquisition :- Company Acquirer [ “,”Adjunct “,” ]
AcquireTerm Company Acquired;
© Ronen Feldman 122
Emulation of HMM Entity
Extractor in CSFG
output concept Company();
ngram CompanyFirstWord;
ngram CompanyWord;
ngram CompanyLastWord;
nonterminal CompanyNext;
Company :- CompanyFirstWord CompanyNext
| CompanyFirstWord;
CompanyNext :- CompanyWord CompanyNext
| CompanyLastWord;
62
© Ronen Feldman 123
Putting is all together
start Text;
nonterminal None;
ngram NoneWord;
None :- NoneWord None | ;
Text :- None Text | Company Text |
Acquisition Text | ;
© Ronen Feldman 124
SCFG Training
• Currently there are three different classes of
trainable parameters in a TEG rulebook:
– the probabilities of rules of nonterminals
– the probabilities of different expansions of
ngrams
– the probabilities of terms in a wordclass.
• All those probabilities are smoothed
maximum likelihood estimates, calculated
directly from the frequencies of the
corresponding elements in the training
dataset.
63
© Ronen Feldman 125
Sample Rules
concept Text;
concept output Person;
ngram NGFirstName;
ngram NGLastName;
ngram NGNone;
wordclass WCHonorific = Mr Mrs Miss Ms Dr;
Person :- WCHonorific NGLastName;
Person :- NGFirstName NGLastName;
Text :- NGNone Text;
Text :- Person Text;
Text :- ;
© Ronen Feldman 126
Training the rule book
• By default, the initial untrained frequencies
of all elements are assumed to be 1. They
can be changed using “<count>” syntax.
• Let us train this rulebook on the training
set containing one sentence:
– Yesterday, <Person>Dr Simmons</Person>,
the distinguished scientist, presented the
discovery.
• This is done in two steps. First, the
sentence is parsed using the untrained
rulebook, but with the constraints
specified by the annotations. In our case
the constraints are satisfied by two
different parses:
64
© Ronen Feldman 127
2 Possible Parse Trees
Text Text
NGNone NGNone
Yesterday Yesterday
Text Text
NGNone NGNone
, ,
Text Text
Person Person
WCHonorific NGFirstName
Dr Dr
NGLastName NGLastName
Simmons Simmons
Text Text
NGNone NGNone
, ,
Text Text
…… ……
© Ronen Feldman 128
How to pick the right Parse?
• The difference is in the expansion of the
Person nonterminal. Both Person rules can
produce the output instance, therefore
there is an ambiguity.
• In this case it is resolved in favor of the
WCHonorific interpretation, because in the
untrained rulebook we have
– P(Dr | WCHonorific) = 1/5 (choice of one term
among five equiprobable ones),
– P(Dr | NGFirstName) ≈ 1/N, where N is the
number of all known words.
65
© Ronen Feldman 129
The Trained Rule Book
concept Text;
concept output Person;
ngram NGFirstName;
ngram NGLastName;
ngram NGNone;
wordclass WCHonorific = Mr Mrs Miss Ms <2>Dr;
Person :- <2>WCHonorific NGLastName;
Person :- NGFirstName NGLastName;
Text :- <11>NGNone Text;
Text :- <2>Person Text;
Text :- <2>;
© Ronen Feldman 130
A real example
• The PersonAffiliation relation contains three attributes –
name of the person, name of the organization, and position of
the person in the organization. It is declared as follows:
– concept output PersonAffiliation(Name, Position, Org);
• Most often, this relation is encountered in the text in the form
– “Mr.Name, Position of Org” or
– “Org Position Ms.Name”.
– Almost any order of the components is possible, with commas
and prepositions inserted as necessary.
– Also, it is common for Name, Position, or both to be
conjunctions of pairs of corresponding entities:
– “Mr.Name1 and Ms.Name2, the Position1 and Position2 of Org”,
– “Org’s Position1 and Position2, Ms.Name”.
• In order to catch those complexities, and for general
simplification of the rules, we use several auxiliary non-
terminals:
– Names, which catches one or two Names,
– Positions, which catches one or two Positions, and
– Orgs, which catches Organizations and Locations, which can
also be involved in PersonAffiliation, as in “Bush, president of
US”:
66
© Ronen Feldman 131
The Basic Rules
• nonterms Names, Positions, Orgs;
– Names :- PERSON->Name | PERSON->Name "and" PERSON->Name;
– Positions :- POSITION->Position | POSITION->Position "and"
POSITION->Position;
– Orgs :- ORGANIZATION->Org | LOCATION->Org;
• We also use auxiliary non-terminals for catching pairs of
attributes: PosName, and PosOrg:
– nonterms PosName, PosOrg;
– PosName :- Positions Names | PosName "and" PosName;
– wordclass wcPreposition = "at" "in" "of" "for" "with";
– wordclass wcPossessive = ("’ " "s") "’ ";
– PosOrg :- Positions wcPreposition Orgs;
– PosOrg :- Orgs [wcPossessive] Positions;
• Finally, the PersonAffiliation rules:
– PersonAffiliation :- Orgs [wcPossessive] PosName;
– PersonAffiliation :- PosName wcPreposition Orgs;
– PersonAffiliation :- PosOrg [","] Names;
– PersonAffiliation :- Names "," PosOrg;
– PersonAffiliation :- Names "is" "a" PosOrg;
© Ronen Feldman 132
What is Missing?
• The rules above catch about 50% of all
PersonAffiliation instances in the texts.
• Other instances do not conform to the
patterns above in several respects. So, in
order to improve the accuracy, additional
rules need to be written.
• First, the Organization name is often
entered into a sentence as a part of a
descriptive noun phrase, as in:
– “Ms.Name is a Position of the industry leader
Org”.
– In order to catch this in a general way, we
define an OrgNP nonterm, which uses an
external PoS tagger:
67
© Ronen Feldman 133
Advanced Rules
• Using External POS Tagger
– ngram ngOrgNoun featureset ExtPoS restriction Noun;
– ngram ngOrgAdj featureset ExtPoS restriction Adj;
– ngram ngNum featureset ExtPoS restriction Number;
– ngram ngProper featureset ExtPoS restriction ProperName;
– ngram ngDet featureset ExtPoS restriction Det;
– ngram ngPrep featureset ExtPoS restriction Prep;
• nonterm OrgNounList;
– OrgNounList :- ngOrgNoun [OrgNounList];
• nonterms OrgAdjWord, OrgAdjList;
– OrgAdjWord :- ngOrgAdj | ngNum | ngProper;
– OrgAdjList :- OrgAdjWord [OrgAdjList];
• nonterm OrgNP;
– OrgNP :- [ngDet] [OrgAdjList] OrgNounList;
– OrgNP :- OrgNP ngPrep OrgNP;
– OrgNP :- OrgNP "and" OrgNP;
© Ronen Feldman 134
Experimental Evaluation
68
© Ronen Feldman 135
The INC Corpus
72.3477.2768.0080.8586.3676.00Acquisition
74.2972.2276.4780.0077.7882.35OrgLocation
77.3379.4675.3392.0094.5289.61PersonAffiliation
FPrecRecallFPrecRecall
Exact match resultsPartial match results
© Ronen Feldman 136
MUC 7
90.5894.4287.0586.9190.1283.9386.6687.286.12Location
90.1990.989.4987.789.5385.9488.8489.7587.94
Organizati
on
92.2490.7893.7586.5786.8386.3186.0185.1386.91Person
FPrec
Recal
lFPrec
Recal
lFPrec
Recal
l
Full SCFG system
Emulation using
SCFG
HMM entity
extractor
69
© Ronen Feldman 137
ACE 2
86.8484.9488.8385.8484.9686.7484.3783.2285.54GPE
64.7671.0659.4965.0671.0260.0358.05
64.73
552.62
Organiza
tion
85.5681.6889.8284.8281.6588.2584.3783.2285.54Person
80.2577.383.4461.1176.2450.99Role
FPrecRecallFPrec
Recal
lFPrec
Recal
l
Full SCFG system
Ergodic
SCFGHMM entity extractor
© Ronen Feldman 138
Final Comparison
71.413572.63380.560580.695{Sum}
22.975524.07447.870538.0825FAC
29.364531.59837.44235.091LOC
83.881585.25689.39288.957GPE
78.097576.42283.594584.835PERSON
49.69154.11368.589569.4895ORG
HMMG TEGDIALBest TEGF1
70
© Ronen Feldman 139
Simple ROLE Rules (ACE-2)
• ROLE :- [Position_Before] ORGANIZATION->ROLE_2
Position ["in" GPE] [","] PERSON->ROLE_1;
• ROLE :- GPE->ROLE_2 Position [","] PERSON->ROLE_1;
• ROLE :- PERSON->ROLE_1 "of" GPE->ROLE_2;
• ROLE :- ORGANIZATION->ROLE_2 "'" "s" [Position]
PERSON->ROLE_1;
© Ronen Feldman 140
A little more complicated set
of rules• ROLE :- [Position_Before] ORGANIZATION->ROLE_2 Position ["in"
GPE] [","] PERSON->ROLE_1;
• ROLE :- GPE->ROLE_2 Position [","] PERSON->ROLE_1;
• ROLE :- PERSON->ROLE_1 "of" GPE->ROLE_2;
• ROLE :- ORGANIZATION->ROLE_2 "'" "s" [Position] PERSON-
>ROLE_1;
• ROLE :- GPE->ROLE_2 [Position] PERSON->ROLE_1;
• ROLE :- <5> GPE->ROLE_2 "'" "s" ORGANIZATION->ROLE_1;
• ROLE :- PERSON->ROLE_1 "," Position WCPreposition
ORGANIZATION->ROLE_2;
71
© Ronen Feldman 141
The Effect of the rules on the
extraction accuracy
0.00
20.00
40.00
60.00
80.00
100.00
50K 100K 150K 200K 250K
0 rules
4 rlules
7 rules
© Ronen Feldman 142
Self-Supervised Relation
Learning from the Web
Ronen Feldman
Data Mining Laboratory
Bar-Ilan University, ISRAEL
Joint work with Benjamin Rosenfeld
72
© Ronen Feldman 143
Approaches for Building IE
Systems
• Knowledge Engineering Approach
– Rules are crafted by linguists in cooperation
with domain experts.
– Most of the work is done by inspecting a set of
relevant documents.
– Can take a lot of time to fine tune the rule set.
– Best results were achieved with KB based IE
systems.
– Skilled/gifted developers are needed.
– A strong development environment is a MUST!
© Ronen Feldman 144
Approaches for Building IE
Systems
• Automatically Trainable Systems
– The techniques are based on pure statistics
and almost no linguistic knowledge
– They are language independent
– The main input is an annotated corpus
– Need a relatively small effort when building the
rules, however creating the annotated corpus
is extremely laborious.
– Huge number of training examples is needed in
order to achieve reasonable accuracy.
– Hybrid approaches can utilize the user input in
the development loop.
73
© Ronen Feldman 145
KnowItAll (KIA)
• KnowItAll is a system developed at
University of Washington by Oren Etzioni
and colleagues (Etzioni, Cafarella et al.
2005).
• KnowItAll is an autonomous, domainindependent
system that extracts facts
from the Web. The primary focus of the
system is on extracting entities (unary
predicates), although KnowItAll is able to
extract relations (N-ary predicates) as well.
• The input to KnowItAll is a set of entity
classes to be extracted, such as “city”,
“scientist”, “movie”, etc., and the output is
a list of entities extracted from the Web.
© Ronen Feldman 146
KnowItAll’s Relation Learning
• The base version of KnowItAll uses only the generic
hand written patterns. The patterns are based on a
general Noun Phrase (NP) tagger.
• For example, here are the two patterns used by
KnowItAll for extracting instances of the
Acquisition(Company, Company) relation:
– NP2 "was acquired by" NP1
– NP1 "'s acquisition of" NP2
• And the following are the three patterns used by
KnowItAll for extracting the MayorOf(City, Person)
relation:
– NP ", mayor of" <city>
– <city> "'s mayor" NP
– <city> "mayor" NP
74
© Ronen Feldman 147
SRES
• SRES (Self-Supervised Relation
Extraction System) which learns to
extract relations from the web in an
unsupervised way.
• The system takes as input the name
of the relation and the types of its
arguments and returns as output a
set of instances of the relation
extracted from the given corpus.
© Ronen Feldman 148
SRES Architecture
Sentence
Gatherer
Input:
Target Relations
Definitions
Web
Sentences
keywords
Pattern
Learner
Instance
Extractor
Output:
Extractions
Seeds
Generator
seeds
patterns
NER Filter
(optional)
instances
Classifier
75
© Ronen Feldman 149
Seeds for Acquisition
• Oracle – PeopleSoft
• Oracle – Siebel Systems
• PeopleSoft – J.D. Edwards
• Novell – SuSE
• Sun – StorageTek
• Microsoft – Groove Networks
• AOL – Netscape
• Microsoft – Vicinity
• San Francisco-based Vector Capital –
Corel
• HP – Compaq
© Ronen Feldman 150
Major Steps in Pattern
Learning
• The sentences containing the
arguments of the seed instances are
extracted from the large set of
sentences returned by the Sentence
Gatherer.
• Then, the patterns are learned from the
seed sentences.
– We need to generate automatically
• Positive Instances
• Negative Instances
• Finally, the patterns are post-processed
d filt d
76
© Ronen Feldman 151
Positive Instances
• The positive set of a predicate consists of
sentences that contain an instance of the
predicate, with the actual instance’s attributes
changed to “<AttrN>”, where N is the attribute
index.
• For example, the sentence
– “The Antitrust Division of the U.S. Department of
Justice evaluated the likely competitive effects of
Oracle's proposed acquisition of PeopleSoft.”
• will be changed to
– “The Antitrust Division… …….effects of
<Attr1>'s proposed acquisition of <Attr2>.”
© Ronen Feldman 152
Negative Instances II
• We generate the negative set from
the sentences in the positive set by
changing the assignment of one or
both attributes to other suitable
entities in the sentence.
• In the shallow parser based mode of
operation, any suitable noun phrase
can be assigned to an attribute.
77
© Ronen Feldman 153
Examples
• The Positive Instance
– “The Antitrust Division of the U.S. Department
of Justice evaluated the likely competitive
effects of <Attr1>’s proposed acquisition of
<Attr2>”
• Possible Negative Instances
– <Attr1> of the <Attr2> evaluated the likely…
– <Attr2> of the U.S. … …acquisition of <Attr1>
– <Attr1> of the U.S. … …acquisition of <Attr2>
– The Antitrust Division of the <Attr1> …..
acquisition of <Attr2>”
© Ronen Feldman 154
Additional Instances
• we use the sentences produced by
exchanging “<Attr1>” and “<Attr2>”
(with obvious generalization for n-ary
predicates) in the positive sentences.
• If the target predicate is symmetric,
like Merger, then such sentences are
put into the positive set.
• Otherwise, for anti-symmetric
predicates, the sentences are put into
the negative set.
78
© Ronen Feldman 155
Pattern Generation
• The patterns for a predicate P are generalizations
of pairs of sentences from the positive set of P.
• The function Generalize(S1, S2) is applied to each
pair of sentences S1 and S2 from the positive set
of the predicate. The function generates a pattern
that is the best (according to the objective
function defined below) generalization of its two
arguments.
• The following pseudo code shows the process of
generating the patterns:
For each predicate P
For each pair S1, S2 from PositiveSet(P)
Let Pattern = Generalize(S1, S2).
Add Pattern to PatternsSet(P).
© Ronen Feldman 156
The Pattern Language
• The patterns are sequences of tokens,
skips, and slots. The tokens can match
only themselves, the skips match zero or
more arbitrary tokens, and slots match
instance attributes.
• Examples of patterns:
– <Attr1> * was acquired by <Attr2>
– <Attr1> * merged with * <Attr2>
– <Attr2> is * ceo of * <Attr1>
• Note, that the sentences from the positive
and negative sets of predicates are also
patterns, the least general ones since they
do not contain skips.
79
© Ronen Feldman 157
The Generalize Function
• The Generalize(s1, s2) function takes two patterns
(e.g., two sentences with slots marked as <AttrN>)
and generates the least (most specific) common
generalization of both.
• The function does a dynamical programming
search for the best match between the two
patterns.
• The cost of the match is defined as the sum of
costs of matches for all elements.
– two identical elements match at no cost,
– a token matches a skip or an empty space at cost 2,
– a skip matches an empty space at cost 1.
– All other combinations have infinite cost.
• After the best match is found, it is converted into
a pattern by copying matched identical elements
and adding skips where non-identical elements
are matched.
© Ronen Feldman 158
Example
•S1 = “Toward this
end, <Arg1> in July
acquired <Arg2>”
•S2 = “Earlier this
year, <Arg1>
acquired <Arg2>”
•After the dynamical
programming-based
search, the following
match will be found:
Toward (cost 2)
Earlier (cost 2)
this this (cost 0)
end (cost 2)
year (cost 2)
, , (cost 0)
<Arg1 > <Arg1 > (cost 0)
in July (cost 4)
acquired acquired (cost 0)
<Arg2 > <Arg2 > (cost 0)
80
© Ronen Feldman 159
Generating the Pattern
• at total cost = 12. The match will be
converted to the pattern
– * * this * * , <Arg1> * acquired
<Arg2>
• which will be normalized (after
removing leading and trailing skips,
and combining adjacent pairs of
skips) into
– this * , <Arg1> * acquired <Arg2>
© Ronen Feldman 160
Post-processing, filtering, and
scoring of patterns• In the first step of the postprocessing
we remove from each
pattern all function words and
punctuation marks that are
surrounded by skips on both sides.
Thus, the pattern from the example
above will be converted to
, <Arg1> * acquired <Arg2>
• Note, that we do not remove elements
that are adjacent to meaningful words
or to slots, like the comma in the
pattern above, because such
anchored elements may be important.
81
© Ronen Feldman 161
Content Based Filtering
• Every pattern must contain at least one
word relevant to its predicate. For each
predicate, the list of relevant words is
automatically generated from WordNet by
following all links to depth at most 2
starting from the predicate keywords. For
example, the pattern
<Arg1> * by <Arg2>
• will be removed, while the pattern
<Arg1> * purchased <Arg2>
• will be kept, because the word “purchased”
can be reached from “acquisition” via
synonym and derivation links.
© Ronen Feldman 162
Scoring the Patterns
• The filtered patterns are then scored by
their performance on the positive and
negative sets.
• We want the scoring formula to reflect the
following heuristic: it needs to rise
monotonically with the number of positive
sentences it matches, but drop very fast
with the number of negative sentences it
matches.
( )2
1matches:
matches:
)(
+∈
∈
=
SPatterntNegativeSeS
SPatterntPositiveSeS
PatternScore
82
© Ronen Feldman 163
Sample Patterns - Inventor
• X , .* inventor .* of Y
• X invented Y
• X , .* invented Y
• when X .* invented Y
• X ' s .* invention .* of Y
• inventor .* Y , X
• Y inventor X
• invention .* of Y .* by X
• after X .* invented Y
• X is .* inventor .* of Y
• inventor .* X , .* of Y
• inventor of Y , .* X ,
• X is .* invention of Y
• Y , .* invented .* by X
• Y was invented by X
© Ronen Feldman 164
Sample Patterns – CEO
(Company/X,Person/Y)• X ceo Y
• X ceo .* Y ,
• former X .* ceo Y
• X ceo .* Y .
• Y , .* ceo of .* X ,
• X chairman .* ceo Y
• Y , X .* ceo
• X ceo .* Y said
• X ' .* ceo Y
• Y , .* chief executive officer .* of X
• said X .* ceo Y
• Y , .* X ' .* ceo
• Y , .* ceo .* X corporation
• Y , .* X ceo
• X ' s .* ceo .* Y ,
• X chief executive officer Y
• Y , ceo .* X ,
• Y is .* chief executive officer .* of X
83
© Ronen Feldman 165
Shallow Parser mode
• In the first mode of operation (without the use of
NER), the predicates may define attributes of two
different types:
ProperName and CommonNP.
• We assume that the values of the ProperName
type are always heads of proper noun phrases.
And the values of the
CommonNP type are simple common noun
phrases (with possible proper noun modifiers, e.g.
“the Kodak camera”).
• We use a Java-written shallow parser from the
OpenNLP (http://opennlp.sourceforge.net/)
package. Each sentence is tokenized, tagged with
part-of-speech, and tagged with noun phrase
boundaries. The pattern matching and extraction
is straightforward.
© Ronen Feldman 166
Building a Classification
Model
• The goal is to set the score of the extractions using
the information on the instance, the extracting
patterns and the matches. Assume, that extraction E
was generated by pattern P from a match M of the
pattern P at a sentence S. The following properties
are used for scoring:
1. Number of different sentences that produce E (with any
pattern).
2. Statistics on the pattern P generated during pattern learning –
the number of positive sentences matched and the number of
negative sentences matched.
3. Information on whether the slots in the pattern P are anchored.
4. The number of non-stop words the pattern P contains.
5. Information on whether the sentence S contains proper noun
phrases between the slots of the match M and outside the
match M.
6. The number of words between the slots of the match M that
t h d t ki f th tt P
84
© Ronen Feldman 167
Building a Classification
Model
• During the experiments, it turned out that
the pattern statistics (2) produced
detrimental results, and the proper noun
phrase information (5) did not produce any
improvement. The rest of the information
was useful, and was turned into the
following set of binary features:
– f1(E, P, M, S) = 1, if the number of sentences
producing E is greater than one.
– f2(E, P, M, S) = 1, if the number of sentences
producing E is greater than two.
– f3(E, P, M, S) = 1, if at least one slot of the
pattern P is anchored.
– f4(E, P, M, S) = 1, if both slots of the pattern P
are anchored.
© Ronen Feldman 168
Building a Classification
Model
– f5…f9(E, P, M, S) = 1, if the number of nonstop
words in P is 0, 1 or greater, 2 or greater,… 4 or
greater, respectively
– f10…f15(E, P, M, S) = 1, if the number of words
between the slots of the match M that were
matched to skips of the pattern P is 0, 1 or less,
2 or less, 3 or less, 5 or less, and 10 or less,
respectively.
• As can be seen, the set of features above
is rather small, and is not specific to any
particular predicate. This allows to train a
model using a small amount of labeled
data for one predicate, and then to use the
model for all other predicates.
85
© Ronen Feldman 169
Using an NER Component
• In the SRES-NER version the entities
of each candidate instance are
passed through a simple rule-based
NER filter, which attaches a score
(“yes”, “maybe”, or “no”) to the
argument(s) and optionally fixes the
arguments boundaries. The NER is
capable of identifying entities of type
PERSON and COMPANY (and can be
extended to identify additional types).
© Ronen Feldman 170
NER Scores
• The scores mean:
– “yes” – the argument is of the correct
entity type.
– “no” – the argument is not of the right
entity type, and hence the candidate
instance should be removed.
– “maybe” – the argument type is
uncertain, can be either correct or no.
86
© Ronen Feldman 171
Utilizing the NER Scores
• If “no” is returned for one of the
arguments, the instance is removed.
Otherwise, an additional binary
feature is added to the instance's
vector:
– f16 = 1 iff the score for both arguments is
“yes”.
• For bound predicates, only the
second argument is analyzed,
naturally.
© Ronen Feldman 172
Experimental Evaluation
• We want to answer the following 4
questions:
1. Can we train SRES’s classifier once,
and then use the results on all other
relations?
2. What boost will we get by introducing a
simple NER into the classification
scheme of SRES?
3. How does SRES’s performance
compare with KnowItAll and KnowItAll-
PL?
4. What is the true recall of SRES?
87
© Ronen Feldman 173
Training
1. The patterns for a single model predicate are
run over a small set of sentences (10000
sentences in our experiment), producing a set
of extractions (between 150-300 extractions in
our experiments).
2. The extractions are manually labeled
according to whether they are correct or no.
3. For each pattern match Mk, the value of the
feature vector fk = (f1,…f16) is calculated, and
the label Lk = ±1 is set according to whether
the extraction that the match produced is
correct or no.
4. A regression model estimating the function L(f)
is built from the training data {( fk, Lk)}. We
used the BBR, but other models, such as SVM
are of course possible.
© Ronen Feldman 174
Testing
1. The patterns for all predicates are
run over the sentences.
2. For each pattern match M, its score
L(f(M)) is calculated by the trained
regression model. Note that we do
not threshold the value of L, instead
using the raw probability value
between zero and one.
3. The final score for each extraction is
set to the maximal score of all
matches that produced the
extraction.
88
© Ronen Feldman 175
Sample Output
• <e> <arg1>HP</arg1> <arg2>Compaq</arg2>
– <s><DOCUMENT>Additional information about the <X>HP</X> -<Y>Compaq</Y>
merger is available at www.VotetheHPway.com .</DOCUMENT></s>
– <s><DOCUMENT>The Packard Foundation, which holds around ten per cent of
<X>HP</X> stock, has decided to vote against the proposed merger with
<Y>Compaq</Y>.</DOCUMENT></s>
– <s><DOCUMENT>Although the merger of <X>HP</X> and <Y>Compaq</Y> has
been approved, there are no indications yet of the plans of HP regarding Digital
GlobalSoft.</DOCUMENT></s>
– <s><DOCUMENT>During the Proxy Working Group's subsequent discussion, the
CIO informed the members that he believed that Deutsche Bank was one of
<X>HP</X>'s advisers on the proposed merger with
<Y>Compaq</Y>.</DOCUMENT></s>
– <s><DOCUMENT>It was the first report combining both <X>HP</X> and
<Y>Compaq</Y> results since their merger.</DOCUMENT></s>
– <s><DOCUMENT>As executive vice president, merger integration, Jeff played a
key role in integrating the operations, financials and cultures of <X>HP</X> and
<Y>Compaq</Y> Computer Corporation following the 19 billion merger of the
two companies.</DOCUMENT></s>
© Ronen Feldman 176
Cross-Classification
Experiment
Acquisition
0.7
0.75
0.8
0.85
0.9
0.95
1
0 50 100 150
Precision
Merger
0 50 100 150 200 250
Acq.
CEO
Inventor
Mayor
Merger
89
© Ronen Feldman 177
Results!
Acquisition
0.50
0.60
0.70
0.80
0.90
1.00
0 5,000 10,000 15,000 20,000
Correct Extractions
Precision
KIA KIA-PL SRES S_NER
Merger
0.50
0.60
0.70
0.80
0.90
1.00
0 2,000 4,000 6,000 8,000 10,000
Correct Extractions
Precision
KIA KIA-PL SRES S_NER
© Ronen Feldman 178
More Results
CeoOf
0.60
0.70
0.80
0.90
1.00
0 50 100 150 200 250 300
Correct Extractions
Precision
KIA KIA-PL SRES S_NER
MayorOf
0.60
0.70
0.80
0.90
1.00
0 200 400 600 800 1,000 1,200
Correct Extractions
Precision
KIA KIA-PL SRES S_NER
90
© Ronen Feldman 179
Inventor Results
Inv entorOf
0.60
0.70
0.80
0.90
1.00
0 500 1,000 1,500 2,000
Correct Extractions
Precision
KIA KIA-PL SRES
© Ronen Feldman 180
When is SRES better than
KIA?
• KnowItAll extraction works well when
redundancy is high and most instances
have a good chance of appearing in simple
forms that KnowItAll is able to recognize.
• The additional machinery in SRES is
necessary when redundancy is low.
• Specifically, SRES is more effective in
identifying low-frequency instances, due to
its more expressive rule representation,
and its classifier that inhibits those rules
from overgeneralizing.
91
© Ronen Feldman 181
The Redundancy of the
Various Datasets
Datasets redundancy
0
10
20
30
40
50
60
70
Acq Merger Inventor CEO Mayor
Averagesentencesperinstance
© Ronen Feldman 182
True Recall Estimates
• It is impossible to manually annotate all of
the relation instances because of the huge
size of the input corpus.
• Thus, indirect methods must be used. We
used a large list of known acquisition and
merger instances (that occurred between
1/1/2004 and 31/12/2005) taken from the
paid service subscription SBC Platinum.
• For each of the instances in this list we
identified all of sentences in the input
corpus that contained both instance
attributes and assumed that all such
sentences are true instances of the
corresponding relation.
92
© Ronen Feldman 183
Under Estimation of the recall
• This is of course an overestimate
since in some cases the appearance
of both attributes of a true relation
instance is just a chance occurrence
and does not constitute a true
mention of the relation.
• Thus, our estimates of the true recall
are pessimistic, and the actual recall
is higher.
© Ronen Feldman 184
True Recall Estimates
Acquisition
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8
Redundancy
Recall
Merger
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8
Redundancy
Recall
93
© Ronen Feldman 185
Conclusions
• We have presented the SRES system
for autonomously learning relations
from the Web.
• SRES solves the bottleneck created by
classic information extraction systems
that either relies on manually
developed extraction patterns or on
manually tagged training corpus.
• The system relies upon a pattern
learning component that enables it to
boost the recall of the system.
© Ronen Feldman 186
Future Work
• In our future research we want to try to
improve the precision values even at the
highest recall levels.
• One of the topics we would like to explore is
the complexity of the patterns that we learn.
Currently we use a very simple pattern
language that just has 3 types of elements,
slots, constants and skips. We want to see if
we can achieve higher precision with more
complex patterns.
• In addition we would like to test SRES on nary
predicates, and to extend the system to
handle predicates that are allowed to lack
some of the attributes.
A th ibl h di ti i i