1 © Ronen Feldman 1 Information Extraction Theory and Practice Ronen Feldman Bar-Ilan University ISRAEL feldman@cs.biu.ac.il © Ronen Feldman 2 What is Information Extraction? • IE does not indicate which documents need to be read by a user, it rather extracts pieces of information that are salient to the user's needs. • Links between the extracted information and the original documents are maintained to allow the user to reference context. • The kinds of information that systems extract vary in detail and reliability. • Named entities such as persons and organizations can be extracted with reliability in the 90th percentile range, but do not provide attributes, facts, or events that those entities have or participate in. 2 © Ronen Feldman 3 Relevant IE Definitions • Entity: an object of interest such as a person or organization. • Attribute: a property of an entity such as its name, alias, descriptor, or type. • Fact: a relationship held between two or more entities such as Position of a Person in a Company. • Event: an activity involving several entities such as a terrorist act, airline crash, management change, new product introduction. © Ronen Feldman 4 IE Accuracy by Information Type 50-60%Events 60-70%Facts 80%Attributes 90-98%Entities AccuracyInformation Type 3 © Ronen Feldman 5 MUC Conferences MUC 7 MUC 6 MUC 5 MUC 4 MUC 3 MUC 2 MUC 1 Conference Spaces Vehicles and Missile Launches 1997 Management Changes1995 Joint Venture and Micro Electronics 1993 Terrorist Activity1992 Terrorist Activity1991 Naval Operations1989 Naval Operations1987 TopicYear © Ronen Feldman 6 The ACE Evaluation • The ACE program is dedicated to the challenge of extracting content from human language. This is a fundamental capability that the ACE program addresses with a basic research effort that is directed to master first the extraction of “entities”, then the extraction of “relations” among these entities, and finally the extraction of “events” that are causally related sets of relations. • After two years of research on the entity detection and tracking task, top systems have achieved a capability that is useful by itself and that, in the context of the ACE EDT task, successfully captures and outputs well over 50 percent of the value at the entity level. • Here value is defined to be the benefit derived by successfully extracting the entities, where each individual entity provides a value that is a function of the entity type (i.e., “person”, “organization”, etc.) and level (i.e., “named”, “unnamed”). Thus each entity contributes to the overall value through the incremental value that it provides. 4 © Ronen Feldman 7 40.6 63.0 11.3 24.3 9.1 11.8 0.1 0.6 0.0 0.3 0 10 20 30 40 50 60 70 80 90 100 Act value Max value NormalizedValue/Cost(in%) relativetoMaximumValue FAC LOC GPE ORG PER Miss 36%, False Alarm 22%, Type Error 6% Average of Top 4 Systems ACE Entity Detection & Tracking Evaluation -- 2/2002 Goal: Extract entities. Each entity is assigned a value. This value is a function of its Type and Level. This value is gained when the entity is successfully detected. This value is lost when an entity is missed, spuriously detected, or mischaracterized. Table of Entity Values 0.0020.0040.010.020.04PRO 0.010.020.050.10.2NOM 0.050.10.250.51NAM FACLOCGPEORGPER Human performance ~80 © Ronen Feldman 8 Applications of Information Extraction • Routing of Information • Infrastructure for IR and for Categorization (higher level features) • Event Based Summarization. • Automatic Creation of Databases and Knowledge Bases. 5 © Ronen Feldman 9 Where would IE be useful? • Semi-Structured Text • Generic documents like News articles. • Most of the information in the document is centered around a set of easily identifiable entities. © Ronen Feldman 10 Approaches for Building IE Systems • Knowledge Engineering Approach – Rules are crafted by linguists in cooperation with domain experts. – Most of the work is done by inspecting a set of relevant documents. – Can take a lot of time to fine tune the rule set. – Best results were achieved with KB based IE systems. – Skilled/gifted developers are needed. – A strong development environment is a MUST! 6 © Ronen Feldman 11 Approaches for Building IE Systems • Automatically Trainable Systems – The techniques are based on pure statistics and almost no linguistic knowledge – They are language independent – The main input is an annotated corpus – Need a relatively small effort when building the rules, however creating the annotated corpus is extremely laborious. – Huge number of training examples is needed in order to achieve reasonable accuracy. – Hybrid approaches can utilize the user input in the development loop. © Ronen Feldman 12 Components of IE System 7 © Ronen Feldman 13 The Extraction Engine © Ronen Feldman 14 Why is IE Difficult? • Different Languages – Morphology is very easy in English, much harder in German and Hebrew. – Identifying word and sentence boundaries is fairly easy in European language, much harder in Chinese and Japanese. – Some languages use orthography (like english) while others (like hebrew, arabic etc) do no have it. • Different types of style – Scientific papers – Newspapers – memos – Emails – Speech transcripts • Type of Document – Tables – Graphics – Small messages vs. Books 8 © Ronen Feldman 15 Morphological Analysis • Easy – English, Japanese – Listing all inflections of a word is a real possibility • Medium – French Spanish – A simple morphological component adds value. • Difficult – German, Hebrew, Arabic – A sophisticated morphological component is a must! © Ronen Feldman 16 Using Vocabularies • “Size doesn’t matter” – Large lists tend to cause more mistakes – Examples: • Said as a person name (male) • Alberta as a name of a person (female) • It might be better to have small domain specific dictionaries 9 © Ronen Feldman 17 Part of Speech Tagging • POS can help to reduce ambiguity, and to deal with ALL CAPS text. • However – It usually fails exactly when you need it – It is domain dependent, so to get the best results you need to retrain it on a relevant corpus. – It takes a lot of time to prepare a training corpus. © Ronen Feldman 18 A simple POS Strategy • Use a tag frequency table to determine the right POS. – This will lead to elimination of rare senses. • The overhead is very small • It improve accuracy by a small percentage. • Compared to full POS it provide similar boost to accuracy. 10 © Ronen Feldman 19 Dealing with Proper Names • The problem – Impossible to enumerate – New candidates are generated all the time – Hard to provide syntactic rules • Types of proper names – People – Companies – Organizations – Products – Technologies – Locations (cities, states, countries, rivers, mountains) © Ronen Feldman 20 Comparing RB Systems with ML Based Systems 90.690.3HUB4 Transcribed Speech 90.493.7MUC7 9396.4MUC6 Wall Street Journal HMMRule Based 11 © Ronen Feldman 21 Building a RB Proper Name Extractor • A Lexicon is always a good start • The rules can be based on the lexicon and on: – The context (preceding/following verbs or nouns) – Regular expressions • Companies: Capital* [,] inc, Capital* corporation.. • Locations: Capital* Lake, Capital* River – Capitalization – List structure • After the creation of an initial set of rules – Run on the corpus – Analyze the results – Fix the rules and repeat… • This Process can take around 2-3 weeks and result in performance of between 85-90% break even. • Better performance can be achieved with more effort (2-3 months) and then performance can get to 95-98% © Ronen Feldman 22 DIAL – Declarative Information Analysis Language An Information Extraction Language 12 © Ronen Feldman 23 The DIAL4 Language Declarative Information Analysis Language • Modular • Entity Oriented • Regular Expressions Based • Rapid and Easy RuleBook Development © Ronen Feldman 24 DIAL4 Concepts Each concept may have: Attributes Guards a set of logical conditions on the concept attributes’ values Actions a set of operations to perform after finding a concept instance Internal a section for defining internal concepts which can only be used within the scope of the concept Functions a section for defining add-on (Perl) functions that can only be used within the scope of the concept Context text units in which concepts instances will be searched 13 © Ronen Feldman 25 DIAL4 Rule Structure • Pattern defines the text pattern to match when searching for a concept instance. • Constraints defines logical conditions to apply to values extracted from the pattern match. If these conditions are not met the match is discarded. • Actions a set of operations to perform after finding a pattern match. This is where concept instances are added to the Shared Memory. © Ronen Feldman 26 A Full Rule – an Example Example of an Instance: Crown Central Petroleum Corp. concept Company {}; rule Company { pattern: (Capital+) -> name wcCompanyExt “.”?; constraints: !(name.FirstToken() IS_IN wcCompNameNonStarters); actions: Add(); }; 14 © Ronen Feldman 27 Improving the Person Concept: Assigning values to the concept attributes concept Person { attributes: string FirstName; string MiddleName; string LastName; }; rule Person { pattern: wcFirstName -> first Capital -> last; actions: Add(FirstName<-first, LastName<-last); }; © Ronen Feldman 28 Rule for Extraction of Person Names Based on Title/Position wordclass wcPosition = adviser minister spokesman president (vice president) general (gen .); /* note that wordclass members are tokenized and entries containing multiple tokens should be enclosed within () */ concept PersonNameStruct { //we define this concept to allow the code reuse attributes: string FirstName; string MiddleName; string LastName; }; wordclass wcNamePrefix = ben abu abed von al; rule PersonNameStruct { pattern: Capital -> first (Capital “.”?)? -> middle ((wcNamePrefix “-”?)? Capital) ->last; actions: Add(FirstName <- first.Text(), MiddleName <- middle.Text(), LastName <- last.Text()); }; rule Person { pattern: LONGEST(wcPosition PersonNameStruct -> name); actions: Add(FirstName <- name.FirstName, MiddleName <- name.MiddleName, LastName <- name.LastName); }; 15 © Ronen Feldman 29 List of Person Names concept PersonsList{}; wordclass wcCriminalIndicatingVerbs = charged blamed arrested; wordclass wcPluralBe = are were; rule PersonsList { pattern: wcCriminalIndicatingVerbs wcPluralBe (PersonNameStruct->> pList “,”?)+ “and” PersonNameStruct ->> pList; actions: iterate (pList) begin currPerson = pList.CurrentItem(); Add(Person, currPerson, FirstName<-currPerson.FirstName, LastName <- currPerson.LastName); end }; © Ronen Feldman 30 Person Concept: Applying Constraints rule Person { pattern: LONGEST(Capital -> first (MiddleName -> middle)? ((wcNamePrefix “-”?)? Capital) ->last); constraints: (first IS_IN wcFirstNames) OR !(middle.IsEmpty()); //in “President V. Putin” don’t match “President” as the first name. !(first.FirstToken() IS_IN wcPosition); actions: Add(FirstName <- first.Text(), MiddleName <- middle.Text(), LastName <- last.Text()); Block(Person, this_match); }; We can conclude with high probability that a proper name is a person name if it has a known first name or a middle name: – John Smith – Emil I. Singer 16 © Ronen Feldman 31 Person Concept: Applying Constraints (cont.) rule Person { pattern: Capital -> first (MiddleName -> middle)? ((wcNamePrefix “-”?)? Capital) ->last; constraints: (first IS_IN wcFirstNames) OR !(middle.IsEmpty()) OR (first {1} AFTER wcPosition); !(first.FirstToken() IS_IN wcPosition); actions: Add(FirstName <- first.Text(), MiddleName <- middle.Text(), LastName <- last.Text()); }; We have two rules for concept Person: the first rule extracts names with sure internal evidence (first or middle name), and the second extracts names without internal evidence but with positions/titles preceding them. Let’s merge these two rules into a single rule. © Ronen Feldman 32 Concept Guards • Guards are applied to concept attributes when a rule attempts to add a concept instance to the Shared Memory. A concept instance will be added only if all guard conditions are met . • Guards enable the concept to ensure conditions on its attribute values in a central location, without having to add these conditions to each rule of the concept. Example: concept Date { attributes: number nDay; number nMonth; number nYear; guards: (nDay >= 1) AND (nDay <= 31); (nMonth >= 1) AND (nMonth <=12); (nYear > 0); }; 17 © Ronen Feldman 33 Introduction to HMMs for IE © Ronen Feldman 34 Motivation • We can view the named entity extraction as a classification problem, where we classify each word as belonging to one of the named entity classes or to the noname class. • One of the most popular techniques for dealing with classifying sequences is HMM. • Example of using HMM for another NLP classification task is that of part of speech tagging (Church, 1988 ; Weischedel et. al., 1993). 18 © Ronen Feldman 35 What is HMM? • HMM (Hidden Markov Model) is a finite state automaton with stochastic state transitions and symbol emissions (Rabiner 1989). • The automaton models a probabilistic generative process. • In this process a sequence of symbols is produced by starting in an initial state, transitioning to a new state, emitting a symbol selected by the state and repeating this transition/emission cycle until a designated final state is reached. © Ronen Feldman 36 Disadvantage of HMM • The main disadvantage of using an HMM for Information extraction is the need for a large amount of training data. i.e., a carefully tagged corpus. • The corpus needs to be tagged with all the concepts whose definitions we want to learn. 19 © Ronen Feldman 37 Notational Conventions • T = length of the sequence of observations (training set) • N = number of states in the model • qt = the actual state at time t • S = {S1,...SN} (finite set of possible states) • V = {O1,...OM} (finite set of observation symbols) • π = {πi} = {P(q1 = Si)} starting probabilities • A = {aij}=P(qt+1= Si | qt = Sj) transition probabilities • B = {bi(Ot)} = {P(Ot | qt = Si)} emission probabilities © Ronen Feldman 38 The Classic Problems Related to HMMs • Find P( O | λ ): the probability of an observation sequence given the HMM model. • Find the most likely state trajectory given λ and O. • Adjust λ = (π, A, B) to maximize P( O | λ ). 20 © Ronen Feldman 39 Calculating P( O | λ ) • The most obvious way to do that would be to enumerate every possible state sequence of length T (the length of the observation sequence). Let Q = Q1,...QT , then by assuming independence between the states we have – P(O|Q, λ) = – P(Q|λ) = • By using Bayes theorem we have – P(O,Q|λ) = P(O|Q, λ) P(Q|λ) • Finally The main problem with this is that we need to do 2TNT multiplications, which is certainly not feasible even for a modest T like 10. ∏∏ == = T i iq T i ii ObqOP i 11 )(),|( λ ∏ − = + 1 1 11 T i qqq ii aπ © Ronen Feldman 40 The forward-backward algorithm • In order to solve that we use the forwardbackward algorithm this is far more efficient. The forward part is based on the computation of terms called the alpha terms. We define the alpha values as follows, • We can compute the alpha values inductively in a very efficient way. • This calculation requires just N2T multiplications. ∑ ∑ = + = + = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = = N i T tj N i ijtt ii iaOP Obaij Obi 1 1 1 1 11 )()|( )()()( )()( λ αα πα 21 © Ronen Feldman 41 The backward phase • In a similar manner we can define a backward variable called beta that computes the probability of a partial observation sequence from t+1 to T. The beta variable will also be computed inductively but in a backward fashion. )()()( 1)( 11 1 jObai i tti N j ijt T ++ = ∑= = ββ β © Ronen Feldman 42 Solution for the second problem • Our main goal is to find the “optimal” state sequence. We will do that by maximizing the probabilities of each state individually. • We will start by defining a set of gamma variables the measure the probabilities that at time t we are at state Si. • The denominator is used just to make gamma a true probability measure. • Now we can find the best state at each time slot in a local fashion. • The main problem with this approach is that the optimization is done locally, and not on the whole sequence of states. This can lead either to a local maximum or even to an invalid sequence. In order to solve that problem we use a well known dynamic programming algorithm called the Viterbi algorithm. ∑= == N i tt tttt t ii ii OP ii i 1 )()( )()( )|( )()( )( βα βα λ βα γ Ni i q t t ≤≤ = 1 )](max[arg γ 22 © Ronen Feldman 43 The Viterbi Algorithm • Intuition – Compute the most likely sequence starting with the empty observation sequence; use this result to compute the most likely sequence with an output sequence of length one; recurse until you have the most likely sequence for the entire sequence of observations. • Algorithmic Details – The delta variables compute the highest probability of a partial sequence up to time t that ends in state Si. The psi variables enables us to accumulate the best sequence as we move along the time slices. • 1. Initialization: 0)( )()( 1 11 = = i Obi ii ψ πδ © Ronen Feldman 44 Viterbi (Cont). • Recursion: • Termination: • Reconstruction: • For t = T-1,T-2,...,1. The resulting sequence, , solves Problem 2. [ ] )()( 1 max )( 1 tiijtt Obai Ni j − ≤≤ = δδ [ ]ijtt ai Ni j )( 1 maxarg )( 1− ≤≤ = δψ [ ])( 1 max* i Ni P Tδ ≤≤ = [ ])( 1 maxarg* i Ni q TT δ ≤≤ = )( * 11 * ++= ttt qq ψ ** 2 * 1 , Tqqq 23 © Ronen Feldman 45 Viterbi (Example) © Ronen Feldman 46 The Just Research HMM • Each HMM extracts just one field of a given document. If more fields are needed, several HMMs need to be constructed. • The HMM takes the entire document as one observation sequence. • The HMM contains two classes of states, background states and target states. The background states emit words in which are not interested, while the target states emit words that constitute the information to be extracted. • The state topology is designed by hand and only a few transitions are allowed between the states. 24 © Ronen Feldman 47 Possible HMM Topologies prefix states suffix states initial state final state background state target state background state target state final state initial state © Ronen Feldman 48 A more General HMM Architecture 25 © Ronen Feldman 49 Experimental Evaluation 46.7%55.3%40.1%48.1%30.9% Status of Acquisition Price of Acquisition Abbreviation of Acquired Company Acquired Company Acquiring Company 59.5%99.1%83.9%71.1% End TimeStart TimeLocationSpeaker © Ronen Feldman 50 BBN’s Identifinder • An ergodic bigram model. • Each Named Class has a separate region in the HMM. • The number of states in each NC region is equal to |V|. Each word has its own state. • Rather then using plain words, extended words are used. An extended word is a pair , where f is a feature of the word w. 26 © Ronen Feldman 51 BBN’s HMM Architecture © Ronen Feldman 52 Possible word Features 1. 2 digit number (01) 2. 4 digit number (1996) 3. alphanumeric string (A34-24) 4. digits and dashes (12-16-02) 5. digits and slashes (12/16/02) 6. digits and comma (1,000) 7. digits and period (2.34) 8. any other number (100) 9. All capital letters (CLF) 10. Capital letter and a period (M.) 11. First word of a sentence (The) 12. Initial letter of the word is capitalized (Albert) 13. word in lower case (country) 14. all other words and tokens (;) 27 © Ronen Feldman 53 Statistical Model • The design of the formal model is done in levels. • At the first level we have the most accurate model, which require the largest amount of training data. • At the lower levels we have back-off models that are less accurate but also require much smaller amounts of training data. • We always try to use the most accurate model possible given the amount of available training data. © Ronen Feldman 54 Computing State Transition Probabilities • When we want to analyze formally the probability of annotating a given word sequence with a set of name classes, we need to consider three different statistical models: – A model for generating a name class – A model to generate the first word in a name class – A model to generate all other words (but the first word) in a name class 28 © Ronen Feldman 55 Computing the Probabilities : Details • The model to generate a name class depends on the previous name class and on the word that precedes the name class; this is the last word in the previous name class and we annotate it by w-1. So formally this amounts to P(NC | NC-1,w-1). • The model to generate the first word in a name class depends on the current name class and the previous name class and hence is P(first| NC, NC-1). • The model to generate all other words within the same name class depends on the previoues word (within the same name class) and the current name class, so formally it is P(| -1, NC). © Ronen Feldman 56 The Actual Computation ),( ),,( ),|( 11 11 11 −− −− −− = wNCc wNCNCc wNCNCP ),( ),,,( ),|,( 1 1 1 − − − >< =>< NCNCc NCNCfwc NCNCfwP first first ),,( ),,,,( ),,|,( 1 1 1 NCfwc NCfwfwc NCfwfwP − − − >< ><>< =><>< c(,-1,NC), counts the number of times that we have the pair after the pair -1 and they both are tagged by the name class NC. 29 © Ronen Feldman 57 Modeling Unknown Words • The main technique is to create a new entity called UNKNOWN (marked _UNK_), and create statistics for that new entity. All words that were no seen before are mapped to _UNK_. • split the collection into 2 even parts, and each time use one part for training and one part as a hold out set. The final statistics is the combination of the results from the two runs. • The statistics needs to be collected for 3 different classes of cases: _UNK_ and then a known word (|V| cases), a known word and then _UNK_ and two consecutive _UNK_ words. This statistics is collected for each name class. © Ronen Feldman 58 Name Class Back-off Models • The full model take into account both the previous name class and the previous word (P(NC| NC-1,w-1) • The first back-off model takes into account just the previous name class (P((NC| NC-1)). • The next back-off model would just estimate the probability of seeing the name class based on the distribution of the various name classes (P(NC)). • Finally, we use a uniform distribution between all names classes (1/(N+1), where N is number of the possible name classes) 30 © Ronen Feldman 59 First Word Back-off Models • The full model takes into account the current name class and the previous name class (P(first| NC, NC-1)). • The first back-off model takes into account just the current name class (P(first| NC)). • The next back-off model, breaks the pair and just uses multiplication of two independent events given the current word class (P(w|NC)P(f|NC)) • The next back-off model is a uniform distribution between all pairs of words and features ( , where F# is the # of possible word features) |#||| 1 FV © Ronen Feldman 60 Rest of the Words Back-off Models • The full model takes into account the current name class and the previous word (P(|-1, NC)). • The first back-off model takes into account just the current name class (P(| NC)). • The next back-off model, breaks the pair and just uses multiplication of two independent events given the current word class (P(w|NC)P(f|NC)) • The next back-off model is a uniform distribution between all pairs of words and features ( , where F# is the # of possible word features) |#||| 1 FV 31 © Ronen Feldman 61 Combining all the models • The actual probability is a combination of the different models. Each model gets a different weight based on the amount of training available to that model. • Lets assume we have 4 models (one full model, and 3 back-off models), and we are trying to estimate the probability of P(X|Y). Let P1 be probability of the event according to the full model, and P2, P3, P4 ate the back-off models respectively. • The weights are computed based on a lambda parameter that is based on each model and it immediate back-off model. For instance λ1 will adjust the wait between the full model and the first back-off model. )( )(# 1 1 )( )( 1 Ybc YYbc Yc + ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ −=λ © Ronen Feldman 62 Analysis • Where c(Y) the count of event Y according to the full model, and bc(Y) is the count of event Y according to the back-off model. #(Y) is the number of unique outcomes of Y. • Lambda has two desirable properties – If the full model and the back-off model both have the same support for event Y, then Lambda will be 0 and we will use just the full model. – If the possible outcomes of Y are distributed uniformly then the weight of lambda will be close to 0 since there is low confidence in the back-off model. )( )(# 1 1 )( )( 1 Ybc YYbc Yc + ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ −=λ 32 © Ronen Feldman 63 Example • We want to compute the probability of P(“bank” | “river”, “Not-AName”). Lets assume that river appears with 3 different words in the Not-A-Name” name class, and in total there are 9 different occurrences of river with any of the 3 words. • so we will use the full model (P1) with 0.75, and the other back-off models with 0.25. We then compute λ2 which computes the weight of the first back-off model (P2) against the other back-off models, and finally λ3 which is the weight of the second back-off model (P3) against the last back-off model. So to sum up, the probability of P(X|Y) would be: • P(X|Y) = λ1 * P1(X|Y) + (1 - λ1) * (λ2 * P2(X|Y) + (1 - λ2) * (λ3 * P3(X|Y) + (1 - λ3) * P4(X|Y))) 4 3 4 3 1 9 3 1 1 9 0 11 =⋅= ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ + ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ −=λ © Ronen Feldman 64 Using different modalities of text• Mixed Case: Abu Sayyaf carried out an attack on a south western beach resort on May 27, seizing hostages including three Americans. They are still holding a missionary couple, Martin and Gracia Burnham, from Wichita, Kansas, and claim to have beheaded the third American, Guillermo Sobero, from Corona, California. Mr. Sobero's body has not been found. • Upper Case: ABU SAYYAF CARRIED OUT AN ATTACK ON A SOUTH WESTERN BEACH RESORT ON MAY 27, SEIZING HOSTAGES INCLUDING THREE AMERICANS. THEY ARE STILL HOLDING A MISSIONARY COUPLE, MARTIN AND GRACIA BURNHAM, FROM WICHITA, KANSAS, AND CLAIM TO HAVE BEHEADED THE THIRD AMERICAN, GUILLERMO SOBERO, FROM CORONA, CALIFORNIA. MR SOBERO'S BODY HAS NOT BEEN FOUND. • SNOR: ABU SAYYAF CARRIED OUT AN ATTACK ON A SOUTH WESTERN BEACH RESORT ON MAY TWENTY SEVEN SEIZING HOSTAGES INCLUDING THREE AMERICANS THEY ARE STILL HOLDING A MISSIONARY COUPLE MARTIN AND GRACIA BURNHAM FROM WICHITA KANSAS AND CLAIM TO HAVE BEHEADED THE THIRD AMERICAN GUILLERMO SOBERO FROM CORONA CALIFORNIA MR SOBEROS BODY HAS NOT BEEN FOUND. 33 © Ronen Feldman 65 Experimental Evaluation (MUC 7) 90%93%SpanishMixed case 90.7%74%EnglishSNOR 93.6%89%EnglishUpper case 94.9%96.4%EnglishMixed case HMMRule BasedLanguageModality © Ronen Feldman 66 How much Data is needed to train an HMM? NA94.9%650,000 91.2%93.1%230,000 90.5%92.8%130,000 NA91.9%85,000 89.7%91.5%60,000 88.6%NA23,000 SpanishEnglishNumber of Tagged Words 34 © Ronen Feldman 67 Limitations of the Model • The context which is used for deciding on the type of each word is just the word the precedes the current word. In many cases, such a limited context may cause classification errors. • As an example consider the following text fragment “The Turkish company, Birgen Air, was using the plane to fill a charter commitment to a German company,”. The token that precedes Birgen is a comma, and hence we are missing the crucial clue company which is just one token before the comma. • Due to the lack of this hint, the IndentiFinder system classified Birgen Air as a location rather than as a company. One way to solve this problem is to augment the model with another token when the previous token is a punctuation mark. © Ronen Feldman 68 Example - input NEWS 4 Ryan daughters tied to cash ( Thu Feb 13, 9:49 AM ) 4 Ryan daughters tied to cash Thu Feb 13, 9:49 AM Yahoo-News By Matt O'Connor, Tribune staff reporter. Tribune staff reporter Ray Gibson contributed to this report

Four of former Gov. George Ryan's daughters shared in almost 10,000 in secret payments from Sen. Phil Gramm's presidential campaign in the mid-1990s, according to testimony Wednesday in federal court.

Alan Drazek, who said he was brought in at Ryan's request as a point man for the Gramm campaign in Illinois, testified he was told by Scott Fawell, Ryan's chief of staff, or Richard Juliano, Fawell's top aide, to cut the checks to the women. According to court records made public Wednesday, Ryan's daughter, Lynda Pignotti, now known as Lynda Fairman, was paid a combined 5,950 in 1995 by the Gramm campaign in four checks laundered through Drazek's business, American Management Resources.

In 1996, individual checks went to Ryan daughters Nancy Coghlan, who received 1,725, and Joanne Barrow and Julie R. Koehl, who each pocketed 1,000, the records showed.

A source said all four daughters had been given immunity from prosecution by federal authorities and testified before the grand jury investigating Fawell as part of the Operation Safe Road probe.

Full story at Chicago Tribune

35 © Ronen Feldman 69 Example - Output Full story at Baltimore Sun By Matt O ' Connor , Tribune staff reporter . Tribune staff reporter Ray Gibson contributed to this report Four of former Gov . George Ryan ' s daughters shared in almost 10 , 000 in secret payments from Sen . Phil Gramm ' s presidential campaign in the mid - 1990 s , according to testimony Wednesday in federal court . Alan Drazek , who said he was brought in at Ryan ' s request as a point man for the Gramm campaign in Illinois , testified he was told by Scott Fawell , Ryan ' s chief of staff , or Richard Juliano , Fawell ' s top aide , to cut the checks to the women . According to court records made public Wednesday , Ryan ' s daughter , Lynda Pignotti , now known as Lynda Fairman , was paid a combined 5 , 950 in 1995 by the Gramm campaign in four checks laundered through Drazek ' s business , American Management Resources . In 1996 , individual checks went to Ryan daughters Nancy Coghlan , who received 1 , 725 , and Joanne Barrow and Julie R . Koehl , who each pocketed 1 , 000 , the records showed . A source said all four daughters had been given immunity from prosecution by federal authorities and testified before the grand jury investigating Fawell as part of the Operation Safe Road probe . Full story at Chicago Tribune © Ronen Feldman 70 Training: ACE + MUC => MUC 50/29.693.7/40.54percent 86.6/82.197.6/82.1money 77.7/91.790.7/91.3location 68.6/92.576.4/77.6time 59/89.590.9/76.6date 83.1/95.991.1/93.7organization 84.9/88.691.9/85.5person ace+muc7muc7r/p 36 © Ronen Feldman 71 Results with our new algorithm 60 70 80 90 100 10.7 14.6 19.8 27 36.7 49.9 68 92.5 126 171 233 317 432 588 801 1090 1484 2019 2749 3741 Amount of training (KWords) Logarithmic scale - each tick multiplies the amount by 7/6 F-measure(%) DATE LOCATION ORGANIZATION PERSON DATE-Nymble LOC-Nymble ORG-Nymble Currently available Amount of training © Ronen Feldman 72 Some Useful Resources • Linguistic Data Consortium (LDC) – Lexicons – Annotated Corpora (Text and Speech) • New Mexico State University – Gazetteers – Many lists of names – Lexicons for different languages • Various Web Sources – CIA World Fact Book – Hoovers – SEC, Nasdaq (list of public company names) – US Census data – Private web sites (like Arabic, Persian, Pakistani names) 37 © Ronen Feldman 73 Shallow Parsing in IE • Only Core Constituents are extracted • No attempt is made at full parses • Relevant prepositional attachments are extracted. – I saw the man with a telescope • Only adverbials related to location and time are processed, others are ignored. • Quantifiers, modals, and propositional attitudes are ignored, or treated in a simplified way. © Ronen Feldman 74 Why not Full Parsing? • Full Parsing for IE was tried in: – SRI Tacitus system (MUC 3) – NYU Proteus (Muc-6) • Main Issues: – Slow (combinatorial explosion of possible parses) – Erroneous – A simple predicate-argument structure is needed. 38 © Ronen Feldman 75 Coreference • The general problem is related to co referential relations between expressions – Whole – part relationship – Containment relationship (set/subset) • A simplest version is to find which noun phrases refer to the same entity • An even more restricted version is to limit it just to proper names. • Example: – The President, George Bush, George W. Bush, or even “W”, all refer to the same entity. © Ronen Feldman 76 Easy and hard in Coreference • “Mohamed Atta, a suspected leader of the hijackers, had spent time in Belle Glade, Fla., where a crop-dusting business is located. Atta and other Middle Eastern men came to South Florida Crop Care nearly every weekend for two months. • “Will Lee, the firm's general manager, said the men asked repeated questions about the crop-dusting business. He said the questions seemed "odd," but he didn't find the men suspicious until after the Sept. 11 attack.” 39 © Ronen Feldman 77 The “Simple” Coreference • Proper Names – IBM, “International Business Machines”, Big Blue – Osama Bin Ladin, Bin Ladin, Usama Bin Laden. (note the variations) • Definite Noun Phrases – The Giant Computer Manufacturer, The Company, The owner of over 600,000 patents • Pronouns – It, he , she, we….. © Ronen Feldman 78 Coreference Example Granite Systems provides Service Resource Management (SRM) software for communication service providers with wireless, wireline, optical and packet technology networks. Utilizing Granite' Xng System, carriers can manage inventories of network resources and capacity, define network configurations, order and schedule resources and provide a database of record to other operational support systems (OSSs). An award-winning company, including an Inc. 500 company in 2000 and 2001, Granite Systems enables clients including AT&T Wireless, KPN Belgium, COLT Telecom, ATG and Verizon to eliminate resource redundancy, improve network reliability and speed service deployment. Founded in 1993, the company is headquartered in Manchester, NH with offices in Denver, CO; Miami, FL; London, U.K.; Nice, France; Paris, France; Madrid, Spain; Rome, Italy; Copenhagen, Denmark; and Singapore. 40 © Ronen Feldman 79 A KE Approach to Corefernce • Mark each noun phrase with the following: – Type (company, person, location) – Singular vs. plural – Gender (male, female, neutral) – Syntactic (name, pronoun, definite/indefinite) • For each candidate – Find accessible antecedents • Each antecedent has a different scope – Proper names’s scope is the whole document – Definite clauses’s scope is the preceding paragraphs – Pronouns might be just the previous sentence, or the same paragraph. – Filter by consistency check – Order by dynamic syntactic preference © Ronen Feldman 80 Filtering Antecedents • George Bush will not match “she”, or “it” • George Bush can not be an antecedent of “The company” or “they” • Using a sort Hierarchy we can use background information to be smarter – Example: “The big automaker is planning to get out the car business. The company feels that it can never longer make a profit making cars.” 41 © Ronen Feldman 81 IE Via BootStrapping © Ronen Feldman 82 AutoSlog (Riloff, 1993) • Creates Extraction Patterns from annotated texts (NPs). • Uses Sentence analyzer (CIRCUS, Lehnert, 1991) to identify clause boundaries and syntactic constituents (subject, verb, direct object, prepositional phrase) • It then uses heuristic templates to generate extraction patterns 42 © Ronen Feldman 83 Example Templates Was aimed at Passive-verb prep Killed with Active-verb prep Bomb against Noun prep Bombed Active-verb was victim aux noun was murdered passive-verb ExampleTemplate © Ronen Feldman 84 AutoSlog-TS (Riloff, 1996) • It took 8 hours to annotate 160 documents, and hence probably a week to annotate 1000 documents. • This bottleneck is a major problem for using IE in new domains. • Hence there is a need for a system that can generate IE patterns from unannotated documents. • AutoSlog-TS is such a system. 43 © Ronen Feldman 85 AutoSlog TS Sentence Analyzer S: World Trade Center V: was bombed PP: by terrorists AutoSlog Heuristics Extraction Patterns was killed bombed by Sentence Analyzer Extraction Patterns was killed was bombed bombed by saw EP REL % was bobmed 87% bombed by 84% was killed 63% saw 49% © Ronen Feldman 86 Top 24 Extraction Patterns Exploded on kidnappedOne of was murderedDestroyed Was wounded in Occurred on Responsibility for Took place on was loctated occured was wounded Claimed Caused took place Death of Exploded in was injured Attack on was kidnapped was killed Assassination of Murder of exploded 44 © Ronen Feldman 87 Evaluation • Data Set: 1500 docs from MUC-4 (772 relevant) • AutoSlog generated 1237 patterns which were manually filtered to 450 in 5 hours. • AutoSlog-TS generated 32,345 patterns, after discarding singleton patterns, 11,225 were left. • Rank(EP) = • The user reviewed the the top 1970 patterns and selected 210 of them in 85 minutes. • AutoSlog achieved better recall, while AutoSlog-TS achieved better precision. freq freq freqrel 2log − © Ronen Feldman 88 Learning Dictionaries by Bootstrapping (Riloff and Jones, 1999) • Learn dictionary entries (semantic lexicon) and extraction patterns simultaneously. • Use untagged text as a training source for learning. • Start with a set of seed lexicon entries and using mutual bootstrapping learn extraction patterns and more lexicon entries. 45 © Ronen Feldman 89 Mutual Bootstrapping Algorithm • Using AutoSlog generate all possible extraction patterns. • Apply patterns to the corpus and save results to EPData • SemLex = Seed Words • Cat_EPList = {} 1. Score all extraction patterns in EPData 2. Best_EP = highest scoring pattern 3. Add Best_EP to Cat_EPList 4. Add Best_EP’s extractions to SemLex 5. Goto 1 © Ronen Feldman 90 Meta Bootstrapping Process SeedWords PermanentSemantic lexicon CandidateExtractionPatterns andtheirExtractions TemporarySemantic Lexicon CategoryEPList Initialize Add5Best NPs Select best_EP Add best_EP’s extractions 46 © Ronen Feldman 91 Sample Extraction Patterns process in offices of expanded into returned to has positionsservices in taken in request informationdistributors in part in message to customers in ministers of thriveoutlets in to enter devoted to consulting in parts of sold to activities in presidents of motivated seminars in sought in positioningoperates in become in is distributoroperations in traveled to employedfacilities in living in owned by offices in terrorism locationwww companywww location © Ronen Feldman 92 Experimental Evaluation 124/244 (.51) 101/194 (.52) 85/144 (.59) 68/94 (.72) 31/44 (.70) 4/4 (1) Terr. Weapon 158/250 (.63) 127/200 (.64) 100/150 (.67) 66/100 (.66) 32/50 (.64) 5/5 (1) Terr. Location 107/231 (.46) 101/181 (.56) 86/131 (.66) 63/81 (.78) 22/31 (.71) 0/1 (0) Web Title 191/250 (.76) 163/200 (.82) 129/150 (.86) 88/100 (.88) 46/50 (.92) 5/5 (1) Web Location 95/206 (.46) 86/163 (.53) 72/113 (.64) 52/65 (.80) 25/32 (.78) 5/5 (1) Web Company Iter 50Iter 40Iter 30Iter 20Iter 10Iter 1 47 © Ronen Feldman 93 KDD Cup Task 1 Information Extraction from Biomedical Articles System Description June / July 2002 © Ronen Feldman 94 The Task: Curate or Not- Curate A Product – mRNA or Protein actually identified (naturally) within specific cells of the natural (WildType) fly. For each paper, a list of all genes mentioned in the paper - for which we must decide if there is a product result - is given Build a system for automatic analysis of scientific papers regarding the Drosophila Fruit FlyDrosophila Fruit Fly. The system should extract (curate) only the papers that include experimentalexperimental resultsresults regarding expressionexpression of genegene productsproducts, and identifyidentify these genes and products 48 © Ronen Feldman 95 Quick Biological Background RNARNA (Ribonucleic Acid)(Ribonucleic Acid) is a molecule that isis a molecule that is “mathematically” equivalent to (but chemically different from) the DNA sequence of the gene. Transcription means transfer of the genetic information from the archival copy of DNA to the short-lived messenger RNA (mRNA) Transcription © Ronen Feldman 96 Quick Biological Background (Continued) is the process that takes a sequence in one code – nucleotides, and creates the corresponding sequence in another code - amino acids (The building blocks of peptides / proteins). A protein will be expressed only if its code was “translated” from the mRNA. TranslationTranslation 49 © Ronen Feldman 97 The Task: So what’s the problem? • Very often papers discuss mutations and forced (ectopic) expression of genes in addition to natural ones • Many genes are “just mentioned” within the papers without actually citing results or are being used as an auxiliary tool for investigating other genes (Example: The White/Red Eye Gene - w) • The Transcript vs. Protein distinction is tricky (they usually have the same name … ) © Ronen Feldman 98 Our System: Translating the problem into an Information Extraction Task • The scientific papers given are lengthy and complex … • We’re given only a text version without images • But they have a very fixed structure • We’re actually interested only in specific, actual experimental results • Fortunately, these results are obtained using a set of wellknown techniques • Our approach is Knowledge-Based Information Extraction, i.e. finding frequent patterns relevant to the domain So our Solution is … 50 © Ronen Feldman 99 The Figure IS the Result Molecular Biologists who review these papers, look mainly for the figures! Example: This figure (from *R100, in the Training Set) that shows that a specific transcript is present both in the eye and the body. Obvious highlighted sections (Title and Abstract) are used too. *Multiple Subtypes of Phospholipase C Are Encoded by the norpA Gene of Drosophila melanogaster Sunkyu Kim , Richard R. McKay , Karen Miller , Randall D. Shortridge J. Biol. Chem. 270(24): 14376-82. © Ronen Feldman 100 The Figure IS the Result (Continued) But our system can’t read figures and actually doesn’t have them … The Solution … 51 © Ronen Feldman 101 The Alternative: Focus on Figure Legend @Northern Analysis of Adult RNA@ When radiolabeled @norpA@ cDNA probes are hybridized to blots of poly(A) [_2747_tex2html_wrap740.xbm] RNA, three major transcripts can be identified. As shown in Fig. 3(@panel@@A@), a major @norpA@ transcript that is 7.5 kb in length is easily detected in wild-type head but is absent from head of @eya@ mutant. The absence of the 7.5-kb transcript from @eya@ head suggests that it is expressed in the compound eye. Two other transcripts, one that is 5.5 kb and one that is 5.0 kb in length, are visible in body. None of these transcripts are detectable in head or body of @norpA@ [_2747_tex2html_wrap732.xbm] mutant (Zhu @et al.@, 1993), suggesting that they are encoded by the @norpA@ gene. [bc2558926003.gif] _________________________________________________________________ Figure 3: Northern blot analysis of @norpA@ transcripts in adult @Drosophila@ tissues. Approximately 5 µg of poly(A) [_2747_tex2html_wrap740.xbm] RNA was loaded in each lane and probed with a 3.4-kb @norpA@ cDNA fragment (nucleotides 1-3453) (@A@), an 80-bp exon 4 cDNA fragment (@B@), or an 80-bp exon 4A cDNA fragment (@C@). @Lane@ designations indicate RNA isolated from adult head or body (thorax and abdomen) of wild-type (@WT@) @Drosophila@, eyes absent (@eya@) mutant, or @norpA@ mutant. Mobility of RNA size standards (in kilobases) are indicated on the @right@. @Panels@@D@- @F@ show the result of reprobing the blots with @Drosophila@ RP49 cDNA (O'Connell and Rosbash, 1984) as a control to test for RNA loading. This is how the extract from the same paper looks as a text file © Ronen Feldman 102 Extracting the Pattern from the Figure Legend • Extracting (finding) the Figure Title is easy : “Figure #” or “Fig. #” beginning at a new line • Look for patterns incorporating a technique used in obtaining the results (for example, Northern blot), or noun phrase or verb describing an expression (“expression”, “localization”, “expressed” …) with a synonym of Gene(s). 52 © Ronen Feldman 103 HP1a, HP1b, and HP1c localize to distinct regions of Drosophila nuclei. Extracting the Pattern from the Figure Legend Example These are probably Proteins (Multi-Capital names are usually Proteins and not Transcripts). GeneList(ProductType) ExpressionVerb © Ronen Feldman 104 Making the Curate Decision : Extract Evidences and Score Them • Extract evidences from Title , Abstract , Figure Legend and GenBank footnotes • Keep a Score entry for the whole document and for each product (transcript/protein) of a candidate gene • At the end of the document, use the scores to decide regarding the curation of the document and the products of the candidate genes. (If a gene’s score is above a certain threshold, mark the gene as having an experimental result, and mark the whole document as curatable). 53 © Ronen Feldman 105 Making the Curate Decision : Positive and Negative Evidences “Northern blot analysis of @norpA@ transcripts in adult @Drosophila@ tissues” “Figure 2. Ectopic expression of @dNSF1@ in the nervous system rescues the phenotypes of @dNSF1@ mutations” Positive Evidence Negative Evidence © Ronen Feldman 106 Implementation : DIAL Rulebook • The System is implemented in DIALDIAL (Declarative Information Analysis Language), a general IE language developed at ClearForest • DIAL is based on matching patterns within the text and then checking constraints on the patterns. • Patterns combine syntactic and semantic elements. 54 © Ronen Feldman 107 © Ronen Feldman 108 55 © Ronen Feldman 109 Results and Evaluation • Document Curation : 78% F- Measure • Gene Products : 67% F-Measure Results achieved © Ronen Feldman 110 Results and Evaluation (Continued) • Most papers belong to a narrow domain (same vocabulary). • Many curatable papers have both relevant results (wild-type expression) and irrelevant ones (Mutations etc.) • Extracting evidences of specific products of genes cannot be achieved by categorization. Patterns with the specific genes must be found. (No real generalization can be made regarding specific genes, other than w) Information Extraction is more suitable than Categorization for this task. (Best Categorization Curation Results – about 62-64% F-Measure) Evaluation 56 © Ronen Feldman 111 A Hybrid Approach Merging the Rule Base and ML Approaches © Ronen Feldman 112 Why is HMM not enough? • The HMM model is flat, so the most it can do is assign a tag to each token in a sentence. • This is suitable for the tasks where the tagged sequences do not nest and where there are no explicit relations between the sequences. • Part-of-speech tagging and entity extraction belong to this category. • Extracting relationships is different, because the tagged sequences can (and must) nest, and there are relations between them which must be explicitly recognized. • < ACQUISITION> Ethicon EndoSurgery, Inc. , a Johnson & Johnson company, has acquired < ACQUIRED> Obtech Medical AG < /ACQUIRED> 57 © Ronen Feldman 113 A Hybrid Approach • The hybrid strategy, attempts to strike a balance between the two knowledge engineer chores – writing the extraction rules – manually tagging the documents. • In this strategy, the knowledge engineer writes SCFG rules, which are then trained on the data which is available. • The powerful disambiguating ability of the SCFG makes writing rules much simpler and cleaner task. • The knowledge engineer has the control of the generality of the rules (s)he writes, and consequently on the amount and the quality of the manually tagged training the system would require. © Ronen Feldman 114 Defining SCFG • Classical definition: A stochastic contextfree grammar (SCFG) is a quintuple G = (T, N, S, R, P) – T is the alphabet of terminal symbols (tokens) – N is the set of nonterminals – S is the starting nonterminal – R is the set of rules – P : R → [0..1] defines their probabilities. • The rules have the form n → s1s2…sk, where n is a nonterminal and each si either token or another nonterminal. • SCFG is a usual context-free grammar with the addition of the P function. 58 © Ronen Feldman 115 The Probability Function • If r is the rule n → s1s2…sk, then P(r) is the frequency of expanding n using this rule. • In Bayesian terms, if it is known that a given sequence of tokens was generated by expanding n, then P(r) is the apriory likelihood that n was expanded using the rule r. • Thus, it follows that for every nonterminal n the sum ∑ P(r) over all rules r headed by n must equal to one. © Ronen Feldman 116 IE with SCFG • A very basic “parsing” is employed for the bulk of a text, but within the relevant parts, the grammar is much more detailed. • The IE grammars can be said to define sublanguages for very specific domains. 59 © Ronen Feldman 117 IE with SCFG • In the classical definition of SCFG it is assumed that the rules are all independent. In this case it is possible to find the (unconditional) probability of a given parse tree by simply multiplying the probabilities of all rules participating in it. • The usual parsing problem is given a sequence of tokens (a string) S, to find the most probable parse tree T which could generate S. A simple generalization of the Viterbi algorithm is able to efficiently solve this problem. • In practical applications of SCFGs, it is rarely the case that the rules are truly independent. Then, the easiest way to cope with this problem while leaving most of the formalism intact is to let the probabilities P(r) be conditioned upon the context where the rule is applied. © Ronen Feldman 118 Markovian CSFG • HMM entity extractors, are a simple case of markovian SCFGs. • Every possible rule which can be formed from the available symbols has nonzero probability. • Usually, all probabilities are initially set to be equal, and then adjusted according to the distributions found in the training data. 60 © Ronen Feldman 119 Training Issues • For some problems the available training corpora appear to be adequate. • In particular, markovian SCFG parsers trained on the Penn Treebank perform quite well (Collins 1997, Charniak 2000, Roark 2001, etc). • But for the task of relationship extraction it turns out to be impractical to manually tag the amount of documents that would be sufficient to adequately train a markovian SCFG. • At a certain point it becomes more productive to go back to the original hand-crafted system and write rules for it, even though it is a much more skilled labor! © Ronen Feldman 120 SCFG Syntax • A rulebook consists of declarations and rules. All nonterminals must be declared before usage. • Some of them can be declared as output concepts, which are the entities, events, and facts that the system is designed to extract. Additionally, two classes of terminal symbols also require declaration: termlists, and ngrams. – A termlist is a collection of terms from a single semantic category, either written explicitly or loaded from external source. – An ngram can expand to any single token. But the probability of generating a given token is not fixed in the rules, but learned from the training dataset, and may be conditioned upon one or more previous tokens. Thus, ngrams is one of the ways the probabilities of the SCFG rules can be context-dependent. 61 © Ronen Feldman 121 Example output concept Acquisition(Acquirer, Acquired); ngram AdjunctWord; nonterminal Adjunct; Adjunct :- AdjunctWord Adjunct | AdjunctWord; termlist AcquireTerm = acquired bought (has acquired) (has bought); Acquisition :- Company Acquirer [ “,”Adjunct “,” ] AcquireTerm Company Acquired; © Ronen Feldman 122 Emulation of HMM Entity Extractor in CSFG output concept Company(); ngram CompanyFirstWord; ngram CompanyWord; ngram CompanyLastWord; nonterminal CompanyNext; Company :- CompanyFirstWord CompanyNext | CompanyFirstWord; CompanyNext :- CompanyWord CompanyNext | CompanyLastWord; 62 © Ronen Feldman 123 Putting is all together start Text; nonterminal None; ngram NoneWord; None :- NoneWord None | ; Text :- None Text | Company Text | Acquisition Text | ; © Ronen Feldman 124 SCFG Training • Currently there are three different classes of trainable parameters in a TEG rulebook: – the probabilities of rules of nonterminals – the probabilities of different expansions of ngrams – the probabilities of terms in a wordclass. • All those probabilities are smoothed maximum likelihood estimates, calculated directly from the frequencies of the corresponding elements in the training dataset. 63 © Ronen Feldman 125 Sample Rules concept Text; concept output Person; ngram NGFirstName; ngram NGLastName; ngram NGNone; wordclass WCHonorific = Mr Mrs Miss Ms Dr; Person :- WCHonorific NGLastName; Person :- NGFirstName NGLastName; Text :- NGNone Text; Text :- Person Text; Text :- ; © Ronen Feldman 126 Training the rule book • By default, the initial untrained frequencies of all elements are assumed to be 1. They can be changed using “” syntax. • Let us train this rulebook on the training set containing one sentence: – Yesterday, Dr Simmons, the distinguished scientist, presented the discovery. • This is done in two steps. First, the sentence is parsed using the untrained rulebook, but with the constraints specified by the annotations. In our case the constraints are satisfied by two different parses: 64 © Ronen Feldman 127 2 Possible Parse Trees Text Text NGNone NGNone Yesterday Yesterday Text Text NGNone NGNone , , Text Text Person Person WCHonorific NGFirstName Dr Dr NGLastName NGLastName Simmons Simmons Text Text NGNone NGNone , , Text Text …… …… © Ronen Feldman 128 How to pick the right Parse? • The difference is in the expansion of the Person nonterminal. Both Person rules can produce the output instance, therefore there is an ambiguity. • In this case it is resolved in favor of the WCHonorific interpretation, because in the untrained rulebook we have – P(Dr | WCHonorific) = 1/5 (choice of one term among five equiprobable ones), – P(Dr | NGFirstName) ≈ 1/N, where N is the number of all known words. 65 © Ronen Feldman 129 The Trained Rule Book concept Text; concept output Person; ngram NGFirstName; ngram NGLastName; ngram NGNone; wordclass WCHonorific = Mr Mrs Miss Ms <2>Dr; Person :- <2>WCHonorific NGLastName; Person :- NGFirstName NGLastName; Text :- <11>NGNone Text; Text :- <2>Person Text; Text :- <2>; © Ronen Feldman 130 A real example • The PersonAffiliation relation contains three attributes – name of the person, name of the organization, and position of the person in the organization. It is declared as follows: – concept output PersonAffiliation(Name, Position, Org); • Most often, this relation is encountered in the text in the form – “Mr.Name, Position of Org” or – “Org Position Ms.Name”. – Almost any order of the components is possible, with commas and prepositions inserted as necessary. – Also, it is common for Name, Position, or both to be conjunctions of pairs of corresponding entities: – “Mr.Name1 and Ms.Name2, the Position1 and Position2 of Org”, – “Org’s Position1 and Position2, Ms.Name”. • In order to catch those complexities, and for general simplification of the rules, we use several auxiliary non- terminals: – Names, which catches one or two Names, – Positions, which catches one or two Positions, and – Orgs, which catches Organizations and Locations, which can also be involved in PersonAffiliation, as in “Bush, president of US”: 66 © Ronen Feldman 131 The Basic Rules • nonterms Names, Positions, Orgs; – Names :- PERSON->Name | PERSON->Name "and" PERSON->Name; – Positions :- POSITION->Position | POSITION->Position "and" POSITION->Position; – Orgs :- ORGANIZATION->Org | LOCATION->Org; • We also use auxiliary non-terminals for catching pairs of attributes: PosName, and PosOrg: – nonterms PosName, PosOrg; – PosName :- Positions Names | PosName "and" PosName; – wordclass wcPreposition = "at" "in" "of" "for" "with"; – wordclass wcPossessive = ("’ " "s") "’ "; – PosOrg :- Positions wcPreposition Orgs; – PosOrg :- Orgs [wcPossessive] Positions; • Finally, the PersonAffiliation rules: – PersonAffiliation :- Orgs [wcPossessive] PosName; – PersonAffiliation :- PosName wcPreposition Orgs; – PersonAffiliation :- PosOrg [","] Names; – PersonAffiliation :- Names "," PosOrg; – PersonAffiliation :- Names "is" "a" PosOrg; © Ronen Feldman 132 What is Missing? • The rules above catch about 50% of all PersonAffiliation instances in the texts. • Other instances do not conform to the patterns above in several respects. So, in order to improve the accuracy, additional rules need to be written. • First, the Organization name is often entered into a sentence as a part of a descriptive noun phrase, as in: – “Ms.Name is a Position of the industry leader Org”. – In order to catch this in a general way, we define an OrgNP nonterm, which uses an external PoS tagger: 67 © Ronen Feldman 133 Advanced Rules • Using External POS Tagger – ngram ngOrgNoun featureset ExtPoS restriction Noun; – ngram ngOrgAdj featureset ExtPoS restriction Adj; – ngram ngNum featureset ExtPoS restriction Number; – ngram ngProper featureset ExtPoS restriction ProperName; – ngram ngDet featureset ExtPoS restriction Det; – ngram ngPrep featureset ExtPoS restriction Prep; • nonterm OrgNounList; – OrgNounList :- ngOrgNoun [OrgNounList]; • nonterms OrgAdjWord, OrgAdjList; – OrgAdjWord :- ngOrgAdj | ngNum | ngProper; – OrgAdjList :- OrgAdjWord [OrgAdjList]; • nonterm OrgNP; – OrgNP :- [ngDet] [OrgAdjList] OrgNounList; – OrgNP :- OrgNP ngPrep OrgNP; – OrgNP :- OrgNP "and" OrgNP; © Ronen Feldman 134 Experimental Evaluation 68 © Ronen Feldman 135 The INC Corpus 72.3477.2768.0080.8586.3676.00Acquisition 74.2972.2276.4780.0077.7882.35OrgLocation 77.3379.4675.3392.0094.5289.61PersonAffiliation FPrecRecallFPrecRecall Exact match resultsPartial match results © Ronen Feldman 136 MUC 7 90.5894.4287.0586.9190.1283.9386.6687.286.12Location 90.1990.989.4987.789.5385.9488.8489.7587.94 Organizati on 92.2490.7893.7586.5786.8386.3186.0185.1386.91Person FPrec Recal lFPrec Recal lFPrec Recal l Full SCFG system Emulation using SCFG HMM entity extractor 69 © Ronen Feldman 137 ACE 2 86.8484.9488.8385.8484.9686.7484.3783.2285.54GPE 64.7671.0659.4965.0671.0260.0358.05 64.73 552.62 Organiza tion 85.5681.6889.8284.8281.6588.2584.3783.2285.54Person 80.2577.383.4461.1176.2450.99Role FPrecRecallFPrec Recal lFPrec Recal l Full SCFG system Ergodic SCFGHMM entity extractor © Ronen Feldman 138 Final Comparison 71.413572.63380.560580.695{Sum} 22.975524.07447.870538.0825FAC 29.364531.59837.44235.091LOC 83.881585.25689.39288.957GPE 78.097576.42283.594584.835PERSON 49.69154.11368.589569.4895ORG HMMG TEGDIALBest TEGF1 70 © Ronen Feldman 139 Simple ROLE Rules (ACE-2) • ROLE :- [Position_Before] ORGANIZATION->ROLE_2 Position ["in" GPE] [","] PERSON->ROLE_1; • ROLE :- GPE->ROLE_2 Position [","] PERSON->ROLE_1; • ROLE :- PERSON->ROLE_1 "of" GPE->ROLE_2; • ROLE :- ORGANIZATION->ROLE_2 "'" "s" [Position] PERSON->ROLE_1; © Ronen Feldman 140 A little more complicated set of rules• ROLE :- [Position_Before] ORGANIZATION->ROLE_2 Position ["in" GPE] [","] PERSON->ROLE_1; • ROLE :- GPE->ROLE_2 Position [","] PERSON->ROLE_1; • ROLE :- PERSON->ROLE_1 "of" GPE->ROLE_2; • ROLE :- ORGANIZATION->ROLE_2 "'" "s" [Position] PERSON- >ROLE_1; • ROLE :- GPE->ROLE_2 [Position] PERSON->ROLE_1; • ROLE :- <5> GPE->ROLE_2 "'" "s" ORGANIZATION->ROLE_1; • ROLE :- PERSON->ROLE_1 "," Position WCPreposition ORGANIZATION->ROLE_2; 71 © Ronen Feldman 141 The Effect of the rules on the extraction accuracy 0.00 20.00 40.00 60.00 80.00 100.00 50K 100K 150K 200K 250K 0 rules 4 rlules 7 rules © Ronen Feldman 142 Self-Supervised Relation Learning from the Web Ronen Feldman Data Mining Laboratory Bar-Ilan University, ISRAEL Joint work with Benjamin Rosenfeld 72 © Ronen Feldman 143 Approaches for Building IE Systems • Knowledge Engineering Approach – Rules are crafted by linguists in cooperation with domain experts. – Most of the work is done by inspecting a set of relevant documents. – Can take a lot of time to fine tune the rule set. – Best results were achieved with KB based IE systems. – Skilled/gifted developers are needed. – A strong development environment is a MUST! © Ronen Feldman 144 Approaches for Building IE Systems • Automatically Trainable Systems – The techniques are based on pure statistics and almost no linguistic knowledge – They are language independent – The main input is an annotated corpus – Need a relatively small effort when building the rules, however creating the annotated corpus is extremely laborious. – Huge number of training examples is needed in order to achieve reasonable accuracy. – Hybrid approaches can utilize the user input in the development loop. 73 © Ronen Feldman 145 KnowItAll (KIA) • KnowItAll is a system developed at University of Washington by Oren Etzioni and colleagues (Etzioni, Cafarella et al. 2005). • KnowItAll is an autonomous, domainindependent system that extracts facts from the Web. The primary focus of the system is on extracting entities (unary predicates), although KnowItAll is able to extract relations (N-ary predicates) as well. • The input to KnowItAll is a set of entity classes to be extracted, such as “city”, “scientist”, “movie”, etc., and the output is a list of entities extracted from the Web. © Ronen Feldman 146 KnowItAll’s Relation Learning • The base version of KnowItAll uses only the generic hand written patterns. The patterns are based on a general Noun Phrase (NP) tagger. • For example, here are the two patterns used by KnowItAll for extracting instances of the Acquisition(Company, Company) relation: – NP2 "was acquired by" NP1 – NP1 "'s acquisition of" NP2 • And the following are the three patterns used by KnowItAll for extracting the MayorOf(City, Person) relation: – NP ", mayor of" "'s mayor" NP – "mayor" NP 74 © Ronen Feldman 147 SRES • SRES (Self-Supervised Relation Extraction System) which learns to extract relations from the web in an unsupervised way. • The system takes as input the name of the relation and the types of its arguments and returns as output a set of instances of the relation extracted from the given corpus. © Ronen Feldman 148 SRES Architecture Sentence Gatherer Input: Target Relations Definitions Web Sentences keywords Pattern Learner Instance Extractor Output: Extractions Seeds Generator seeds patterns NER Filter (optional) instances Classifier 75 © Ronen Feldman 149 Seeds for Acquisition • Oracle – PeopleSoft • Oracle – Siebel Systems • PeopleSoft – J.D. Edwards • Novell – SuSE • Sun – StorageTek • Microsoft – Groove Networks • AOL – Netscape • Microsoft – Vicinity • San Francisco-based Vector Capital – Corel • HP – Compaq © Ronen Feldman 150 Major Steps in Pattern Learning • The sentences containing the arguments of the seed instances are extracted from the large set of sentences returned by the Sentence Gatherer. • Then, the patterns are learned from the seed sentences. – We need to generate automatically • Positive Instances • Negative Instances • Finally, the patterns are post-processed d filt d 76 © Ronen Feldman 151 Positive Instances • The positive set of a predicate consists of sentences that contain an instance of the predicate, with the actual instance’s attributes changed to “”, where N is the attribute index. • For example, the sentence – “The Antitrust Division of the U.S. Department of Justice evaluated the likely competitive effects of Oracle's proposed acquisition of PeopleSoft.” • will be changed to – “The Antitrust Division… …….effects of 's proposed acquisition of .” © Ronen Feldman 152 Negative Instances II • We generate the negative set from the sentences in the positive set by changing the assignment of one or both attributes to other suitable entities in the sentence. • In the shallow parser based mode of operation, any suitable noun phrase can be assigned to an attribute. 77 © Ronen Feldman 153 Examples • The Positive Instance – “The Antitrust Division of the U.S. Department of Justice evaluated the likely competitive effects of ’s proposed acquisition of ” • Possible Negative Instances – of the evaluated the likely… – of the U.S. … …acquisition of of the U.S. … …acquisition of – The Antitrust Division of the ….. acquisition of ” © Ronen Feldman 154 Additional Instances • we use the sentences produced by exchanging “” and “” (with obvious generalization for n-ary predicates) in the positive sentences. • If the target predicate is symmetric, like Merger, then such sentences are put into the positive set. • Otherwise, for anti-symmetric predicates, the sentences are put into the negative set. 78 © Ronen Feldman 155 Pattern Generation • The patterns for a predicate P are generalizations of pairs of sentences from the positive set of P. • The function Generalize(S1, S2) is applied to each pair of sentences S1 and S2 from the positive set of the predicate. The function generates a pattern that is the best (according to the objective function defined below) generalization of its two arguments. • The following pseudo code shows the process of generating the patterns: For each predicate P For each pair S1, S2 from PositiveSet(P) Let Pattern = Generalize(S1, S2). Add Pattern to PatternsSet(P). © Ronen Feldman 156 The Pattern Language • The patterns are sequences of tokens, skips, and slots. The tokens can match only themselves, the skips match zero or more arbitrary tokens, and slots match instance attributes. • Examples of patterns: – * was acquired by * merged with * is * ceo of * • Note, that the sentences from the positive and negative sets of predicates are also patterns, the least general ones since they do not contain skips. 79 © Ronen Feldman 157 The Generalize Function • The Generalize(s1, s2) function takes two patterns (e.g., two sentences with slots marked as ) and generates the least (most specific) common generalization of both. • The function does a dynamical programming search for the best match between the two patterns. • The cost of the match is defined as the sum of costs of matches for all elements. – two identical elements match at no cost, – a token matches a skip or an empty space at cost 2, – a skip matches an empty space at cost 1. – All other combinations have infinite cost. • After the best match is found, it is converted into a pattern by copying matched identical elements and adding skips where non-identical elements are matched. © Ronen Feldman 158 Example •S1 = “Toward this end, in July acquired ” •S2 = “Earlier this year, acquired ” •After the dynamical programming-based search, the following match will be found: Toward (cost 2) Earlier (cost 2) this this (cost 0) end (cost 2) year (cost 2) , , (cost 0) (cost 0) in July (cost 4) acquired acquired (cost 0) (cost 0) 80 © Ronen Feldman 159 Generating the Pattern • at total cost = 12. The match will be converted to the pattern – * * this * * , * acquired • which will be normalized (after removing leading and trailing skips, and combining adjacent pairs of skips) into – this * , * acquired © Ronen Feldman 160 Post-processing, filtering, and scoring of patterns• In the first step of the postprocessing we remove from each pattern all function words and punctuation marks that are surrounded by skips on both sides. Thus, the pattern from the example above will be converted to , * acquired • Note, that we do not remove elements that are adjacent to meaningful words or to slots, like the comma in the pattern above, because such anchored elements may be important. 81 © Ronen Feldman 161 Content Based Filtering • Every pattern must contain at least one word relevant to its predicate. For each predicate, the list of relevant words is automatically generated from WordNet by following all links to depth at most 2 starting from the predicate keywords. For example, the pattern * by • will be removed, while the pattern * purchased • will be kept, because the word “purchased” can be reached from “acquisition” via synonym and derivation links. © Ronen Feldman 162 Scoring the Patterns • The filtered patterns are then scored by their performance on the positive and negative sets. • We want the scoring formula to reflect the following heuristic: it needs to rise monotonically with the number of positive sentences it matches, but drop very fast with the number of negative sentences it matches. ( )2 1matches: matches: )( +∈ ∈ = SPatterntNegativeSeS SPatterntPositiveSeS PatternScore 82 © Ronen Feldman 163 Sample Patterns - Inventor • X , .* inventor .* of Y • X invented Y • X , .* invented Y • when X .* invented Y • X ' s .* invention .* of Y • inventor .* Y , X • Y inventor X • invention .* of Y .* by X • after X .* invented Y • X is .* inventor .* of Y • inventor .* X , .* of Y • inventor of Y , .* X , • X is .* invention of Y • Y , .* invented .* by X • Y was invented by X © Ronen Feldman 164 Sample Patterns – CEO (Company/X,Person/Y)• X ceo Y • X ceo .* Y , • former X .* ceo Y • X ceo .* Y . • Y , .* ceo of .* X , • X chairman .* ceo Y • Y , X .* ceo • X ceo .* Y said • X ' .* ceo Y • Y , .* chief executive officer .* of X • said X .* ceo Y • Y , .* X ' .* ceo • Y , .* ceo .* X corporation • Y , .* X ceo • X ' s .* ceo .* Y , • X chief executive officer Y • Y , ceo .* X , • Y is .* chief executive officer .* of X 83 © Ronen Feldman 165 Shallow Parser mode • In the first mode of operation (without the use of NER), the predicates may define attributes of two different types: ProperName and CommonNP. • We assume that the values of the ProperName type are always heads of proper noun phrases. And the values of the CommonNP type are simple common noun phrases (with possible proper noun modifiers, e.g. “the Kodak camera”). • We use a Java-written shallow parser from the OpenNLP (http://opennlp.sourceforge.net/) package. Each sentence is tokenized, tagged with part-of-speech, and tagged with noun phrase boundaries. The pattern matching and extraction is straightforward. © Ronen Feldman 166 Building a Classification Model • The goal is to set the score of the extractions using the information on the instance, the extracting patterns and the matches. Assume, that extraction E was generated by pattern P from a match M of the pattern P at a sentence S. The following properties are used for scoring: 1. Number of different sentences that produce E (with any pattern). 2. Statistics on the pattern P generated during pattern learning – the number of positive sentences matched and the number of negative sentences matched. 3. Information on whether the slots in the pattern P are anchored. 4. The number of non-stop words the pattern P contains. 5. Information on whether the sentence S contains proper noun phrases between the slots of the match M and outside the match M. 6. The number of words between the slots of the match M that t h d t ki f th tt P 84 © Ronen Feldman 167 Building a Classification Model • During the experiments, it turned out that the pattern statistics (2) produced detrimental results, and the proper noun phrase information (5) did not produce any improvement. The rest of the information was useful, and was turned into the following set of binary features: – f1(E, P, M, S) = 1, if the number of sentences producing E is greater than one. – f2(E, P, M, S) = 1, if the number of sentences producing E is greater than two. – f3(E, P, M, S) = 1, if at least one slot of the pattern P is anchored. – f4(E, P, M, S) = 1, if both slots of the pattern P are anchored. © Ronen Feldman 168 Building a Classification Model – f5…f9(E, P, M, S) = 1, if the number of nonstop words in P is 0, 1 or greater, 2 or greater,… 4 or greater, respectively – f10…f15(E, P, M, S) = 1, if the number of words between the slots of the match M that were matched to skips of the pattern P is 0, 1 or less, 2 or less, 3 or less, 5 or less, and 10 or less, respectively. • As can be seen, the set of features above is rather small, and is not specific to any particular predicate. This allows to train a model using a small amount of labeled data for one predicate, and then to use the model for all other predicates. 85 © Ronen Feldman 169 Using an NER Component • In the SRES-NER version the entities of each candidate instance are passed through a simple rule-based NER filter, which attaches a score (“yes”, “maybe”, or “no”) to the argument(s) and optionally fixes the arguments boundaries. The NER is capable of identifying entities of type PERSON and COMPANY (and can be extended to identify additional types). © Ronen Feldman 170 NER Scores • The scores mean: – “yes” – the argument is of the correct entity type. – “no” – the argument is not of the right entity type, and hence the candidate instance should be removed. – “maybe” – the argument type is uncertain, can be either correct or no. 86 © Ronen Feldman 171 Utilizing the NER Scores • If “no” is returned for one of the arguments, the instance is removed. Otherwise, an additional binary feature is added to the instance's vector: – f16 = 1 iff the score for both arguments is “yes”. • For bound predicates, only the second argument is analyzed, naturally. © Ronen Feldman 172 Experimental Evaluation • We want to answer the following 4 questions: 1. Can we train SRES’s classifier once, and then use the results on all other relations? 2. What boost will we get by introducing a simple NER into the classification scheme of SRES? 3. How does SRES’s performance compare with KnowItAll and KnowItAll- PL? 4. What is the true recall of SRES? 87 © Ronen Feldman 173 Training 1. The patterns for a single model predicate are run over a small set of sentences (10000 sentences in our experiment), producing a set of extractions (between 150-300 extractions in our experiments). 2. The extractions are manually labeled according to whether they are correct or no. 3. For each pattern match Mk, the value of the feature vector fk = (f1,…f16) is calculated, and the label Lk = ±1 is set according to whether the extraction that the match produced is correct or no. 4. A regression model estimating the function L(f) is built from the training data {( fk, Lk)}. We used the BBR, but other models, such as SVM are of course possible. © Ronen Feldman 174 Testing 1. The patterns for all predicates are run over the sentences. 2. For each pattern match M, its score L(f(M)) is calculated by the trained regression model. Note that we do not threshold the value of L, instead using the raw probability value between zero and one. 3. The final score for each extraction is set to the maximal score of all matches that produced the extraction. 88 © Ronen Feldman 175 Sample Output • HP CompaqAdditional information about the HP -Compaq merger is available at www.VotetheHPway.com .The Packard Foundation, which holds around ten per cent of HP stock, has decided to vote against the proposed merger with Compaq.Although the merger of HP and Compaq has been approved, there are no indications yet of the plans of HP regarding Digital GlobalSoft.During the Proxy Working Group's subsequent discussion, the CIO informed the members that he believed that Deutsche Bank was one of HP's advisers on the proposed merger with Compaq.It was the first report combining both HP and Compaq results since their merger.As executive vice president, merger integration, Jeff played a key role in integrating the operations, financials and cultures of HP and Compaq Computer Corporation following the 19 billion merger of the two companies. © Ronen Feldman 176 Cross-Classification Experiment Acquisition 0.7 0.75 0.8 0.85 0.9 0.95 1 0 50 100 150 Precision Merger 0 50 100 150 200 250 Acq. CEO Inventor Mayor Merger 89 © Ronen Feldman 177 Results! Acquisition 0.50 0.60 0.70 0.80 0.90 1.00 0 5,000 10,000 15,000 20,000 Correct Extractions Precision KIA KIA-PL SRES S_NER Merger 0.50 0.60 0.70 0.80 0.90 1.00 0 2,000 4,000 6,000 8,000 10,000 Correct Extractions Precision KIA KIA-PL SRES S_NER © Ronen Feldman 178 More Results CeoOf 0.60 0.70 0.80 0.90 1.00 0 50 100 150 200 250 300 Correct Extractions Precision KIA KIA-PL SRES S_NER MayorOf 0.60 0.70 0.80 0.90 1.00 0 200 400 600 800 1,000 1,200 Correct Extractions Precision KIA KIA-PL SRES S_NER 90 © Ronen Feldman 179 Inventor Results Inv entorOf 0.60 0.70 0.80 0.90 1.00 0 500 1,000 1,500 2,000 Correct Extractions Precision KIA KIA-PL SRES © Ronen Feldman 180 When is SRES better than KIA? • KnowItAll extraction works well when redundancy is high and most instances have a good chance of appearing in simple forms that KnowItAll is able to recognize. • The additional machinery in SRES is necessary when redundancy is low. • Specifically, SRES is more effective in identifying low-frequency instances, due to its more expressive rule representation, and its classifier that inhibits those rules from overgeneralizing. 91 © Ronen Feldman 181 The Redundancy of the Various Datasets Datasets redundancy 0 10 20 30 40 50 60 70 Acq Merger Inventor CEO Mayor Averagesentencesperinstance © Ronen Feldman 182 True Recall Estimates • It is impossible to manually annotate all of the relation instances because of the huge size of the input corpus. • Thus, indirect methods must be used. We used a large list of known acquisition and merger instances (that occurred between 1/1/2004 and 31/12/2005) taken from the paid service subscription SBC Platinum. • For each of the instances in this list we identified all of sentences in the input corpus that contained both instance attributes and assumed that all such sentences are true instances of the corresponding relation. 92 © Ronen Feldman 183 Under Estimation of the recall • This is of course an overestimate since in some cases the appearance of both attributes of a true relation instance is just a chance occurrence and does not constitute a true mention of the relation. • Thus, our estimates of the true recall are pessimistic, and the actual recall is higher. © Ronen Feldman 184 True Recall Estimates Acquisition 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 Redundancy Recall Merger 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 Redundancy Recall 93 © Ronen Feldman 185 Conclusions • We have presented the SRES system for autonomously learning relations from the Web. • SRES solves the bottleneck created by classic information extraction systems that either relies on manually developed extraction patterns or on manually tagged training corpus. • The system relies upon a pattern learning component that enables it to boost the recall of the system. © Ronen Feldman 186 Future Work • In our future research we want to try to improve the precision values even at the highest recall levels. • One of the topics we would like to explore is the complexity of the patterns that we learn. Currently we use a very simple pattern language that just has 3 types of elements, slots, constants and skips. We want to see if we can achieve higher precision with more complex patterns. • In addition we would like to test SRES on nary predicates, and to extend the system to handle predicates that are allowed to lack some of the attributes. A th ibl h di ti i i