The Age of Spiritual Machines : When Computers Exceed Human Intelligence
Ray Kurzweil
List Price: $14.95
Our Price: $11.96
You Save: $2.99
Extracted Book Template
Title: The Age of Spiritual Machines :
When Computers Exceed Human Intelligence
Author: Ray Kurzweil
List-Price: $14.95
Price: $11.96
Template Types
•Slots in template typically filled by a substring from the document.
•Some slots may have a fixed set of pre-specified possible fillers that may not occur in the text
–Terrorist act: threatened, attempted, accomplished.
–Job type: clerical, service, custodial, etc.
–Company type: SEC code
•Some slots may allow multiple fillers.
–Programming language
•Some domains may allow multiple extracted templates per document.
–Multiple apartment listings in one ad
Pattern-Matching Rule Extraction
•Another approach to building IE systems is to use pattern-matching rules for each field to
identify the strings to extract for that field.
•When building web extraction systems (wrappers) manually, it is common to write regular expression
patterns (in a language like Perl) to identify the desired regions of the text.
•Works well when a fairly fixed local context is sufficient to identify extractions, as in
extracting from web pages generated by a program or very stylized text like classified ads.
Regular Expressions
•Language for composing complex patterns from simpler ones.
–An individual character is a regex.
–Union: If e1 and e2 are regexes, then (e1 | e2 ) is a regex that matches whatever either e1 or e2
–Concatenation: If e1 and e2 are regexes, then e1 e2 is a regex that matches a string that consists
of a substring that matches e1 immediately followed by a substring that matches e2
–Repetition (Kleene closure): If e1 is a regex, then e1* is a regex that matches a sequence of zero
or more strings that match e1
Regular Expression Examples
•(u|e)nabl(e|ing) matches
•(un|en)*able matches
Simple Extraction Patterns
•Specify an item to extract for a slot using a regular expression pattern.
–Price pattern: “\b\$\d+(\.\d{2})?\b”
•May require preceding (pre-filler) pattern to identify proper context.
–Amazon list price:
•Pre-filler pattern: “List Price: ”
•Filler pattern: “\$\d+(\.\d{2})?\b”
•May require succeeding (post-filler) pattern to identify the end of the filler.
–Amazon list price:
•Pre-filler pattern: “List Price: ”
•Filler pattern: “.+”
•Post-filler pattern: “”
Adding NLP Information to Patterns
•If extracting from automatically generated web pages, simple regex patterns usually work.
•If extracting from more natural, unstructured, human-written text, some NLP may help.
–Part-of-speech (POS) tagging
•Mark each word as a noun, verb, preposition, etc.
–Syntactic parsing
•Identify phrases: NP, VP, PP
–Semantic word categories (e.g. from WordNet)
•KILL: kill, murder, assassinate, strangle, suffocate
•Extraction patterns can use POS or phrase tags.
–Crime victim:
•Prefiller: [POS: V, Hypernym: KILL]
•Filler: [Phrase: NP]
Evaluating IE Accuracy
•Always evaluate performance on independent, manually-annotated test data not used during system
•Measure for each test document:
–Total number of correct extractions in the solution template: N
–Total number of slot/value pairs extracted by the system: E
–Number of extracted slot/value pairs that are correct (i.e. in the solution template): C
•Compute average value of metrics adapted from IR:
–Recall = C/N
–Precision = C/E
–F-Measure = Harmonic mean of recall and precision
ACE 2002 Newspaper Corpus
•Newspaper article extraction task.
–422 training documents
–97 test documents
•Extracted information:
–Entities: Person, Organization, Facility, Location, _______Geopolitical Entity
–Relations: Role, Part, Located, Near, Social
ACE 2002 Newspaper Corpus
–ERK: string subsequence kernel extractor
–K4: The tree dependency kernel from
– [Culotta et. al, 2004].
Text Mining
•Automatically extract information from a large corpus to build a large database or knowledge-base
of useful information.
•For example, we have used our trained protein interaction extractor to mine biomedical journal
–Input: 753,459 Medline abstracts that reference “human”
–Output: Database of 6,580 interactions between 3,737 human proteins
Active Learning
•Annotating training documents for each application is difficult and expensive.
•Random selection can waste effort on annotating documents that do not help the learner.
•Best to focus human effort on annotating the most informative documents.
•Active learning methods pick only the most informative examples for training.
•At each step, select the example that is estimated to be the most useful for improving the current
learner and then ask the human oracle to annotate this example.
Uncertainty Sampling
•Assume learned system can provide confidence in its predicted labelings of examples.
•From a pool of unlabeled data, pick as most informative, the unlabelled example about which the
current learned system is most uncertain.
Let D be a set of unlabeled examples
Until desired accuracy is reached
Apply current learned system, L, to all examples in D
From D, select the example, E, whose label is most uncertain
Ask the user to label E and remove it from D.
Add E to the training set and retrain L
Rapier Uncertainty Sampling Results
50% savings
in labeled examples!
Information Extraction Issues
•Effectively exploiting global information
•Better active learning methods
•Integrating entity and relation extraction
•Unsupervised IE
•Semi-supervised IE
•Adaptation and transfer to new tasks
•Mining extracted data to find cross-document regularities.
•Use resulting mined knowledge to improve IE