Transformation-Based Learning
Christian Siefkes
christian@siefkes.net
Transformation-Based Learning – p.1
Introduction
 
An ‘error-driven’ approach for learning an
ordered set of rules
 
Adds annotations/classiﬁcations to each token of
the input
 
Developed by Brill [1995] for POS tagging
 
Also used for other NLP areas, e.g.
­ text chunking [Ramshaw and Marcus 1995;
Florian et al. 2000]
­ prepositional phrase attachment [Brill and Resnik
1994]
­ parsing [Brill 1996]
­ dialogue act tagging [Samuel 1998]
­ named entity recognition [Day et al. 1997]
Transformation-Based Learning – p.2
Required Input
For application:
 
The input to annotate:
POS: Recently, there has been a rebirth of empiricism in
the ﬁeld of natural language processing.
Additionally for training:
 
The correctly annotated input (‘truth’):
POS: Recently/RB ,/, there/EX has/VBZ been/VBN a/DT
rebirth/NN of/IN empiricism/NN in/IN the/DT ﬁeld/NN
of/IN natural/JJ language/NN processing/NN ./.
Transformation-Based Learning – p.3
Preliminaries
 
Templates of admissible transformation rules
(triggering environments)
 
An initial-state annotator
POS:
Known words: Tag each word with its the most frequent tag.
Unknown words: Tag each capitalized word as proper noun
(NNP); each other word as common noun (NP).
 
An objective function for learning
POS: Minimize the number of tagging errors.
Transformation-Based Learning – p.4
Transformation Rules
Rewrite rules: what to replace
POS: ti
 
tj;
¡
 
tj (replace tag ti / any tag by tag tj)
Triggering environment: when to replace
POS:
Non-lexicalized templates:
1. The preceding (following) word is
tagged ta.
2. The word two before (after) is tagged
ta.
3. One of the two preceding (following)
words is tagged ta.
4. One of the three preceding (following)
words is tagged ta.
5. The preceding word is tagged ta and
the following word is tagged tb.
6. The preceding (following) word is
tagged ta and the word two before (after)
is tagged tb.
Lexicalized templates:
1. The preceding (following) word is wa.
2. The word two before (after) is wa.
3. One of the two preceding (following)
words is wa.
4. The current word is wa and the preceding
(following) word is wb.
5. The current word is wa and the preceding
(following) word is tagged ta.
6. The current word is wa.
7. The preceding (following) word is wa
and the preceding (following) tag is ta.
8. The current word is wa, the preceding
(following) word is wb and the preceding
(following) tag is ta.
Transformation-Based Learning – p.5
Learning Algorithm
1. Generate all rules that correct at least one error.
2. For each rule:
(a) Apply to a copy of the most recent state of the
training set.
(b) Score the result using the objective function.
3. Select the rule with the best score.
4. Update the training set by applying the selected
rule.
5. Stop if the score is smaller than some pre-set
threshold T; otherwise repeat from step 1.
Transformation-Based Learning – p.6
Rules Learnt
The ﬁrst rules learnt by Brill’s POS tagger (with
examples):
# From To If
1 NN VB previous tag is TO
to/TO conﬂict/NN NB
2 VBP VB one of the previous 3 tags is MD
might/MD vanish/VBP VB
3 NN VB one of the previous two tags is MD
might/MD not reply/NN VB
4 VB NN one of the previous two tags is DT
the/DT amazing play/VB NN
Transformation-Based Learning – p.7
Tagging Unknown Words
Additional rule templates use character-based cues:
Change the tag of an unknown word from X to Y if:
1. Deleting the preﬁx (sufﬁx) x, |x| 4, results in a
word.
2. The ﬁrst (last) 1–4 characters of the word are x.
3. Adding the character string x, |x| 4, as a preﬁx
(sufﬁx) results in a word.
4. Word w appears immediately to the left (right)
of the word.
5. Character z appears in the word.
Transformation-Based Learning – p.8
Unknown Words: Rules Learnt
# From To If
1 NN NNS has sufﬁx -s
rules/NN NNS
4 NN VBN has sufﬁx -ed
tagged/NN VBN
5 NN VBG has sufﬁx -ing
applying/NN VBG
18 NNS NN has sufﬁx -ss
actress/NNS NN
Transformation-Based Learning – p.9
Training Speedup: Hepple
Disallows interaction between learnt rules,
by enforcing two assumptions:
Sample independence: a state change in a sample
does not change the context of surrounding
samples
Rule commitment: there will be at most one state
change per sample
: Impressive reduction in training time, but the
quality of the results is reduced (assumptions do
not always hold)
Transformation-Based Learning – p.10
‘Lossless’ Speedup: Fast TBL
1. Store for each rule r that corrects at least one error:
 
good
 
r
¡
: the number of errors corrected by r
 
bad
 
r
¡
: the number of errors introduced by r
2. Select the rule b with the best score.
Stop if the score is smaller than a threshold T.
3. Apply b to each sample s.
4. Considering only samples in the set
¢
s
£
b changes s
¤
V
¥
s
¦
,
where V
¥
s
¦
is the set of samples whose tag might depend
on s (the ‘vicinity’ of s; s
§
V
¥
s
¦
):
 
Update good
 
r
¡
and bad
 
r
¡
for all stored rules,
discarding rules whose good
 
r
¡
reaches 0.
 
Add rules with a positive good  
r
¡
not yet stored.
Repeat from step 2. [Ngai and Florian 2001]
Transformation-Based Learning – p.11
Text Chunking
A robust preparation for / alternative to full parsing.
 
Input: A.P. Green currently has 2,664,098 shares
outstanding.
 
Expected output: [NP A.P. Green] [ADVP
currently] [VB has] [NP 2,664,098 shares]
[ADJP outstanding].
 
Alternative representation: A.P./B-NP Green/I-NP
currently/B-ADVP has/B-VP 2,664,098/B-NP
shares/I-NP outstanding/B-ADJP ./O
 
Rules: Similar to those used for POS tagging,
considering
­ Words ­ POS tags ­ Chunk tags
Transformation-Based Learning – p.12
Prepositional Phrase Attachment
Samples: 1. I [VB washed] [NP the shirt] [PP with soap and water].
2. I [VB washed] [NP the shirt] [PP with pockets].
Task: Is the prepositional phrase attached to the verb
(sample 1) or to the noun phrase (sample 2)?
Approach: Apply TBL to 4-tuple of base head words (tag
tuple as either VB or NP):
1. wash shirt with soap
2. wash shirt with pocket
Rules: Templates consider the words in the tuple and
their semantic classes (WordNet hierarchy)
Transformation-Based Learning – p.13
Evaluation
POS tagging:
Regular TBL Fast TBL Hepple
Accuracy 96.61% 96.61% 96.23%
Time 38:06h 17:21min 6:13min
Prepositional Phrase Attachment:
Regular TBL Fast TBL Hepple
Accuracy 81.0% 81.0% 77.8%
Time 3:10h 14:38min 4:01min
Scaling on input data:
Fast TBL: linear
Regular TBL: almost quadratic
Transformation-Based Learning – p.14
Advantages
 
Can capture more context than Markov models
 
Always learns on the whole data set – no ‘divide
and conquer’ : no data sparseness:
­ Target evaluation criterion can be directly
used for training, no need for indirect
measures (e.g. entropy)
­ No overtraining
 
Can consider its own (intermediate) results on the
whole context : More powerful than other
methods like decision trees [Brill 1995, sec. 3]
Transformation-Based Learning – p.15
More Advantages
 
Can do any processing, not only classiﬁcation:
­ Can change the structure of the input (e.g.
parse tree)
­ Can be used as an postprocessor to any
annotation system
 
Resulting model is easy to review and understand
 
Very fast to apply – rule set can be converted into
a ﬁnite-state transducer [Roche and Schabes
1995] (for tagging and classiﬁcation) or ﬁnite-state
tree automaton [Satta and Brill 1996] (for parsing
and other tree transformations)
Transformation-Based Learning – p.16
. . . and Disadvantages
 
Greedy learning so the found rule sequence might
not be optimal
 
Not a probabilistic method:
­ Cannot directly return more than one result
(k-best tagging can be added but is not built-in [Brill
1995, sec. 4.4])
­ Cannot measure conﬁdence of results (through
[Florian et al. 2000] estimate probabilities by
converting transformation rule lists to decision trees
and computing distributions over equivalence classes)
Transformation-Based Learning – p.17
References
Brill, Eric (1995). Transformation-Based Error-Driven
Learning and Natural Language Processing: A Case
Study in Part-of-Speech Tagging. Computational
Linguistics, 21(4):543–565. Online (access date: 2003-
01-02):
http://citeseer.nj.nec.com/brill95transformationbased.html.
Brill, Eric (1996). Recent Advances in Parsing Technology,
chap. Learning to Parse with Transformations. Kluwer.
Brill, Eric and Philip Resnik (1994). A Rule-Based
Approach To Prepositional Phrase Attachment Disambiguation.
In Proceedings of COLING’94. Kyoto. Online
(access date: 2003-02-04):
http://www.cs.jhu.edu/~brill/pp-attachment.ps.
Day, David; John Aberdeen; Lynette Hirschman; Robyn
Kozierok; Patricia Robinson; and Marc Vilain (1997).
Mixed-Initiative Development of Language Processing
Systems. In Fifth Conference on Applied Natural
Language Processing, pp. 348–355. Association for
Computational Linguistics. Online (access date: 2003-
02-01):
http://www.mitre.org/technology/alembic-workbench/ANLP97-bigger.html.
Florian, Radu; John C. Henderson; and Grace Ngai (2000).
Coaxing Conﬁdences from an Old Friend: Probabilistic
Classiﬁcations from Transformation Rule Lists. In Proceedings
of Joint Sigdat Conference on Empirical Methods
in Natural Language Processing and Very Large Corpora
(EMNLP/VLC-2000). Hong Kong University of Science
and Technology. Online (access date: 2003-01-27):
http://arXiv.org/ps/cs/0104020.
Ngai, Grace and Radu Florian (2001). TransformationBased
Learning in the Fast Lane. In Proceedings of the
Second Conference of the North American Chapter of
the Association for Computational Linguistics (NAACL).
Pittsburgh, PA. Online (access date: 2003-01-27):
http://nlp.cs.jhu.edu/%7Erflorian/papers/naacl01.ps.
Ramshaw, Lance and Mitch Marcus (1995). Text Chunking
Using Transformation-Based Learning. In Proceedings
of the Third Workshop on Very Large Corpora, eds. David
Yarovsky and Kenneth Church, pp. 82–94. Association
for Computational Linguistics, Somerset, New Jersey.
Online (access date: 2003-02-04):
http://citeseer.nj.nec.com/ramshaw95text.html.
Roche, Emmanuel and Yves Schabes (1995). Deterministic
Part-of-Speech Tagging with Finite-State Transducers.
Computational Linguistics, 21(2):227–253. Online
(access date: 2003-01-31):
http://citeseer.nj.nec.com/roche95deterministic.html.
Samuel, Ken (1998). Lazy Transformation-Based Learning.
In Proceedings of the 11th International Florida Artiﬁcial
Intelligence Research Symposium Conference. Online
(access date: 2003-01-27):
http://xxx.lanl.gov/ps/cmp-lg/9806003.
Satta, Giorgio and Eric Brill (1996). Efﬁcient
Transformation-Based Parsing. In Proceedings of
ACL 1996. Online (access date: 2003-01-27):
http://www.cs.jhu.edu/~brill/Eff_Pars.ps.
Transformation-Based Learning – p.18