© Tilman Becker, DFKI
March 2002 (‹#›)
Large Spoken Language
Dialogue Systems:
Verbmobil & SmartKom
Tilman Becker
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
becker@dfki.de
http://verbmobil.dfki.de  http://www.smartkom.org


© Tilman Becker, DFKI
March 2002  (‹#›)
Overview
•Speech-to-speech translation: Verbmobil
•Multi-Modal Man-Machine Interaction: SmartKom
•Zooming in: Natural Language Generation


© Tilman Becker, DFKI
March 2002  (‹#›)
Content
•Overview of Verbmobil
•A walk through the system
–Acoustic Processing
–Dialog Translation
–Selection and Speech Synthesis
•Technical issues
•Human Factors and Experiences


© Tilman Becker, DFKI
March 2002 (‹#›)
Overview of Verbmobil
Challenges, Partners, and General Approaches


© Tilman Becker, DFKI
March 2002  (‹#›)
What is Verbmobil?
•Speech-to-speech translation system
•Robust processing of spontaneous dialogs
•Speaker independent (adaptive)
•Languages: English, German, Japanese
•Domains: Appointment scheduling, travel planning and hotel reservation, remote PC maintenance
•The system mediates between two humans, it does not play an active role
•There is no control of the ongoing dialog by the system


© Tilman Becker, DFKI
March 2002  (‹#›)
Input Conditions
Naturalness
Adaptability
  Dialog Capabilities
Close-Speaking
Microphone/Headset
Push-to-talk
Telephone,
Pause-based
Segmentation
Isolated Words
Read
Continuous
Speech
Speaker
Independent
Speaker
Dependent
Monolog
Dictation
Information-
seeking Dialog
Open Microphone,
GSM Quality
Spontaneous
Speech
Speaker
Adaptive
Multiparty
Negotiation
Verbmobil
Challenges for Language Engineering


© Tilman Becker, DFKI
March 2002  (‹#›)
Classification of Machine Translation
Methods
Syntactic
Analysis
Word
Structure
Word
Structure
Direct Translation
Syntactic Transfer
Semantic
Transfer
Interlingua
Semantic
Structure
Semantic
Structure
Semantic
Analysis
Semantic
Generation
Syntactic
Generation
Syntactic
Structure
Syntactic
Structure
Morphologic
Analysis
Morphologic
Generation
Source Language
Target Language


© Tilman Becker, DFKI
March 2002  (‹#›)
  The Verbmobil Partners


© Tilman Becker, DFKI
March 2002  (‹#›)
Prof. Mahr
TU Berlin
Dr. Klein, Dr. Wolf
DLR, PT
Prof. Hoffmann
TU Dresden
Prof. Paulus
TU Braunschweig
Prof. Görz
Prof. Niemann
Univ. Erlangen
Prof. v. Hahn
Univ. Hamburg
Prof. Tillmann
LMU München
Dr. Ruske
TU München
Dr. Block
Siemens, München
R. Reng
Temic, Ulm
Dipl.-Ing. Mangold
DaimlerChrysler, Ulm
Prof. Gibbon
Univ. Bielefeld
Prof. Blauert
Univ. Bochum
Prof. Rohrer
Univ. Stuttgart
Prof. Hinrichs
Univ. Tübingen
Prof. Waibel
Univ. Karlsruhe
A. Klüter
DFKI,
Kaiserslautern
Dr. Eisele
Philips, Aachen
Prof. Ney
RWTH Aachen
Prof. Hess
Univ. Bonn
Dr. Reuse
BMBF
Referat 524
Prof. Pinkal
Univ. d. Saarlandes
Prof. Uszkoreit
Prof. Wahlster
DFKI, Saarbrücken
Sprecheradaption
Multilinguale
Wortlisten
Signalnahe
Evaluierung
Erkenner DC,
Sprachsteuerung
(C, C++, Fortran)
Datensammlung,
Integrierte Verarbeitung
(C, C++, LISP, Prolog)
Woz-Experimente,
Datensammlung
Transfer (Prolog)
Multilinguale Erkenner
(C, C++)
Kontextaus-wertung
(LISP, Prolog, Java)
Prof. Kurematsu
ATR International, Kyoto, Japan
Prof. Waibel
CMU, Pittsburgh;
Prof. Sag
CSLI, Stanford, USA
Syntax,
Rob. Semantik, Dialog
(LISP, Prolog)
Datensammlung, Erkennung
Syntax
(C, C++, Prolog)
Datensammlung
Erkenner Aachen
Stat. Transfer
(C++,C)
Chunk-Parser
(Prolog)
Reparatur,
Prosodie D, E (C)
Akustische
Synthese
(C, C++)
System
integration
(C++, Tcl-Tk)
Multilinguale
Prosodiesteuerung
(C++,C)
  The Verbmobil Partners


© Tilman Becker, DFKI
March 2002  (‹#›)
•23 participating institutions (in Verbmobil II), from Germany and the USA
•Over 900 full-time employees and students involved over the whole duration
•Funded by the German Ministry for Education and Science and the participating companies:
Facts About the Project
BMBF-Funding Phase I, 1.01.93 – 31.12.96
62.7 Mio. DM
BMBF-Funding Phase II, 1.01.97 - 30.9.2000
53.3 Mio. DM
Industrial investment I+II
32.6 Mio. DM
Related industrial R & D activities
ca. 20 Mio. DM
Total
168.6 Mio. DM
31.6 Mio €
27   Mio €
16.5 Mio €
ca. 10   Mio €
 85.1 Mio €


© Tilman Becker, DFKI
March 2002  (‹#›)
Verbmobil – The Book
There are over 600 refereed papers on the various aspects of and achievements in Verbmobil.
Wolfgang Wahlster (ed.):
"Verbmobil: Foundations of     Speech-to-Speech Translation"
Springer-Verlag Berlin Heidelberg
New York. 679 Pages
ISBN 3-540-67783-6
U:\coling\pic_trans\buch.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
Typical Verbmobil Hardware
•SUN Ultra-Sparc 80
•4 processors (450 MHz)
•2 GB main memory
•8 GB swap
•no special signal processing hardware
•Desklab Gradient A/D converter or Sun internal audio device
•close-speaking cordless microphones


© Tilman Becker, DFKI
March 2002  (‹#›)
The Graphical User Interface
U:\coling\pic_trans\GUI.gif

-Grober Überblick über die Module und die Verarbeitungsweise
-Erste Demo:
-??


© Tilman Becker, DFKI
March 2002 (‹#›)
Walk Through the Verbmobil System
Detailed Module Presentation and Demonstration


© Tilman Becker, DFKI
March 2002 (‹#›)
Acoustic Processing


© Tilman Becker, DFKI
March 2002  (‹#›)
Recording, Synthesizing and Synchronization
•Task:
Providing a uniform interface to varying audio hardware; synchronizing in- and output
•Input:
Audio data and system states
•Method:
Introducing audio modules; Finite State Machine for synchronizing
•
•Result:
Audio Data and Synchronization
•Benefit:
Encapsulating audio hardware, “open microphone”, preventing out-of-sync or overlapping system
output
•Responsible:
DFKI, Kaiserslautern


© Tilman Becker, DFKI
March 2002  (‹#›)
Audio Configuration
•Configuration of the systems I/O behavior
–How many speakers?
–For every (possible) speaker:
•Input device (channel identification, speaker adaption)
•Output device(s) (translation output, destination for man/machine dialogs)
•Source language (or „unknown“)
–Desired system output categories
•
•Audio channel configuration
–Uniform configuration of heterogeneous audio hardware
–
U:\vm\snapshots\change_configuration.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
Recording Audio Data
•Turn-based processing, barge-in available for voice commands
•Different audio quality:
–lab-quality close-speaking microphone (16kHz)
–room microphone (16kHz)
–telephone quality (8kHz)
–GSM mobile (8kHz)
ðAudio module concept
–provides a uniform interface of different hardware devices to the system
–# of channels is only limited by hardware
•Open Microphone Approach (essential for telephone translation service!)
•Input/output synchronization
•No cross-talk allowed


© Tilman Becker, DFKI
March 2002  (‹#›)
Microphone 1
Microphone open
Speech input
Microphone 1
Microphone closed
Microphone 1
Speech output
Pause detection
Synchronization
Translation
Open Microphone Approach


© Tilman Becker, DFKI
March 2002  (‹#›)
Synchronisation
•Synchronization controls the high-level System behavior
•Realized via Finite State Machine
Dial-up-
Connection
Start
Welcome
User
Start
Recording
GUI
command
Stop
Recording
Utterance
recorded
Wait for
Echo
Echo
configured?
Wait for
best Hyp.
Best Hypothesis
configured?
Execute
action
Unknown word
detected
Voice
command
Voice
command
Voice
command
Confirm


© Tilman Becker, DFKI
March 2002  (‹#›)
Recognizing Speech
•Task:
Analyzing continuous spontaneous speech signals
•Input:
Audio data
•Method:
HMMs, class based language models, etc.
•
•Result:
Word Hypotheses Graphs (WHG) and speech commands
•Benefit:
Compact representation of hypotheses of what has been said
•Responsible:
DaimlerChrysler AG
University of Karlsruhe
RWTH Aachen
Philips GmbH (Language Models)


© Tilman Becker, DFKI
March 2002  (‹#›)
General Speech Recognition Task
U:\coling\pic_trans\signal2.gif
German
English
Japanese
U:\coling\pic_trans\recog1.gif
Audio Signal
Recognizers
Word Hypotheses Graph


© Tilman Becker, DFKI
March 2002  (‹#›)
Word Hypotheses Graphs (WHGs)
•WHGs realize the interface between acoustic and linguistic processing
U:\coling\pic_trans\recog1.gif
Edge = Word
Best Hypothesis
Acoustic Score


© Tilman Becker, DFKI
March 2002  (‹#›)
Focuses of Speech Recognition
in Verbmobil
Robustness
Multilinguality
Large
Vocabulary
Daimler
Chrysler
RWTH
Aachen
University of
Karlsruhe


© Tilman Becker, DFKI
March 2002  (‹#›)
Nine Available Recognizer Modules
•DaimlerChrysler
–German, 16 kHz, speaker adaptive, approx. 10000 words
–German, 8 kHz, telephone/GSM quality, speaker adaptive, approx. 10000 words
–English, 8 kHz, telephone/GSM quality, speaker adaptive, approx. 7000 words
•University of Karlsruhe
–German, 16 kHz, speaker adaptive, approx. 10000 words
–English, 16 kHz, speaker adaptive, approx. 7000 words
–Japanese, 16 kHz, speaker adaptive, approx. 2600 words
–Language Identification Component (German, English, Japanese)
•RWTH Aachen
–German, 16 kHz, speaker adaptive, approx. 10000 words
–German, 16 kHz, speaker dependent, approx. 30000 words


© Tilman Becker, DFKI
March 2002  (‹#›)
Principal Recognizer Architecture
HMM word
models
Language
model
Speech signal
short-time
analysis
vector
quantization
Search
Decoder
U:\coling\pic_trans\signal2.gif
audio
signal
acoustic
preprocessing
data
reduction
word and sentence
identification
output


© Tilman Becker, DFKI
March 2002  (‹#›)
The Speech Recognition Task
•Some Highlights of the Verbmobil Recognizers:
– Speaker adaptive recognition:
•Start speaker independent
•Recognition results enhance during the dialog
– Capable of dividing speech and noise input using garbage models
– Segmentation of speech input allows incremental processing
– Word class based language models and recognition allow flexible vocabulary
 extension
– Online vocabulary extension through unknown word detection (names, towns,
 street names, …)
– Integrated continuous und speech command recognition
•… and many more


© Tilman Becker, DFKI
March 2002  (‹#›)
Language Identification
•Features
–ID on 3 seconds speech signal (maximum)
–Real time factor 0.5
–Speaker independent
–Unknown audio channel
–Using language model know-how
•
•Flexible Architecture:
LID can be combined with any
speech recognizer
German
English
Japanese
LID
Recognizers


© Tilman Becker, DFKI
March 2002  (‹#›)
Prosodic Processing
•Task:
Recognizing prosodic phenomena (accents, sentence mood) and boundaries
•Input:
WHG and speech signal
•Method:
Neural networks and statistical classifiers
•
•Result:
WHG annotated with accent and boundary information
•Benefit:
Provides prosodic  information needed for correct translation of spontaneous speech
•Responsible:
Universität Erlangen-Nürnberg


© Tilman Becker, DFKI
March 2002  (‹#›)
Prosody in Speech Communication
•Prosody can help to disambiguate
•lexical and phrasal accent
•phrasing (chunks of speech)
•sentence mood
•emotion, attitude, foreign accent
•
•Parameters represented by Features
•F0 (fundamental frequency)
•Energy
•Duration
•Speech tempo
•Pause
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Speech Signal
Word Hypotheses Graph
Multilingual Prosody Module
Prosodic features:
l F0
l duration
l energy
l ....
Search Space
Restriction
Parsing
Dialog Act
Segmentation and
Recognition
Dialog
Understand.
Constraints for
Transfer
Translation
Lexical
Choice
Generation
Speech
Synthesis
Speaker
Adaptation
Boundary
Information
Boundary
Information
Sentence
Mood
Accented
Words
Prosodic Feature
Vector
Prosody in Verbmobil


© Tilman Becker, DFKI
March 2002  (‹#›)
What Linguistic Analysis Really Needs
• Syntactic Boundaries
  He saw ? the man ? with the telescope      Prosody cannot help
• Dialog Act Boundaries
  No, I have no time at all on Thursday. D
  But how about on Friday?
  Dialog acts are pragmatic units that chunk the input into
  units which can be processed alone.
• Prosodic Syntactic Boundaries
  Of course ? not ? on Saturday
  Syntactic boundaries that correlate to the acoustic-phonetic
  reality; help during analysis within one chunk/dialog act.
  Important in spontaneous speech with elliptical utterances.


© Tilman Becker, DFKI
March 2002  (‹#›)
Extraction of Prosodic Features
•computed for each word
•from basic prosodic features and segmental information
•over different time contexts
•modeling of FO:
linear regression coefficient, regression error, mean, median,   minimum, maximum, onset, offset
and their temporal locations
•modeling of energy--contour
mean, median, maximum, max-pos, regression coefficient, ...
and phoneme intrinsic normalizations
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Extraction of Prosodic Features


© Tilman Becker, DFKI
March 2002  (‹#›)
Prosodic Classification in Verbmobil
•five classes of boundaries: default, particles, phrases, clauses, sentences
•sentence mood: question vs. non-questions
•phrase accent: disambiguation of particles
•Computed by NN-classifiers and Language Models
•Language Models trained on a corpus annotated with syntactic prosodic boundaries and dialog act
boundaries


© Tilman Becker, DFKI
March 2002  (‹#›)
An Example
I am calling about the trip to Hanover on the seventh and eighth of March
...
2       3       I       50.284023       34      46       (ID r3485) (PR (S 1.00 0.00 0.00 0.00
0.00)  (A 0.82 0.18)  (F 0.92 0.08)  (I 0.63 0.37)  )
...
3       9       am      24.803406       47      52       (ID r3489) (PR (S 1.00 0.00 0.00 0.00
0.00)  (A 0.84 0.16)  (F 0.81 0.19)  (I 0.79 0.21)  )
3       10      am      32.151409       47      54       (ID r3490) (PR (S 1.00 0.00 0.00 0.00
0.00)  (A 0.88 0.12)  (F 0.37 0.63)  (I 0.65 0.35)  )
...
9       11      going   142.015503      53      91       (ID r3504) (PR (S 0.94 0.00 0.05 0.00
0.00)  (A 0.14 0.86)  (F 0.10 0.90)  (I 0.93 0.07)  )
10      11      calling 131.019409      55      91       (ID r3505) (PR (S 0.39 0.01 0.32 0.27
0.01)  (A 0.07 0.93)  (F 0.13 0.87)  (I 0.93 0.07)  )
11      12      about   125.144707      92      124      (ID r3506) (PR (S 1.00 0.00 0.00 0.00
0.00)  (A 0.22 0.78)  (F 0.92 0.08)  (I 0.74 0.26)  )
12      13      the     40.895718       125     136      (ID r3507) (PR (S 1.00 0.00 0.00 0.00
0.00)  (A 0.90 0.10)  (F 1.00 0.00)  (I 0.65 0.35)  )
12      13      that    42.615807       125     136      (ID r3508) (PR (S 0.80 0.00 0.07 0.00
0.12)  (A 0.84 0.16)  (F 1.00 0.00)  (I 0.86 0.14)  )
13      14      trip    106.785835      137     167      (ID r3509) (PR (S 0.10 0.00 0.80 0.10
0.00)  (A 0.24 0.76)  (F 0.03 0.97)  (I 0.55 0.45)  )
14      15      to      69.326729       168     188      (ID r3510) (PR (S 0.86 0.02 0.08 0.02
0.02)  (A 0.85 0.15)  (F 1.00 0.00)  (I 0.42 0.58)  )
15      16      Hanover 245.755707      189     261      (ID r3511) (PR (S 0.02 0.14 0.43 0.01
0.40)  (A 0.01 0.99)  (F 0.04 0.96)  (I 0.49 0.51)  )
...
16      18      and     69.891464       266     284      (ID r3514) (PR (S 0.57 0.08 0.11 0.23
0.02)  (A 0.87 0.13)  (F 0.95 0.05)  (I 0.84 0.16)  )
17      18      on      75.358749       264     280      (ID r3515) (PR (S 0.92 0.03 0.01 0.03
0.00)  (A 0.87 0.13)  (F 0.62 0.38)  (I 0.38 0.62)  )
18      19      the     37.180725       285     295      (ID r3516) (PR (S 1.00 0.00 0.00 0.00
0.00)  (A 0.94 0.06)  (F 0.98 0.02)  (I 0.84 0.16)  )
19      20      seventh 184.631897      296     350      (ID r3517) (PR (S 0.06 0.10 0.31 0.00
0.53)  (A 0.07 0.93)  (F 0.11 0.89)  (I 0.12 0.88)  )
20      21      and     44.750828       356     369      (ID r3518) (PR (S 0.99 0.00 0.01 0.00
0.00)  (A 0.85 0.15)  (F 0.15 0.85)  (I 0.92 0.08)  )
21      22      the     42.576515       370     376      (ID r3520) (PR (S 1.00 0.00 0.00 0.00
0.00)  (A 0.95 0.05)  (F 1.00 0.00)  (I 0.38 0.62)  )
22      23      eighth  134.293030      381     420      (ID r3521) (PR (S 0.00 0.00 0.99 0.00
0.01)  (A 0.24 0.76)  (F 0.38 0.62)  (I 0.12 0.88)  )
23      24      of      62.543167       425     443      (ID r3522) (PR (S 1.00 0.00 0.00 0.00
0.00)  (A 0.74 0.26)  (F 1.00 0.00)  (I 0.83 0.17)  )
24      25      March   204.886185      444     497      (ID r3523) (PR (S 0.02 0.63 0.03 0.02
0.30)  (A 0.04 0.96)  (F 0.03 0.97)  (I 0.38 0.62)  )


© Tilman Becker, DFKI
March 2002  (‹#›)
Repair of Self-Corrections
•Task:
Detecting and repairing self-corrections
•Input:
WHGs
•Method:
Stochastic models
•
•Result:
Enriched WHGs, including additional repaired hypotheses
•Benefit:
Enabling Verbmobil to repair self-corrections of spontaneous speech input
•Responsible:
Universität Erlangen-Nürnberg


© Tilman Becker, DFKI
March 2002  (‹#›)
The Understanding of Spontaneous Speech Repairs
I need a car next Tuesday oops Monday
Original Utterance
Editing Phase
Repair Phase
Reparandum
Editing Term
Reparans
Recognition of
Substitutions
Transformation of the
Word Hypotheses Graph
I need a car next Monday


© Tilman Becker, DFKI
March 2002  (‹#›)
Facts about Repairs in the Verbmobil Corpus
•21% of all turns in the Verbmobil corpus (79 562 turns ) contain at least one self correction
•The syntactic category is preserved in most cases
(For example: Out of a sample of 266 verb replacements, 224 are again mapped to verbs)
•Repairs take place in a restricted context
(in 98% the reparandum consists of  less than 5 words)
•Repair sequences underlie certain regularities
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Architecture of Repair Processing
U:\coling\pic_trans\repair1.tif
       “On Thursday I cannot no I can meet äh after one”


© Tilman Becker, DFKI
March 2002  (‹#›)
Scopus Detection
•The editing term (ET) is given by the prosody
•Wanted: Beginning (RB) and end (RE) of the Repair
•Search the best replacement of a word order on the left hand side of ET through a word order on
the right hand side of ET
Þrate the possible replacements
search space is limited through looking at 4 words before and after ET
•
•Choose the best rated replacement over a certain threshold


© Tilman Becker, DFKI
March 2002  (‹#›)
Repair Detection and Word Smoothing
U:\coling\pic_trans\repair2.gif U:\coling\pic_trans\repair3.gif


© Tilman Becker, DFKI
March 2002 (‹#›)
Dialog Translation


© Tilman Becker, DFKI
March 2002  (‹#›)
Multiple Approaches
•Mono-cultural approaches are dangerous
–humans vs. viruses ê diversity
–Microsoft vs. ILOVEYOU and copycats ê alternative software solutions
•Some sources of errors in a speech translation system
–external
•spontaneous speech: not well formed, hesitations, repairs
•bad acoustic conditions
•human dialog behavior
–internal
•knowledge gaps in modules
•software errors
•probabilistic processing
•
•q Use multiple engines, varying approaches on various stages of processing


© Tilman Becker, DFKI
March 2002  (‹#›)
•Exclusive alternatives: three different 16 kHz German speech recognizers with various capabilities
•Competing approaches:
–three parsers: HPSG, Chunk, Statistical
–five translation tracks: case-based, dialog-act based, statistical, substring- based, linguistic
(deep) semantic translation
•Needed: selection and combination of results from competing tracks
–parsers: combination of partial analyses in the semantic processing modules
–translation: preselection module
•
Multiple Approaches in Verbmobil


© Tilman Becker, DFKI
March 2002  (‹#›)
Multiple Translation Tracks - Approaches  and Advantages
•Case-based:
–Approach: uses examples from the aligned bilingual Verbmobil corpus
–Advantage: good translation if input matches example in corpus
•Dialog-act based:
–Approach: extract core intention (dialog act) and content
–Advantage: robust wrt. recognition errors
•Statistical
–Approach: use statistical language and translation models
–Advantage: guaranteed translation with high approximate correctness
• Substring- based
–Approach: combines statistical word alignment with precomputation of translation ”chunks” and
contextual clustering
–Advantage: guaranteed translation with high approximate correctness
• Linguistic (deep) semantic translation
–Approach: “classic” approach using semantic transfer
–Advantage: high quality translation in case of success


© Tilman Becker, DFKI
March 2002  (‹#›)
Example Based Translation
•Task:
Providing a translation based on translation templates and partial linguistic analysis
•Input:
WHGs or best Hypothesis
•Method:
Definite Clause Grammar (DCG), graph matching algorithms
•
•Result:
Translation and a confidence value
•Benefit:
Improving Verbmobils translation capabilities through an additional translation path
•Responsible:
DFKI, Kaiserslautern


© Tilman Becker, DFKI
March 2002  (‹#›)
The Case Based Approach
•Training is based on Verbmobil‘s bilangual corpus
–E: I am on vacation, on the sixth and the seventh.
–D: ich bin am sechsten und siebten verreist.
•
•Principle: Look up an example in the
example storage that matches the input
sentence best, use it’s translation as output
•
S
S´
T(S´)
T(S)
T(S) known
T(S) unknown
(EBMT)


© Tilman Becker, DFKI
March 2002  (‹#›)
Generalization in Example Based Machine Translation (EBMT)
•Handicap of this naive approach: inadequate coverage
–S :    I am not free on Friday.
–S’:    I am not free on Monday.
–T(S’): am Montag habe ich keine Zeit.
•
•Solution: partial generalization (analysis and generation)
–E: I am not free <Temp>.
–D: <Temp> habe ich keine Zeit.
•
•Automatic generalization approach:
–The grammar automatically generalizes the corpus (offline)
–The runtime module generalizes incoming input (online)
–Match generalized input sentence with generalized corpus example
–Result: instantiated corpus translation
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Generalization of WHGs
U:\coling\pic_trans\whg1.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
Example Based Translation – Some More Features
•Generalization grammar for temporals, names, locations (region, town, country), institutions
•Fast and robust WHG search:
–WHG packing
–Optimal alignment for fast corpus search
–Search space pruning
–Search space caching
–Any time capable
•Adequate confidence value for selection
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Dialog-Act Based Translation
•Task:
Robustly provide a translation of core intentions and contents of the domain
•Input:
Prosodically annotated best hypothesis (flat WHG)
•Method:
Statistical dialog-act classifier and Finite State Transducers
•
•Result:
Translation  and a confidence value, additionally content descriptions for the dialog module
•Benefit:
Robust translation and content extraction even when the recognition is erroneous
•Responsible:
DFKI, Saarbrücken


© Tilman Becker, DFKI
March 2002  (‹#›)
Dialog Acts
•Describe the core intention of an utterance
•32 acts defined in a hierarchy, 19 used in processing
•21 CD-ROMs with 1505 dialogs (German, English, Japanese) annotated with dialog acts for training
and test purposes
•Computation uses  bigram  language models
•Probabilities estimated from the annotated corpus
•Leave-One-Out test results  for  approx. 1000 German, English and Japanese dialogs: Recall 72.48
%  (27185 of  37505), Precision 69.90 %


© Tilman Becker, DFKI
March 2002  (‹#›)
Dialog Acts - The Hierarchy


© Tilman Becker, DFKI
March 2002  (‹#›)
Representation of Information and Extraction
•Semantic representation language, used also in the dialog and context modules
•Extraction using Finite State Transducers
•Semi-automatic creation exploiting semantic databases and lexica
•Comfortable development platform
\\Serv-204\vm-dialog\slides\phase-2\pls-9-10-12-1999\autopict.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
Processing Steps
Best Chain
I would so we were to leave Hamburg on the first
Utterance
good so we will leave Hamburg on the first
Dialog Act
INFORM
Content Rep.
has_move:[move,has_source_location:[city,has_name =‘hamburg‘],
                             has_departure_time:[date,time=[dom:1]].


© Tilman Becker, DFKI
March 2002  (‹#›)
Generation
•Generation templates (>140), depending on dialog act, topic, content
•Translated in Finite State Transducers
•Examples:
suggest scheduling $has_date
g:ich w"urde $* vorschlagen &loc_mode_dat
e:how about $*
suggest entertainment or($has_location,$has_theme)
g:wir k"onnten $* gehen &loc_mode_acc
e:we could go $*
     request_suggest
g:was schlagen Sie vor
e:what do you suggest
j:itsu ga yoroshii deshou ka
•
•Result for our example:  also wir fahren ab Hamburg am ersten
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Statistical Translation
•Task:
Provide approximative correct translations
•Input:
Prosodically annotated best hypothesis (flat WHG)
•Method:
Use statistical language and translation models
•
•Result:
Translation  and a confidence value
•Benefit:
Approximative correct translation for spontaneous speech
•Responsible:
RWTH Aachen


© Tilman Becker, DFKI
March 2002  (‹#›)
The Statistical Translation Model
•Task: translate the source string f in the most probable target string e:
•Bayes’ rule needs language model of the target language, and lexicon and alignment models
•Learned from aligned corpus


© Tilman Becker, DFKI
March 2002  (‹#›)
Alignment Templates
•Find corresponding words in source and target language sentences
•Difficult for language pairs with different word order
•Solution: alignment templates
–based on word classes (sparse data problem: approx. 40% of the words in the training corpus are
singletons)
–first step: statistically learn alignment of words for each translation direction
–second step: combine the alignments of both directions
–third step: statistically learn alignment of “phrases”, i.e. word sequences
–


© Tilman Becker, DFKI
March 2002  (‹#›)
Alignment
•
Word-to-Word                vs.                   Alignment Templates


© Tilman Becker, DFKI
March 2002  (‹#›)
Deep Translation
•Task:
Provide high quality translations
•Input:
Prosodically annotated WHG and contextual information
•Method:
Use syntactic and semantic approaches to analysis, transfer, and generation
•
•Result:
Translation containing content information, suited for high quality speech synthesis
•Benefit:
Delivers the highest quality, but is sensitive to recognition errors and spontaneous speech
phenomena
•Responsible:
Siemens AG, DFKI Saarbrücken, Universität Tübingen, Universität des Saarlandes, Universität
Stuttgart,
TU Berlin, CSLI Stanford


© Tilman Becker, DFKI
March 2002  (‹#›)
Modules Involved
•Integrated processing comprises
– search through the WHG
– statistic parser
– chunk parser
•Semantic Construction provides VITs from
  statistic and chunk parser output
•Deep Analysis: HPSG Parser
•Dialog Semantics:combination of
 parsing results, and semantic resolution
•Transfer: VIT to VIT transfer
•Generation: TAG generation from VITs
•Dialog+Context: provides contextual
  information
U:\coling\pic_trans\GUI_deep_analysis.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
The Multi-Parser Approach
•Verbmobil uses three different syntactic parsers:
an HPSG parser, a chunk parser, and a probabilistic LR parser.
•Every parser implements another level of parsing accuracy, depth of syntactic analysis, and
robustness of the analyzing process.
–Chunk parser: Most robust but least accurate analysis
–HPSG parser: Most accurate by least robust analysis
–Probabilistic parser: Level of accuracy and robustness between HPSG and chunk parser
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Integrated Processing
•Gets  WHGs for the English, German, or Japanese speech input and dispatches WHG information to the
three parsers
•Provides an A* search algorithm that allows any connected parser to find the best scored path
using
–acoustic score of the speech recognizer
–Verbmobil  trigram language model
•Parsers  analyze the same utterance simultaneously


© Tilman Becker, DFKI
March 2002  (‹#›)
VIT: Verbmobil Interface Term
•Common syntactic-semantic interface
•Contains all linguistic information relevant for translation
•Record-like data structure: variable-free lists of non-recursive terms
•``Flat'' set representations: semantic, scopal, sortal,  morpho-syntactic,
prosodic, and discourse information
• Labels relate different kinds of information
•Abstract Data Type implements construction, access, update, check,
print, etc. facilities


© Tilman Becker, DFKI
March 2002  (‹#›)
VIT: Verbmobil Interface Term
vit(vitID(sid(...),                   %Segment ID
    []),                              %WHG-String
    index(l250,l234,i72),             %Index
    [start_v(l248,i72),               %Conditions
     arg1(l248,i72,i75),
     nop(l240,h85),
     quest(l249,h84),
     time(l238,i73),
     abstr_vacation(l247,i75),
     pron(l242,i74),
     poss(l244,i75,i74),
     temp_loc(l239,i72,i73),
     def(l245,i75,h87,h86),
     whq(l235,i73,h83,h82)],
    [in_g(l235,l237), ...             %Constraints
     leq(l234,h85), ...],
    [s_class(l240,mp), ...],          %Sorts
    [ana_ante(i74,[i75,i69,i67,i66]), %Discourse
     prontype(i74,third,std), ...],
    [gend(i75,masc), num(i75,sg)],    %Syntax
    [ta_mood(i72,ind), ...],          %Tense and Aspect
    [...])                            %Prosody
When do your vacations begin?


© Tilman Becker, DFKI
March 2002  (‹#›)
VIT: Verbmobil Interface Term
H:\vit.gif
We meet at the station.


© Tilman Becker, DFKI
March 2002  (‹#›)
HPSG Processing
•Task:
Thorough syntactic analysis
•Input:
Word chains from integrated processing
•Method:
Apply HPSG analysis
•
•Result:
Source language VITs
•Benefit:
Delivers the highest quality, but is sensitive to recognition errors and spontaneous speech
phenomena
•Responsible:
DFKI Saarbrücken, CSLI Stanford


© Tilman Becker, DFKI
March 2002  (‹#›)
Head Driven Phrase Structure Grammar
•Well known  advanced grammar theory in linguistics
•Based on the concept of a sign as integrated information structure for all types of linguistic
information
•Inherently multilingual by distinguishing universal principles from language specific aspects
•Typed feature structures with inheritance
•Small  number of rules, due to general principles
•Independent of specific processing strategies, usable for analysis and generation


© Tilman Becker, DFKI
March 2002  (‹#›)
HPSG Basic Principles
•Lexicalism: Words carry all the important information about what they can be combined with, thus
allowing to deal with regular and idiosyncratic properties in a uniform way
•Heads: Phrases contain a head which determines their combinatory potential, e.g. verbs as heads
determine what complements must be present, and what modifiers they can combine with
•Principles: Few language independent general projection principles stating, e.g., how to combine a
head with complements and modifiers
•Unification: Monotonically combines constraints from different sources


© Tilman Becker, DFKI
March 2002  (‹#›)
HPSG Parsing in Verbmobil
•active chart parser allowing bidirectional and island parsing on word hypotheses graphs or strings
•fast processing by
–eliminating disjunctions, enabling fast conjunctive unification
–precompiling type unifiability, avoiding runtime computations
–quick checks on mostly relevant features, avoiding full unification
–quick checks on possibly discontinuous constituents, e.g. separable verb prefixes in German,
reducing the chart size
–precompiling rule filters on possible rule sequences
–scoring rule applications
•anytime behavior
•robust: best partial analyses even for ungrammatical input


© Tilman Becker, DFKI
March 2002  (‹#›)
Statistical Parser
•Task:
Robust probabilistic parsing
•Input:
n-best hypotheses
•Method:
LR-Parser trained on Verbmobil´s
tree-bank
•
•Result:
Syntactic tree representation of the input sentence
•Benefit:
Increasing robustness in Verbmobil´s multi-engine parser strategy
•Responsible:
Siemens AG


© Tilman Becker, DFKI
March 2002  (‹#›)
Statistical Parser – Approach
•(Non-probabilistic) LR-parsing worked quite well for parsing speech in Verbmobil’s first phase.
•LR-parsing is well known to be able to parse huge amounts of input very efficiently.
•Probabilistic chart parsing of spontaneous speech input had some problems i.e. the combinatorical
explosion of edges in the chart on a word graph
•Þ try probabilistic LR-Parser
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Statistical Parser –
Training and Transformations
•Training process: derivation of an LR table and the estimation of unknown probabilistic parameters
from the Verbmobil tree bank
–Find the set of all context free rules (G) contained in the tree bank.
–Construct an LR table from G using well known standard
–Problems: sparse data, different annotation styles
Þ eliminate rules that do occur less than N times
•Transformations:
–Needed after parsing to correct errors of the probabilistic context free parser
–Rules are learned automatically from the training corpus


© Tilman Becker, DFKI
March 2002  (‹#›)
Chunk Parser
•Task:
Robust and efficient partial parsing, even on ill-formed input
•Input:
N-best hypotheses
•Method:
Cascaded Finite State Transducers
•
•Result:
Syntactic tree representation of the input sentence
•Benefit:
Increasing robustness in Verbmobils multi-engine parser strategy
•Responsible:
Universität Tübingen


© Tilman Becker, DFKI
March 2002  (‹#›)
Parsing Based on Chunks
•1st Step: Chunk Parsing using Cascaded Finite State Transducers
•“Chunks are non-recursive cores of ‘major’ phrases, i.e. NP, VP, PP, …”
•2nd Step:
•Building a syntactic tree out of the parsing results
•
•Benefit: Robust and efficient parsing
•But: Partial parsing: Often no spanning analysis
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Example for Chunks
•“Ich habe bei meinem letzten Besuch in Hannover so eine nette Kneipe entdeckt”
•Chunks:
•[NX Ich] [VX habe] [PX bei [ NX meinem letzten Besuch]] in [NX Hannover] [PX so [NX eine nette
Kneipe]]  [VX entdeckt].
•where
• [NX]: Extends from the beginning to the head of a NP
• [VX]: Includes all modals, auxiliary verbs and medial adverbs, but ends at the head verb or
predicate adjective
• [PX]: Extends to the end of an [NX]
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Tree-Building Tasks
•Determine the chunk position inside the syntactic tree
•Complete the internal chunk structure
•Determine functional categories and topological fields
•Rearrange chunks to obtain a complete syntactic tree
•


© Tilman Becker, DFKI
March 2002  (‹#›)
The Result is a Syntactic Tree
U:\coling\pic_trans\chunk3.gif
“Alright, and that should get us there about nine in the evening.”


© Tilman Becker, DFKI
March 2002  (‹#›)
... but analysis is not always spanning
U:\coling\pic_trans\chunk5.gif
“The train arise at seven thirty. We could take a cab it to the hotel problem train station.”


© Tilman Becker, DFKI
March 2002  (‹#›)
Semantic Construction
•Task:
Convert and extend syntax trees to VITs
•Input:
Syntax tree from statistical and chunk parsers
•Method:
Compositional construction using semantic lexicon
•
•Result:
VITs
•Benefit:
Providing results of shallow parser to the deep analysis track
•Responsible:
Universität Stuttgart (IMS)


© Tilman Becker, DFKI
March 2002  (‹#›)
Schematic Processing
Lexcion access and interpretation of the grammatical roles
Intermediate representation:
Application Tree
Compositional semantic construction
Intermediate representation:
VIT
Non compositional semantic construction using transfer rule engine
Intermediate representation:
Resulting VIT
Input:
Syntactic tree


© Tilman Becker, DFKI
March 2002  (‹#›)
Dialog Semantics
•Task:
Combining results from various parsers, reinterpret and correct VITs, and resolve non-local
ambiguities
•Input:
VITs from different parsers
•Method:
VIT models and rule based approaches
•
•Result:
VIT ready for transfer
•Benefit:
Enhances robustness of deep analysis and provides vital information for transfer
•Responsible:
Universität des Saarlandes,  Saarbrücken


© Tilman Becker, DFKI
March 2002  (‹#›)
Combining Analyses from Various Parsers
•Parsers deliver VITs for segments of a turn
•May be spanning analyses or just  partial fragments
•Combination necessary, both analyses of one parsers, but also analyses from various parsers
•Combination criteria
–HPSG is better than statistical parsers is better than chunk parser
–Integrated results are better than fragments
–Longer results are better than short ones
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Stochastic Choice of Spanning Results
•Parser internal scores not normalized Þ external scoring necessary
•Statistical model based on VIT content and dialog act (Tetragram language models)
•Search through Vit Hypotheses Graph VHG comparable to search through WHG
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Robust Semantic Processing
•Partial results don’t necessarily fit together
–phenomena of spontaneous speech
–recognition errors
–parsing errors
•Rule based correction


© Tilman Becker, DFKI
March 2002  (‹#›)
Bridging Mechanism for False Starts


© Tilman Becker, DFKI
March 2002  (‹#›)
Resolving Non-Local Ambiguities
•Based on prosody and dialog act information
•Ambiguities processed:
–Verb disambiguation:
Wir gehen in’s Theater  (We go to the theater)
Montag geht bei mir nicht   (Monday does not suit me)
–Sentence mood
 Wir gehen in’s Theater !   vs.   Wir gehen in’s Theater?
–Adverb disambiguation
Wir gehen eher in’s Theater  (We go to the theater earlier)
Montag geht bei mir eher nicht   (Monday does not really suit me)
–Anaphora and ellipsis resolution
–Japanese: Definiteness, topic phrases, zero anaphora


© Tilman Becker, DFKI
March 2002  (‹#›)
Semantic Based Transfer
•Task:
Transfer VITs from the source to the target language
•Input:
VITs
•Method:
Rule based transfer
•
•Result:
VITs for generation
•Benefit:
Translate VITs inside the deep translation path
•Responsible:
Universität Stuttgart (IMS)


© Tilman Becker, DFKI
March 2002  (‹#›)
Main characteristics
The Transfer Approach: Rule Based Transfer
•VITs are mapped onto VITs: Transfer is a VIT rewriting system
•Rule based, context conditions restrict application
•Transfer rules remove matching source language expressions from the VIT
•Efficient implementation
•Examples:
•Simple Rules: adelig(L,I) -> noble(L,I)
•Simple Templates: @mod(adelig, noble, L, I)
•Selectional restrictions: #sort_check(I,human) -> true
   @mod(gross,tall,_,I)
      #sort_check(I,location) -> true
   @mod(gross,large,_,I)
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Advanced Features of Transfer
•Structural changes:
–Adjective to PP: tagsüber -> during the day
–Insertion: übernachte -> spend the night
–…
•Disambiguation:
•
•
–
–
type of ambiguity
lexical
structural
discourse semantics,
 context
discourse semantics
transfer
kinds of knowledge
needed for
disambiguation
modules that contribute to
the resolution
syntactic, semantic,
contrastive, domain,
prosodic
 parsers, semantic
construction, discourse
semantics, transfer, context
syntactic, semantic,
domain
parsers, semantic
construction, transfer
anaphora  and
ellipsis
syntactic, semantic,
domain
semantic focus and
operator scope
prosodic, syntactic,
semantic, contrastive,
domain


© Tilman Becker, DFKI
March 2002  (‹#›)
Performance of Transfer
•Rules are compiled and packed
•18088 rules German Û English
•4694 rules German Û Japanese
•Mean runtime per sentence: 80 msec (Sun Ultra II, 300 MHz)
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Context Evaluation
•Task:
Resolving ambiguities in the dialog context during semantic transfer
•Input:
Requests from transfer
•Method:
Using world knowledge and rules
•Result:
disambiguated transfer requests
•Benefit:
Higher quality of transfer results
•Responsible:
Technical University (TU) Berlin


© Tilman Becker, DFKI
March 2002  (‹#›)
Context Evaluation - Tasks and Methods
•Supports semantic transfers and processes VITs
•Gets information from dialog module from shallow tracks
•Extends disambiguation of the dialog semantic module and uses ontological information


© Tilman Becker, DFKI
March 2002  (‹#›)
1
Nehmen wir dieses Hotel, ja. è Let us take this hotel.
Ich reserviere einen Platz. è I will reserve a room.
2
Machen wir das Abendessen dort. è Let us have dinner there.
Ich reserviere einen Platz. è I will reserve a table.
3
Gehen wir ins Theater. è Let us go to the theater.
Ich möchte Plätze reservieren. è I would like to reserve seats.
Example: Platz è room / table / seat
Using World Knowledge for Transfer


© Tilman Becker, DFKI
March 2002  (‹#›)
Dialog Processing
•Task:
Provides dialog context for all tracks and computes main information for dialog summaries
•Input:
Data from a lot of modules
•Method:
Frame-like topic structuring and rules
•Result:
context information and dialog summaries and minutes
•Benefit:
Verbmobil knows what happens throughout the dialog and can present it
•Responsible:
DFKI, Saarbrücken


© Tilman Becker, DFKI
March 2002  (‹#›)
Dialog Processing
•Dialog Memory:
–Stores information from each track
–Only dialog act based and semantic transfer provide abstract representations: Discourse
Representation Language  DRL:
I would so we were to leave Hamburg on the first
[INFORM,has_move:[move,has_source_location:[city,has_name='hamburg’,
has_departure_time:[date,time='day:1’
•Discourse Interpretation:
–Groups information into topics
–Completes information
–Keeps tracks of negotiation structure


© Tilman Becker, DFKI
March 2002  (‹#›)
Probabilistic
Analysis of Dialog
Acts (HMM)
Recognition of
Dialog Plans
(Plan Operators)
Dialog Act
Dialog Phase
Syntactic Analysis
Robust
Dialog Semantics
VIT
Semantic
Transfer
Dialog Act
Dialog Information in Semantic Transfer


© Tilman Becker, DFKI
March 2002  (‹#›)
Collaboration for a New Functionality: Result Summaries
•Provide the users with a summary of the topics that were agreed
•Two benefits
–have a piece of information to use in calendars etc.
–control the translation
•Approach: exploit already existing modules for
–content extraction
–dialog interpretation
–planning the summary
–generation
–transfer
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Result Summary


© Tilman Becker, DFKI
March 2002  (‹#›)
Generation
•Task:
Robustly generate the output of the semantic transfer in German, English, or Japanese
•Input:
VITs from transfer
•Method:
Constraint system for micro-planning, TAG grammar (reusing HPSG grammars) for syntactic realization
•Result:
Strings, enriched with content-to-speech (CTS) information to support synthesis
•Benefit:
Output from the semantic transfer track
•Responsible:
DFKI, Saarbrücken


© Tilman Becker, DFKI
March 2002  (‹#›)
Architecture
Microplanning
Module
•Selecting planning rules
•Lexical choice constraints
Surface
Realization
Module
•Inflection
•Synthesis Annotation
Syntactic
Realization
Module
•Selecting LTAG trees
•Tree combination
VIT (Verbmobil Interface Term)
Annotated
String
Robustness
Preprocessing
•Repairing structural problems
•Heuristics for generation gap


© Tilman Becker, DFKI
March 2002  (‹#›)
Preprocessing for Robustness
•Why pre-pocessing:
•Check and repair inconsistencies as early as possible
•Keep robustness and standard modules separate
•Alternative: relax constraints
•
•Preprocessing for robustness means:
•Executing a set of solution submodules in sequence
•For each problem found,  the preprocessor lowers a  confidence value for the generation output
which measures the reliability of our result
–
•


© Tilman Becker, DFKI
March 2002  (‹#›)
How much robustness?
•PRO:
In a dialog system, a poor translation might still be better than none at all,
•CON:
one of the shallow modules can be selected when deep processing fails,
so respect the inherent limitations of robustness.
Þ Generation knows its limits and sometimes decides not to produce a string
•
•Selection module: uses training corpus and confidence values to select from the different
translation paths


© Tilman Becker, DFKI
March 2002  (‹#›)
Microplanning: Create Syntactic Building Blocks
Method: Mapping of dependency structures
•
Example: Time Expressions
DEF (L,I,G,H)
DOWF (L1,I,mo)
ORD (L2,I,11)
MOFY (L3,I,may)
MONDAY1
ARG
ELEVENTH_DAY
SPEC
ARG
THE
MAY
Semantical dependency: VIT
Syntactical  dependency: TAG
ARG
OF_P


© Tilman Becker, DFKI
March 2002 (‹#›)
Multilingual Generation
for
Translation in Speech-to-Speech Dialogues
and
its Realization in Verbmobil
Tilman Becker . Anne Kilger . Peter Poller . Patrice Lopez
DFKI GmbH
Stuhlsatzenhausweg 3
66123 Saarbrücken
Tilman.Becker@dfki.de


© Tilman Becker, DFKI
March 2002  (‹#›)
VM-GECO:
VerbMobil’s GEneration COmponents
• Multilingual Generation: German, English, Japanese
•
• Language-independent kernel algorithms
• Language-specific knowlegde sources
•
• Extended “standard” pipeline architecture:
• Microplanning
• Syntactic Realization
• Surface Realization
Annotated String


© Tilman Becker, DFKI
March 2002  (‹#›)
Standard Architecture
Microplanning
Module
•Selecting planning rules
•Lexical choice constraints
Surface
Realization
Module
•Inflection
•Synthesis Annotation
Syntactic
Realization
Module
•Selecting LTAG trees
•Tree combination
VIT (Verbmobil Interface Term)
Annotated String
Rules
CDL-TAG
LTAG
HPSG
Rules


© Tilman Becker, DFKI
March 2002  (‹#›)
VIT: Verbmobil Interface Term
vit(vitID(sid(...),                   %Segment ID
    []),                              %WHG-String
    index(l250,l234,i72),             %Index
    [start_v(l248,i72),               %Conditions
     arg1(l248,i72,i75),
     nop(l240,h85),
     quest(l249,h84),
     time(l238,i73),
     abstr_vacation(l247,i75),
     pron(l242,i74),
     poss(l244,i75,i74),
     temp_loc(l239,i72,i73),
     def(l245,i75,h87,h86),
     whq(l235,i73,h83,h82)],
    [in_g(l235,l237), ...             %Constraints
     leq(l234,h85), ...],
    [s_class(l240,mp), ...],          %Sorts
    [ana_ante(i74,[i75,i69,i67,i66]), %Discourse
     prontype(i74,third,std), ...],
    [gend(i75,masc), num(i75,sg)],    %Syntax
    [ta_mood(i72,ind), ...],          %Tense and Aspect
    [...])                            %Prosody
When do your vacations begin?


© Tilman Becker, DFKI
March 2002  (‹#›)
VIT: Verbmobil Interface Term
H:\vit.gif
We meet at the station.


© Tilman Becker, DFKI
March 2002  (‹#›)
Microplanning:
deriving a sentence plan
•Microplanning tasks:
•determine type of utterance
•determine syntactic structure
•execute word choice
•
•Microplanning rules map parts of VIT input to partial dependency structures
•
•Implemented as constraint solving problem
•Approx. 7,200 microplanning rules (German)


© Tilman Becker, DFKI
March 2002  (‹#›)
Microplanning:
deriving a sentence plan
•An example: “the eleventh of May”
DEF (L,I,G,H)
DOWF (L1,I,mo)
ORD (L2,I,11)
MOFY (L3,I,may)
MONDAY1
ARG
ELEVENTH_DAY
SPEC
ARG
THE
MAY
Semantic dependency: VIT
Syntactic dependency: TAG
ARG
OF_P


© Tilman Becker, DFKI
March 2002  (‹#›)
Syntactic Realization
•Tasks of syntactic realization:
•selecting lexicalized (TAG) trees
•constructing a phrase structure tree
•provide all information for surface realization:
–inflection and annotation for CTS (content to speech) synthesis
•
•Based on FB-LTAG:
Feature-Based Lexicalized Tree Adjoining Grammars
•Compiled from HPSG grammars


© Tilman Becker, DFKI
March 2002  (‹#›)
Syntactic Realization:
•An example: “the eleventh of May”
MONDAY1
ARG
ELEVENTH_DAY
SPEC
ARG
THE
MAY
Syntactic dependency:
TAG derivation tree
ARG
OF_P
Syntactic phrase structure:
TAG derived tree
NP
N
PP
NP
P
N
DET
the
eleventh
of
May


© Tilman Becker, DFKI
March 2002  (‹#›)
HPSG to TAG Compilation
•HPSG: context-free rules (schemas)
•TAG: extended local lexical structures (trees)
•Off-line compilation computes all projections from lexical types
•Generates approx. 2,300 TAG trees from 250 lexical types
–Reuse existing Resources:
•Spontaneous speech, syntactic/lexical coverage of Verbmobil domain
–Speed vs. space
–TAG captures dependencies
–HPSG include syntax-semantics interface,
vast body of linguistic work


© Tilman Becker, DFKI
March 2002  (‹#›)
Problems for generation
•Technical problems
–should be eliminated
–hard to eliminate in a large-scale system
–better to be robust
•Task-inherent problems
–Spontaneous speech input
–Insufficiencies in the analysis and translation
–Generation gap:
mismatch between semantic input and coverage of the grammar
•®     Robust generator necessary


© Tilman Becker, DFKI
March 2002  (‹#›)
Problems for generation (2)
•(Task-inherent) problems manifest  themselves as fault
wrt. the interface language definition
•
• Problems with the structure of the semantic representation:
–unconnected subgraphs
–multiple predicates referring to the same object
–omission of obligatory arguments
•Problems with the content of the semantic representation:
–contradicting information
–missing information (e.g. agreement information)


© Tilman Becker, DFKI
March 2002  (‹#›)
Extended Architecture
Microplanning
Module
•Selecting planning rules
•Lexical choice constraints
Surface
Realization
Module
•Inflection
•Synthesis Annotation
Syntactic
Realization
Module
•Selecting LTAG trees
•Tree combination
VIT (Verbmobil Interface Term)
Annotated
String
Robustness
Preprocessing
•Repairing structural problems
•Heuristics for generation gap


© Tilman Becker, DFKI
March 2002  (‹#›)
Extended Architecture (2)
•Why pre-pocessing:
•Check and repair inconsistencies as early as possible
•Keep robustness and standard modules separate
•Alternative: relax constraints
•Preprocessing for robustness means:
•Executing a set of solution submodules in sequence
•For each problem found,  the preprocessor lowers a  confidence value for the generation output
which measures the reliability of our result


© Tilman Becker, DFKI
March 2002  (‹#›)
How much robustness?
•PRO:
In a dialogue system,
a poor translation might still be better than none at all,
•CON:
one of the shallow modules can be selected when deep processing fails,
so respect the inherent limitations of robustness.
•
•Selection module: uses training corpus and confidence values to select from the different
translation paths


© Tilman Becker, DFKI
March 2002  (‹#›)
Content-to-Speech (CTS) Output
•Output annotated with information like speech act, syntactic grouping, word classes, prominence,
...
•Enhances synthesis quality
•Example:
{SpeechAct:begin}{SpeechActType: Inform}{Language:English}{Utterance:begin}
{SentenceType:Aussagesatz}{WordClass:N}Verbmobil{WordClass:AUX}is {WordClass: DET-ART}
a{Prominence:2} {WordClass:ADJ}speaker_independent{WordClass:N} system{BorderProminence:5}
{WordClass:CONJ-SYN}that {Prominence:15}{WordClass:V}offers
{Prominence:4}{WordClass:N}translation_assistance{BorderProminence:2}  {WordClass:PREP-SYN}in
{Prominence:4}{WordClass:N}dialog {WordClass:N}situations {Utterance:end}


© Tilman Becker, DFKI
March 2002  (‹#›)
Minutes and Summaries
•Dialog module keeps track of the dialog:
dialog model, context extraction, translations: dialog history
•Three types of “protocols”:
•Minutes:    relevant exchanges
•Summary: dialog results
•Scripts:     complete dialog script


© Tilman Becker, DFKI
March 2002  (‹#›)
Multilingual Minutes and Summaries
•Multilinguality: Integration of transfer module:
German Summary
 (HTML)
Context
Syndialog
Dialog
VM-PROTO
GENGER
Transfer (G®E)
VM-PROTO
GENENG
English Summary
 (HTML)
Document structure
VITs
VITs


© Tilman Becker, DFKI
March 2002  (‹#›)
Conclusion
•Multilingual generation:
–kernel algorithms
–multilingual knowledge sources
•Robustness is necessary and useful
–within limits
•Output of classified, graded quality
•Generation of minutes and summaries
•The Verbmobil book:    2 articles on Generation


© Tilman Becker, DFKI
March 2002 (‹#›)
Selection and Speech Synthesis


© Tilman Becker, DFKI
March 2002  (‹#›)
Selection of Translations
•Task:
Select the “best” translation out of all deep and shallow translation paths
•Input:
Translations (text or content)
•Method:
Learning inequalities
•
•Result:
Selected Translation (text or content)
•Benefit:
Use the expertise of all translation paths for a particular utterance
•Responsible:
TU Berlin


© Tilman Becker, DFKI
March 2002  (‹#›)
Segment 1
Wenn wir den Termin vorziehen,
Segment 2
das würde mir gut passen.
Selection Module
Segment 1
Translated by Semantic Transfer
Segment 2
Translated by Case-Based Translation
Segment 1
If you prefer another hotel,
Segment 2
please let me know.
Alternative Translations with Confidence Values
Statistical
Translation
Dialog-Act Based
Translation
Semantic
Transfer
Case-Based
Translation
Integrating Deep and Shallow
Processing


© Tilman Becker, DFKI
March 2002  (‹#›)
The Selection Problem
•Selection is a difficult business:
•confidence values are difficult to compare
–probabilistic vs. knowledge based approaches
–no bird’s eyes view possible
•re-training necessary after changes in the engines
•training data must be produced
•
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Speech Synthesis
•Task:
Synthesize the translation
•Input:
text or content
•Method:
Multilevel selection and concatenation of speech units from large speech corpora
•
•Result:
Audio signal
•Benefit:
“End of the chain” of the speech-to-speech system
•Responsible:
Universität Bonn
TU Dresden
Universität Bochum
Daimler Chrysler


© Tilman Becker, DFKI
March 2002  (‹#›)
•Text-to-Speech (TTS): reading machine from arbitrary text in orthographic form. Unlimited domain.
The machine does not know what it is saying.
•Concept-to-Speech [or content-to-speech] (CTS): spoken out-put from a database inquiry or from a
dialog system. The input of the synthesizer comes from a semantic representation via a generation
module. The machine should have full knowledge of what it is saying.
•Reproductive Speech Synthesis: spoken output from pre-recorded samples. For strictly limited
domains.
Different Types of Synthesis


© Tilman Becker, DFKI
March 2002  (‹#›)
•Target utterances are synthesized from a corpus of utterances from within the domain.
•All units – whatever they are – have multiple instances in the corpus.
•No predefined units: the unit selection algorithm selects contiguous chunks of speech from the
data base – the longer, the better.
•When units of word size and above are applied, much of the natural prosody is preserved.
•Problem: coverage. Words not in the database cannot be synthesized in this way.
Corpus-Based Synthesis


© Tilman Becker, DFKI
March 2002  (‹#›)
I
have
time
monday.
on
Sentence to synthesize
I
have
time
monday
I
have
time
monday
I
have
monday
I
on
on
on
on
S
E
Edge direction
S
E
have
time
I
monday
on
Unit Selection Algorithm


© Tilman Becker, DFKI
March 2002  (‹#›)
•Word is the central unit and the starting point for all processing.
•Only if no suitable instance of a word is available in the database, an algorithm is invoked that
composes a word from subword units which are currently phones.
•The principal strategy on both the word and the sub-word levels is to concatenate chunks that are
as long as possible (up to a whole sentence).
•Like in CHATR, no prosodic manipulation is performed in this synthesis.
•In principle each word is needed in up to three positions (initial/medial, final declarative,
final interrogative) and in both accented and unaccented mode.
•For Verbmobil this would mean that we need about 80000 word tokens to be recorded (which is
prohibitive).
•Good coverage is reached by a selection of typical phrases from within the domain (dialogs from
the Verbmobil dialog database).
•Additional utterances realize frequent words in relevant contexts (e.g., opening phrase, names of
big cities).
•
Implementation


© Tilman Becker, DFKI
March 2002  (‹#›)
Architecture
Generation
Prosody
Generation
 and Unit
Selection
(Diphone)
Transcrip-
tion;
Accenting;
Phrases
Time-
Domain
Synthesis
(PSOLA)
Situation-
Dependent
Adap-
tation
Prosodic
and
Spectral
Adapt.
Recording
Audio Out
Diphone
Corpora
Pitch
Unit Selection
Word Level
Database
Word
Level
Speech
Corpora
Unit Selection
Phone Level
Database
Phone
Level


© Tilman Becker, DFKI
March 2002 (‹#›)
Verbmobil From a Software Engineering Point of View
System Design and Software Integration


© Tilman Becker, DFKI
March 2002  (‹#›)
Software Technology Challenges
•The goal
•Build an integrated system
•The situation
•Researchers do research
•Using different programming languages
•Researchers don’t want to be bothered with technical details
•The solution
•Introducing: the System Group
•Maximal technical support for the researchers/developers


© Tilman Becker, DFKI
March 2002  (‹#›)
The System Architecture
M1
M2
M3
M5
M6
M4
BB 2
BB 1
BB 3
M1
M2
M3
M4
M5
M6
Verbmobil I
Verbmobil II
Multi-Agent Architecture
Multi-Blackboard Architecture
 Modules know all communication partners
 Direct communication between modules
 Reconfiguration difficult
 Software: ICE and ICE Master
 Basic Platform: PVM
 Modules know their I/O data pools
 No direct communication between modules
    198 blackboards  vs. 2380 direct comm. paths
 Reconfiguration easy
 Several instances of one module/functionality
 Software: PCA and Module Manager
 Basic Platform: PVM
Blackboards


© Tilman Becker, DFKI
March 2002  (‹#›)
Audio Data
Word Hypotheses
Graph with
Prosodic Labels
VITs
Underspecified
Discourse
Representations
Command
Recognizer
Spontaneous
Speech Recognizer
Channel/Speaker
Adaptation
Prosodic
Analysis
Statistical
Parser
Dialog Act
Recognition
Chunk
Parser
HPSG
Parser
Semantic
Construction
Robust Dialog
Semantics
Semantic
Transfer
Generation
 Sample Pool Structure


© Tilman Becker, DFKI
March 2002  (‹#›)
Distributed Execution Supports Distributed Development
server 2
server 1
controlling terminal
User 2
User 1
Pool
Communication
Architecture


© Tilman Becker, DFKI
March 2002  (‹#›)
Support from the System Group (1)
•Integration framework (Testbed) with
•common communication mechanism for all used programming languages (C, C++, Lisp, Prolog, Java,
Fortran, Tcl/Tk)
•Narrow interface for all used programming languages
•Overall system control infrastructure
•Standards on various levels
–Installation
–Compilation
–Communication formats between modules
–...
•Toolbox for recording, replaying, testing, inspecting data exchanged between modules, ...


© Tilman Becker, DFKI
March 2002  (‹#›)
The Testbed is the Integration Framework for the Verbmobil System
PCA
Visualization
Manager
Automatic Test
Module
Synchronization
Module
User Command Mapper
Arbitration of Concurrent Modules
GUI
Testbed
Manager


© Tilman Becker, DFKI
March 2002  (‹#›)
Initializing
Synchronization
ACTIVE
Shutdown or Error
CONNECTED
WAITING
READY
DIED
The Testbed controls the System:
Module States


© Tilman Becker, DFKI
March 2002  (‹#›)
The GUI- Visualization and Debug Tool
.... and much more
D:\Orsini\pictures\VM_eng_std_31_5_99.tif D:\temp\sendw.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
Assure high system stability and robustness in connection with
large-scale testing
audio modules,
testbed
acoustic modules
2 Weeks
parsers and shallow translation modules
2 Weeks
linguistic modules and synthesis
2 Weeks
system delivery
2 - 4 Weeks
integration and stabilization phase
Support from the System Group (2):
Regular Integration Cycles


© Tilman Becker, DFKI
March 2002 (‹#›)
Human Factors


© Tilman Becker, DFKI
March 2002  (‹#›)
A Remark about Project Duration
•1993
–“You will need special hardware!”
–“1500 words speaker independent is                 impossible!”
–“Aren’t your goals unrealistic?”
•
•2000
–“Does it run on my notebook?”
–“Only 10 000 words?”
–“Why can’t it also translate in the domains X, Y, and Z?”
8 years is a long time, especially since the invention of Internet time
but
it is a unique chance for
• large scale, continuous research and development
• training people, collaborating, gaining experience
• collecting and annotating data


© Tilman Becker, DFKI
March 2002  (‹#›)
Management Challenges
•The goal
•Build an integrated system
•The situation
•Partners distributed and pretty independent
•Great variation in project and background experience
•Adjustment of project plan and goals over time needed
•The solution
•Define a flat management structure
•Create a group spirit


© Tilman Becker, DFKI
March 2002  (‹#›)
Project Organization
                      Verbmobil Consortium
Group of  Module Managers
Head of System Integration Group
A. Klüter
Module Coordinator
N. Reithinger
Manager
 Module  1
Manager
Module n
...
Verbmobil Advisory Board
Scientific Management
Scientific Head
W. Wahlster
Deputy Scientific Head
A. Waibel
Head of Project Management Group
R. Karger
German Federal Ministry for Research and Education


© Tilman Becker, DFKI
March 2002  (‹#›)
Module Managers
•Have technical hands on experience
•Responsible for one module, even if it is developed at different sites
•Volunteers (sort of ...)
•Meet regularly, despite e-mail, phone and other devices
•Define next milestones
•Define data and software integration plans
•
•Module coordinator coordinates the efforts and is the link to the scientific
•management
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Example: Optimization Schedule 2000
•21.02.  Delivery of CeBit system
•21.02. - 30.04. Optimization phase
•
•09.05. Delivery Verbmobil System 1.0
•starting 09.05
–speech recognizer evaluation
–turn evaluation
–15.03. - 28.04. End-To-End evaluation with feedback to developers
–27.03. - 07.04. Workshop Deep Processing
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Experience
•The group of module managers is a Good Thing™
•Common goals motivate
•Friendly peer pressure works most of the time
•Early problem detection and resolution in most cases
•Regular integration cycles focus and motivate
•
•q Proactive consensus management (PCM)
•
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Experience
•The System Group is a Good Thing™
•The multi blackboard architecture is a Good Thing™
•Crucial for the success of Verbmobil
•Software foundation for (almost) hassle free module development
•
•q Controlled distributed development possible
•
•


© Tilman Becker, DFKI
March 2002  (‹#›)
U:\coling\pic_trans\symposium.gif 30.7.2000,10:30-18:00 Saarbrücken, Kongresshalle


© Tilman Becker, DFKI
March 2002 (‹#›)
SmartKom
•Overview
•Architecture
•Core Areas: Analysis, Fusion, Generation, ...
•Dialogue Processing
C:\Orsini\pictures\Logos\SmartKom.tif


© Tilman Becker, DFKI
March 2002  (‹#›)
Overview
•Introduction
–Why Multimodal Interaction Systems?
–Reference Architecture for Multimodal Systems
•SmartKom: A Multimodal Interaction System
–SmartKom: A Transportable Interface Agent
–Situated Delegation-oriented Dialog Paradigm: Collaborative Problem Solving
–Modes in SmartKom
–More About the System
–M3L: XML based Multimodal Markup Language
–Multimodal Coordination
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Why Multimodal Interaction Systems?
(Oviatt&Cohen, CACM March 2000)
•Accessibility for diverse users and usage contexts
–Selection of modes by the user and by the system
e.g. lean- forward/lean-backward mode in a home environment, car
•Performance stability and robustness
–Users can select robust mode
–Mutual disambiguation and presentation
•Expressive power and efficiency
–Interface more powerful
–Faster
–Increased task completion
•
–

    User(s)
User
Modeling
Discourse
Management
Intention
Recognition
Interaction
Management
Mode
Analysis
Language
Graphics
Gesture
Sound
Media Input
Processing
Media Output
Rendering
Reference Architecture for Multimodal Systems
Context
 Management
Expectation
Management
User ID
Application
Interface
Integrate
Respond
Request
Terminate
Initiate
T
A
V
G
G
Mode
Coordination
Presentation
Design
Multimodal
 Reference
Resolution
Multimodal
Fusion
A
A
V
G
G
Mode
Design
Language
Graphics
Gesture
Sound
Animated
Presentation
Agent
Select Content
Design
Allocate
Coordinate
Layout
User
Model
Discourse
Model
Domain
Model
Media
Models
Task
Model
Representation and Inference, States and Histories
Application
Models
Context
Model
Reference
Resolution
Action
Planning
2 Nov. 2001
Dagstuhl Seminar
Fusion and Coordination
in Multimodal Interaction
edited by: M. Maybury


© Tilman Becker, DFKI
March 2002  (‹#›)
Overview
•Introduction
•SmartKom: A Multimodal Interaction System
–SmartKom: A Transportable Interface Agent
–Situated Delegation-oriented Dialog Paradigm: Collaborative Problem Solving
–Modes in SmartKom
–More About the System
–M3L: XML based Multimodal Markup Language
–Multimodal Coordination
•MIAMM
–Main Objectives
–Interaction using Haptics
•Research Roadmap of Multimodality
•Conclusion
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Human-Technology Interaction Lead Projects


© Tilman Becker, DFKI
March 2002  (‹#›)
The SmartKom Consortium
&#14;sony_blau.PICT 00000FC5&#14;Karl-Heinz II+ B244020D:
MediaInterface
European Media Lab
&#8;LME.pict 00000FC5&#14;Karl-Heinz II+ B244020D:
Uinv. Of
Munich
Univ. of
Stuttgart
Saarbrücken
Aachen
Dresden
Berkeley
Stuttgart
Munich
Univ. of
Erlangen
Heidelberg
Main Contractor
DFKI
Saarbrücken
Project Budget: € 25.5 million
Project Duration: 4 years (September 1999 – September 2003)
Ulm
E:\Daten\CD_DC\DC_055b.bmp C:\Orsini\Pictures\Logos\sympalog.tif
C:\Orsini\pictures\Logos\SmartKom.tif


© Tilman Becker, DFKI
March 2002  (‹#›)
SmartKom: A Transportable Interface Agent
MM
Dialogue
Back-
bone
Home:
EPG
Public:
Cinema,
Phone,
 Mail
Mobile:
Navigation
Application
Layer
SmartKom-Mobile:
A Handheld
Communication
Assistant
SmartKom-Public:
A Multimodal
Communication
 Kiosk
SmartKom-Home/Office:
 Multimodal Portal
to Information Services


© Tilman Becker, DFKI
March 2002  (‹#›)
An Example Interaction with SmartKom Mobile


© Tilman Becker, DFKI
March 2002  (‹#›)
C:\Orsini\PowerPoint\SmartKom\2_PLS_SmartKom\Blocher_Zeigegeste.tif
User
 specifies goal
 delegates task
 cooperate
 on problems
 asks questions
 presents results
 Service 1
Service 2
Service 3
IT Services
Personalized
Interaction
Agent
Situated Delegation-oriented Dialog Paradigm: Collaborative Problem Solving


© Tilman Becker, DFKI
March 2002  (‹#›)
Modes in SmartKom
•Speech
–Speaker independent speech recognition
–Prosodic input processing
–Synthesis
•Gesture
–Input
•Natural gestures (SIVIT)
•Pen-based
–Presentation agent
•Facial/body expression
–User state recognition
–System state presentation


© Tilman Becker, DFKI
March 2002  (‹#›)
The Main Modules on the Control GUI
E:\MO3BAAFDFA0002306.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
More About the System
•Modules realized as independent processes
•Not all must be there (critical path: speech or graphic input to speech or graphic output)
•(Mostly) independent from display size
•Pool Communication Architecture (PCA) based on PVM for Linux and NT
–Modules know only about their  I/O pools
–Literature:
•Andreas Klüter, Alassane Ndiaye, Heinz Kirchmann: Verbmobil From a Software Engineering Point of
View: System Design and Software Integration. In Wolfgang Wahlster: Verbmobil - Foundation of
Speech-To-Speech Translation. Springer, 2000.
•Data exchanged using M3L documents
•All modules and pools are visualized here ...


© Tilman Becker, DFKI
March 2002  (‹#›)
C:\Documents and Settings\bert\Desktop\archi.wmf
The Real Story


© Tilman Becker, DFKI
March 2002  (‹#›)
Frame Languages
Object-oriented Modeling
Primitives
NL/MM-Semantics
More formal Semantics
Subsumption, Inferences
W3C Standards
XML Schema/DTDs
M3L
The “Glue“ - M3L: XML based Multimodal Markup Language
Domain Knowledge
NL/MM Representation
     This year‘s  work
Pool
Pool
Pool
. ... .
XML schema
  XML   schema
XML schema


© Tilman Becker, DFKI
March 2002  (‹#›)
<?xml version="1.0"?>
<presentationContent>
[...]
    <abstractPresentationContent>
         <movieTheater structId=„pid3072”>
              <entityKey> cinema_17a </entityKey>
              <name> Europa  </name>
                  <geoCoordinate>
                    <x> 225 </x> <y> 230 </y>
                  </geoCoordinate>
            </movieTheater>
    </abstractPresentationContent>
[...]
     <panelElement>
        <map  structId="PM23">
          <boundingShape>
            <leftTop>
              <x>  0.5542 </x>  <y>  0.1950 </y>
            </leftTop>
            <rightBottom>
              <x>  0.9892 </x>  <y>  0.7068 </y>
            </rightBottom>
          </boundingShape>
          <contentReference> pid3072 </contentReference>
        </map>
      </panelElement>
[...]
</presentationContent>
An Example of the M3L Representation of the Multimodal Discourse Context
Kinokarte_mit_Smartakus
„No presentation without representation!“


© Tilman Becker, DFKI
March 2002  (‹#›)
Mode Processing: The Data Flow
Display Objects
with ID and
 Location
Dialogue
Backbone
Presentation
MM Fusion
User State
Domain Information
System State
Speech
            Speech
Agent‘s Posture and Behaviour
Mimics
(Neutral  or Annoyance)
Interaction
Modeling
Prosody (emotion)
Gesture


© Tilman Becker, DFKI
March 2002  (‹#›)
Processing the User‘s State
E:\MO3BAAFDFA0002306.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
Processing the User‘s State
•Annotated in the data from the data collection
•Recognized using mimics and prosody
•In case of anger activate the dynamic help
Object level
Meta level
This is great! Show me more!
That was quick!
One moment, let me think.
OK now, what are you doing?
Oh no, that’s ugly! A new one!
What the .... is going on?
•Different reference levels:


© Tilman Becker, DFKI
March 2002  (‹#›)
Wizard of Oz Data Collection
(LMU Munich)
•
•
•
WOZ_Experimente
Data distributed on DVD (1 DVD per 5 minute dialogue)


© Tilman Becker, DFKI
March 2002  (‹#›)
User States Annotated in 45 dialogues
Neutral
681
Joy/success
31
Reflection
59
Perplexity
31
Surprise/Astonishment
11
Annoyance/Failure
16
Only about 18% emotional user state events


© Tilman Becker, DFKI
March 2002  (‹#›)
User Independent Classification of Facial Expressions
(Univ. Erlangen)
Localization
Classification
(SVM, Eigenfaces)
 Annoyance
Rest (neutral)


© Tilman Becker, DFKI
March 2002  (‹#›)
Media Fusion
E:\MO3BAAFDFA0002306.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
Gesture Processing
•Objects on the screen are tagged with IDs and bounding boxes
•Gesture input
–Natural gestures recognized by SIVIT
–Touch sensitive screen
•Gesture recognition
–Location
–Type of gesture: pointing, tarrying, encircling
•Gesture Analysis
–Reference object in the display described as domain model (sub-)objects (M3L schemata)
–Compute distance to bounding boxes
–Output: gesture lattice with hypotheses
–


© Tilman Becker, DFKI
March 2002  (‹#›)
•Word lattice
•
•
•
•
•
•
•
•Prosody inserts boundary and stress information
•Speech analysis creates intention hypotheses
which movies are playing at the Metropol hypothesis(action:info,performance(cinema(name:Metropol))
..)
•
•
•
S:\sprachinterpretation\doc\intern\slides\pictures\whg.bmp
Speech Processing


© Tilman Becker, DFKI
March 2002  (‹#›)
Media Fusion
•Integrates gesture hypotheses in the intention hypotheses of speech analysis
•Information restriction possible from both media
•Possible but not necessary correspondence of gestures and placeholders (deictic expressions/
anaphora) in the intention hypothesis
•Necessary: Time coordination of gesture and speech information
•Time stamps in ALL M3L documents!!
•Output: sequence of intention hypothesis
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Presentation
E:\MO3BAAFDFA0002306.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
Presentation
•Starts with action planning
•Definition of an abstract presentation goal
•Presentation planner:
–Selects presentation, style, mode, and agent‘s general behaviour
–Activates natural language generator which activates the speech synthesis which returns audio data
and time-stamped phoneme/viseme sequence
•Character animation realizes the agent‘s behaviour
•Synchronized presentation of audio and visual information
–
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Partial view of SK architecture:
Multimodal Presentation
Action Planner
Presentation
Planner
Display
Management
Gesture
Generation
Graphics
Generation
Text Generation
Speech Synthesis
Display
Representation
for
Gesture Analysis
Functions
Modelling


© Tilman Becker, DFKI
March 2002  (‹#›)
User Perspective
Monitor:
frontal view
Table:
angled view


© Tilman Becker, DFKI
March 2002  (‹#›)
Lip Synchronization with Visemes
•Goal: present a speech prompt as natural as possible
•Viseme: elementary lip positions
•Correspondence of visemes and phonemes
•Examples:
•
C:\Documents and Settings\bert\Desktop\dagstuhl\animations\lipsync\r2.gif C:\Documents and
Settings\bert\Desktop\dagstuhl\animations\lipsync\u3.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
Behavioural Schemata
•Goal: the agent (Smartakus) is always active to signal the state of the system
•Four main states
–Wait for user‘s input
–User‘s input
–Processing
–System presentation
•Current body movements
–9 vital, 2 processing, 9 presentation (5 pointing, 2 movements, 2 face/mouth)
–About 60 basic movements
•


© Tilman Becker, DFKI
March 2002  (‹#›)
New animations
Examples for complex movements and speech-synchronized gestures
Enumeration
of items
Moving
in a circle
Pointing
to the
right


© Tilman Becker, DFKI
March 2002  (‹#›)
Example: Pointing Gestures
base position
preparation
stroke
retraction
composed gesture:
C:\PLS2\standart.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
[USEMAP]
Details:
Natural Language Generation in SmartKom
Discourse Updates in Interactive Dialogues


© Tilman Becker, DFKI
March 2002 (‹#›)
Natural Language Generation
in SmartKom
Tilman Becker
AT&T Research
2 Aug 2001
D:\Orsini\pictures\DFKI.tif
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
Stuhlsatzenhausweg 3, Geb. 43. 1 - 66123 Saarbrücken
Tel.: (0681) 302-5271
Fax.: (0681) 302-5020
Email: becker@dfki.de
www.smartkom.org
C:\Orsini\pictures\Logos\SmartKom.tif


© Tilman Becker, DFKI
March 2002  (‹#›)
Overview
•Architecture
•Presentation Goals
•Natural Language Generation
for Speech Synthesis
–Architecture
–Selection of data, sentence templates
–„fully specified templates“
–Concept-To-Speech information
•A short look aside: graphics and gestures
•Outlook


© Tilman Becker, DFKI
March 2002  (‹#›)
Presentation Begins in Action Planning
•
•Presentation as planning of a multi-modal dialog act
•
•Abstract presentation goals
(defined in an XML Schema
 presentation.xsd)


© Tilman Becker, DFKI
March 2002  (‹#›)
Natural Language Generation: Overview
•Input, Output
•Architecture
•Knowledge Bases
•The steps of generation
•Templates
–Tree Adjoining Grammars
–“fully specified templates”
•Concept-To-Speech information
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Typical Abstract Presentation Goals
•Presentation of information (usu. With an implicit request): “Here you can see...” : <inform>
•Explicit Request to fill a slot: “Please show me where you want to sit” : <request>
•Feedback: “Your reservation is secured...” <feedback>
•Canned presentations:
 <goodbye>
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Input for Natural Language Generation
<speechGenerationTask    goalKey="11">
  <speechPresentationGoal>
    <inform>
      <comment          commentTyp="onGraphicalPresentation">
          <graphicalRealisationType>            list          </graphicalRealisationType>
          <deepFocus    structReference="struct201"/>
          <content         structReference="struct17"/>
          <content         structReference="struct18"/>
      </comment>
    </inform>
  </speechPresentationGoal>
  <abstractPresentationContent>
          <performance>
            <avMedium>
              <title structId="struct18"> Schmalspurganoven </title>
            </avMedium>
            <cinema>
              <movieTheatre structId="struct17">
                <name>    Europa     </name>
              </movieTheatre>
            </cinema>
          <beginTime structId="struct201"/>
        </performance>
[...]
  </abstractPresentationContent>
</speechGenerationTask>


© Tilman Becker, DFKI
March 2002  (‹#›)
Output of Natural Language Generation
F:\generator\data\demodialog\anfangszeiten.gif
    Auf     der      Übersicht       sehen       Sie    die     Anfangszeiten     des    Films
Schmalspurgan.    im     Kino     Europa
<concept    sequence="11">
  <discourseElement      id="9011"  discourseRelation="simple">
    <sentence        id="tsen-9011"   sentenceMode="declarative">
[...]
              <syntaxElement  case="acc" argumentStatus="Object"  syntaxCategory="NP">
                <syntaxElement  syntaxCategory="Det">
                  <lexicalElement   partOfSpeechTag="ART">
                    <text>                      die                    </text>
                  </lexicalElement>
                </syntaxElement>
                <syntaxElement   syntaxCategory="N">
                      <lexicalElement    partOfSpeechTag="NN">
                        <text>            Anfangszeiten     </text>
                      </lexicalElement>
                    </syntaxElement>
[...]
      </syntaxElement>
    </sentence>
  </discourseElement>
</concept>


© Tilman Becker, DFKI
March 2002  (‹#›)
Sketch of the Architecture
Text
Planner
Sentence
Planner
Syntax
Realizer
What?
How?
Details..
XSLT
PrePlan
TAG


© Tilman Becker, DFKI
March 2002  (‹#›)
Knowledge Bases in NLG
•Defining the goal (XSLT Stylesheet,   What?)
•Planning rules (PrePlan,                  How?)
•(Template-)grammar (TAG, Realizer         How?)
•(Morphology)
•Lexicon (TAG, Realizer)
•Discourse memory (anaphora etc.)
•User model (“Interaktionsmodellierung”)
(register etc.)


© Tilman Becker, DFKI
March 2002  (‹#›)
First Step: Defining the Goal
•XSLT: Mapping abstract goals to realization goals, e.g.:
<request>
  <slotFill><select>
     <modality>gesture</modality>
   </select></slotFill>
  <requestFocus>
    <deepFocus idReference=“mf42”/>
  </requestFocus></request>
(showme mf42)

<xsl:template match="request/slotFill/select[normalize-space(modality/text())='gesture']">
  (showme
 <xsl:apply-templates select="requestFocus/deepFocus"/>
   )
</xsl:template>


© Tilman Becker, DFKI
March 2002  (‹#›)
First Step (2): Using Context Information
•XSLT: Creation of a generation knowledge base from the input, e.g.:
<performance id="mf745">
  <entityKey id="mf746">
    performance_1000030
  </entityKey>
  <avMedium id="mf747">
    <entityKey id="mf748">
      avMedium_1002535
    </entityKey>
    <title>
      O Brother, Where Art Thou?
    </title>
  </avMedium>
</performance>
(GKB (
 (performance mf745)
 (entitykey
   mf746
   performance_1000030)
 ...
 (title
   mf747
   “O Brother..”)
 ...
)


© Tilman Becker, DFKI
March 2002  (‹#›)
Second Step:
Sentence Planning with Templates
•Result is a derivation tree
•PrePlan (a simple planning tool in Java):
–(Text and) sentence planning
–Selection of templates and filling of slots, e.g.:
(overview mf42)
  ->
(select “You can see an overview”)
(adjoin “Node Overview-4711”)
(np-realize mf42)
–
–Select and adjoin refer to trees and nodes of the (TAG) Grammar


© Tilman Becker, DFKI
March 2002  (‹#›)
TAG Grammars
•Tree Adjoining Grammars (Joshi et al 1975)
•A grammar
–consists of partial trees,
–that are combined by two operations:
•Adjunction
•Substitution
–Lexicalized grammars:
•A set of possible partial trees for every word
•Every partial tree is a “maximal projection” of the word


© Tilman Becker, DFKI
March 2002  (‹#›)
TAG: Initial Trees
S
NP
VP
V
NP
see
NP
N
You
NP
Det
an
N
overview
Substitution as in context-free grammars:


© Tilman Becker, DFKI
March 2002  (‹#›)
TAG: Auxiliary Trees
NP
Det
an
N
overview
Adjunction is more powerful than context-free grammars:
N
N*
PP
P
over
NP


© Tilman Becker, DFKI
March 2002  (‹#›)
TAG with Templates
•Instead of lexicalized trees:
–A template tree contains the entire structure of a template
–…including all words
–A simplistic „template Grammar“ consists of complete sentences
–Can smoothly be developed into a complete grammar
•Problem:
–What are the right syntactic(?) structures?
–General problem with CTS


© Tilman Becker, DFKI
March 2002  (‹#›)
Planning a Derivation Tree
NP
Det
an
N
overview
S
NP
VP
V
NP
 see
N
you
Commenting on a
graphical presentation
Referring to a list
you-see-tree
NP_22
derived tree
derivation trees
an-overview-Baum
derived tree


© Tilman Becker, DFKI
March 2002  (‹#›)
Concept-To-Speech
•Syntactic Information is used to compute
Prosodic Information
•Sentences are combined to discourse tree
•Filtering of irrelevant syntactic features
•
•Synthesis is based on Festival
•Preprocessing traverses syntactic structure (Scheme)
•
•Work carried out at IMS, Stuttgart, Germany
Gregor Möhler, Antje Schweitzer
(Prof. Dogil)


© Tilman Becker, DFKI
March 2002  (‹#›)
CTS versus TTS
\\ARAKAKADU\moehler\Smartkom\Doc\IMS-Talks\demo1_discourse.gif
\\ARAKAKADU\moehler\Smartkom\Doc\IMS-Talks\demo1_wave.gif
L*H H%
L*H
H*L
H*L%
%
© Gregor Möhler, IMS Stuttgart


© Tilman Becker, DFKI
March 2002  (‹#›)
Templates
•Where do we get the templates from?
–Ideally from existing grammars:
•consistent
•short development time
•no/less expertise required
–Data collection for a new application:
•example dialogues
•Wizard of Oz experiments
•dialogue models
–Growing collection of “standard templates” (will lead to a real grammar)
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Current work
•Complete TAG implementation with unification:
–Porting an existing Unifier (LISP)
–XML-Representation of the grammar:
•Graphical tools
•XSLT mapping to/from other formats (LISP)
–
•Structure of planning rules:
–Separate text and sentence planning
•Extending the set of templates


© Tilman Becker, DFKI
March 2002  (‹#›)
Future Work
•Generating referring expressions
•Generating text for graphics, esp. for mobile scenario “no audio”
•Text planning
•Abstract “sentence plans”:
–Module within syntactic realization
•Various tools (next slide)
•Language independent steps of NLG
•
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Future Work
•Tools for:
–PrePlan planning rules
–Lexicon (morphology)
•Template tree development scenario:
–Parser (with a German grammar -- Kim Gerdes) produces derivation trees
–(Graphical) tool to
•select correct analysis
•relate to existing templates
•mark fixed/variable parts
[USEMAP]


© Tilman Becker, DFKI
March 2002  (‹#›)
MIAMM
•Multidimensional Information Access
using Multiple Modalities (IST-2000-29487)
–Cross Programme Action 2 User Friendliness, Human Factors, Multi-Lingual, Multi-Modal dialog modes
•Duration: September 2001 - February 2004
•Participants
–INRIA (Laboratoire Loria), FR [Coord.]
•Speech recognition, language analysis, contextual interpretation
–Deutsches Forschungszentrum für Künstliche Intelligenz, DE
•Graphical interface, language analysis, dialogue management
–Netherlands Organization for Applied Scientific Research (TNO), NL
•Task analysis, interaction scenarios, evaluation
–Sony International Europe GmbH, DE
•Multilingual speech recognition (en, de), software for haptic interaction, domain modeling,
hardware interaction
–CANON Research Centre Europe (CRE), UK
•Multimedia database and search application
–
C:\WINNT\Profiles\bert\Desktop\logo.gif


© Tilman Becker, DFKI
March 2002  (‹#›)
The Haptic Device
•Phantom (www.sensable.com)
C:\WINNT\Profiles\peters\Desktop\6dof.jpg
3 degrees of freedom force feedback unit


© Tilman Becker, DFKI
March 2002  (‹#›)
 Empirical and
Data-Driven Models
of Multimodality
2002
2005
 Advanced Methods
for Multimodal Communication

 Computational
Models
of Multimodality
Adequate Corpora
for MM Research
Mobile, Human-Centered, and
Intelligent Multimodal Interfaces

Multimodal
Interface Toolkit
Research Roadmap of Multimodality 2002-2005

XML-Encoded
MM Human-Human and
Human-Machine Corpora
Mobile Multimodal
Interaction Tools

 Standards for the
Annotation of MM
Training Corpora
Examples of Added-Value
of  Multimodality
Multimodal
Barge-In
  Markup Languages
for Multimodal Dialogue
Semantics
Models for Effective and
Trustworthy MM  HCI
Collection of Hardest and Most
Frequent/Relevant Phenomena
 Task- , Situation-
and User- Aware Multimodal Interaction
 Plug- and Play Infrastructure
Toolkits for
 Multimodal Systems
Situated and Task-
Specific MM Corpora
Common Representation of
Multimodal Content
Decision-theoretic, Symbolic and Hybrid
 Modules for MM Input Fusion
Reusable Components
for Multimodal Analysis
and Generation
Corpora with Multimodal
Artefacts and New Multi-
modal  Input Devices
Models of MM
Mutual Disambiguation
Multiparty MM
Interaction
2 Nov. 2001
Dagstuhl Seminar
Fusion and Coordination
in Multimodal Interaction
edited by: W. Wahlster
Multimodal Toolkit for
Universal Access


© Tilman Becker, DFKI
March 2002  (‹#›)
2006
2010
 Ecological Multimodal Interfaces
Research Roadmap of Multimodality 2006-2010
 Empirical and
Data-Driven Models
of Multimodality
 Advanced Methods
for Multimodal Communication
 Toolkits for
 Multimodal Systems
 Usability Evaluation
Methods for MM System
 Multimodal Feedback
and  Grounding
 Tailored and
Adaptive MM Interaction
 Incremental Feedback between
Modalities during Generation
 Models of MM
Collaboration
 Parametrized Model of
Multimodal Behaviour
 Demonstration of
Performance Advances
through Multimodal Interaction
 Real-time Localization
 and Motion/Eye
Tracking Technology
 Multimodality in VR
and AR Environments

 Resource-Bounded
Multimodal Interaction
 User‘s Theories
of System‘s
Multimodal Capabilities
Multicultural Adaptation
of Multimodal Presentations
Affective MM Communication
Test suites
and Benchmarks for
 Multimodal Interaction
Multimodal Models
of Engagement and Floor
Management
Non-Monotonic MM
Input Interpretation
Computational Models
of the Acquisition of MM
Communication Skills
Non-Intrusive
& Invisible MM
Input Sensors
Biologically-Inspired
Intersensory Coordination Models
2 Nov. 2001
Dagstuhl Seminar
Fusion and Coordination
in Multimodal Interaction
edited by: W. Wahlster


© Tilman Becker, DFKI
March 2002  (‹#›)
Research Roadmap of Multimodality 2001-2010
Enabling Technologies and Important Contributing Research Areas
Multimodal Input
Multimodal
Interaction
l Sensor Technologies
l Vision
l Speech  & Audio Technology
l Biometrics

l User Modelling
l Cognitive Science
l Discourse Theory
l Ergonomics
l Smart Graphics
l Design Theory
l Embodied Conversational Agents
l Speech Synthesis
Multimodal Output
2 Nov. 2001
Dagstuhl Seminar
Fusion and Coordination
in Multimodal Interaction
edited by: W. Wahlster
l  Planning
l Formal Ontologies
l Machine Learning
l Pattern Recognition


© Tilman Becker, DFKI
March 2002  (‹#›)
Multimodal Interaction in SmartKom
C:\Orsini\Pictures\SmartKom_Graphiken\Smartakus_Kinokarte.tif
Scenario:
public (mobile, home)
Application:
movie information
(EPG, email, phone, fax,
address book,
tv and vcr control, routing/tourist info)
U: I want to make a reservation in (R) this movie theater
S: This theater does not take reservations
U: Then a different one, (R) this one perhaps


© Tilman Becker, DFKI
March 2002 (‹#›)
IJCAI 2001
Workshop TASK-4
Seattle, WA, USA
D:\Orsini\pictures\DFKI.tif
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
Stuhlsatzenhausweg 3, Geb. 43.8 - 66123 Saarbrücken
Tel.: (0681) 302-5271
Email: {janal,becker}@dfki.de
www.dfki.de
Overlay as the basic operation
in discourse processing
Jan Alexandersson
Tilman Becker


© Tilman Becker, DFKI
March 2002  (‹#›)
•Construct a discourse memory of contextual information
•Hypotheses:
–enrich w/ context information
–compute scores
•discourse memory:
–enrich
–retract
–(partially) overwrite
Discourse modelling tasks


© Tilman Becker, DFKI
March 2002  (‹#›)
Discourse modelling
Architecture
Hypothese von IE
Hypothese an IE
Request
Response
Hypothese von IE
Hypothesis from IE
Hypothese an IE
Hypothesis to IE
Discourse
memory
complete
score
store


© Tilman Becker, DFKI
March 2002  (‹#›)
Dialog memory
•A typical  dialog situation:
–User: I want to see Matrix
–Sytem: Ok, it runs at 8 and at 10
–User: At 8
•Dialog memory:
–structured storage for utterances (and their meaning)
•“current context:”
–data structure representing the currently active context
–e.g.: Matrix at 8


© Tilman Becker, DFKI
March 2002  (‹#›)
Putting the user in context
•New information is added to current context,
•
•Result:
updated current context
•
•used, e.g. for a database query


© Tilman Becker, DFKI
March 2002  (‹#›)
Speech
analysis
Gesture
analysis
Application interface
 Speech
recognition
 Gesture
recognition
Multimodal
Chart Parser
Unification-
based
multimodal
Grammar
MVPQ, AT&T
Johnston 2000
Unification-based Integration of Speech and Gesture


© Tilman Becker, DFKI
March 2002  (‹#›)
Updating current context with Unification
•Representing complex discourse objects as typed feature structures (TFS), e.g. Johnston 1998
•Used, e.g. in media fusion:
–User: I want to see this one [pointing to movie “Matrix”]
–Speech: “I want to see X”
–Gesture: “When is Matrix showing?”
“I want to see Matrix.” ....
–Media Fusion: “I want to see Matrix.”
•Problem: enumeration of all structures (in deixis)


© Tilman Becker, DFKI
March 2002  (‹#›)
Typed feature structures and XML
•In the SmartKom project, discourse objects are represented in XML
•Mapping from XML to TFS assumed
•Example:


© Tilman Becker, DFKI
March 2002  (‹#›)
The limits of unification
•Not all new information is consistent with current context
•Even for Mediafusion:
–User: This one, (but) in green
•Some parts must be kept, some be overwritten
–“keep and overwrite”, M. Streit
•
•Provide a principled method,
based on unification
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Overlay to the rescue
•Unification is monotonic, reflexive operation
•old information from the current context can be changed, new information is more important
•q we need a non-monotonic, non-reflexive operation: overlay


© Tilman Becker, DFKI
March 2002  (‹#›)
Overlay to the rescue
•Task: compare new (intention) hypothesis against discourse history
•new information consistent with focus:
V Unifikation
•new in formation (partially) inconsistent with focus:
 V Overlay


© Tilman Becker, DFKI
March 2002  (‹#›)
Example for Unification
•U: I want to go to the movies tonight
•S: Here is a list of the films that are shown in  Heidelberg tonight: (SmartKom shows a list)
•U: I want to see (R) this one, where is it playing?
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Unification: monotonic operation


© Tilman Becker, DFKI
March 2002  (‹#›)
<domainObject>
  <entertainment>
    <broadcast>
      ... Schmalspurganoven ...
   </broadcast>
  </entertainment>
</domainObject>
<domainObject>
  <entertainment>
    <performance>
      <beginTime>
        <function>
          <between>
            <from>
              2000-12-13T12:34:56
            </from>
            <to>
              2000-12-13T23:59:59
            </to>
          </between>
        </function>
      </beginTime>
      <cinema>
        <movieTheater>
          <contact>
            <address>
              <town>
                Heidelberg
              </town>
            </address>
          </contact>
        </movieTheater>
      </cinema>
    </performance>
  </entertainment>
</domainObject>
<domainObject>
  <entertainment>
    <performance>
      ... Schmalspurganoven ...
   </performance>
  </entertainment>
</domainObject>
<domainObject>
  <entertainment>
    <performance>
      <beginTime>
       ...
      </beginTime>
      <cinema>
        <movieTheater>
          <contact>
            <address>
              <town>
                Heidelberg
              </town>
            </address>
          </contact>
        </movieTheater>
      </cinema>
      <avMedium>
        ...
        <title>
          Schmalspurganoven
        </title>
      </avMedium>
    </performance>
  </entertainment>
</domainObject>


© Tilman Becker, DFKI
March 2002  (‹#›)
Unification: compatibility condition
fail


© Tilman Becker, DFKI
March 2002  (‹#›)
Overlay:  nonmonotonic operation, that always succeeds


© Tilman Becker, DFKI
March 2002  (‹#›)
Example for Overlay
•U: I want to make a reservation in (R) this movie theater
•S: This theater does not take reservations
•U: Then a different one, (R) this one perhaps


© Tilman Becker, DFKI
March 2002  (‹#›)
<domainObject>
  <entertainment>
    <performance>
      <cinema>
        <movieTheater>
          ...
          <name>Studio Europa</name>
          <contact>
           ...
          </contact>
        </movieTheater>
      </cinema>
    </performance>
  </entertainment>
  <movieTheater/>
</domainObject>
<domainObject>
  <entertainment>
    <performance>
      <beginTime>
       ...
      </beginTime>
      <cinema>
        <movieTheater>
          ...
          <name>  Studio Europa
          </name>
          <contact>
            ...
          </contact>
        </movieTheater>
      </cinema>
      <avMedium>
        ...
        <title> Schmalspurganoven
        </title>
      </avMedium>
    </performance>
  </entertainment>
</domainObject>
P=0.3
P=0.7
<domainObject>
  <movieTheater>
    ...
    <name> Studio Europa </name>
    <contact>
     ...
    </contact>
  </movieTheater>
  <movieTheater/>
</domainObject>
<domainObject>
  <entertainment>
    <performance>
      <beginTime>
        <function>
          ...
        </function>
      </beginTime>
      <cinema>
        <movieTheater>
          ...
          <name> Kamera </name>
          <contact>
           ...
          </contact>
        </movieTheater>
      </cinema>
      <avMedium>
        ...
        <title>
          Schmalspurganoven
        </title>
      </avMedium>
    </performance>
  </entertainment>
</domainObject>
<domainObject>
  <movieTheater>
    ...
    <name> Studio Europa </name>
    <contact>
     ...
    </contact>
  </movieTheater>
  <movieTheater/>
</domainObject>


© Tilman Becker, DFKI
March 2002  (‹#›)
Type Hierarchy
TV_or_Movie
TV
Movie
•BeginTime
•Subtitles
•....
•Channel
•...
•Cinema
•...
TOP
MovieTheater
. . .


© Tilman Becker, DFKI
March 2002  (‹#›)
Overlay and Typed Feature Structures (TFS)
•Two non-unifiable structures (type clash):
–Cover is more important than background
–Keep information from background:
•Find lub (most specific common supertype)
•“reduce” background to this type
•recursively apply overlay on features
•for atomic values: ignore background
•
•
•


© Tilman Becker, DFKI
March 2002  (‹#›)
An Example
•U: What films are showing on TV tonight?
•S: [shows list of films]
•U: That‘s a boring program, I‘ll rather go to the movies.
•
•
•
•Q: How do we save “tonight” ?


© Tilman Becker, DFKI
March 2002  (‹#›)
An Example
•U: What films are showing on TV tonight?
•Þ Context of type TV
•S: [shows list of films]
•U: That‘s a boring program, I‘ll rather go to the movies.
•Þ Analysis finds data of type Movie
•incompatible with context
•abstract context to lub TV_or_Movie
(keeps “tonight”)
•unifiable with analysis


© Tilman Becker, DFKI
March 2002  (‹#›)
Does TFS solve all your problems?
•An adequate type hierarchy must exist
–“most specific common supertype”
–Carpenter and others on default unification
•Overlay (and unification) of lists and sequences is not well defined -- and content dependent
•What about “semantics”, e.g. DRS, Verbmobil VIT/MRS?
•
•


© Tilman Becker, DFKI
March 2002  (‹#›)
Implementation
•Mapping of XML Schema to Java classes
see data binding:
–Castor Project
–Java 1.4: JAXB
•XML documents are represented internally as instances of these classes
•Unification and overlay are realized using the Java meta protocol


© Tilman Becker, DFKI
March 2002  (‹#›)
Next steps
•Treatment of subobjects
–find relation to context
•Grounding
–model the presentation-acceptance cycle of discourse objects
•Inclusion of dialog management plans
–expected vs. Possible next states
–better interpretation in context
•Fully formalize XML schema to tfs mapping


© Tilman Becker, DFKI
March 2002  (‹#›)
Summary of the Talk
•Two large-scale spoken dialogue projects: Verbmobil, SmartKom
•Spotlight on Aspects of NLG, Discourse Processing
•
•Conclusion:
–Large Scale projects offer new insights‘
See also upcoming 6th framework of EU
–Modular Architecture (data pool driven middleware)
–combine shallow and deep approaches
•multi-engine approach
•fully specified template approach
–emerging multi-modal markup language


© Tilman Becker, DFKI
March 2002  (‹#›)
Finally
Thank you very much
for your kind attention.


© Tilman Becker, DFKI
March 2002  (‹#›)
Verbmobil -The Project
•Some information for those who haven´t heard of Verbmobil recently
•speaker independent speech-to-speech translation system for appointment scheduling and travel
planning:
German « English (10 175 words German, 6871 words English)
German « Japanese (2566 words Japanese)
•69 modules, full configuration 3.5 GB
•23 participating institutions (in Verbmobil II)
•over 900 full workers and students involved
•project duration: 1993 - 2000
• q scientific, software technology, and management challenges


© Tilman Becker, DFKI
March 2002  (‹#›)
Scientific Results
•Speaker independent speech recognition over various channels
•Language ID
•Unknown words
•Prosodic information (segmentation, stress etc.) used in various modules
•Repair of hesitations, repetitions
•Combination of parser analysis fragments
•Semantic representation: VIT
•
There are over 600 refereed papers on the various aspects of and
achievements in Verbmobil.
See also W. Wahlster (ed.): Verbmobil: Foundations of Speech-to-Speech Translation,
Springer Verlag, to appear July 2000       ... at any shop near your office :-)
Some highlights
•Context and dialog knowledge supports translation
•Efficient semantic transfer
•Content to speech generation
•Word concatenative speech synthesis
•Dialog minutes and summaries
•Large data collection with annotation on various levels (e.g. tree-banks, dialog acts)
•....


© Tilman Becker, DFKI
March 2002  (‹#›)
Multi-Engine for Translation (DÝE)
- Large-Scale Web-based Evaluation: 25 345 Translations, 65 Evaluators
- Sentence Length 1 - 60 Words
Translation Thread
Case-based Translation
Statistical Translation
Dialog-Act based Translation
Semantic Transfer
Substring-based Translation
Automatic Selection
Manual Selection
37%
69%
40%
40%
65%
57% / 78% *
88%
44%
79%
45%
47%
75%
66% / 83% *
95%
46%
81%
46%
49%
79%
68% / 85% *
97%
Word
Accuracy ³ 50%
5069 Turns
Word
Accuracy ³ 75%
3267 Turns
Word
Accuracy ³ 80%
2723 Turns
* After Training with Instance-based Learning Algorithm


© Tilman Becker, DFKI
March 2002  (‹#›)
Agreement between Different Labels
•Most M- (79%) and D-bound. (91%) are prosodically marked
•About half of the M-boundaries (52%) are D-boundaries
•Practically all D-boundaries (97%) are M-boundaries
•High agreement between the non-boundaries (92-100%)
•Even a prosody with a recognition rate of 100% will not find 21% of the M-boundaries and 9% of the
D-boundaries!
100
0
97
3
-M3
48
52
21
79
M3
-D3
D3
-B3
B3
93
7
92
8
-D3
48
97
9
91
D3
-M3
M3
-B3
B3
B3 prosodic boundary
M3 syntactic boundary
D3 dialog act boundary


© Tilman Becker, DFKI
March 2002  (‹#›)
0
50
100
150
200
250
300
350
1
5
10
15
20
25
30
35
40
45
50
55
60
Distribution of Sentence Length in Large-Scale Evaluation


© Tilman Becker, DFKI
March 2002  (‹#›)
Topic
Meeting time
Meeting place
Means of transport
Departure place
Arrival time
Place of arrival
Who reserves the hotel
How to get to departure place
Means of return transportation
Departure place for return trip
Meeting time for return trip
Meeting place for return trip
Arriving place for return trip
Total Number of Dialog Tasks
Average Percentage of
Successful Task Completions
Weighted Average Percentage
of Successful Task Completions
Successful
Completions
25
21
30
22
22
17
28
7
23
16
3
3
10
227
Attempts
28
27
30
25
26
19
31
9
24
17
4
4
11
255
Percentage
of Successful
Task Completions
89,3
77,8
100
88
84,6
89,5
90,3
77,8
95,8
94,1
75
75
90,9
86,8
89,6
Frequency-Based
Weighting Factor
0,90
0,87
0,97
0,81
0,84
0,61
1
0,29
0,77
0,55
0,13
0,13
0,35
Results of End-to-End Evaluation Based on
Dialog Task Completion for 31 Trials


© Tilman Becker, DFKI
March 2002  (‹#›)
Test Results for the current Repair Module
U:\coling\pic_trans\repair4.gif
     Remember:
The output of the Repair module are additional hypotheses for the linguistic analysis. The original
hypotheses remain in the WHG


© Tilman Becker, DFKI
March 2002  (‹#›)
Examples