© Tilman Becker, DFKI March 2002 (‹#›) Large Spoken Language Dialogue Systems: Verbmobil & SmartKom Tilman Becker DFKI GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken becker@dfki.de http://verbmobil.dfki.de http://www.smartkom.org © Tilman Becker, DFKI March 2002 (‹#›) Overview •Speech-to-speech translation: Verbmobil •Multi-Modal Man-Machine Interaction: SmartKom •Zooming in: Natural Language Generation © Tilman Becker, DFKI March 2002 (‹#›) Content •Overview of Verbmobil •A walk through the system –Acoustic Processing –Dialog Translation –Selection and Speech Synthesis •Technical issues •Human Factors and Experiences © Tilman Becker, DFKI March 2002 (‹#›) Overview of Verbmobil Challenges, Partners, and General Approaches © Tilman Becker, DFKI March 2002 (‹#›) What is Verbmobil? •Speech-to-speech translation system •Robust processing of spontaneous dialogs •Speaker independent (adaptive) •Languages: English, German, Japanese •Domains: Appointment scheduling, travel planning and hotel reservation, remote PC maintenance •The system mediates between two humans, it does not play an active role •There is no control of the ongoing dialog by the system © Tilman Becker, DFKI March 2002 (‹#›) Input Conditions Naturalness Adaptability Dialog Capabilities Close-Speaking Microphone/Headset Push-to-talk Telephone, Pause-based Segmentation Isolated Words Read Continuous Speech Speaker Independent Speaker Dependent Monolog Dictation Information- seeking Dialog Open Microphone, GSM Quality Spontaneous Speech Speaker Adaptive Multiparty Negotiation Verbmobil Challenges for Language Engineering © Tilman Becker, DFKI March 2002 (‹#›) Classification of Machine Translation Methods Syntactic Analysis Word Structure Word Structure Direct Translation Syntactic Transfer Semantic Transfer Interlingua Semantic Structure Semantic Structure Semantic Analysis Semantic Generation Syntactic Generation Syntactic Structure Syntactic Structure Morphologic Analysis Morphologic Generation Source Language Target Language © Tilman Becker, DFKI March 2002 (‹#›) The Verbmobil Partners © Tilman Becker, DFKI March 2002 (‹#›) Prof. Mahr TU Berlin Dr. Klein, Dr. Wolf DLR, PT Prof. Hoffmann TU Dresden Prof. Paulus TU Braunschweig Prof. Görz Prof. Niemann Univ. Erlangen Prof. v. Hahn Univ. Hamburg Prof. Tillmann LMU München Dr. Ruske TU München Dr. Block Siemens, München R. Reng Temic, Ulm Dipl.-Ing. Mangold DaimlerChrysler, Ulm Prof. Gibbon Univ. Bielefeld Prof. Blauert Univ. Bochum Prof. Rohrer Univ. Stuttgart Prof. Hinrichs Univ. Tübingen Prof. Waibel Univ. Karlsruhe A. Klüter DFKI, Kaiserslautern Dr. Eisele Philips, Aachen Prof. Ney RWTH Aachen Prof. Hess Univ. Bonn Dr. Reuse BMBF Referat 524 Prof. Pinkal Univ. d. Saarlandes Prof. Uszkoreit Prof. Wahlster DFKI, Saarbrücken Sprecheradaption Multilinguale Wortlisten Signalnahe Evaluierung Erkenner DC, Sprachsteuerung (C, C++, Fortran) Datensammlung, Integrierte Verarbeitung (C, C++, LISP, Prolog) Woz-Experimente, Datensammlung Transfer (Prolog) Multilinguale Erkenner (C, C++) Kontextaus-wertung (LISP, Prolog, Java) Prof. Kurematsu ATR International, Kyoto, Japan Prof. Waibel CMU, Pittsburgh; Prof. Sag CSLI, Stanford, USA Syntax, Rob. Semantik, Dialog (LISP, Prolog) Datensammlung, Erkennung Syntax (C, C++, Prolog) Datensammlung Erkenner Aachen Stat. Transfer (C++,C) Chunk-Parser (Prolog) Reparatur, Prosodie D, E (C) Akustische Synthese (C, C++) System integration (C++, Tcl-Tk) Multilinguale Prosodiesteuerung (C++,C) The Verbmobil Partners © Tilman Becker, DFKI March 2002 (‹#›) •23 participating institutions (in Verbmobil II), from Germany and the USA •Over 900 full-time employees and students involved over the whole duration •Funded by the German Ministry for Education and Science and the participating companies: Facts About the Project BMBF-Funding Phase I, 1.01.93 – 31.12.96 62.7 Mio. DM BMBF-Funding Phase II, 1.01.97 - 30.9.2000 53.3 Mio. DM Industrial investment I+II 32.6 Mio. DM Related industrial R & D activities ca. 20 Mio. DM Total 168.6 Mio. DM 31.6 Mio € 27 Mio € 16.5 Mio € ca. 10 Mio € 85.1 Mio € © Tilman Becker, DFKI March 2002 (‹#›) Verbmobil – The Book There are over 600 refereed papers on the various aspects of and achievements in Verbmobil. Wolfgang Wahlster (ed.): "Verbmobil: Foundations of Speech-to-Speech Translation" Springer-Verlag Berlin Heidelberg New York. 679 Pages ISBN 3-540-67783-6 U:\coling\pic_trans\buch.gif © Tilman Becker, DFKI March 2002 (‹#›) Typical Verbmobil Hardware •SUN Ultra-Sparc 80 •4 processors (450 MHz) •2 GB main memory •8 GB swap •no special signal processing hardware •Desklab Gradient A/D converter or Sun internal audio device •close-speaking cordless microphones © Tilman Becker, DFKI March 2002 (‹#›) The Graphical User Interface U:\coling\pic_trans\GUI.gif -Grober Überblick über die Module und die Verarbeitungsweise -Erste Demo: -?? © Tilman Becker, DFKI March 2002 (‹#›) Walk Through the Verbmobil System Detailed Module Presentation and Demonstration © Tilman Becker, DFKI March 2002 (‹#›) Acoustic Processing © Tilman Becker, DFKI March 2002 (‹#›) Recording, Synthesizing and Synchronization •Task: Providing a uniform interface to varying audio hardware; synchronizing in- and output •Input: Audio data and system states •Method: Introducing audio modules; Finite State Machine for synchronizing • •Result: Audio Data and Synchronization •Benefit: Encapsulating audio hardware, “open microphone”, preventing out-of-sync or overlapping system output •Responsible: DFKI, Kaiserslautern © Tilman Becker, DFKI March 2002 (‹#›) Audio Configuration •Configuration of the systems I/O behavior –How many speakers? –For every (possible) speaker: •Input device (channel identification, speaker adaption) •Output device(s) (translation output, destination for man/machine dialogs) •Source language (or „unknown“) –Desired system output categories • •Audio channel configuration –Uniform configuration of heterogeneous audio hardware – U:\vm\snapshots\change_configuration.gif © Tilman Becker, DFKI March 2002 (‹#›) Recording Audio Data •Turn-based processing, barge-in available for voice commands •Different audio quality: –lab-quality close-speaking microphone (16kHz) –room microphone (16kHz) –telephone quality (8kHz) –GSM mobile (8kHz) ðAudio module concept –provides a uniform interface of different hardware devices to the system –# of channels is only limited by hardware •Open Microphone Approach (essential for telephone translation service!) •Input/output synchronization •No cross-talk allowed © Tilman Becker, DFKI March 2002 (‹#›) Microphone 1 Microphone open Speech input Microphone 1 Microphone closed Microphone 1 Speech output Pause detection Synchronization Translation Open Microphone Approach © Tilman Becker, DFKI March 2002 (‹#›) Synchronisation •Synchronization controls the high-level System behavior •Realized via Finite State Machine Dial-up- Connection Start Welcome User Start Recording GUI command Stop Recording Utterance recorded Wait for Echo Echo configured? Wait for best Hyp. Best Hypothesis configured? Execute action Unknown word detected Voice command Voice command Voice command Confirm © Tilman Becker, DFKI March 2002 (‹#›) Recognizing Speech •Task: Analyzing continuous spontaneous speech signals •Input: Audio data •Method: HMMs, class based language models, etc. • •Result: Word Hypotheses Graphs (WHG) and speech commands •Benefit: Compact representation of hypotheses of what has been said •Responsible: DaimlerChrysler AG University of Karlsruhe RWTH Aachen Philips GmbH (Language Models) © Tilman Becker, DFKI March 2002 (‹#›) General Speech Recognition Task U:\coling\pic_trans\signal2.gif German English Japanese U:\coling\pic_trans\recog1.gif Audio Signal Recognizers Word Hypotheses Graph © Tilman Becker, DFKI March 2002 (‹#›) Word Hypotheses Graphs (WHGs) •WHGs realize the interface between acoustic and linguistic processing U:\coling\pic_trans\recog1.gif Edge = Word Best Hypothesis Acoustic Score © Tilman Becker, DFKI March 2002 (‹#›) Focuses of Speech Recognition in Verbmobil Robustness Multilinguality Large Vocabulary Daimler Chrysler RWTH Aachen University of Karlsruhe © Tilman Becker, DFKI March 2002 (‹#›) Nine Available Recognizer Modules •DaimlerChrysler –German, 16 kHz, speaker adaptive, approx. 10000 words –German, 8 kHz, telephone/GSM quality, speaker adaptive, approx. 10000 words –English, 8 kHz, telephone/GSM quality, speaker adaptive, approx. 7000 words •University of Karlsruhe –German, 16 kHz, speaker adaptive, approx. 10000 words –English, 16 kHz, speaker adaptive, approx. 7000 words –Japanese, 16 kHz, speaker adaptive, approx. 2600 words –Language Identification Component (German, English, Japanese) •RWTH Aachen –German, 16 kHz, speaker adaptive, approx. 10000 words –German, 16 kHz, speaker dependent, approx. 30000 words © Tilman Becker, DFKI March 2002 (‹#›) Principal Recognizer Architecture HMM word models Language model Speech signal short-time analysis vector quantization Search Decoder U:\coling\pic_trans\signal2.gif audio signal acoustic preprocessing data reduction word and sentence identification output © Tilman Becker, DFKI March 2002 (‹#›) The Speech Recognition Task •Some Highlights of the Verbmobil Recognizers: – Speaker adaptive recognition: •Start speaker independent •Recognition results enhance during the dialog – Capable of dividing speech and noise input using garbage models – Segmentation of speech input allows incremental processing – Word class based language models and recognition allow flexible vocabulary extension – Online vocabulary extension through unknown word detection (names, towns, street names, …) – Integrated continuous und speech command recognition •… and many more © Tilman Becker, DFKI March 2002 (‹#›) Language Identification •Features –ID on 3 seconds speech signal (maximum) –Real time factor 0.5 –Speaker independent –Unknown audio channel –Using language model know-how • •Flexible Architecture: LID can be combined with any speech recognizer German English Japanese LID Recognizers © Tilman Becker, DFKI March 2002 (‹#›) Prosodic Processing •Task: Recognizing prosodic phenomena (accents, sentence mood) and boundaries •Input: WHG and speech signal •Method: Neural networks and statistical classifiers • •Result: WHG annotated with accent and boundary information •Benefit: Provides prosodic information needed for correct translation of spontaneous speech •Responsible: Universität Erlangen-Nürnberg © Tilman Becker, DFKI March 2002 (‹#›) Prosody in Speech Communication •Prosody can help to disambiguate •lexical and phrasal accent •phrasing (chunks of speech) •sentence mood •emotion, attitude, foreign accent • •Parameters represented by Features •F0 (fundamental frequency) •Energy •Duration •Speech tempo •Pause • © Tilman Becker, DFKI March 2002 (‹#›) Speech Signal Word Hypotheses Graph Multilingual Prosody Module Prosodic features: l F0 l duration l energy l .... Search Space Restriction Parsing Dialog Act Segmentation and Recognition Dialog Understand. Constraints for Transfer Translation Lexical Choice Generation Speech Synthesis Speaker Adaptation Boundary Information Boundary Information Sentence Mood Accented Words Prosodic Feature Vector Prosody in Verbmobil © Tilman Becker, DFKI March 2002 (‹#›) What Linguistic Analysis Really Needs • Syntactic Boundaries He saw ? the man ? with the telescope Prosody cannot help • Dialog Act Boundaries No, I have no time at all on Thursday. D But how about on Friday? Dialog acts are pragmatic units that chunk the input into units which can be processed alone. • Prosodic Syntactic Boundaries Of course ? not ? on Saturday Syntactic boundaries that correlate to the acoustic-phonetic reality; help during analysis within one chunk/dialog act. Important in spontaneous speech with elliptical utterances. © Tilman Becker, DFKI March 2002 (‹#›) Extraction of Prosodic Features •computed for each word •from basic prosodic features and segmental information •over different time contexts •modeling of FO: linear regression coefficient, regression error, mean, median, minimum, maximum, onset, offset and their temporal locations •modeling of energy--contour mean, median, maximum, max-pos, regression coefficient, ... and phoneme intrinsic normalizations • © Tilman Becker, DFKI March 2002 (‹#›) Extraction of Prosodic Features © Tilman Becker, DFKI March 2002 (‹#›) Prosodic Classification in Verbmobil •five classes of boundaries: default, particles, phrases, clauses, sentences •sentence mood: question vs. non-questions •phrase accent: disambiguation of particles •Computed by NN-classifiers and Language Models •Language Models trained on a corpus annotated with syntactic prosodic boundaries and dialog act boundaries © Tilman Becker, DFKI March 2002 (‹#›) An Example I am calling about the trip to Hanover on the seventh and eighth of March ... 2 3 I 50.284023 34 46 (ID r3485) (PR (S 1.00 0.00 0.00 0.00 0.00) (A 0.82 0.18) (F 0.92 0.08) (I 0.63 0.37) ) ... 3 9 am 24.803406 47 52 (ID r3489) (PR (S 1.00 0.00 0.00 0.00 0.00) (A 0.84 0.16) (F 0.81 0.19) (I 0.79 0.21) ) 3 10 am 32.151409 47 54 (ID r3490) (PR (S 1.00 0.00 0.00 0.00 0.00) (A 0.88 0.12) (F 0.37 0.63) (I 0.65 0.35) ) ... 9 11 going 142.015503 53 91 (ID r3504) (PR (S 0.94 0.00 0.05 0.00 0.00) (A 0.14 0.86) (F 0.10 0.90) (I 0.93 0.07) ) 10 11 calling 131.019409 55 91 (ID r3505) (PR (S 0.39 0.01 0.32 0.27 0.01) (A 0.07 0.93) (F 0.13 0.87) (I 0.93 0.07) ) 11 12 about 125.144707 92 124 (ID r3506) (PR (S 1.00 0.00 0.00 0.00 0.00) (A 0.22 0.78) (F 0.92 0.08) (I 0.74 0.26) ) 12 13 the 40.895718 125 136 (ID r3507) (PR (S 1.00 0.00 0.00 0.00 0.00) (A 0.90 0.10) (F 1.00 0.00) (I 0.65 0.35) ) 12 13 that 42.615807 125 136 (ID r3508) (PR (S 0.80 0.00 0.07 0.00 0.12) (A 0.84 0.16) (F 1.00 0.00) (I 0.86 0.14) ) 13 14 trip 106.785835 137 167 (ID r3509) (PR (S 0.10 0.00 0.80 0.10 0.00) (A 0.24 0.76) (F 0.03 0.97) (I 0.55 0.45) ) 14 15 to 69.326729 168 188 (ID r3510) (PR (S 0.86 0.02 0.08 0.02 0.02) (A 0.85 0.15) (F 1.00 0.00) (I 0.42 0.58) ) 15 16 Hanover 245.755707 189 261 (ID r3511) (PR (S 0.02 0.14 0.43 0.01 0.40) (A 0.01 0.99) (F 0.04 0.96) (I 0.49 0.51) ) ... 16 18 and 69.891464 266 284 (ID r3514) (PR (S 0.57 0.08 0.11 0.23 0.02) (A 0.87 0.13) (F 0.95 0.05) (I 0.84 0.16) ) 17 18 on 75.358749 264 280 (ID r3515) (PR (S 0.92 0.03 0.01 0.03 0.00) (A 0.87 0.13) (F 0.62 0.38) (I 0.38 0.62) ) 18 19 the 37.180725 285 295 (ID r3516) (PR (S 1.00 0.00 0.00 0.00 0.00) (A 0.94 0.06) (F 0.98 0.02) (I 0.84 0.16) ) 19 20 seventh 184.631897 296 350 (ID r3517) (PR (S 0.06 0.10 0.31 0.00 0.53) (A 0.07 0.93) (F 0.11 0.89) (I 0.12 0.88) ) 20 21 and 44.750828 356 369 (ID r3518) (PR (S 0.99 0.00 0.01 0.00 0.00) (A 0.85 0.15) (F 0.15 0.85) (I 0.92 0.08) ) 21 22 the 42.576515 370 376 (ID r3520) (PR (S 1.00 0.00 0.00 0.00 0.00) (A 0.95 0.05) (F 1.00 0.00) (I 0.38 0.62) ) 22 23 eighth 134.293030 381 420 (ID r3521) (PR (S 0.00 0.00 0.99 0.00 0.01) (A 0.24 0.76) (F 0.38 0.62) (I 0.12 0.88) ) 23 24 of 62.543167 425 443 (ID r3522) (PR (S 1.00 0.00 0.00 0.00 0.00) (A 0.74 0.26) (F 1.00 0.00) (I 0.83 0.17) ) 24 25 March 204.886185 444 497 (ID r3523) (PR (S 0.02 0.63 0.03 0.02 0.30) (A 0.04 0.96) (F 0.03 0.97) (I 0.38 0.62) ) © Tilman Becker, DFKI March 2002 (‹#›) Repair of Self-Corrections •Task: Detecting and repairing self-corrections •Input: WHGs •Method: Stochastic models • •Result: Enriched WHGs, including additional repaired hypotheses •Benefit: Enabling Verbmobil to repair self-corrections of spontaneous speech input •Responsible: Universität Erlangen-Nürnberg © Tilman Becker, DFKI March 2002 (‹#›) The Understanding of Spontaneous Speech Repairs I need a car next Tuesday oops Monday Original Utterance Editing Phase Repair Phase Reparandum Editing Term Reparans Recognition of Substitutions Transformation of the Word Hypotheses Graph I need a car next Monday © Tilman Becker, DFKI March 2002 (‹#›) Facts about Repairs in the Verbmobil Corpus •21% of all turns in the Verbmobil corpus (79 562 turns ) contain at least one self correction •The syntactic category is preserved in most cases (For example: Out of a sample of 266 verb replacements, 224 are again mapped to verbs) •Repairs take place in a restricted context (in 98% the reparandum consists of less than 5 words) •Repair sequences underlie certain regularities • © Tilman Becker, DFKI March 2002 (‹#›) Architecture of Repair Processing U:\coling\pic_trans\repair1.tif “On Thursday I cannot no I can meet äh after one” © Tilman Becker, DFKI March 2002 (‹#›) Scopus Detection •The editing term (ET) is given by the prosody •Wanted: Beginning (RB) and end (RE) of the Repair •Search the best replacement of a word order on the left hand side of ET through a word order on the right hand side of ET Þrate the possible replacements search space is limited through looking at 4 words before and after ET • •Choose the best rated replacement over a certain threshold © Tilman Becker, DFKI March 2002 (‹#›) Repair Detection and Word Smoothing U:\coling\pic_trans\repair2.gif U:\coling\pic_trans\repair3.gif © Tilman Becker, DFKI March 2002 (‹#›) Dialog Translation © Tilman Becker, DFKI March 2002 (‹#›) Multiple Approaches •Mono-cultural approaches are dangerous –humans vs. viruses ê diversity –Microsoft vs. ILOVEYOU and copycats ê alternative software solutions •Some sources of errors in a speech translation system –external •spontaneous speech: not well formed, hesitations, repairs •bad acoustic conditions •human dialog behavior –internal •knowledge gaps in modules •software errors •probabilistic processing • •q Use multiple engines, varying approaches on various stages of processing © Tilman Becker, DFKI March 2002 (‹#›) •Exclusive alternatives: three different 16 kHz German speech recognizers with various capabilities •Competing approaches: –three parsers: HPSG, Chunk, Statistical –five translation tracks: case-based, dialog-act based, statistical, substring- based, linguistic (deep) semantic translation •Needed: selection and combination of results from competing tracks –parsers: combination of partial analyses in the semantic processing modules –translation: preselection module • Multiple Approaches in Verbmobil © Tilman Becker, DFKI March 2002 (‹#›) Multiple Translation Tracks - Approaches and Advantages •Case-based: –Approach: uses examples from the aligned bilingual Verbmobil corpus –Advantage: good translation if input matches example in corpus •Dialog-act based: –Approach: extract core intention (dialog act) and content –Advantage: robust wrt. recognition errors •Statistical –Approach: use statistical language and translation models –Advantage: guaranteed translation with high approximate correctness • Substring- based –Approach: combines statistical word alignment with precomputation of translation ”chunks” and contextual clustering –Advantage: guaranteed translation with high approximate correctness • Linguistic (deep) semantic translation –Approach: “classic” approach using semantic transfer –Advantage: high quality translation in case of success © Tilman Becker, DFKI March 2002 (‹#›) Example Based Translation •Task: Providing a translation based on translation templates and partial linguistic analysis •Input: WHGs or best Hypothesis •Method: Definite Clause Grammar (DCG), graph matching algorithms • •Result: Translation and a confidence value •Benefit: Improving Verbmobils translation capabilities through an additional translation path •Responsible: DFKI, Kaiserslautern © Tilman Becker, DFKI March 2002 (‹#›) The Case Based Approach •Training is based on Verbmobil‘s bilangual corpus –E: I am on vacation, on the sixth and the seventh. –D: ich bin am sechsten und siebten verreist. • •Principle: Look up an example in the example storage that matches the input sentence best, use it’s translation as output • S S´ T(S´) T(S) T(S) known T(S) unknown (EBMT) © Tilman Becker, DFKI March 2002 (‹#›) Generalization in Example Based Machine Translation (EBMT) •Handicap of this naive approach: inadequate coverage –S : I am not free on Friday. –S’: I am not free on Monday. –T(S’): am Montag habe ich keine Zeit. • •Solution: partial generalization (analysis and generation) –E: I am not free . –D: habe ich keine Zeit. • •Automatic generalization approach: –The grammar automatically generalizes the corpus (offline) –The runtime module generalizes incoming input (online) –Match generalized input sentence with generalized corpus example –Result: instantiated corpus translation • © Tilman Becker, DFKI March 2002 (‹#›) Generalization of WHGs U:\coling\pic_trans\whg1.gif © Tilman Becker, DFKI March 2002 (‹#›) Example Based Translation – Some More Features •Generalization grammar for temporals, names, locations (region, town, country), institutions •Fast and robust WHG search: –WHG packing –Optimal alignment for fast corpus search –Search space pruning –Search space caching –Any time capable •Adequate confidence value for selection • © Tilman Becker, DFKI March 2002 (‹#›) Dialog-Act Based Translation •Task: Robustly provide a translation of core intentions and contents of the domain •Input: Prosodically annotated best hypothesis (flat WHG) •Method: Statistical dialog-act classifier and Finite State Transducers • •Result: Translation and a confidence value, additionally content descriptions for the dialog module •Benefit: Robust translation and content extraction even when the recognition is erroneous •Responsible: DFKI, Saarbrücken © Tilman Becker, DFKI March 2002 (‹#›) Dialog Acts •Describe the core intention of an utterance •32 acts defined in a hierarchy, 19 used in processing •21 CD-ROMs with 1505 dialogs (German, English, Japanese) annotated with dialog acts for training and test purposes •Computation uses bigram language models •Probabilities estimated from the annotated corpus •Leave-One-Out test results for approx. 1000 German, English and Japanese dialogs: Recall 72.48 % (27185 of 37505), Precision 69.90 % © Tilman Becker, DFKI March 2002 (‹#›) Dialog Acts - The Hierarchy © Tilman Becker, DFKI March 2002 (‹#›) Representation of Information and Extraction •Semantic representation language, used also in the dialog and context modules •Extraction using Finite State Transducers •Semi-automatic creation exploiting semantic databases and lexica •Comfortable development platform \\Serv-204\vm-dialog\slides\phase-2\pls-9-10-12-1999\autopict.gif © Tilman Becker, DFKI March 2002 (‹#›) Processing Steps Best Chain I would so we were to leave Hamburg on the first Utterance good so we will leave Hamburg on the first Dialog Act INFORM Content Rep. has_move:[move,has_source_location:[city,has_name =‘hamburg‘], has_departure_time:[date,time=[dom:1]]. © Tilman Becker, DFKI March 2002 (‹#›) Generation •Generation templates (>140), depending on dialog act, topic, content •Translated in Finite State Transducers •Examples: suggest scheduling $has_date g:ich w"urde $* vorschlagen &loc_mode_dat e:how about $* suggest entertainment or($has_location,$has_theme) g:wir k"onnten $* gehen &loc_mode_acc e:we could go $* request_suggest g:was schlagen Sie vor e:what do you suggest j:itsu ga yoroshii deshou ka • •Result for our example: also wir fahren ab Hamburg am ersten • © Tilman Becker, DFKI March 2002 (‹#›) Statistical Translation •Task: Provide approximative correct translations •Input: Prosodically annotated best hypothesis (flat WHG) •Method: Use statistical language and translation models • •Result: Translation and a confidence value •Benefit: Approximative correct translation for spontaneous speech •Responsible: RWTH Aachen © Tilman Becker, DFKI March 2002 (‹#›) The Statistical Translation Model •Task: translate the source string f in the most probable target string e: •Bayes’ rule needs language model of the target language, and lexicon and alignment models •Learned from aligned corpus © Tilman Becker, DFKI March 2002 (‹#›) Alignment Templates •Find corresponding words in source and target language sentences •Difficult for language pairs with different word order •Solution: alignment templates –based on word classes (sparse data problem: approx. 40% of the words in the training corpus are singletons) –first step: statistically learn alignment of words for each translation direction –second step: combine the alignments of both directions –third step: statistically learn alignment of “phrases”, i.e. word sequences – © Tilman Becker, DFKI March 2002 (‹#›) Alignment • Word-to-Word vs. Alignment Templates © Tilman Becker, DFKI March 2002 (‹#›) Deep Translation •Task: Provide high quality translations •Input: Prosodically annotated WHG and contextual information •Method: Use syntactic and semantic approaches to analysis, transfer, and generation • •Result: Translation containing content information, suited for high quality speech synthesis •Benefit: Delivers the highest quality, but is sensitive to recognition errors and spontaneous speech phenomena •Responsible: Siemens AG, DFKI Saarbrücken, Universität Tübingen, Universität des Saarlandes, Universität Stuttgart, TU Berlin, CSLI Stanford © Tilman Becker, DFKI March 2002 (‹#›) Modules Involved •Integrated processing comprises – search through the WHG – statistic parser – chunk parser •Semantic Construction provides VITs from statistic and chunk parser output •Deep Analysis: HPSG Parser •Dialog Semantics:combination of parsing results, and semantic resolution •Transfer: VIT to VIT transfer •Generation: TAG generation from VITs •Dialog+Context: provides contextual information U:\coling\pic_trans\GUI_deep_analysis.gif © Tilman Becker, DFKI March 2002 (‹#›) The Multi-Parser Approach •Verbmobil uses three different syntactic parsers: an HPSG parser, a chunk parser, and a probabilistic LR parser. •Every parser implements another level of parsing accuracy, depth of syntactic analysis, and robustness of the analyzing process. –Chunk parser: Most robust but least accurate analysis –HPSG parser: Most accurate by least robust analysis –Probabilistic parser: Level of accuracy and robustness between HPSG and chunk parser • © Tilman Becker, DFKI March 2002 (‹#›) Integrated Processing •Gets WHGs for the English, German, or Japanese speech input and dispatches WHG information to the three parsers •Provides an A* search algorithm that allows any connected parser to find the best scored path using –acoustic score of the speech recognizer –Verbmobil trigram language model •Parsers analyze the same utterance simultaneously © Tilman Becker, DFKI March 2002 (‹#›) VIT: Verbmobil Interface Term •Common syntactic-semantic interface •Contains all linguistic information relevant for translation •Record-like data structure: variable-free lists of non-recursive terms •``Flat'' set representations: semantic, scopal, sortal, morpho-syntactic, prosodic, and discourse information • Labels relate different kinds of information •Abstract Data Type implements construction, access, update, check, print, etc. facilities © Tilman Becker, DFKI March 2002 (‹#›) VIT: Verbmobil Interface Term vit(vitID(sid(...), %Segment ID []), %WHG-String index(l250,l234,i72), %Index [start_v(l248,i72), %Conditions arg1(l248,i72,i75), nop(l240,h85), quest(l249,h84), time(l238,i73), abstr_vacation(l247,i75), pron(l242,i74), poss(l244,i75,i74), temp_loc(l239,i72,i73), def(l245,i75,h87,h86), whq(l235,i73,h83,h82)], [in_g(l235,l237), ... %Constraints leq(l234,h85), ...], [s_class(l240,mp), ...], %Sorts [ana_ante(i74,[i75,i69,i67,i66]), %Discourse prontype(i74,third,std), ...], [gend(i75,masc), num(i75,sg)], %Syntax [ta_mood(i72,ind), ...], %Tense and Aspect [...]) %Prosody When do your vacations begin? © Tilman Becker, DFKI March 2002 (‹#›) VIT: Verbmobil Interface Term H:\vit.gif We meet at the station. © Tilman Becker, DFKI March 2002 (‹#›) HPSG Processing •Task: Thorough syntactic analysis •Input: Word chains from integrated processing •Method: Apply HPSG analysis • •Result: Source language VITs •Benefit: Delivers the highest quality, but is sensitive to recognition errors and spontaneous speech phenomena •Responsible: DFKI Saarbrücken, CSLI Stanford © Tilman Becker, DFKI March 2002 (‹#›) Head Driven Phrase Structure Grammar •Well known advanced grammar theory in linguistics •Based on the concept of a sign as integrated information structure for all types of linguistic information •Inherently multilingual by distinguishing universal principles from language specific aspects •Typed feature structures with inheritance •Small number of rules, due to general principles •Independent of specific processing strategies, usable for analysis and generation © Tilman Becker, DFKI March 2002 (‹#›) HPSG Basic Principles •Lexicalism: Words carry all the important information about what they can be combined with, thus allowing to deal with regular and idiosyncratic properties in a uniform way •Heads: Phrases contain a head which determines their combinatory potential, e.g. verbs as heads determine what complements must be present, and what modifiers they can combine with •Principles: Few language independent general projection principles stating, e.g., how to combine a head with complements and modifiers •Unification: Monotonically combines constraints from different sources © Tilman Becker, DFKI March 2002 (‹#›) HPSG Parsing in Verbmobil •active chart parser allowing bidirectional and island parsing on word hypotheses graphs or strings •fast processing by –eliminating disjunctions, enabling fast conjunctive unification –precompiling type unifiability, avoiding runtime computations –quick checks on mostly relevant features, avoiding full unification –quick checks on possibly discontinuous constituents, e.g. separable verb prefixes in German, reducing the chart size –precompiling rule filters on possible rule sequences –scoring rule applications •anytime behavior •robust: best partial analyses even for ungrammatical input © Tilman Becker, DFKI March 2002 (‹#›) Statistical Parser •Task: Robust probabilistic parsing •Input: n-best hypotheses •Method: LR-Parser trained on Verbmobil´s tree-bank • •Result: Syntactic tree representation of the input sentence •Benefit: Increasing robustness in Verbmobil´s multi-engine parser strategy •Responsible: Siemens AG © Tilman Becker, DFKI March 2002 (‹#›) Statistical Parser – Approach •(Non-probabilistic) LR-parsing worked quite well for parsing speech in Verbmobil’s first phase. •LR-parsing is well known to be able to parse huge amounts of input very efficiently. •Probabilistic chart parsing of spontaneous speech input had some problems i.e. the combinatorical explosion of edges in the chart on a word graph •Þ try probabilistic LR-Parser • © Tilman Becker, DFKI March 2002 (‹#›) Statistical Parser – Training and Transformations •Training process: derivation of an LR table and the estimation of unknown probabilistic parameters from the Verbmobil tree bank –Find the set of all context free rules (G) contained in the tree bank. –Construct an LR table from G using well known standard –Problems: sparse data, different annotation styles Þ eliminate rules that do occur less than N times •Transformations: –Needed after parsing to correct errors of the probabilistic context free parser –Rules are learned automatically from the training corpus © Tilman Becker, DFKI March 2002 (‹#›) Chunk Parser •Task: Robust and efficient partial parsing, even on ill-formed input •Input: N-best hypotheses •Method: Cascaded Finite State Transducers • •Result: Syntactic tree representation of the input sentence •Benefit: Increasing robustness in Verbmobils multi-engine parser strategy •Responsible: Universität Tübingen © Tilman Becker, DFKI March 2002 (‹#›) Parsing Based on Chunks •1st Step: Chunk Parsing using Cascaded Finite State Transducers •“Chunks are non-recursive cores of ‘major’ phrases, i.e. NP, VP, PP, …” •2nd Step: •Building a syntactic tree out of the parsing results • •Benefit: Robust and efficient parsing •But: Partial parsing: Often no spanning analysis • © Tilman Becker, DFKI March 2002 (‹#›) Example for Chunks •“Ich habe bei meinem letzten Besuch in Hannover so eine nette Kneipe entdeckt” •Chunks: •[NX Ich] [VX habe] [PX bei [ NX meinem letzten Besuch]] in [NX Hannover] [PX so [NX eine nette Kneipe]] [VX entdeckt]. •where • [NX]: Extends from the beginning to the head of a NP • [VX]: Includes all modals, auxiliary verbs and medial adverbs, but ends at the head verb or predicate adjective • [PX]: Extends to the end of an [NX] • © Tilman Becker, DFKI March 2002 (‹#›) Tree-Building Tasks •Determine the chunk position inside the syntactic tree •Complete the internal chunk structure •Determine functional categories and topological fields •Rearrange chunks to obtain a complete syntactic tree • © Tilman Becker, DFKI March 2002 (‹#›) The Result is a Syntactic Tree U:\coling\pic_trans\chunk3.gif “Alright, and that should get us there about nine in the evening.” © Tilman Becker, DFKI March 2002 (‹#›) ... but analysis is not always spanning U:\coling\pic_trans\chunk5.gif “The train arise at seven thirty. We could take a cab it to the hotel problem train station.” © Tilman Becker, DFKI March 2002 (‹#›) Semantic Construction •Task: Convert and extend syntax trees to VITs •Input: Syntax tree from statistical and chunk parsers •Method: Compositional construction using semantic lexicon • •Result: VITs •Benefit: Providing results of shallow parser to the deep analysis track •Responsible: Universität Stuttgart (IMS) © Tilman Becker, DFKI March 2002 (‹#›) Schematic Processing Lexcion access and interpretation of the grammatical roles Intermediate representation: Application Tree Compositional semantic construction Intermediate representation: VIT Non compositional semantic construction using transfer rule engine Intermediate representation: Resulting VIT Input: Syntactic tree © Tilman Becker, DFKI March 2002 (‹#›) Dialog Semantics •Task: Combining results from various parsers, reinterpret and correct VITs, and resolve non-local ambiguities •Input: VITs from different parsers •Method: VIT models and rule based approaches • •Result: VIT ready for transfer •Benefit: Enhances robustness of deep analysis and provides vital information for transfer •Responsible: Universität des Saarlandes, Saarbrücken © Tilman Becker, DFKI March 2002 (‹#›) Combining Analyses from Various Parsers •Parsers deliver VITs for segments of a turn •May be spanning analyses or just partial fragments •Combination necessary, both analyses of one parsers, but also analyses from various parsers •Combination criteria –HPSG is better than statistical parsers is better than chunk parser –Integrated results are better than fragments –Longer results are better than short ones • © Tilman Becker, DFKI March 2002 (‹#›) Stochastic Choice of Spanning Results •Parser internal scores not normalized Þ external scoring necessary •Statistical model based on VIT content and dialog act (Tetragram language models) •Search through Vit Hypotheses Graph VHG comparable to search through WHG • © Tilman Becker, DFKI March 2002 (‹#›) Robust Semantic Processing •Partial results don’t necessarily fit together –phenomena of spontaneous speech –recognition errors –parsing errors •Rule based correction © Tilman Becker, DFKI March 2002 (‹#›) Bridging Mechanism for False Starts © Tilman Becker, DFKI March 2002 (‹#›) Resolving Non-Local Ambiguities •Based on prosody and dialog act information •Ambiguities processed: –Verb disambiguation: Wir gehen in’s Theater (We go to the theater) Montag geht bei mir nicht (Monday does not suit me) –Sentence mood Wir gehen in’s Theater ! vs. Wir gehen in’s Theater? –Adverb disambiguation Wir gehen eher in’s Theater (We go to the theater earlier) Montag geht bei mir eher nicht (Monday does not really suit me) –Anaphora and ellipsis resolution –Japanese: Definiteness, topic phrases, zero anaphora © Tilman Becker, DFKI March 2002 (‹#›) Semantic Based Transfer •Task: Transfer VITs from the source to the target language •Input: VITs •Method: Rule based transfer • •Result: VITs for generation •Benefit: Translate VITs inside the deep translation path •Responsible: Universität Stuttgart (IMS) © Tilman Becker, DFKI March 2002 (‹#›) Main characteristics The Transfer Approach: Rule Based Transfer •VITs are mapped onto VITs: Transfer is a VIT rewriting system •Rule based, context conditions restrict application •Transfer rules remove matching source language expressions from the VIT •Efficient implementation •Examples: •Simple Rules: adelig(L,I) -> noble(L,I) •Simple Templates: @mod(adelig, noble, L, I) •Selectional restrictions: #sort_check(I,human) -> true @mod(gross,tall,_,I) #sort_check(I,location) -> true @mod(gross,large,_,I) • © Tilman Becker, DFKI March 2002 (‹#›) Advanced Features of Transfer •Structural changes: –Adjective to PP: tagsüber -> during the day –Insertion: übernachte -> spend the night –… •Disambiguation: • • – – type of ambiguity lexical structural discourse semantics, context discourse semantics transfer kinds of knowledge needed for disambiguation modules that contribute to the resolution syntactic, semantic, contrastive, domain, prosodic parsers, semantic construction, discourse semantics, transfer, context syntactic, semantic, domain parsers, semantic construction, transfer anaphora and ellipsis syntactic, semantic, domain semantic focus and operator scope prosodic, syntactic, semantic, contrastive, domain © Tilman Becker, DFKI March 2002 (‹#›) Performance of Transfer •Rules are compiled and packed •18088 rules German Û English •4694 rules German Û Japanese •Mean runtime per sentence: 80 msec (Sun Ultra II, 300 MHz) • © Tilman Becker, DFKI March 2002 (‹#›) Context Evaluation •Task: Resolving ambiguities in the dialog context during semantic transfer •Input: Requests from transfer •Method: Using world knowledge and rules •Result: disambiguated transfer requests •Benefit: Higher quality of transfer results •Responsible: Technical University (TU) Berlin © Tilman Becker, DFKI March 2002 (‹#›) Context Evaluation - Tasks and Methods •Supports semantic transfers and processes VITs •Gets information from dialog module from shallow tracks •Extends disambiguation of the dialog semantic module and uses ontological information © Tilman Becker, DFKI March 2002 (‹#›) 1 Nehmen wir dieses Hotel, ja. è Let us take this hotel. Ich reserviere einen Platz. è I will reserve a room. 2 Machen wir das Abendessen dort. è Let us have dinner there. Ich reserviere einen Platz. è I will reserve a table. 3 Gehen wir ins Theater. è Let us go to the theater. Ich möchte Plätze reservieren. è I would like to reserve seats. Example: Platz è room / table / seat Using World Knowledge for Transfer © Tilman Becker, DFKI March 2002 (‹#›) Dialog Processing •Task: Provides dialog context for all tracks and computes main information for dialog summaries •Input: Data from a lot of modules •Method: Frame-like topic structuring and rules •Result: context information and dialog summaries and minutes •Benefit: Verbmobil knows what happens throughout the dialog and can present it •Responsible: DFKI, Saarbrücken © Tilman Becker, DFKI March 2002 (‹#›) Dialog Processing •Dialog Memory: –Stores information from each track –Only dialog act based and semantic transfer provide abstract representations: Discourse Representation Language DRL: I would so we were to leave Hamburg on the first [INFORM,has_move:[move,has_source_location:[city,has_name='hamburg’, has_departure_time:[date,time='day:1’ •Discourse Interpretation: –Groups information into topics –Completes information –Keeps tracks of negotiation structure © Tilman Becker, DFKI March 2002 (‹#›) Probabilistic Analysis of Dialog Acts (HMM) Recognition of Dialog Plans (Plan Operators) Dialog Act Dialog Phase Syntactic Analysis Robust Dialog Semantics VIT Semantic Transfer Dialog Act Dialog Information in Semantic Transfer © Tilman Becker, DFKI March 2002 (‹#›) Collaboration for a New Functionality: Result Summaries •Provide the users with a summary of the topics that were agreed •Two benefits –have a piece of information to use in calendars etc. –control the translation •Approach: exploit already existing modules for –content extraction –dialog interpretation –planning the summary –generation –transfer • © Tilman Becker, DFKI March 2002 (‹#›) Result Summary © Tilman Becker, DFKI March 2002 (‹#›) Generation •Task: Robustly generate the output of the semantic transfer in German, English, or Japanese •Input: VITs from transfer •Method: Constraint system for micro-planning, TAG grammar (reusing HPSG grammars) for syntactic realization •Result: Strings, enriched with content-to-speech (CTS) information to support synthesis •Benefit: Output from the semantic transfer track •Responsible: DFKI, Saarbrücken © Tilman Becker, DFKI March 2002 (‹#›) Architecture Microplanning Module •Selecting planning rules •Lexical choice constraints Surface Realization Module •Inflection •Synthesis Annotation Syntactic Realization Module •Selecting LTAG trees •Tree combination VIT (Verbmobil Interface Term) Annotated String Robustness Preprocessing •Repairing structural problems •Heuristics for generation gap © Tilman Becker, DFKI March 2002 (‹#›) Preprocessing for Robustness •Why pre-pocessing: •Check and repair inconsistencies as early as possible •Keep robustness and standard modules separate •Alternative: relax constraints • •Preprocessing for robustness means: •Executing a set of solution submodules in sequence •For each problem found, the preprocessor lowers a confidence value for the generation output which measures the reliability of our result – • © Tilman Becker, DFKI March 2002 (‹#›) How much robustness? •PRO: In a dialog system, a poor translation might still be better than none at all, •CON: one of the shallow modules can be selected when deep processing fails, so respect the inherent limitations of robustness. Þ Generation knows its limits and sometimes decides not to produce a string • •Selection module: uses training corpus and confidence values to select from the different translation paths © Tilman Becker, DFKI March 2002 (‹#›) Microplanning: Create Syntactic Building Blocks Method: Mapping of dependency structures • Example: Time Expressions DEF (L,I,G,H) DOWF (L1,I,mo) ORD (L2,I,11) MOFY (L3,I,may) MONDAY1 ARG ELEVENTH_DAY SPEC ARG THE MAY Semantical dependency: VIT Syntactical dependency: TAG ARG OF_P © Tilman Becker, DFKI March 2002 (‹#›) Multilingual Generation for Translation in Speech-to-Speech Dialogues and its Realization in Verbmobil Tilman Becker . Anne Kilger . Peter Poller . Patrice Lopez DFKI GmbH Stuhlsatzenhausweg 3 66123 Saarbrücken Tilman.Becker@dfki.de © Tilman Becker, DFKI March 2002 (‹#›) VM-GECO: VerbMobil’s GEneration COmponents • Multilingual Generation: German, English, Japanese • • Language-independent kernel algorithms • Language-specific knowlegde sources • • Extended “standard” pipeline architecture: • Microplanning • Syntactic Realization • Surface Realization Annotated String © Tilman Becker, DFKI March 2002 (‹#›) Standard Architecture Microplanning Module •Selecting planning rules •Lexical choice constraints Surface Realization Module •Inflection •Synthesis Annotation Syntactic Realization Module •Selecting LTAG trees •Tree combination VIT (Verbmobil Interface Term) Annotated String Rules CDL-TAG LTAG HPSG Rules © Tilman Becker, DFKI March 2002 (‹#›) VIT: Verbmobil Interface Term vit(vitID(sid(...), %Segment ID []), %WHG-String index(l250,l234,i72), %Index [start_v(l248,i72), %Conditions arg1(l248,i72,i75), nop(l240,h85), quest(l249,h84), time(l238,i73), abstr_vacation(l247,i75), pron(l242,i74), poss(l244,i75,i74), temp_loc(l239,i72,i73), def(l245,i75,h87,h86), whq(l235,i73,h83,h82)], [in_g(l235,l237), ... %Constraints leq(l234,h85), ...], [s_class(l240,mp), ...], %Sorts [ana_ante(i74,[i75,i69,i67,i66]), %Discourse prontype(i74,third,std), ...], [gend(i75,masc), num(i75,sg)], %Syntax [ta_mood(i72,ind), ...], %Tense and Aspect [...]) %Prosody When do your vacations begin? © Tilman Becker, DFKI March 2002 (‹#›) VIT: Verbmobil Interface Term H:\vit.gif We meet at the station. © Tilman Becker, DFKI March 2002 (‹#›) Microplanning: deriving a sentence plan •Microplanning tasks: •determine type of utterance •determine syntactic structure •execute word choice • •Microplanning rules map parts of VIT input to partial dependency structures • •Implemented as constraint solving problem •Approx. 7,200 microplanning rules (German) © Tilman Becker, DFKI March 2002 (‹#›) Microplanning: deriving a sentence plan •An example: “the eleventh of May” DEF (L,I,G,H) DOWF (L1,I,mo) ORD (L2,I,11) MOFY (L3,I,may) MONDAY1 ARG ELEVENTH_DAY SPEC ARG THE MAY Semantic dependency: VIT Syntactic dependency: TAG ARG OF_P © Tilman Becker, DFKI March 2002 (‹#›) Syntactic Realization •Tasks of syntactic realization: •selecting lexicalized (TAG) trees •constructing a phrase structure tree •provide all information for surface realization: –inflection and annotation for CTS (content to speech) synthesis • •Based on FB-LTAG: Feature-Based Lexicalized Tree Adjoining Grammars •Compiled from HPSG grammars © Tilman Becker, DFKI March 2002 (‹#›) Syntactic Realization: •An example: “the eleventh of May” MONDAY1 ARG ELEVENTH_DAY SPEC ARG THE MAY Syntactic dependency: TAG derivation tree ARG OF_P Syntactic phrase structure: TAG derived tree NP N PP NP P N DET the eleventh of May © Tilman Becker, DFKI March 2002 (‹#›) HPSG to TAG Compilation •HPSG: context-free rules (schemas) •TAG: extended local lexical structures (trees) •Off-line compilation computes all projections from lexical types •Generates approx. 2,300 TAG trees from 250 lexical types –Reuse existing Resources: •Spontaneous speech, syntactic/lexical coverage of Verbmobil domain –Speed vs. space –TAG captures dependencies –HPSG include syntax-semantics interface, vast body of linguistic work © Tilman Becker, DFKI March 2002 (‹#›) Problems for generation •Technical problems –should be eliminated –hard to eliminate in a large-scale system –better to be robust •Task-inherent problems –Spontaneous speech input –Insufficiencies in the analysis and translation –Generation gap: mismatch between semantic input and coverage of the grammar •® Robust generator necessary © Tilman Becker, DFKI March 2002 (‹#›) Problems for generation (2) •(Task-inherent) problems manifest themselves as fault wrt. the interface language definition • • Problems with the structure of the semantic representation: –unconnected subgraphs –multiple predicates referring to the same object –omission of obligatory arguments •Problems with the content of the semantic representation: –contradicting information –missing information (e.g. agreement information) © Tilman Becker, DFKI March 2002 (‹#›) Extended Architecture Microplanning Module •Selecting planning rules •Lexical choice constraints Surface Realization Module •Inflection •Synthesis Annotation Syntactic Realization Module •Selecting LTAG trees •Tree combination VIT (Verbmobil Interface Term) Annotated String Robustness Preprocessing •Repairing structural problems •Heuristics for generation gap © Tilman Becker, DFKI March 2002 (‹#›) Extended Architecture (2) •Why pre-pocessing: •Check and repair inconsistencies as early as possible •Keep robustness and standard modules separate •Alternative: relax constraints •Preprocessing for robustness means: •Executing a set of solution submodules in sequence •For each problem found, the preprocessor lowers a confidence value for the generation output which measures the reliability of our result © Tilman Becker, DFKI March 2002 (‹#›) How much robustness? •PRO: In a dialogue system, a poor translation might still be better than none at all, •CON: one of the shallow modules can be selected when deep processing fails, so respect the inherent limitations of robustness. • •Selection module: uses training corpus and confidence values to select from the different translation paths © Tilman Becker, DFKI March 2002 (‹#›) Content-to-Speech (CTS) Output •Output annotated with information like speech act, syntactic grouping, word classes, prominence, ... •Enhances synthesis quality •Example: {SpeechAct:begin}{SpeechActType: Inform}{Language:English}{Utterance:begin} {SentenceType:Aussagesatz}{WordClass:N}Verbmobil{WordClass:AUX}is {WordClass: DET-ART} a{Prominence:2} {WordClass:ADJ}speaker_independent{WordClass:N} system{BorderProminence:5} {WordClass:CONJ-SYN}that {Prominence:15}{WordClass:V}offers {Prominence:4}{WordClass:N}translation_assistance{BorderProminence:2} {WordClass:PREP-SYN}in {Prominence:4}{WordClass:N}dialog {WordClass:N}situations {Utterance:end} © Tilman Becker, DFKI March 2002 (‹#›) Minutes and Summaries •Dialog module keeps track of the dialog: dialog model, context extraction, translations: dialog history •Three types of “protocols”: •Minutes: relevant exchanges •Summary: dialog results •Scripts: complete dialog script © Tilman Becker, DFKI March 2002 (‹#›) Multilingual Minutes and Summaries •Multilinguality: Integration of transfer module: German Summary (HTML) Context Syndialog Dialog VM-PROTO GENGER Transfer (G®E) VM-PROTO GENENG English Summary (HTML) Document structure VITs VITs © Tilman Becker, DFKI March 2002 (‹#›) Conclusion •Multilingual generation: –kernel algorithms –multilingual knowledge sources •Robustness is necessary and useful –within limits •Output of classified, graded quality •Generation of minutes and summaries •The Verbmobil book: 2 articles on Generation © Tilman Becker, DFKI March 2002 (‹#›) Selection and Speech Synthesis © Tilman Becker, DFKI March 2002 (‹#›) Selection of Translations •Task: Select the “best” translation out of all deep and shallow translation paths •Input: Translations (text or content) •Method: Learning inequalities • •Result: Selected Translation (text or content) •Benefit: Use the expertise of all translation paths for a particular utterance •Responsible: TU Berlin © Tilman Becker, DFKI March 2002 (‹#›) Segment 1 Wenn wir den Termin vorziehen, Segment 2 das würde mir gut passen. Selection Module Segment 1 Translated by Semantic Transfer Segment 2 Translated by Case-Based Translation Segment 1 If you prefer another hotel, Segment 2 please let me know. Alternative Translations with Confidence Values Statistical Translation Dialog-Act Based Translation Semantic Transfer Case-Based Translation Integrating Deep and Shallow Processing © Tilman Becker, DFKI March 2002 (‹#›) The Selection Problem •Selection is a difficult business: •confidence values are difficult to compare –probabilistic vs. knowledge based approaches –no bird’s eyes view possible •re-training necessary after changes in the engines •training data must be produced • • © Tilman Becker, DFKI March 2002 (‹#›) Speech Synthesis •Task: Synthesize the translation •Input: text or content •Method: Multilevel selection and concatenation of speech units from large speech corpora • •Result: Audio signal •Benefit: “End of the chain” of the speech-to-speech system •Responsible: Universität Bonn TU Dresden Universität Bochum Daimler Chrysler © Tilman Becker, DFKI March 2002 (‹#›) •Text-to-Speech (TTS): reading machine from arbitrary text in orthographic form. Unlimited domain. The machine does not know what it is saying. •Concept-to-Speech [or content-to-speech] (CTS): spoken out-put from a database inquiry or from a dialog system. The input of the synthesizer comes from a semantic representation via a generation module. The machine should have full knowledge of what it is saying. •Reproductive Speech Synthesis: spoken output from pre-recorded samples. For strictly limited domains. Different Types of Synthesis © Tilman Becker, DFKI March 2002 (‹#›) •Target utterances are synthesized from a corpus of utterances from within the domain. •All units – whatever they are – have multiple instances in the corpus. •No predefined units: the unit selection algorithm selects contiguous chunks of speech from the data base – the longer, the better. •When units of word size and above are applied, much of the natural prosody is preserved. •Problem: coverage. Words not in the database cannot be synthesized in this way. Corpus-Based Synthesis © Tilman Becker, DFKI March 2002 (‹#›) I have time monday. on Sentence to synthesize I have time monday I have time monday I have monday I on on on on S E Edge direction S E have time I monday on Unit Selection Algorithm © Tilman Becker, DFKI March 2002 (‹#›) •Word is the central unit and the starting point for all processing. •Only if no suitable instance of a word is available in the database, an algorithm is invoked that composes a word from subword units which are currently phones. •The principal strategy on both the word and the sub-word levels is to concatenate chunks that are as long as possible (up to a whole sentence). •Like in CHATR, no prosodic manipulation is performed in this synthesis. •In principle each word is needed in up to three positions (initial/medial, final declarative, final interrogative) and in both accented and unaccented mode. •For Verbmobil this would mean that we need about 80000 word tokens to be recorded (which is prohibitive). •Good coverage is reached by a selection of typical phrases from within the domain (dialogs from the Verbmobil dialog database). •Additional utterances realize frequent words in relevant contexts (e.g., opening phrase, names of big cities). • Implementation © Tilman Becker, DFKI March 2002 (‹#›) Architecture Generation Prosody Generation and Unit Selection (Diphone) Transcrip- tion; Accenting; Phrases Time- Domain Synthesis (PSOLA) Situation- Dependent Adap- tation Prosodic and Spectral Adapt. Recording Audio Out Diphone Corpora Pitch Unit Selection Word Level Database Word Level Speech Corpora Unit Selection Phone Level Database Phone Level © Tilman Becker, DFKI March 2002 (‹#›) Verbmobil From a Software Engineering Point of View System Design and Software Integration © Tilman Becker, DFKI March 2002 (‹#›) Software Technology Challenges •The goal •Build an integrated system •The situation •Researchers do research •Using different programming languages •Researchers don’t want to be bothered with technical details •The solution •Introducing: the System Group •Maximal technical support for the researchers/developers © Tilman Becker, DFKI March 2002 (‹#›) The System Architecture M1 M2 M3 M5 M6 M4 BB 2 BB 1 BB 3 M1 M2 M3 M4 M5 M6 Verbmobil I Verbmobil II Multi-Agent Architecture Multi-Blackboard Architecture  Modules know all communication partners  Direct communication between modules  Reconfiguration difficult  Software: ICE and ICE Master  Basic Platform: PVM  Modules know their I/O data pools  No direct communication between modules 198 blackboards vs. 2380 direct comm. paths  Reconfiguration easy  Several instances of one module/functionality  Software: PCA and Module Manager  Basic Platform: PVM Blackboards © Tilman Becker, DFKI March 2002 (‹#›) Audio Data Word Hypotheses Graph with Prosodic Labels VITs Underspecified Discourse Representations Command Recognizer Spontaneous Speech Recognizer Channel/Speaker Adaptation Prosodic Analysis Statistical Parser Dialog Act Recognition Chunk Parser HPSG Parser Semantic Construction Robust Dialog Semantics Semantic Transfer Generation Sample Pool Structure © Tilman Becker, DFKI March 2002 (‹#›) Distributed Execution Supports Distributed Development server 2 server 1 controlling terminal User 2 User 1 Pool Communication Architecture © Tilman Becker, DFKI March 2002 (‹#›) Support from the System Group (1) •Integration framework (Testbed) with •common communication mechanism for all used programming languages (C, C++, Lisp, Prolog, Java, Fortran, Tcl/Tk) •Narrow interface for all used programming languages •Overall system control infrastructure •Standards on various levels –Installation –Compilation –Communication formats between modules –... •Toolbox for recording, replaying, testing, inspecting data exchanged between modules, ... © Tilman Becker, DFKI March 2002 (‹#›) The Testbed is the Integration Framework for the Verbmobil System PCA Visualization Manager Automatic Test Module Synchronization Module User Command Mapper Arbitration of Concurrent Modules GUI Testbed Manager © Tilman Becker, DFKI March 2002 (‹#›) Initializing Synchronization ACTIVE Shutdown or Error CONNECTED WAITING READY DIED The Testbed controls the System: Module States © Tilman Becker, DFKI March 2002 (‹#›) The GUI- Visualization and Debug Tool .... and much more D:\Orsini\pictures\VM_eng_std_31_5_99.tif D:\temp\sendw.gif © Tilman Becker, DFKI March 2002 (‹#›) Assure high system stability and robustness in connection with large-scale testing audio modules, testbed acoustic modules 2 Weeks parsers and shallow translation modules 2 Weeks linguistic modules and synthesis 2 Weeks system delivery 2 - 4 Weeks integration and stabilization phase Support from the System Group (2): Regular Integration Cycles © Tilman Becker, DFKI March 2002 (‹#›) Human Factors © Tilman Becker, DFKI March 2002 (‹#›) A Remark about Project Duration •1993 –“You will need special hardware!” –“1500 words speaker independent is impossible!” –“Aren’t your goals unrealistic?” • •2000 –“Does it run on my notebook?” –“Only 10 000 words?” –“Why can’t it also translate in the domains X, Y, and Z?” 8 years is a long time, especially since the invention of Internet time but it is a unique chance for • large scale, continuous research and development • training people, collaborating, gaining experience • collecting and annotating data © Tilman Becker, DFKI March 2002 (‹#›) Management Challenges •The goal •Build an integrated system •The situation •Partners distributed and pretty independent •Great variation in project and background experience •Adjustment of project plan and goals over time needed •The solution •Define a flat management structure •Create a group spirit © Tilman Becker, DFKI March 2002 (‹#›) Project Organization Verbmobil Consortium Group of Module Managers Head of System Integration Group A. Klüter Module Coordinator N. Reithinger Manager Module 1 Manager Module n ... Verbmobil Advisory Board Scientific Management Scientific Head W. Wahlster Deputy Scientific Head A. Waibel Head of Project Management Group R. Karger German Federal Ministry for Research and Education © Tilman Becker, DFKI March 2002 (‹#›) Module Managers •Have technical hands on experience •Responsible for one module, even if it is developed at different sites •Volunteers (sort of ...) •Meet regularly, despite e-mail, phone and other devices •Define next milestones •Define data and software integration plans • •Module coordinator coordinates the efforts and is the link to the scientific •management • © Tilman Becker, DFKI March 2002 (‹#›) Example: Optimization Schedule 2000 •21.02. Delivery of CeBit system •21.02. - 30.04. Optimization phase • •09.05. Delivery Verbmobil System 1.0 •starting 09.05 –speech recognizer evaluation –turn evaluation –15.03. - 28.04. End-To-End evaluation with feedback to developers –27.03. - 07.04. Workshop Deep Processing • © Tilman Becker, DFKI March 2002 (‹#›) Experience •The group of module managers is a Good Thing™ •Common goals motivate •Friendly peer pressure works most of the time •Early problem detection and resolution in most cases •Regular integration cycles focus and motivate • •q Proactive consensus management (PCM) • • © Tilman Becker, DFKI March 2002 (‹#›) Experience •The System Group is a Good Thing™ •The multi blackboard architecture is a Good Thing™ •Crucial for the success of Verbmobil •Software foundation for (almost) hassle free module development • •q Controlled distributed development possible • • © Tilman Becker, DFKI March 2002 (‹#›) U:\coling\pic_trans\symposium.gif 30.7.2000,10:30-18:00 Saarbrücken, Kongresshalle © Tilman Becker, DFKI March 2002 (‹#›) SmartKom •Overview •Architecture •Core Areas: Analysis, Fusion, Generation, ... •Dialogue Processing C:\Orsini\pictures\Logos\SmartKom.tif © Tilman Becker, DFKI March 2002 (‹#›) Overview •Introduction –Why Multimodal Interaction Systems? –Reference Architecture for Multimodal Systems •SmartKom: A Multimodal Interaction System –SmartKom: A Transportable Interface Agent –Situated Delegation-oriented Dialog Paradigm: Collaborative Problem Solving –Modes in SmartKom –More About the System –M3L: XML based Multimodal Markup Language –Multimodal Coordination • © Tilman Becker, DFKI March 2002 (‹#›) Why Multimodal Interaction Systems? (Oviatt&Cohen, CACM March 2000) •Accessibility for diverse users and usage contexts –Selection of modes by the user and by the system e.g. lean- forward/lean-backward mode in a home environment, car •Performance stability and robustness –Users can select robust mode –Mutual disambiguation and presentation •Expressive power and efficiency –Interface more powerful –Faster –Increased task completion • – User(s) User Modeling Discourse Management Intention Recognition Interaction Management Mode Analysis Language Graphics Gesture Sound Media Input Processing Media Output Rendering Reference Architecture for Multimodal Systems Context Management Expectation Management User ID Application Interface Integrate Respond Request Terminate Initiate T A V G G Mode Coordination Presentation Design Multimodal Reference Resolution Multimodal Fusion A A V G G Mode Design Language Graphics Gesture Sound Animated Presentation Agent Select Content Design Allocate Coordinate Layout User Model Discourse Model Domain Model Media Models Task Model Representation and Inference, States and Histories Application Models Context Model Reference Resolution Action Planning 2 Nov. 2001 Dagstuhl Seminar Fusion and Coordination in Multimodal Interaction edited by: M. Maybury © Tilman Becker, DFKI March 2002 (‹#›) Overview •Introduction •SmartKom: A Multimodal Interaction System –SmartKom: A Transportable Interface Agent –Situated Delegation-oriented Dialog Paradigm: Collaborative Problem Solving –Modes in SmartKom –More About the System –M3L: XML based Multimodal Markup Language –Multimodal Coordination •MIAMM –Main Objectives –Interaction using Haptics •Research Roadmap of Multimodality •Conclusion • © Tilman Becker, DFKI March 2002 (‹#›) Human-Technology Interaction Lead Projects © Tilman Becker, DFKI March 2002 (‹#›) The SmartKom Consortium sony_blau.PICT 00000FC5Karl-Heinz II+ B244020D: MediaInterface European Media Lab LME.pict 00000FC5Karl-Heinz II+ B244020D: Uinv. Of Munich Univ. of Stuttgart Saarbrücken Aachen Dresden Berkeley Stuttgart Munich Univ. of Erlangen Heidelberg Main Contractor DFKI Saarbrücken Project Budget: € 25.5 million Project Duration: 4 years (September 1999 – September 2003) Ulm E:\Daten\CD_DC\DC_055b.bmp C:\Orsini\Pictures\Logos\sympalog.tif C:\Orsini\pictures\Logos\SmartKom.tif © Tilman Becker, DFKI March 2002 (‹#›) SmartKom: A Transportable Interface Agent MM Dialogue Back- bone Home: EPG Public: Cinema, Phone, Mail Mobile: Navigation Application Layer SmartKom-Mobile: A Handheld Communication Assistant SmartKom-Public: A Multimodal Communication Kiosk SmartKom-Home/Office: Multimodal Portal to Information Services © Tilman Becker, DFKI March 2002 (‹#›) An Example Interaction with SmartKom Mobile © Tilman Becker, DFKI March 2002 (‹#›) C:\Orsini\PowerPoint\SmartKom\2_PLS_SmartKom\Blocher_Zeigegeste.tif User specifies goal delegates task cooperate on problems asks questions presents results Service 1 Service 2 Service 3 IT Services Personalized Interaction Agent Situated Delegation-oriented Dialog Paradigm: Collaborative Problem Solving © Tilman Becker, DFKI March 2002 (‹#›) Modes in SmartKom •Speech –Speaker independent speech recognition –Prosodic input processing –Synthesis •Gesture –Input •Natural gestures (SIVIT) •Pen-based –Presentation agent •Facial/body expression –User state recognition –System state presentation © Tilman Becker, DFKI March 2002 (‹#›) The Main Modules on the Control GUI E:\MO3BAAFDFA0002306.gif © Tilman Becker, DFKI March 2002 (‹#›) More About the System •Modules realized as independent processes •Not all must be there (critical path: speech or graphic input to speech or graphic output) •(Mostly) independent from display size •Pool Communication Architecture (PCA) based on PVM for Linux and NT –Modules know only about their I/O pools –Literature: •Andreas Klüter, Alassane Ndiaye, Heinz Kirchmann: Verbmobil From a Software Engineering Point of View: System Design and Software Integration. In Wolfgang Wahlster: Verbmobil - Foundation of Speech-To-Speech Translation. Springer, 2000. •Data exchanged using M3L documents •All modules and pools are visualized here ... © Tilman Becker, DFKI March 2002 (‹#›) C:\Documents and Settings\bert\Desktop\archi.wmf The Real Story © Tilman Becker, DFKI March 2002 (‹#›) Frame Languages Object-oriented Modeling Primitives NL/MM-Semantics More formal Semantics Subsumption, Inferences W3C Standards XML Schema/DTDs M3L The “Glue“ - M3L: XML based Multimodal Markup Language Domain Knowledge NL/MM Representation This year‘s work Pool Pool Pool . ... . XML schema XML schema XML schema © Tilman Becker, DFKI March 2002 (‹#›) [...] cinema_17a Europa 225 230 [...] 0.5542 0.1950 0.9892 0.7068 pid3072 [...] An Example of the M3L Representation of the Multimodal Discourse Context Kinokarte_mit_Smartakus „No presentation without representation!“ © Tilman Becker, DFKI March 2002 (‹#›) Mode Processing: The Data Flow Display Objects with ID and Location Dialogue Backbone Presentation MM Fusion User State Domain Information System State Speech Speech Agent‘s Posture and Behaviour Mimics (Neutral or Annoyance) Interaction Modeling Prosody (emotion) Gesture © Tilman Becker, DFKI March 2002 (‹#›) Processing the User‘s State E:\MO3BAAFDFA0002306.gif © Tilman Becker, DFKI March 2002 (‹#›) Processing the User‘s State •Annotated in the data from the data collection •Recognized using mimics and prosody •In case of anger activate the dynamic help Object level Meta level This is great! Show me more! That was quick! One moment, let me think. OK now, what are you doing? Oh no, that’s ugly! A new one! What the .... is going on? •Different reference levels: © Tilman Becker, DFKI March 2002 (‹#›) Wizard of Oz Data Collection (LMU Munich) • • • WOZ_Experimente Data distributed on DVD (1 DVD per 5 minute dialogue) © Tilman Becker, DFKI March 2002 (‹#›) User States Annotated in 45 dialogues Neutral 681 Joy/success 31 Reflection 59 Perplexity 31 Surprise/Astonishment 11 Annoyance/Failure 16 Only about 18% emotional user state events © Tilman Becker, DFKI March 2002 (‹#›) User Independent Classification of Facial Expressions (Univ. Erlangen) Localization Classification (SVM, Eigenfaces) Annoyance Rest (neutral) © Tilman Becker, DFKI March 2002 (‹#›) Media Fusion E:\MO3BAAFDFA0002306.gif © Tilman Becker, DFKI March 2002 (‹#›) Gesture Processing •Objects on the screen are tagged with IDs and bounding boxes •Gesture input –Natural gestures recognized by SIVIT –Touch sensitive screen •Gesture recognition –Location –Type of gesture: pointing, tarrying, encircling •Gesture Analysis –Reference object in the display described as domain model (sub-)objects (M3L schemata) –Compute distance to bounding boxes –Output: gesture lattice with hypotheses – © Tilman Becker, DFKI March 2002 (‹#›) •Word lattice • • • • • • • •Prosody inserts boundary and stress information •Speech analysis creates intention hypotheses which movies are playing at the Metropol hypothesis(action:info,performance(cinema(name:Metropol)) ..) • • • S:\sprachinterpretation\doc\intern\slides\pictures\whg.bmp Speech Processing © Tilman Becker, DFKI March 2002 (‹#›) Media Fusion •Integrates gesture hypotheses in the intention hypotheses of speech analysis •Information restriction possible from both media •Possible but not necessary correspondence of gestures and placeholders (deictic expressions/ anaphora) in the intention hypothesis •Necessary: Time coordination of gesture and speech information •Time stamps in ALL M3L documents!! •Output: sequence of intention hypothesis • © Tilman Becker, DFKI March 2002 (‹#›) Presentation E:\MO3BAAFDFA0002306.gif © Tilman Becker, DFKI March 2002 (‹#›) Presentation •Starts with action planning •Definition of an abstract presentation goal •Presentation planner: –Selects presentation, style, mode, and agent‘s general behaviour –Activates natural language generator which activates the speech synthesis which returns audio data and time-stamped phoneme/viseme sequence •Character animation realizes the agent‘s behaviour •Synchronized presentation of audio and visual information – • © Tilman Becker, DFKI March 2002 (‹#›) Partial view of SK architecture: Multimodal Presentation Action Planner Presentation Planner Display Management Gesture Generation Graphics Generation Text Generation Speech Synthesis Display Representation for Gesture Analysis Functions Modelling © Tilman Becker, DFKI March 2002 (‹#›) User Perspective Monitor: frontal view Table: angled view © Tilman Becker, DFKI March 2002 (‹#›) Lip Synchronization with Visemes •Goal: present a speech prompt as natural as possible •Viseme: elementary lip positions •Correspondence of visemes and phonemes •Examples: • C:\Documents and Settings\bert\Desktop\dagstuhl\animations\lipsync\r2.gif C:\Documents and Settings\bert\Desktop\dagstuhl\animations\lipsync\u3.gif © Tilman Becker, DFKI March 2002 (‹#›) Behavioural Schemata •Goal: the agent (Smartakus) is always active to signal the state of the system •Four main states –Wait for user‘s input –User‘s input –Processing –System presentation •Current body movements –9 vital, 2 processing, 9 presentation (5 pointing, 2 movements, 2 face/mouth) –About 60 basic movements • © Tilman Becker, DFKI March 2002 (‹#›) New animations Examples for complex movements and speech-synchronized gestures Enumeration of items Moving in a circle Pointing to the right © Tilman Becker, DFKI March 2002 (‹#›) Example: Pointing Gestures base position preparation stroke retraction composed gesture: C:\PLS2\standart.gif © Tilman Becker, DFKI March 2002 (‹#›) [USEMAP] Details: Natural Language Generation in SmartKom Discourse Updates in Interactive Dialogues © Tilman Becker, DFKI March 2002 (‹#›) Natural Language Generation in SmartKom Tilman Becker AT&T Research 2 Aug 2001 D:\Orsini\pictures\DFKI.tif Deutsches Forschungszentrum für Künstliche Intelligenz GmbH Stuhlsatzenhausweg 3, Geb. 43. 1 - 66123 Saarbrücken Tel.: (0681) 302-5271 Fax.: (0681) 302-5020 Email: becker@dfki.de www.smartkom.org C:\Orsini\pictures\Logos\SmartKom.tif © Tilman Becker, DFKI March 2002 (‹#›) Overview •Architecture •Presentation Goals •Natural Language Generation for Speech Synthesis –Architecture –Selection of data, sentence templates –„fully specified templates“ –Concept-To-Speech information •A short look aside: graphics and gestures •Outlook © Tilman Becker, DFKI March 2002 (‹#›) Presentation Begins in Action Planning • •Presentation as planning of a multi-modal dialog act • •Abstract presentation goals (defined in an XML Schema presentation.xsd) © Tilman Becker, DFKI March 2002 (‹#›) Natural Language Generation: Overview •Input, Output •Architecture •Knowledge Bases •The steps of generation •Templates –Tree Adjoining Grammars –“fully specified templates” •Concept-To-Speech information • © Tilman Becker, DFKI March 2002 (‹#›) Typical Abstract Presentation Goals •Presentation of information (usu. With an implicit request): “Here you can see...” : •Explicit Request to fill a slot: “Please show me where you want to sit” : •Feedback: “Your reservation is secured...” •Canned presentations: • © Tilman Becker, DFKI March 2002 (‹#›) Input for Natural Language Generation list Schmalspurganoven Europa [...] © Tilman Becker, DFKI March 2002 (‹#›) Output of Natural Language Generation F:\generator\data\demodialog\anfangszeiten.gif Auf der Übersicht sehen Sie die Anfangszeiten des Films Schmalspurgan. im Kino Europa [...] die Anfangszeiten [...] © Tilman Becker, DFKI March 2002 (‹#›) Sketch of the Architecture Text Planner Sentence Planner Syntax Realizer What? How? Details.. XSLT PrePlan TAG © Tilman Becker, DFKI March 2002 (‹#›) Knowledge Bases in NLG •Defining the goal (XSLT Stylesheet, What?) •Planning rules (PrePlan, How?) •(Template-)grammar (TAG, Realizer How?) •(Morphology) •Lexicon (TAG, Realizer) •Discourse memory (anaphora etc.) •User model (“Interaktionsmodellierung”) (register etc.) © Tilman Becker, DFKI March 2002 (‹#›) First Step: Defining the Goal •XSLT: Mapping abstract goals to realization goals, e.g.: (showme mf42) (showme ) © Tilman Becker, DFKI March 2002 (‹#›) First Step (2): Using Context Information •XSLT: Creation of a generation knowledge base from the input, e.g.: performance_1000030 avMedium_1002535 O Brother, Where Art Thou? (GKB ( (performance mf745) (entitykey mf746 performance_1000030) ... (title mf747 “O Brother..”) ... ) © Tilman Becker, DFKI March 2002 (‹#›) Second Step: Sentence Planning with Templates •Result is a derivation tree •PrePlan (a simple planning tool in Java): –(Text and) sentence planning –Selection of templates and filling of slots, e.g.: (overview mf42) -> (select “You can see an overview”) (adjoin “Node Overview-4711”) (np-realize mf42) – –Select and adjoin refer to trees and nodes of the (TAG) Grammar © Tilman Becker, DFKI March 2002 (‹#›) TAG Grammars •Tree Adjoining Grammars (Joshi et al 1975) •A grammar –consists of partial trees, –that are combined by two operations: •Adjunction •Substitution –Lexicalized grammars: •A set of possible partial trees for every word •Every partial tree is a “maximal projection” of the word © Tilman Becker, DFKI March 2002 (‹#›) TAG: Initial Trees S NP VP V NP see NP N You NP Det an N overview Substitution as in context-free grammars: © Tilman Becker, DFKI March 2002 (‹#›) TAG: Auxiliary Trees NP Det an N overview Adjunction is more powerful than context-free grammars: N N* PP P over NP © Tilman Becker, DFKI March 2002 (‹#›) TAG with Templates •Instead of lexicalized trees: –A template tree contains the entire structure of a template –…including all words –A simplistic „template Grammar“ consists of complete sentences –Can smoothly be developed into a complete grammar •Problem: –What are the right syntactic(?) structures? –General problem with CTS © Tilman Becker, DFKI March 2002 (‹#›) Planning a Derivation Tree NP Det an N overview S NP VP V NP see N you Commenting on a graphical presentation Referring to a list you-see-tree NP_22 derived tree derivation trees an-overview-Baum derived tree © Tilman Becker, DFKI March 2002 (‹#›) Concept-To-Speech •Syntactic Information is used to compute Prosodic Information •Sentences are combined to discourse tree •Filtering of irrelevant syntactic features • •Synthesis is based on Festival •Preprocessing traverses syntactic structure (Scheme) • •Work carried out at IMS, Stuttgart, Germany Gregor Möhler, Antje Schweitzer (Prof. Dogil) © Tilman Becker, DFKI March 2002 (‹#›) CTS versus TTS \\ARAKAKADU\moehler\Smartkom\Doc\IMS-Talks\demo1_discourse.gif \\ARAKAKADU\moehler\Smartkom\Doc\IMS-Talks\demo1_wave.gif L*H H% L*H H*L H*L% % © Gregor Möhler, IMS Stuttgart © Tilman Becker, DFKI March 2002 (‹#›) Templates •Where do we get the templates from? –Ideally from existing grammars: •consistent •short development time •no/less expertise required –Data collection for a new application: •example dialogues •Wizard of Oz experiments •dialogue models –Growing collection of “standard templates” (will lead to a real grammar) • © Tilman Becker, DFKI March 2002 (‹#›) Current work •Complete TAG implementation with unification: –Porting an existing Unifier (LISP) –XML-Representation of the grammar: •Graphical tools •XSLT mapping to/from other formats (LISP) – •Structure of planning rules: –Separate text and sentence planning •Extending the set of templates © Tilman Becker, DFKI March 2002 (‹#›) Future Work •Generating referring expressions •Generating text for graphics, esp. for mobile scenario “no audio” •Text planning •Abstract “sentence plans”: –Module within syntactic realization •Various tools (next slide) •Language independent steps of NLG • • © Tilman Becker, DFKI March 2002 (‹#›) Future Work •Tools for: –PrePlan planning rules –Lexicon (morphology) •Template tree development scenario: –Parser (with a German grammar -- Kim Gerdes) produces derivation trees –(Graphical) tool to •select correct analysis •relate to existing templates •mark fixed/variable parts [USEMAP] © Tilman Becker, DFKI March 2002 (‹#›) MIAMM •Multidimensional Information Access using Multiple Modalities (IST-2000-29487) –Cross Programme Action 2 User Friendliness, Human Factors, Multi-Lingual, Multi-Modal dialog modes •Duration: September 2001 - February 2004 •Participants –INRIA (Laboratoire Loria), FR [Coord.] •Speech recognition, language analysis, contextual interpretation –Deutsches Forschungszentrum für Künstliche Intelligenz, DE •Graphical interface, language analysis, dialogue management –Netherlands Organization for Applied Scientific Research (TNO), NL •Task analysis, interaction scenarios, evaluation –Sony International Europe GmbH, DE •Multilingual speech recognition (en, de), software for haptic interaction, domain modeling, hardware interaction –CANON Research Centre Europe (CRE), UK •Multimedia database and search application – C:\WINNT\Profiles\bert\Desktop\logo.gif © Tilman Becker, DFKI March 2002 (‹#›) The Haptic Device •Phantom (www.sensable.com) C:\WINNT\Profiles\peters\Desktop\6dof.jpg 3 degrees of freedom force feedback unit © Tilman Becker, DFKI March 2002 (‹#›) Empirical and Data-Driven Models of Multimodality 2002 2005 Advanced Methods for Multimodal Communication Computational Models of Multimodality Adequate Corpora for MM Research Mobile, Human-Centered, and Intelligent Multimodal Interfaces Multimodal Interface Toolkit Research Roadmap of Multimodality 2002-2005 XML-Encoded MM Human-Human and Human-Machine Corpora Mobile Multimodal Interaction Tools Standards for the Annotation of MM Training Corpora Examples of Added-Value of Multimodality Multimodal Barge-In Markup Languages for Multimodal Dialogue Semantics Models for Effective and Trustworthy MM HCI Collection of Hardest and Most Frequent/Relevant Phenomena Task- , Situation- and User- Aware Multimodal Interaction Plug- and Play Infrastructure Toolkits for Multimodal Systems Situated and Task- Specific MM Corpora Common Representation of Multimodal Content Decision-theoretic, Symbolic and Hybrid Modules for MM Input Fusion Reusable Components for Multimodal Analysis and Generation Corpora with Multimodal Artefacts and New Multi- modal Input Devices Models of MM Mutual Disambiguation Multiparty MM Interaction 2 Nov. 2001 Dagstuhl Seminar Fusion and Coordination in Multimodal Interaction edited by: W. Wahlster Multimodal Toolkit for Universal Access © Tilman Becker, DFKI March 2002 (‹#›) 2006 2010 Ecological Multimodal Interfaces Research Roadmap of Multimodality 2006-2010 Empirical and Data-Driven Models of Multimodality Advanced Methods for Multimodal Communication Toolkits for Multimodal Systems Usability Evaluation Methods for MM System Multimodal Feedback and Grounding Tailored and Adaptive MM Interaction Incremental Feedback between Modalities during Generation Models of MM Collaboration Parametrized Model of Multimodal Behaviour Demonstration of Performance Advances through Multimodal Interaction Real-time Localization and Motion/Eye Tracking Technology Multimodality in VR and AR Environments Resource-Bounded Multimodal Interaction User‘s Theories of System‘s Multimodal Capabilities Multicultural Adaptation of Multimodal Presentations Affective MM Communication Test suites and Benchmarks for Multimodal Interaction Multimodal Models of Engagement and Floor Management Non-Monotonic MM Input Interpretation Computational Models of the Acquisition of MM Communication Skills Non-Intrusive & Invisible MM Input Sensors Biologically-Inspired Intersensory Coordination Models 2 Nov. 2001 Dagstuhl Seminar Fusion and Coordination in Multimodal Interaction edited by: W. Wahlster © Tilman Becker, DFKI March 2002 (‹#›) Research Roadmap of Multimodality 2001-2010 Enabling Technologies and Important Contributing Research Areas Multimodal Input Multimodal Interaction l Sensor Technologies l Vision l Speech & Audio Technology l Biometrics l User Modelling l Cognitive Science l Discourse Theory l Ergonomics l Smart Graphics l Design Theory l Embodied Conversational Agents l Speech Synthesis Multimodal Output 2 Nov. 2001 Dagstuhl Seminar Fusion and Coordination in Multimodal Interaction edited by: W. Wahlster l Planning l Formal Ontologies l Machine Learning l Pattern Recognition © Tilman Becker, DFKI March 2002 (‹#›) Multimodal Interaction in SmartKom C:\Orsini\Pictures\SmartKom_Graphiken\Smartakus_Kinokarte.tif Scenario: public (mobile, home) Application: movie information (EPG, email, phone, fax, address book, tv and vcr control, routing/tourist info) U: I want to make a reservation in (R) this movie theater S: This theater does not take reservations U: Then a different one, (R) this one perhaps © Tilman Becker, DFKI March 2002 (‹#›) IJCAI 2001 Workshop TASK-4 Seattle, WA, USA D:\Orsini\pictures\DFKI.tif Deutsches Forschungszentrum für Künstliche Intelligenz GmbH Stuhlsatzenhausweg 3, Geb. 43.8 - 66123 Saarbrücken Tel.: (0681) 302-5271 Email: {janal,becker}@dfki.de www.dfki.de Overlay as the basic operation in discourse processing Jan Alexandersson Tilman Becker © Tilman Becker, DFKI March 2002 (‹#›) •Construct a discourse memory of contextual information •Hypotheses: –enrich w/ context information –compute scores •discourse memory: –enrich –retract –(partially) overwrite Discourse modelling tasks © Tilman Becker, DFKI March 2002 (‹#›) Discourse modelling Architecture Hypothese von IE Hypothese an IE Request Response Hypothese von IE Hypothesis from IE Hypothese an IE Hypothesis to IE Discourse memory complete score store © Tilman Becker, DFKI March 2002 (‹#›) Dialog memory •A typical dialog situation: –User: I want to see Matrix –Sytem: Ok, it runs at 8 and at 10 –User: At 8 •Dialog memory: –structured storage for utterances (and their meaning) •“current context:” –data structure representing the currently active context –e.g.: Matrix at 8 © Tilman Becker, DFKI March 2002 (‹#›) Putting the user in context •New information is added to current context, • •Result: updated current context • •used, e.g. for a database query © Tilman Becker, DFKI March 2002 (‹#›) Speech analysis Gesture analysis Application interface Speech recognition Gesture recognition Multimodal Chart Parser Unification- based multimodal Grammar MVPQ, AT&T Johnston 2000 Unification-based Integration of Speech and Gesture © Tilman Becker, DFKI March 2002 (‹#›) Updating current context with Unification •Representing complex discourse objects as typed feature structures (TFS), e.g. Johnston 1998 •Used, e.g. in media fusion: –User: I want to see this one [pointing to movie “Matrix”] –Speech: “I want to see X” –Gesture: “When is Matrix showing?” “I want to see Matrix.” .... –Media Fusion: “I want to see Matrix.” •Problem: enumeration of all structures (in deixis) © Tilman Becker, DFKI March 2002 (‹#›) Typed feature structures and XML •In the SmartKom project, discourse objects are represented in XML •Mapping from XML to TFS assumed •Example: © Tilman Becker, DFKI March 2002 (‹#›) The limits of unification •Not all new information is consistent with current context •Even for Mediafusion: –User: This one, (but) in green •Some parts must be kept, some be overwritten –“keep and overwrite”, M. Streit • •Provide a principled method, based on unification • © Tilman Becker, DFKI March 2002 (‹#›) Overlay to the rescue •Unification is monotonic, reflexive operation •old information from the current context can be changed, new information is more important •q we need a non-monotonic, non-reflexive operation: overlay © Tilman Becker, DFKI March 2002 (‹#›) Overlay to the rescue •Task: compare new (intention) hypothesis against discourse history •new information consistent with focus: V Unifikation •new in formation (partially) inconsistent with focus: V Overlay © Tilman Becker, DFKI March 2002 (‹#›) Example for Unification •U: I want to go to the movies tonight •S: Here is a list of the films that are shown in Heidelberg tonight: (SmartKom shows a list) •U: I want to see (R) this one, where is it playing? • © Tilman Becker, DFKI March 2002 (‹#›) Unification: monotonic operation © Tilman Becker, DFKI March 2002 (‹#›) ... Schmalspurganoven ... 2000-12-13T12:34:56 2000-12-13T23:59:59
Heidelberg
... Schmalspurganoven ... ...
Heidelberg
... Schmalspurganoven
© Tilman Becker, DFKI March 2002 (‹#›) Unification: compatibility condition fail © Tilman Becker, DFKI March 2002 (‹#›) Overlay: nonmonotonic operation, that always succeeds © Tilman Becker, DFKI March 2002 (‹#›) Example for Overlay •U: I want to make a reservation in (R) this movie theater •S: This theater does not take reservations •U: Then a different one, (R) this one perhaps © Tilman Becker, DFKI March 2002 (‹#›) ... Studio Europa ... ... ... Studio Europa ... ... Schmalspurganoven P=0.3 P=0.7 ... Studio Europa ... ... ... Kamera ... ... Schmalspurganoven ... Studio Europa ... © Tilman Becker, DFKI March 2002 (‹#›) Type Hierarchy TV_or_Movie TV Movie •BeginTime •Subtitles •.... •Channel •... •Cinema •... TOP MovieTheater . . . © Tilman Becker, DFKI March 2002 (‹#›) Overlay and Typed Feature Structures (TFS) •Two non-unifiable structures (type clash): –Cover is more important than background –Keep information from background: •Find lub (most specific common supertype) •“reduce” background to this type •recursively apply overlay on features •for atomic values: ignore background • • • © Tilman Becker, DFKI March 2002 (‹#›) An Example •U: What films are showing on TV tonight? •S: [shows list of films] •U: That‘s a boring program, I‘ll rather go to the movies. • • • •Q: How do we save “tonight” ? © Tilman Becker, DFKI March 2002 (‹#›) An Example •U: What films are showing on TV tonight? •Þ Context of type TV •S: [shows list of films] •U: That‘s a boring program, I‘ll rather go to the movies. •Þ Analysis finds data of type Movie •incompatible with context •abstract context to lub TV_or_Movie (keeps “tonight”) •unifiable with analysis © Tilman Becker, DFKI March 2002 (‹#›) Does TFS solve all your problems? •An adequate type hierarchy must exist –“most specific common supertype” –Carpenter and others on default unification •Overlay (and unification) of lists and sequences is not well defined -- and content dependent •What about “semantics”, e.g. DRS, Verbmobil VIT/MRS? • • © Tilman Becker, DFKI March 2002 (‹#›) Implementation •Mapping of XML Schema to Java classes see data binding: –Castor Project –Java 1.4: JAXB •XML documents are represented internally as instances of these classes •Unification and overlay are realized using the Java meta protocol © Tilman Becker, DFKI March 2002 (‹#›) Next steps •Treatment of subobjects –find relation to context •Grounding –model the presentation-acceptance cycle of discourse objects •Inclusion of dialog management plans –expected vs. Possible next states –better interpretation in context •Fully formalize XML schema to tfs mapping © Tilman Becker, DFKI March 2002 (‹#›) Summary of the Talk •Two large-scale spoken dialogue projects: Verbmobil, SmartKom •Spotlight on Aspects of NLG, Discourse Processing • •Conclusion: –Large Scale projects offer new insights‘ See also upcoming 6th framework of EU –Modular Architecture (data pool driven middleware) –combine shallow and deep approaches •multi-engine approach •fully specified template approach –emerging multi-modal markup language © Tilman Becker, DFKI March 2002 (‹#›) Finally Thank you very much for your kind attention. © Tilman Becker, DFKI March 2002 (‹#›) Verbmobil -The Project •Some information for those who haven´t heard of Verbmobil recently •speaker independent speech-to-speech translation system for appointment scheduling and travel planning: German « English (10 175 words German, 6871 words English) German « Japanese (2566 words Japanese) •69 modules, full configuration 3.5 GB •23 participating institutions (in Verbmobil II) •over 900 full workers and students involved •project duration: 1993 - 2000 • q scientific, software technology, and management challenges © Tilman Becker, DFKI March 2002 (‹#›) Scientific Results •Speaker independent speech recognition over various channels •Language ID •Unknown words •Prosodic information (segmentation, stress etc.) used in various modules •Repair of hesitations, repetitions •Combination of parser analysis fragments •Semantic representation: VIT • There are over 600 refereed papers on the various aspects of and achievements in Verbmobil. See also W. Wahlster (ed.): Verbmobil: Foundations of Speech-to-Speech Translation, Springer Verlag, to appear July 2000 ... at any shop near your office :-) Some highlights •Context and dialog knowledge supports translation •Efficient semantic transfer •Content to speech generation •Word concatenative speech synthesis •Dialog minutes and summaries •Large data collection with annotation on various levels (e.g. tree-banks, dialog acts) •.... © Tilman Becker, DFKI March 2002 (‹#›) Multi-Engine for Translation (DÝE) - Large-Scale Web-based Evaluation: 25 345 Translations, 65 Evaluators - Sentence Length 1 - 60 Words Translation Thread Case-based Translation Statistical Translation Dialog-Act based Translation Semantic Transfer Substring-based Translation Automatic Selection Manual Selection 37% 69% 40% 40% 65% 57% / 78% * 88% 44% 79% 45% 47% 75% 66% / 83% * 95% 46% 81% 46% 49% 79% 68% / 85% * 97% Word Accuracy ³ 50% 5069 Turns Word Accuracy ³ 75% 3267 Turns Word Accuracy ³ 80% 2723 Turns * After Training with Instance-based Learning Algorithm © Tilman Becker, DFKI March 2002 (‹#›) Agreement between Different Labels •Most M- (79%) and D-bound. (91%) are prosodically marked •About half of the M-boundaries (52%) are D-boundaries •Practically all D-boundaries (97%) are M-boundaries •High agreement between the non-boundaries (92-100%) •Even a prosody with a recognition rate of 100% will not find 21% of the M-boundaries and 9% of the D-boundaries! 100 0 97 3 -M3 48 52 21 79 M3 -D3 D3 -B3 B3 93 7 92 8 -D3 48 97 9 91 D3 -M3 M3 -B3 B3 B3 prosodic boundary M3 syntactic boundary D3 dialog act boundary © Tilman Becker, DFKI March 2002 (‹#›) 0 50 100 150 200 250 300 350 1 5 10 15 20 25 30 35 40 45 50 55 60 Distribution of Sentence Length in Large-Scale Evaluation © Tilman Becker, DFKI March 2002 (‹#›) Topic Meeting time Meeting place Means of transport Departure place Arrival time Place of arrival Who reserves the hotel How to get to departure place Means of return transportation Departure place for return trip Meeting time for return trip Meeting place for return trip Arriving place for return trip Total Number of Dialog Tasks Average Percentage of Successful Task Completions Weighted Average Percentage of Successful Task Completions Successful Completions 25 21 30 22 22 17 28 7 23 16 3 3 10 227 Attempts 28 27 30 25 26 19 31 9 24 17 4 4 11 255 Percentage of Successful Task Completions 89,3 77,8 100 88 84,6 89,5 90,3 77,8 95,8 94,1 75 75 90,9 86,8 89,6 Frequency-Based Weighting Factor 0,90 0,87 0,97 0,81 0,84 0,61 1 0,29 0,77 0,55 0,13 0,13 0,35 Results of End-to-End Evaluation Based on Dialog Task Completion for 31 Trials © Tilman Becker, DFKI March 2002 (‹#›) Test Results for the current Repair Module U:\coling\pic_trans\repair4.gif Remember: The output of the Repair module are additional hypotheses for the linguistic analysis. The original hypotheses remain in the WHG © Tilman Becker, DFKI March 2002 (‹#›) Examples