Co-funded by the 7th Framework Programme of the European Commission through the contract T4ME, grant agreement no.: 249119. fp7_.png Machine Translation Research in META-NET Jan Hajič Institute of Formal and Applied Linguistics Charles University in Prague, CZ hajic@ufal.mff.cuni.cz With contributions by Marcello Federico, Pavel Pecina, Stephan Peitz and Timo Honkela META-FORUM 2010: Challenges for Multilingual Europe Brussels, Belgium, November 17/18, 2010 META-NET-Logo_L_rgb.eps http://www.meta-net.eu 2 Outline qPillar I in META-NET §…the research element of META-NET qSemantics in Machine Translation §Semantic features in statistical MT §(Semantic) Tree-based translation qHybrid MT systems §Rule-based and statistical qContext in MT §„Extra-linguistic“ features qMore data for MT §Parallel data for under-resources langauges qRelated projects & the Future q http://www.meta-net.eu 2 http://www.meta-net.eu 3 Semantics in Machine Translation q http://www.meta-net.eu 3 http://www.meta-net.eu 4 Semantics in Machine Translation qWhat is semantics, anyway? §For now: anything beyond and outside morphology and syntax »Semantic Roles (words vs. predicates) »Lexical Semantics (WSD), MWE »Named Entities »Co-reference (pronominal, bridging anaphora) »Textual Entailment »Discourse Structure »Information Structure … + any combination of the above qNew metrics §BLEU, METEOR, NIST etc. biased towards (good) local n-grams §Metrics sensitive to semantics? qTools and Resources §Semantically annotated parallel corpora; metrics tools, analysis tools q http://www.meta-net.eu 4 http://www.meta-net.eu 5 Semantics in Machine Translation qAnalysis – transfer [– generation] Semantics (semantic features) Syntax Morphology Generation (if needed) Source Target transfer http://www.meta-net.eu 6 Semantics in Machine Translation qCase Study 1 §Cross-lingual Textual Entailment for Adequacy Evaluation -Y. Mehad, M. Negri, M. Federico: Towards cross-lingual textual entailment, NAACL 2010 qCase Study 2 §Combined Syntax and Semantics for MT Transfer -D. Mareček, M. Popel, Z. Žabokrtský: Maximum Entropy Translation Model in Dependency-Based MT Framework, WMT / ACL 2010 qCase Study 3 §Anaphora Resolution for translation of pronouns -C. Hardmeier, M. Federico: Modeling Pronominal Anaphora in Statistical MT, IWSLT 2010. qCase Studies → Selected Challenges §Evaluation of impact of individual additions -Evaluation data with/without phenomenon under study -Automatic vs. human evaluation http://www.meta-net.eu 7 Hybrid MT Systems q http://www.meta-net.eu 7 http://www.meta-net.eu 8 Machine Translation Paradigms qRB-MT – Rule-Based Machine translation qEB-MT – Example-Based Machine Translation qSMT – Statistical Machine Translation qPB-SMT – Phrase-Based Statistical Machine Translation qHPB-SMT – Hierachical Phrase-Based Statistical Machine Translation qSB-SMT – Syntax-Based Statistical Machine Translation q... qObservation: Different systems have different strengths q (e.g. easy training of SMT vs. good grammar of RB-MT) qHypothesis: Hybrid systems can combine best of all http://www.meta-net.eu 9 Hybrid MT: Pre-Translation System Selection qMultiple MT engines/systems available qMachine learning techniques §decide which system is best to translate the input sentence q q input RB-MT EB-MT PB-SMT HPB-SMT SB-SMT ML output http://www.meta-net.eu 10 Hybrid MT: Pre-Translation System Selection qMultiple MT engines/systems available qAll systems translate §Analysis of ouptuts → select translation q q input RB-MT EB-MT PB-SMT HPB-SMT SB-SMT output1 ML output1 output2 output3 output4 output5 http://www.meta-net.eu 11 Hybrid MT: Pre-Translation System Selection qMultiple MT engines/systems available qAll systems translate §Translation compiled from analyzed pieces q q input RB-MT EB-MT PB-SMT HPB-SMT SB-SMT output ML http://www.meta-net.eu 12 The META-NET Hybrid System Approach qBased on system combination qMultiple systems based on different paradigms used to produce annotated n-best outputs: §Matrex (example based): all language pairs ↔ English §Moses (phrase based): all language pairs ↔ English §Metis (rule based): Spanish → English, German → English §Apertium (rule based): Spanish ↔ English §Lucy (rule based): Spanish, German ↔ English §Joshua (hierarchical phrase based): all language pairs ↔ English §TectoMT (deep syntax based): Czech ↔ English qAnnotation: words, phrases, subtrees, chunks scored by different models (depending on the system) qDecoding: machine learning techniques used to recombine those to get better output http://www.meta-net.eu 13 Context in Machine Translation q http://www.meta-net.eu 13 http://www.meta-net.eu 14 Increase MT quality and services in multimodal context q Česká republika je jedním z mála vnitrozemských států, jehož obrysy lze rozeznat na satelitních snímcích. Czech Republic is one of the few inland countries whose borders can be seen from satellite photographs. (SOURCE)‏ (TARGET)‏ (CONTEXTS)‏ MT Czech-rep-gm-terrain http://www.meta-net.eu 15 Context in Machine Translation qResearch §Domain as context: domain adapted language and translation models §Context in statistical morphology learning §Context of use in translation §Multimodal context in translation qChallenge preparations §Gathering experience from MorphoChallenge 2010 (Associated with PASCAL Network of Excellence)‏ §Preparation for Context in Translation Challenge for ICANN 2011 Conference (topic: Reranking Translation Candidates)‏ §Preparation for Context in Translation Challenge for 2013 (theme: Multimodal Context in Translation)‏ q http://www.meta-net.eu 16 Context in Machine Translation qDomain adapted language and translation models §Method -Large corpus divided in predefined domains -Train translation and language models on each domain -Train additional language models on the predefined domains -Train a classifier to classify incoming documents to a domain -Decode using respective translation and language models -Evaluate results and revise method if necessary §Resources -JRC-Acquis & Eurovoc -Europarl §Innovation -Design, implement and fine-tune classification algorithms -Explore ways to effectively combine language and translation models http://www.meta-net.eu 17 Context in Machine Translation qContext in statistical morphology learning § O. Kohonen, S. Virpioja, L. Leppänen and K. Lagus (2010): Semisupervised Extensions to Morfessor Baseline qMultimodal context in translation §Research questions: -Which kind of multimodal contextual information can be used to advance MT quality? How to better access multimodal information? -In which MT applications multimodal information is useful? §Current target: enhancing language and translation models with visual and textual context data and ontological knowledge -Use cases: translation of figure captions, translation of subtitles, MT in extended reality applications, robotics applications http://www.meta-net.eu 18 Context in Machine Translation: 2011 Challenge qData §JRC Acquis corpus, 22 European languages §Translations by the state-of-the-art statistical systems qTasks §To choose to the best translation from a set candidate translations by multiple systems (reranking task)‏ §Context is given by the source sentence, larger linguistic context and the domain of the text qGoals §To discover the set of best context features, find representation §To foster collaboration between MT and Machine Learning (ML) researchers; infuse MT research with advances from the ML field qFuture Challenge: 2013 §Using visual context (images) http://www.meta-net.eu 19 Context in Machine Translation: 2013 Challenge qData §Data requirements are set; specific data collections are sought for §Requirements: multilingual (parallel) text and image collection (where the association is close)‏ qTasks §Word sense disambiguation with multimodal contextual information (tentative)‏ qGoals §Improved MT using visual features, deepening symbol grounding in multilingual systems §Increase collaboration between MT and image analysis/machine learning researchers q q http://www.meta-net.eu 20 Data and Machine Learning for MT q http://www.meta-net.eu 20 http://www.meta-net.eu 21 Data and Advanced Machine Learning in MT q“There is no data like more data” §Data crawling, cleanup, deduplication, … §Available through META-SHARE qAdvanced Machine Learning Experiments §Combining several previously described approaches §Syntax, Semantics, Hybrids, … q F2 rwth-dep rwth-syntax output ML F1 F4 F3 http://www.meta-net.eu 22 Related Projects q http://www.meta-net.eu 22 http://www.meta-net.eu 23 EU 7th FP Machine Translation (selected projects) qEuromatrixPlus §Machine Translation in general – now 8 selected languages (Czech, English, French, Spanish, German, Italian, Slovak, Bulgarian) qFAUST §Improving fluency, incorporating user feedback (fast) §French, English, Czech, Spanish qACCURAT §Using comparable corpora, esp. for low-resource languages §Estonian, Croatian, … qLetsMT! (PSP) §Building of data resources (low-resourced languages) §For business and research qPanacea §Building Resources & Language Tools §Tools + Resources → Automatically analyzed corpora qKhresmoi (IP) §Medical information retrieval for patients and practitioners §Cross-language (English, German, Czech, French) ← MT http://www.meta-net.eu 24 The Future q http://www.meta-net.eu 24 http://www.meta-net.eu 25 The Future qResources, resources, resources §… and their avialabilty (META-SHARE) qNovel, high-risk research §Linguistics -Unclear “which linguistics”, but some §Language Understanding -Context, domain knowledge (ontologies?), other modalitites §…but SMT is here to stay (in some form) -… even though we might not recognize the current “kitchen-sink” paradigm a few years from now §New algorithms -Neural networks (finally?), Genetic algorithms, Brain research, … §Better [automatic] evaluation to guide progress qCommercial Applications §Post-editing (CAT) tools with integrated (S)MT, novel features, ergonomics §Multilingual information access, information extraction, summarization, sentiment http://www.meta-net.eu 26 Q/A q q qThank you very much. q q qoffice@meta-net.eu q qhttp://www.meta-net.eu qhttp://www.facebook.com/META.Alliance 26