Current Challenges Philipp Koehn 31 October 2024 Philipp Koehn Machine Translation: Current Challenges 31 October 2024 5WMT 2024: LLMs Arrive NMT LLM Research System English Japanese Human Online-B Command-R+ GPT-4 Claude-3.5 Gemini-1.5-Pro Tower70B IOL-Research Aya23 NTTSU Team-J Llama3-70B IKUN-C English Spanish Human Dubformer GPT-4 IOL-Research Mistral-Large Tower-70B Claude-3.5 Gemini-1.5-Pro Command-R+ Llama3-70B Online-B IKUN IKUN-C MSLC English Hindi TranssionMT Tower-70B Claude-3.5 Online-B Gemini-1.5-Pro GPT-4 Human IOL-Research Llama3-70B Aya23 IKUN-C English Icelandic Human Dubformer Claude-3.5 Tower-70B AMI IKUN Online-B GPT-4 IKUN-C IOL-Research Llama3-70B English Czech Human Tower-70B Claude-3.5 Online-W CUNI-MH Gemini-1.5-Pro GPT-4 Command-R+ IOL-Research SCIR-MT CUNI-DocT Aya23 CUNI-GA IKUN Llama3-70B IKUN-C English German GPT-4 Dubformer Tower-70B Online-B TranssionMT Human-B Mistral-Large Command-R+ Online-W Claude-3.5 Human-A IOL-Research Gemini-1.5-Pro Aya-23 Online-A Llama3-70B IKUN IKUN-C English Chinese Human Online-B IOL-Research Tower70B GPT-4 Gemini-1.5-Pro Claude-3.5 Command-R+ Llama3-70B HW-TSC Aya23 IKUN IKUN-C Czech Ukranian Human Claude 3.5 Gemini-1.5-Pro Tower70B IOL-Research Online-W Command-R+ GPT-4 IKUN GPT-4 CUNI-Transf IKUN-C Philipp Koehn Machine Translation: Current Challenges 31 October 2024 10Domain Mismatch System ↓ Law Medical IT Koran Subtitles All Data 30.532.8 45.142.2 35.344.7 17.917.9 26.420.8 Law 31.134.4 12.118.2 3.5 6.9 1.3 2.2 2.8 6.0 Medical 3.910.2 39.443.5 2.0 8.5 0.6 2.0 1.4 5.8 IT 1.9 3.7 6.5 5.3 42.139.8 1.8 1.6 3.9 4.7 Koran 0.4 1.8 0.0 2.1 0.0 2.3 15.918.8 1.0 5.5 Subtitles 7.0 9.9 9.317.8 9.213.6 9.0 8.4 25.922.1 Philipp Koehn Machine Translation: Current Challenges 31 October 2024 22Robustness to User Generated Content Philipp Koehn Machine Translation: Current Challenges 31 October 2024 23Challenges • Jargon and acronyms • Misspellings (sometimes intended for effect) • Mangled grammar • Special symbols (emojis, etc.) • Hashtags, URLs, ... • Use of dialectical languages • Use of non-standard writing systems (e.g., Latin script due to lack of keyboard) Philipp Koehn Machine Translation: Current Challenges 31 October 2024 24Some Methods • Special handling of non-words like emojis, hashtags, URLs • Creating synthetic noisy training data • Adversarial training • Resources – Machine translation of noisy text data set (MTNT) – WMT 2020 Shared Task on Machine Translation Robustness Philipp Koehn Machine Translation: Current Challenges 31 October 2024