Machine Translation in Practice for PV061 FI:PV061: Introduction to MT Michal Štefánik stefanik.m@mail.muni.cz Outline 1. Motivation 2. Background 3. Practical problems 4. {Pre/Post}processing 5. Generation heuristics 6. Deployment 2/25 Motivation What is wrong about just using Google Translate? - Price - Speed - Robustness on specific domains 3/25 Background 4/25 Background - training 5/25 Background - inference 6/25 Practical problems - What if we have specific vocabulary for some terms? - What if the translator never seen some of the symbols? - What if the text is non-canonical (i.e. weird)? 7/25 Our approach 8/25 ● We need to serve 10+ language pairs, so we choose a big, pre-trained model and fine-tune it for our purposes (mBART) ● We take special care of the non-pretrained languages - we use auxiliary language if it is low-resource (like zh_TW) ● We train it to natively support the {pre/post}-processing that we need (later) Demo Demo 10/25 curl text-translator-api.gaussalgo.com/translate/ \ -X POST \ -H 'content-type: application/json' \ -d '{"source_lang":"en_XX", "target_lang":"cs_CZ", "text": "Weird text to break the demo"}' Application overview 11/25 {Pre/Post}processing 12/25 Generation heuristics 13/25 Our translator applies these heuristics: 1. NoBadWords: translations containing tokens from the list for the selected languages (such as Chinese or Arabic in Indonesian and some shared tokens) will get manually-assigned score of -infinity 2. RepetitionPenalty: scores of tokens that were already generated is multiplied by DEFAULT_REPETITION_PENALTY, hence lowered (logits are negative) 3. MinLength: Sequences of logits shorter than given threshold are set to -inf. This helps us to avoid the early generation termination. 4. ForcedBeginningOfSequenceToken: all the sequences not starting with given language token are assigned -inf. This is a support for mBART discourse interface. 5. ForcedEndOfSequenceToken: if some candidate sequence already contains <\s>, all the others are set to -inf, hence pruned and so the generation process ends. This is a speed-up trick that allows the generation to stop as soon as possible. Deployment 14/25 ● Training of mBART on batch_size=1 consumes 20GB of GPU, our current mBART wa fine-tuned on a single A100 for ~90 hours (~350USD) (link to the training tracker) ● Production uses kubernetes engine management ○ We fit three instances (models) to a single node with Nvidia Tesla T4 (~700USD/month) ○ Performance equivalent to 64-core CPUs (~1500USD/month) ○ Auto-scaling ● Each customer gets their own configurations of ○ Manual translation vocabulary ○ Abbreviations ○ Processors ○ Frequent typo corrections ● Currently, we manage to share the same model among all customers but this will soon change Thanks! Michal Štefánik stefanik.m@mail.muni.cz