Corpus Acquisition from the Internet Philipp Koehn partially based on slides from Christian Buck 13 November 2018 Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 1Big Data For many language pairs, lots of text available. Text you read in your lifetime Translated text available English text available 300 million words billions of words trillions of words Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 2Mining the Web • Largest source for text: the World Wide Web – publicly available crawl of the web – hosted by Amazon Web Services, but can be downloaded – regularly updated (semi-annual) – 2-4 billion web pages per crawl • Currently filling up hard drives in our lab Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 3Monolingual Data • Starting point: 35TB of text • Processing pipeline [Buck et al., 2014] – language detection – deduplication – normalization of Unicode characters – sentence splitting • Obtained corpora Language Lines (B) Tokens (B) Bytes BLEU (WMT) English 59.13 975.63 5.14 TB German 3.87 51.93 317.46 GB +0.5 Spanish 3.50 62.21 337.16 GB French 3.04 49.31 273.96 GB +0.6 Russian 1.79 21.41 220.62 GB +1.2 Czech 0.47 5.79 34.67 GB +0.6 Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 4Parallel Data • Basic processing pipeline [Smith et al., 2013] – find parallel web pages (based on URL only) – align document by HTML structure – sentence splitting and tokenization – sentence alignment – filtering (remove boilerplate) • Obtained corpora French German Spanish Russian Japanese Chinese Segments 10.2M 7.50M 5.67M 3.58M 1.70M 1.42M Foreign Tokens 128M 79.9M 71.5M 34.7M 9.91M 8.14M English Tokens 118M 87.5M 67.6M 36.7M 19.1M 14.8M Bengali Farsi Telugu Somali Kannada Pashto Segments 59.9K 44.2K 50.6K 52.6K 34.5K 28.0K Foreign Tokens 573K 477K 336K 318K 305K 208K English Tokens 537K 459K 358K 325K 297K 218K • Much more work needed! Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 5Data Cleaning and Subsampling • Not all data useful – some may be harmful • Removing data based on – domain relevance – alignment quality – redundancy – bad language (orthography, non-words) – machine translated or poorly translated • Removing bad data always reduces training time • Removing bad data sometimes helps quality • Clean data approach (only using high quality data) helps in limited domains Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 6 corpus crawling Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 7Finding Monolingual Text • Simple Idea 1. Download many websites 2. Extract text from HTML 3. Guess language of text 4. Add to corpus 5. Profit • Turns out all these steps are quite involved Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 8Common Crawl • Non-profit organization • Data – publicly available on Amazon S3 – e.g. January 2015: 140TB / 1.8B pages • Crawler – Apache Nutch – collecting pre-defined list of URLs Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 9 extracting text Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 10A Web Page Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 11HTML Source Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 12Method 1: Strip Tags Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 13Method 2: HTML Parser Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 14 language detection Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 15What Language? Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 16Clues: Letter N-Grams Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 17Example: langid.py • Muitas intervenc¸ ˜oes alertaram – prediction: Portuguese – high confidence (-90.8) • Muitas intervenc¸ ˜oes – prediction: Portuguese – fairly high confidence (-68.2) • Muitas – prediction: English – low confidence (9.1) Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 18Language Identification Tools • langid.py (Lui & Baldwin, ACL 2012) – 1-4 grams, NaiveBayes, Feature Selection • TextCat (based on Cavnar & Trenkle, 1994) – similar to langid.py – no Feature Selection • Compact/Chromium Language Detector 2 (Google) – takes hints from tld, meta data – super fast – detects spans of text Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 19Detected Languages in CommonCrawl (Buck and Heafield, LREC2014) Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 20Most Common English Phrases Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 21Benefit of Huge Language Models Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 22 bilingual corpus crawling Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 23Mining Bilingual Text • Bilingual text = same text in different languages • Usually: one side translation of the other • Full page or interface/content only • Potentially translation on same page e.g., Twitter, Facebook posts Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 24Pipeline 1. Identify web sites worth crawling 2. Crawl web site 3. Language detection — as before 4. Extract text from HTML — as before 5. Align documents 6. Align sentences 7. Clean corpus Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 25 identify web sites Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 26Targeted Crawling • A few web sites with a lot of parallel text, e.g., – European Union, e.g., proceedings of the European Parliament – Canadian Hansards – United Nations – Project Syndicate – TED Talks – Movie / TV show subtitles – Global Voices • Hand-written tools – crawling – text extraction – document alignment • Few days effort per site Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 27Broad Crawling • Identify many web sites to crawl – has the phrase This page in English or variants – has link to language flag – known to have content in multiple languages (from CommonCrawl) • Follow links – up to n links deep into site – up to n links in total – only follow links to web pages, not images, etc. • Avoid crawling sites too deeply that do not have parallel text? (requires quick feedback from downstream processing) Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 28 document alignment Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 29Document Alignment • Early Work: STRAND (Resnik 1998, 1999) (Structural Translation Recognition, Acquiring Natural Data) • Pipeline 1. candidate generation 2. candidate ranking 3. filtering 4. optional: sentence alignment 5. evaluation Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 30Link Structure • Parent page: a page that links to different language versions Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 31Parent Page Example Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 32Sibling Page • A page that links to its translation in another language Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 33URL Matching • Often URLs differ only slightly, often indicating language xyz.com/en/ xyz.com/fr/ xyz.com/bla.htm xyz.com/bla.htm?lang=FR xyz.com/the cat xyz.fr/le chat Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 34Finding URL Patterns • URLs with pattern =en Count Pattern 545875 lang=en 140420 lng=en 126434 LANG=en 110639 hl=en 99065 language=en 81471 tlng=en 56968 l=en 47504 locale=en 33656 langue=en 33503 lang=eng 19421 uil=English 15170 ln=en 14242 Language=EN 13948 lang=EN 12108 language=english 11997 lang=engcro 11646 store=en Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 35Finding URL Patterns • URLs with pattern lang.*=.* Count Pattern 13948 lang=EN 13456 language=ca 13098 switchlang=1 12960 language=zh 12890 lang=Spanish 12471 lang=th 12266 langBox=US 12108 language=english 12003 lang=cz 11997 lang=engcro 11635 lang=sl 11578 lang=d 11474 lang=lv 11376 lang=NL 11349 lang=croeng 11244 lang=English Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 36Document Length • Extract texts and compare lengths (Smith 2001) • Document or sentence level Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 37Document Object Model • Translated web pages often retain similar structure • This includes links to the same images, etc. Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 38Linearized Structure Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 39Levenshtein Alignment Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 40Content Similarity • Simple things – same numbers or names in documents – often quite effective • Use of lexicon – treat documents as bag of words – consider how many words in EN document have translations in FR document • A bit more complex – semantic representations of documents content – bag of word vectors – neural network embeddings • Major challenge: do this fast for n × m document pairs Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 41Google’s Content Matching • Basic idea: translate everything into English, match large n-grams • For each non-English document: 1. Translate everything to English using MT 2. Find distinctive ngrams (a) rare, but not too rare (5-grams) (b) used for matching only • Build inverted index: ngram → documents [cat sat on] → {[doc1, ES], [doc3, DE], ...} [on the mat] → {[doc1, ES], [doc2, FR], ...} Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 42Matching using Inverted Index [cat sat on] -> {[doc1, ES], [doc3, DE], ...} [on the mat] -> {[doc1, ES], [doc2, ES], ...} [on the table] -> {[doc3, DE]} • For each n-gram – generate all pairs where: ∗ document list short (≤ 50) ∗ source language different • Result: [doc1, doc3], ... Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 43Scoring using Forward Index • Forward index maps documents to n-grams • For each document pair [d1, d2] – collect scoring n-grams for both documents – build IDF-weighted vector – distance: cosine similarity Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 44Scoring Document Pairs • Given ngrams(d1) = n1, n2, ..., nr ngrams(d2) = n1, n2, ..., nr • Inverse document frequency idf(n) = log |D| df(n) where: |D| = number of documents df(n) = number of documents with n • Scoring of IDF-weighted vectors v v1,x = idf(nx) if nx ∈ ngrams(d1), 0 otherwise v2,x = idf(nx) if nx ∈ ngrams(d2), 0 otherwise score(d1, d2) = v1 ˙v2 ||v1||||v2||| Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 45 sentence alignment Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 46Sentence Alignment • Much early work in 1990s, e.g., Gale and Church (1991) – find sequence of 1-1, 1-2, 0-1, etc., sentence alignment groups – good element in sequence = similar number of words – dynamic programming search for best sequence • Featurized alignments – with dictionary (Hunalign) – with induced dictionary (Gargantua) – consider tags such as

• Sensitive to noise — often large parts of page not translated Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 47Filtering Bad Data • Mismatched sentence pairs from errors in pipeline • Non-literal translation e.g. news stories are notoriously non-literal • Bad translations • Machine translation – much of the parallel text on the Internet generated by Google Translate – detection hard — looks like very clean parallel data – maybe too clean (little reordering, very literal) – watermarking machine translation (Venugopal et al., 2011) • How clean should it be? – trade-off between precision and recall unclear Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018 48Open Challenges • Currently serious attempt at broad crawling for parallel data at JHU • Major challenges – crawling (just using standard tool) – document alignment (major research topic) → shared task at WMT 2016 machine translation conference – sentence alignment (just using standard tool) – detection of machine translated text (some old work) – filtering out bad sentence pairs (major research topic) → shared task at WMT 2018 machine translation conference • JHU efforts: continuously processing tera bytes of data Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018