Corpus Acquisition from the Internet
Philipp Koehn
partially based on slides from Christian Buck
13 November 2018
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
1Big Data
For many language pairs, lots of text available.
Text you read
in your lifetime
Translated text
available
English text
available
300 million words
billions of words
trillions of words
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
2Mining the Web
• Largest source for text: the World Wide Web
– publicly available crawl of the web
– hosted by Amazon Web Services, but can be downloaded
– regularly updated (semi-annual)
– 2-4 billion web pages per crawl
• Currently ﬁlling up hard drives in our lab
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
3Monolingual Data
• Starting point: 35TB of text
• Processing pipeline [Buck et al., 2014]
– language detection
– deduplication
– normalization of Unicode characters
– sentence splitting
• Obtained corpora
Language Lines (B) Tokens (B) Bytes BLEU (WMT)
English 59.13 975.63 5.14 TB German
3.87 51.93 317.46 GB +0.5
Spanish 3.50 62.21 337.16 GB French
3.04 49.31 273.96 GB +0.6
Russian 1.79 21.41 220.62 GB +1.2
Czech 0.47 5.79 34.67 GB +0.6
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
4Parallel Data
• Basic processing pipeline [Smith et al., 2013]
– ﬁnd parallel web pages (based on URL only)
– align document by HTML structure
– sentence splitting and tokenization
– sentence alignment
– ﬁltering (remove boilerplate)
• Obtained corpora
French German Spanish Russian Japanese Chinese
Segments 10.2M 7.50M 5.67M 3.58M 1.70M 1.42M
Foreign Tokens 128M 79.9M 71.5M 34.7M 9.91M 8.14M
English Tokens 118M 87.5M 67.6M 36.7M 19.1M 14.8M
Bengali Farsi Telugu Somali Kannada Pashto
Segments 59.9K 44.2K 50.6K 52.6K 34.5K 28.0K
Foreign Tokens 573K 477K 336K 318K 305K 208K
English Tokens 537K 459K 358K 325K 297K 218K
• Much more work needed!
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
5Data Cleaning and Subsampling
• Not all data useful – some may be harmful
• Removing data based on
– domain relevance
– alignment quality
– redundancy
– bad language (orthography, non-words)
– machine translated or poorly translated
• Removing bad data always reduces training time
• Removing bad data sometimes helps quality
• Clean data approach (only using high quality data) helps in limited domains
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
6
corpus crawling
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
7Finding Monolingual Text
• Simple Idea
1. Download many websites
2. Extract text from HTML
3. Guess language of text
4. Add to corpus
5. Proﬁt
• Turns out all these steps are quite involved
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
8Common Crawl
• Non-proﬁt organization
• Data
– publicly available on Amazon S3
– e.g. January 2015: 140TB / 1.8B pages
• Crawler
– Apache Nutch
– collecting pre-deﬁned list of URLs
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
9
extracting text
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
10A Web Page
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
11HTML Source
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
12Method 1: Strip Tags
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
13Method 2: HTML Parser
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
14
language detection
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
15What Language?
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
16Clues: Letter N-Grams
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
17Example: langid.py
• Muitas intervenc¸ ˜oes alertaram
– prediction: Portuguese
– high conﬁdence (-90.8)
• Muitas intervenc¸ ˜oes
– prediction: Portuguese
– fairly high conﬁdence (-68.2)
• Muitas
– prediction: English
– low conﬁdence (9.1)
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
18Language Identiﬁcation Tools
• langid.py (Lui & Baldwin, ACL 2012)
– 1-4 grams, NaiveBayes, Feature Selection
• TextCat (based on Cavnar & Trenkle, 1994)
– similar to langid.py
– no Feature Selection
• Compact/Chromium Language Detector 2 (Google)
– takes hints from tld, meta data
– super fast
– detects spans of text
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
19Detected Languages in CommonCrawl
(Buck and Heaﬁeld, LREC2014)
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
20Most Common English Phrases
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
21Beneﬁt of Huge Language Models
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
22
bilingual corpus crawling
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
23Mining Bilingual Text
• Bilingual text = same text in different languages
• Usually: one side translation of the other
• Full page or interface/content only
• Potentially translation on same page
e.g., Twitter, Facebook posts
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
24Pipeline
1. Identify web sites worth crawling
2. Crawl web site
3. Language detection — as before
4. Extract text from HTML — as before
5. Align documents
6. Align sentences
7. Clean corpus
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
25
identify web sites
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
26Targeted Crawling
• A few web sites with a lot of parallel text, e.g.,
– European Union, e.g., proceedings of the European Parliament
– Canadian Hansards
– United Nations
– Project Syndicate
– TED Talks
– Movie / TV show subtitles
– Global Voices
• Hand-written tools
– crawling
– text extraction
– document alignment
• Few days effort per site
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
27Broad Crawling
• Identify many web sites to crawl
– has the phrase This page in English or variants
– has link to language ﬂag
– known to have content in multiple languages (from CommonCrawl)
• Follow links
– up to n links deep into site
– up to n links in total
– only follow links to web pages, not images, etc.
• Avoid crawling sites too deeply that do not have parallel text?
(requires quick feedback from downstream processing)
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
28
document alignment
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
29Document Alignment
• Early Work: STRAND (Resnik 1998, 1999)
(Structural Translation Recognition, Acquiring Natural Data)
• Pipeline
1. candidate generation
2. candidate ranking
3. ﬁltering
4. optional: sentence alignment
5. evaluation
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
30Link Structure
• Parent page: a page that links to different language versions
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
31Parent Page Example
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
32Sibling Page
• A page that links to its translation in another language
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
33URL Matching
• Often URLs differ only slightly, often indicating language
xyz.com/en/ xyz.com/fr/
xyz.com/bla.htm xyz.com/bla.htm?lang=FR
xyz.com/the cat xyz.fr/le chat
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
34Finding URL Patterns
• URLs with pattern =en
Count Pattern
545875 lang=en
140420 lng=en
126434 LANG=en
110639 hl=en
99065 language=en
81471 tlng=en
56968 l=en
47504 locale=en
33656 langue=en
33503 lang=eng
19421 uil=English
15170 ln=en
14242 Language=EN
13948 lang=EN
12108 language=english
11997 lang=engcro
11646 store=en
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
35Finding URL Patterns
• URLs with pattern lang.*=.*
Count Pattern
13948 lang=EN
13456 language=ca
13098 switchlang=1
12960 language=zh
12890 lang=Spanish
12471 lang=th
12266 langBox=US
12108 language=english
12003 lang=cz
11997 lang=engcro
11635 lang=sl
11578 lang=d
11474 lang=lv
11376 lang=NL
11349 lang=croeng
11244 lang=English
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
36Document Length
• Extract texts and compare lengths (Smith 2001)
• Document or sentence level
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
37Document Object Model
• Translated web pages often retain similar structure
• This includes links to the same images, etc.
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
38Linearized Structure
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
39Levenshtein Alignment
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
40Content Similarity
• Simple things
– same numbers or names in documents
– often quite effective
• Use of lexicon
– treat documents as bag of words
– consider how many words in EN document have translations in FR document
• A bit more complex
– semantic representations of documents content
– bag of word vectors
– neural network embeddings
• Major challenge: do this fast for n × m document pairs
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
41Google’s Content Matching
• Basic idea: translate everything into English, match large n-grams
• For each non-English document:
1. Translate everything to English using MT
2. Find distinctive ngrams
(a) rare, but not too rare (5-grams)
(b) used for matching only
• Build inverted index: ngram → documents
[cat sat on] → {[doc1, ES], [doc3, DE], ...}
[on the mat] → {[doc1, ES], [doc2, FR], ...}
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
42Matching using Inverted Index
[cat sat on] -> {[doc1, ES], [doc3, DE], ...}
[on the mat] -> {[doc1, ES], [doc2, ES], ...}
[on the table] -> {[doc3, DE]}
• For each n-gram
– generate all pairs where:
∗ document list short (≤ 50)
∗ source language different
• Result: [doc1, doc3], ...
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
43Scoring using Forward Index
• Forward index maps documents to n-grams
• For each document pair [d1, d2]
– collect scoring n-grams for both documents
– build IDF-weighted vector
– distance: cosine similarity
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
44Scoring Document Pairs
• Given ngrams(d1) = n1, n2, ..., nr
ngrams(d2) = n1, n2, ..., nr
• Inverse document frequency
idf(n) = log
|D|
df(n)
where: |D| = number of documents
df(n) = number of documents with n
• Scoring of IDF-weighted vectors v
v1,x = idf(nx) if nx ∈ ngrams(d1), 0 otherwise
v2,x = idf(nx) if nx ∈ ngrams(d2), 0 otherwise
score(d1, d2) =
v1 ˙v2
||v1||||v2|||
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
45
sentence alignment
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
46Sentence Alignment
• Much early work in 1990s, e.g., Gale and Church (1991)
– ﬁnd sequence of 1-1, 1-2, 0-1, etc., sentence alignment groups
– good element in sequence = similar number of words
– dynamic programming search for best sequence
• Featurized alignments
– with dictionary (Hunalign)
– with induced dictionary (Gargantua)
– consider tags such as <P>
• Sensitive to noise — often large parts of page not translated
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
47Filtering Bad Data
• Mismatched sentence pairs from errors in pipeline
• Non-literal translation
e.g. news stories are notoriously non-literal
• Bad translations
• Machine translation
– much of the parallel text on the Internet generated by Google Translate
– detection hard — looks like very clean parallel data
– maybe too clean (little reordering, very literal)
– watermarking machine translation (Venugopal et al., 2011)
• How clean should it be?
– trade-off between precision and recall unclear
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018
48Open Challenges
• Currently serious attempt at broad crawling for parallel data at JHU
• Major challenges
– crawling (just using standard tool)
– document alignment (major research topic)
→ shared task at WMT 2016 machine translation conference
– sentence alignment (just using standard tool)
– detection of machine translated text (some old work)
– ﬁltering out bad sentence pairs (major research topic)
→ shared task at WMT 2018 machine translation conference
• JHU efforts: continuously processing tera bytes of data
Philipp Koehn Machine Translation: Corpus Acquisition from the Internet 13 November 2018