Data for MT PV061 Pavel Rychlý NLP Centre, FI MU 29 Nov 2023 Pavel Rychlý ·Data for MT ·29 Nov 2023 1 / 9 Common Crawl https://commoncrawl.org/ around 10 crawls per year (~85 in total from 2013) 90 TB of compressed data each 60-80 TB 2020-2022 40-60 TB 2017-2019 30-50 TB 2014-2016 each file ~ 1GB raw data - WARC format (also WET - text) textual content only: HTML, PDF, XML, ... data accessible from https://data.commoncrawl.org/[...] Pavel Rychlý ·Data for MT ·29 Nov 2023 2 / 9 ClueWeb09 http://lemurproject.org/clueweb09/ crawling in January and February 2009 1 billion web pages, in 10 languages 5 TB, compressed. (25 TB, uncompressed.) distributed on on 8TB hard disk ($380) Pavel Rychlý ·Data for MT ·29 Nov 2023 3 / 9 WARC format raw data from from web servers includes response headers, HTML, JavaScript, ... WARC/1.0 WARC-Type: response WARC-Date: 2014-08-02T09:52:13Z WARC-Record-ID: Content-Length: 43428 Content-Type: application/http; msgtype=response WARC-Warcinfo-ID: WARC-Concurrent-To: WARC-IP-Address: 212.58.244.61 WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm WARC-Truncated: length HTTP/1.1 200 OK Content-Type: text/html Date: Sat, 02 Aug 2014 09:52:13 GMT Set-Cookie: BBC-UID=15730d9c1b741c0d3942e2aca1317fbf39e57b90be68a32 BBC NEWS | Africa | Namibia braces for Nujoma exit Pavel Rychlý ·Data for MT ·29 Nov 2023 4 / 9 Original web page Pavel Rychlý ·Data for MT ·29 Nov 2023 5 / 9 WET format WARC/1.0 WARC-Type: conversion WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm WARC-Date: 2014-08-02T09:52:13Z WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJC Content-Type: text/plain Content-Length: 6724 BBC NEWS | Africa | Namibia braces for Nujoma exit [an error occurred while processing this directive] Low graphics|Accessibility help One-Minute World News News services Your news when you want it News Front Page Africa Americas Asia-Pacific Europe Middle East South Asia UK Business Health Science & Environment Technology Entertainment Also in the news ----------------- Video and Audio ----------------- Programmes Pavel Rychlý ·Data for MT ·29 Nov 2023 6 / 9 Corpus Tools https://corpus.tools/ JusText - removing boilerplate Chared - detecting character encoding onion - removing duplicate parts SpiderLing - web spider for linguistics Pavel Rychlý ·Data for MT ·29 Nov 2023 7 / 9 OPUS corpus https://opus.nlpl.eu/ ~70 sources (subcorpora) 700+ languages dng (Dungan): 6 sentences, 25 tokens Tatoeba (short) translated sentences for language learning ~400 languages, 12M segments WikiMatrix (Facebook) from wikipedia 85 languages, 135M segments Pavel Rychlý ·Data for MT ·29 Nov 2023 8 / 9 Corpora from Common Crawl CCNet: 130 languages, monolingual fastText for language identification (statistical) language modeling for filtering CCAligned 113 languages, 2.3G segments aligned to English only similarity of sentence embeddings (LASER) CCMatrix from CCNet 90 languages (1200 pairs), 7.4G segments MultiCCAligned Pavel Rychlý ·Data for MT ·29 Nov 2023 9 / 9