Webcrawl Parallel Corpus German-English 2015
Are you looking for parallel texts to train your statistical translation engines? Do you want to find domain-relevant terminology? Do you want to boost matches in your TMs?
We can offer you 10 million German-English parallel sentences:
- parallel sentence pairs crawled from the internet
- elaborate multi-step quality filtering, including language identification filter, machine translation filter, grammaticality filter etc.
- no duplicate sentence pairs
- no overlap with existing publicly available corpora like europarl, DGT-TM, etc. (see full list)
- web pages have been categorized for subject area (see distribution of subject areas)
- crawled between 10/2013 and 05/2015 – includes up-to-date terminology
- available in TMX and Moses format
International Standard Language Resource Number (ISLRN): 800-190-274-236-9
Contents of this page:
Identifying multilingual web sites: We crawled millions of hosts using Apache Nutch. Sites with pages in English as well as German were fully crawled. For each English page, we identified the most similar German page using URL pattern matching as well as dictionary-based content comparison.
Content extraction and linguistic annotation: All candidate pages were processed with our linguistic text analysis tool LinA to extract the textual content from the HTML, PDF, or Office formats, followed by sentence splitting, tokenization, part-of-speech tagging, and lemmatization.
Identifying parallel pages and sentence alignment: For each prospective English-German page pair we searched for parallel segments in the pages, again using our large English-German dictionary. If parallel text segments were found, these were fed into a sentence aligner to get pairs of parallel sentences.
ParallelnessFilter: This filter checks the parallel sentences found by the sentence aligner if they really are translations of each other. The filter is implemented as a machine-learning classifier that relies on features like word alignment and dictionary-based word overlap. It achieves an accuracy of 94%.
GarbageFilter: This filter assigns a quality score to each sentence. Sentences which are grammatical receive a high score, whereas sentences that contain lists, garbled or unknown words, weired characters and the like, receive a low score. If the score is below a threshold, the sentence is discarded. In the following, the filtering process is explained in more detail.
In a first step, all sentences which are shorter than 5 tokens or longer than 60 tokens are discarded.
The second step employs a classifier that we trained using supervised machine learning. We manually collected more than 5,000 prototypical examples for „good“ and „bad“ sentences, respectively. A feature extractor is applied to represent each sentence by a dozen features including type-token ratio, number of non-letter characters, number of upper-case words, and part-of-speech n-grams. Then, a classifier is trained on this data. The classifier achieves an accuracy of 97%, as evaluated by 10-fold cross-validation.
Filtering of machine translated texts: We rely on two approaches to identify sites that contain machine translated text:
- we collect site statistics like the number of languages on the site and the number of pages in each language. Suspicious sites are manually checked and blacklisted if they contain machine translated text.
- we have built a classifier that can spot sentences that are probably machine translated. If a site contains a certain percentage of suspicious sentences, the site is also manually checked and blacklisted if neccessary.
Classification of subject area: We have invested significant effort into manually collecting highly representative and clean training documents for all of our 87 subject areas. For each subject area we collected texts with more than 100,000 words, both for English and German. We trained a document classifier on this data that is able to reliably annotate new documents with their best matching subject area.
Deduplication: We agressively deduplicate sentence pairs. We concatenate both sides of the sentence pair, then we normalize the resulting string by removing all characters that are not Unicode letters, and then we lowercase. We compute the MD5 hash value for the resulting string. All other sentence pairs resulting in the same hash value are discarded.
- DGT-TM (2004/01-2014/03)
The corpus is provided in TMX and Moses format.
The TMX format includes for each translation unit (sentence pair) the source and target segment, the subject area, the crawl date, and the top-level domains. The encoding is Unicode UTF-8. The following shows a sample translation unit.
<tu tuid="1" datatype="Text">
<prop type="domain">markt-wettbewerb militaer tourismus</prop>
<prop xml:lang="de" type="TLD">de</prop>
<prop xml:lang="de" type="crawldate">2014-10-16</prop>
<seg>Kostendeckende Skalierbarkeit ist eines der wichtigsten taktischen Ziele der Stiftung, um die Errungenschaften ihrer wegweisenden Projekte umzusetzen, damit Regierungen und andere Nicht-Regierungsorganisationen die Technologien im großen Stil einführen und von ihnen profitieren können – nicht nur in Afrika, sondern auch auf anderen Kontinenten.</seg>
<prop xml:lang="en" type="TLD">com</prop>
<prop xml:lang="en" type="crawldate">2014-09-27</prop>
<seg>Cost-effective scalability is integral to the foundation’s tactical aim of leveraging the achievements of its ground-breaking projects so that governments and other NGOs can adopt and profit from the technologies and approaches on a broad scope on the African continent and beyond.</seg>
The Moses format contains only the raw text of the aligned sentence pairs.
In order to measure the quality of the Webcrawl corpus and compare it to other available parallel corpora we used C-Eval, a parallel corpora cleaning and evaluation tool that is described in Zariņa et al. 2015.
We trained a classifier model using the first 100,000 sentence pairs from the DCEP corpus (German-English), plus 6,802 sentence pairs from the Wikipedia Parallel Quotations corpus. The corpora were tokenized using the tokenizer script from the Moses distribution, and additionally processed with the script clean-corpus-n.perl (also from the Moses distribution) using the parameters min=1 and max=60. C-Eval training parameters were -a fastalign and -c reptree. Then we evaluated a number of parallel German-English corpora against the model. For efficiency reasons, we limited the number of sentence pairs in each corpus to the first one million sentence pairs. However, we evaluated three slices of our Webcrawl Corpus: the first one million sentences, the fifth million, and the tenth million. The results are shown in the table below.
|Corpus||good sentence pairs|
|europarl-v7 (1st 1M)||99.39%|
|linguatools Webcrawl (10st 1M)||99.14%|
|linguatools Webcrawl (5th 1M)||98.87%|
|linguatools Webcrawl (1st 1M)||98.39%|
|OpenSubtitels2013 (1st 1M)||96.35%|
|commoncrawl (1st 1M)||86.62%|
The linguatools webcrawl corpus almost reaches the quality of the europarl corpus. Its quality is higher than that of well-known parallel corpora like OpenSubtitles, DGT-TM, and EMEA. Most notably, the linguatools webcrawl corpus has a significantly higher quality than the commoncrawl corpus.
In order to further assess the quality of the webcrawled parallel corpus, we test its use for training the statistical machine translation system Moses for the language direction German to English. We train baseline systems on several publicly available parallel corpora and compare the results to our webcrawl corpus. In training, we follow the steps described here. Training includes MERT tuning on the development corpus news-test2008. All systems use the same English language model, a 5-gram model built on a corpus with 228 million tokens using KenLM. The quality is estimated via BLEU scores that we compute on the test corpus newstest2011 with the script
multi-bleu.perl from the Moses distribution. The following table shows the results.
|Parallel corpus||sentence pairs||Moses BLEU score|
|europarl-v7 + commoncrawl||4,310,725||20.11|
|linguatools webcrawl 4M||4,304,414||20.39|
The linguatools webcrawl 4M corpus consists of the first 4.3 million sentence pairs from the linguatools 10M corpus. This sample was created to compare its quality with the combination of two publicly available corpora of the same size: europarl-v7 and commoncrawl. As can be seen, the quality of the linguatools corpus is slightly better than the combination of europarl and commoncrawl.
The result for the EUBookshop corpus demonstrates that more data is only useful if the data is of good quality. (The EUBookshop corpus was extracted from PDF documents and is therefore very noisy.)
In conclusion, the comparison with other well-known parallel corpora via the Moses BLEU score confirms the high quality of the linguatools webcrawl corpus.
Number of hosts: 112,757.
All statistics were collected on tokenized sentences, i.e. punctuation like comma is a token.
|number of tokens||187,001,386||205,219,201|
|average number of tokens/sentence||18.92||20.77|
|average number of characters/sentence||107.26||97.09|
|average token length||5.67||4.67|
|number of types||3,742,325||1,609,531|
|number of tokens with non-letter characters||30,246,635||29,054,126|
The tables show the distribution of top level domains (TLD) for German and English sentences. Only the 30 most frequent TLDs are shown for each language.
|No.||TLD (German)||German Sentences||No.||TLD (Englisch)||English sentences|
The table shows the number of sentences from each subject area. Since sentences can be assigned to more than one subject area (maximally to three), the sum is greater than the total number of sentences.
|Domain||Number of sentence pairs||Percentage|
- word frequency list with all types occurring in the German part of the corpus and their frequencies.
- word frequency list with all types occurring in the English part of the corpus and their frequencies.
Download 10,000 sample sentence pairs in TMX format:
Bitte gib deinen Namen und Email-Adresse für den kostenlosen Download an.
If you are interested in obtaining a license please inquire email@example.com for license conditions and fees.
If you’d like to stay informed about corpora updates and new tools for text analysis you can subscribe to linguatools newsletter by providing your email address.