webcrawl parallel corpus german english-2015

Are you looking for parallel texts to train your statistical translation engines? Do you want to find domain-relevant terminology? Do you want to boost matches in your TMs?

We can offer you 10 million German-English parallel sentences:

  • parallel sentence pairs crawled from the internet
  • elaborate multi-step quality filtering, including language identification filter, machine translation filter, grammaticality filter etc.
  • no duplicate sentence pairs
  • no overlap with existing publicly available corpora like europarl, DGT-TM, etc. (see full list)
  • web pages have been categorized for subject area (see distribution of subject areas)
  • crawled between 10/2013 and 05/2015 – includes up-to-date terminology
  • available in TMX and Moses format

International Standard Language Resource Number (ISLRN): 800-190-274-236-9

Contents of this page:

Corpus collection and preprocessing Distribution of top-level domains
File formats Distribution of subject areas
Evaluation of corpus quality Word frequency lists
Use in statistical machine translation License conditions
Corpus statistics

Corpus collection and preprocessing

Identifying multilingual web sites: We crawled millions of hosts using Apache Nutch. Sites with pages in English as well as German were fully crawled. For each English page, we identified the most similar German page using URL pattern matching as well as dictionary-based content comparison.

Content extraction and linguistic annotation: All candidate pages were processed with our linguistic text analysis tool LinA to extract the textual content from the HTML, PDF, or Office formats, followed by sentence splitting, tokenization, part-of-speech tagging, and lemmatization.

Identifying parallel pages and sentence alignment: For each prospective English-German page pair we searched for parallel segments in the pages, again using our large English-German dictionary. If parallel text segments were found, these were fed into a sentence aligner to get pairs of parallel sentences.

ParallelnessFilter: This filter checks the parallel sentences found by the sentence aligner if they really are translations of each other. The filter is implemented as a machine-learning classifier that relies on features like word alignment and dictionary-based word overlap. It achieves an accuracy of 94%.

GarbageFilter: This filter assigns a quality score to each sentence. Sentences which are grammatical receive a high score, whereas sentences that contain lists, garbled or unknown words, weired characters and the like, receive a low score. If the score is below a threshold, the sentence is discarded. In the following, the filtering process is explained in more detail.
In a first step, all sentences which are shorter than 5 tokens or longer than 60 tokens are discarded.
The second step employs a classifier that we trained using supervised machine learning. We manually collected more than 5,000 prototypical examples for „good“ and „bad“ sentences, respectively. A feature extractor is applied to represent each sentence by a dozen features including type-token ratio, number of non-letter characters, number of upper-case words, and part-of-speech n-grams. Then, a classifier is trained on this data. The classifier achieves an accuracy of 97%, as evaluated by 10-fold cross-validation.

Filtering of machine translated texts: We rely on two approaches to identify sites that contain machine translated text:

  1. we collect site statistics like the number of languages on the site and the number of pages in each language. Suspicious sites are manually checked and blacklisted if they contain machine translated text.
  2. we have built a classifier that can spot sentences that are probably machine translated. If a site contains a certain percentage of suspicious sentences, the site is also manually checked and blacklisted if neccessary.

Classification of subject area: We have invested significant effort into manually collecting highly representative and clean training documents for all of our 87 subject areas. For each subject area we collected texts with more than 100,000 words, both for English and German. We trained a document classifier on this data that is able to reliably annotate new documents with their best matching subject area.

Deduplication: We agressively deduplicate sentence pairs. We concatenate both sides of the sentence pair, then we normalize the resulting string by removing all characters that are not Unicode letters, and then we lowercase. We compute the MD5 hash value for the resulting string. All other sentence pairs resulting in the same hash value are discarded.

With the same method we have also made sure that the linguatools webcrawl corpus does not contain any sentence pairs that are present in one of the following publicly available corpora:

  1. commoncrawl
  2. DGT-TM (2004/01-2014/03)
  3. DCEP
  4. EAC-FORMS
  5. EAC-REFERENCE
  6. ECB
  7. ECDC
  8. EMEA
  9. EUConst
  10. europarl-v7
  11. KDE4
  12. MultiUN
  13. news-commentary-v8
  14. OpenOffice3
  15. OpenSubtitles2013

File formats

The corpus is provided in TMX and Moses format.
The TMX format includes for each translation unit (sentence pair) the source and target segment, the subject area, the crawl date, and the top-level domains. The encoding is Unicode UTF-8. The following shows a sample translation unit.

The Moses format contains only the raw text of the aligned sentence pairs.

Evaluation of corpus quality

In order to measure the quality of the Webcrawl corpus and compare it to other available parallel corpora we used C-Eval, a parallel corpora cleaning and evaluation tool that is described in Zariņa et al. 2015.

We trained a classifier model using the first 100,000 sentence pairs from the DCEP corpus (German-English), plus 6,802 sentence pairs from the Wikipedia Parallel Quotations corpus. The corpora were tokenized using the tokenizer script from the Moses distribution, and additionally processed with the script clean-corpus-n.perl (also from the Moses distribution) using the parameters min=1 and max=60. C-Eval training parameters were -a fastalign and -c reptree. Then we evaluated a number of parallel German-English corpora against the model. For efficiency reasons, we limited the number of sentence pairs in each corpus to the first one million sentence pairs. However, we evaluated three slices of our Webcrawl Corpus: the first one million sentences, the fifth million, and the tenth million. The results are shown in the table below.

Corpus good sentence pairs
news-commentary-v8 99.61%
europarl-v7 (1st 1M) 99.39%
linguatools Webcrawl (10st 1M) 99.14%
linguatools Webcrawl (5th 1M) 98.87%
linguatools Webcrawl (1st 1M) 98.39%
OpenSubtitels2013 (1st 1M) 96.35%
DGT-TM-2014-1-3 96.16%
EMEA 91.89%
commoncrawl (1st 1M) 86.62%

The linguatools webcrawl corpus almost reaches the quality of the europarl corpus. Its quality is higher than that of well-known parallel corpora like OpenSubtitles, DGT-TM, and EMEA. Most notably, the linguatools webcrawl corpus has a significantly higher quality than the commoncrawl corpus.

Use in statistical machine translation

In order to further assess the quality of the webcrawled parallel corpus, we test its use for training the statistical machine translation system Moses for the language direction German to English. We train baseline systems on several publicly available parallel corpora and compare the results to our webcrawl corpus. In training, we follow the steps described here. Training includes MERT tuning on the development corpus news-test2008. All systems use the same English language model, a 5-gram model built on a corpus with 228 million tokens using KenLM. The quality is estimated via BLEU scores that we compute on the test corpus newstest2011 with the script multi-bleu.perl from the Moses distribution. The following table shows the results.

Parallel corpus sentence pairs Moses BLEU score
europarl-v7 1,934,299 18.76
commoncrawl 2,376,426 18.96
europarl-v7 + commoncrawl 4,310,725 20.11
linguatools webcrawl 4M 4,304,414 20.39
EUbookshop 9,153,394 14.37

The linguatools webcrawl 4M corpus consists of the first 4.3 million sentence pairs from the linguatools 10M corpus. This sample was created to compare its quality with the combination of two publicly available corpora of the same size: europarl-v7 and commoncrawl. As can be seen, the quality of the linguatools corpus is slightly better than the combination of europarl and commoncrawl.
The result for the EUBookshop corpus demonstrates that more data is only useful if the data is of good quality. (The EUBookshop corpus was extracted from PDF documents and is therefore very noisy.)

In conclusion, the comparison with other well-known parallel corpora via the Moses BLEU score confirms the high quality of the linguatools webcrawl corpus.

Corpus statistics

Number of hosts: 112,757.

All statistics were collected on tokenized sentences, i.e. punctuation like comma is a token.

German English
number of tokens 187,001,386 205,219,201
average number of tokens/sentence 18.92 20.77
average number of characters/sentence 107.26 97.09
average token length 5.67 4.67
number of types 3,742,325 1,609,531
number of tokens with non-letter characters 30,246,635 29,054,126

Distribution of top-level domains

The tables show the distribution of top level domains (TLD) for German and English sentences. Only the 30 most frequent TLDs are shown for each language.

No. TLD (German) German Sentences No. TLD (Englisch) English sentences
1 de 4537239 1 de 3894656
2 com 3067614 2 com 3573016
3 at 720643 3 at 671620
4 ch 614056 4 ch 566740
5 net 275240 5 net 280868
6 eu 263512 6 eu 274409
7 org 114274 7 uk 144922
8 cz 90399 8 org 118162
9 it 44924 9 cz 96494
10 hu 31022 10 it 53445
11 va 28214 11 ie 43778
12 biz 13084 12 hu 31529
13 be 11564 13 va 28214
14 fr 10682 14 biz 13947
15 nl 8706 15 fr 12615
16 lu 8048 16 nl 11472
17 edu 4895 17 be 11303
18 se 4881 18 lu 6544
19 hr 3483 19 se 6134
20 gr 3338 20 edu 4895
21 no 3177 21 es 3960
22 fi 3165 22 gr 3880
23 ru 2450 23 no 3686
24 ie 2429 24 hr 3485
25 li 1651 25 fi 3246
26 sk 1490 26 dk 2993
27 es 1424 27 ru 2450
28 dk 1400 28 li 1598
29 info 1272 29 sk 1512
30 pl 1215 30 pl 1421

Distribution of subject areas

The table shows the number of sentences from each subject area. Since sentences can be assigned to more than one subject area (maximally to three), the sum is greater than the total number of sentences.

Domain Number of sentence pairs Percentage
1 technik 6197201 62.71%
2 informationstechnologie 2970819 30.06%
3 internet 1986228 20.10%
4 tourismus 1720585 17.41%
5 e-commerce 1040798 10.53%
6 verlag 861034 8.71%
7 theater 786733 7.96%
8 mode-lifestyle 744357 7.53%
9 schule 679484 6.88%
10 universitaet 673393 6.81%
11 verwaltung 543615 5.50%
12 transport-verkehr 534681 5.41%
13 informatik 516390 5.23%
14 kunst 505192 5.11%
15 musik 452109 4.58%
16 film 450280 4.56%
17 infrastruktur 448350 4.54%
18 staatliche-entscheidungsorgane-und-oeffentliches-finanzwesen 438509 4.44%
19 oekonomie 428123 4.33%
20 media 401179 4.06%
21 finanzmarkt 352048 3.56%
22 wirtschaftsrecht 310442 3.14%
23 auto 279440 2.83%
24 politik 278932 2.82%
25 sport 275174 2.78%
26 steuerterminologie 264190 2.67%
27 jagd 253437 2.56%
28 verkehr-kommunikation 250659 2.54%
29 marketing 227915 2.31%
30 boerse 214375 2.17%
31 radio 200195 2.03%
32 medizin 199667 2.02%
33 personalwesen 191718 1.94%
34 rechnungswesen 182317 1.84%
35 ressorts 157230 1.59%
36 immobilien 148757 1.51%
37 markt-wettbewerb 146063 1.48%
38 religion 139697 1.41%
39 astrologie 136522 1.38%
40 flaechennutzung 133794 1.35%
41 mythologie 127215 1.29%
42 militaer 124325 1.26%
43 psychologie 124032 1.26%
44 transaktionsprozesse 123867 1.25%
45 soziologie 122371 1.24%
46 bahn 118148 1.20%
47 unternehmensstrukturen 112655 1.14%
48 gastronomie 109054 1.10%
49 physik 103773 1.05%
50 literatur 103532 1.05%
51 verkehrssicherheit 102500 1.04%
52 weltinstitutionen 101563 1.03%
53 oekologie 94269 0.95%
54 pharmazie 87934 0.89%
55 jura 75095 0.76%
56 astronomie 74080 0.75%
57 bau 72846 0.74%
58 gartenbau 66582 0.67%
59 verkehr-gueterverkehr 65225 0.66%
60 handel 60286 0.61%
61 versicherung 57907 0.59%
62 raumfahrt 57533 0.58%
63 luftfahrt 55945 0.57%
64 foto 55758 0.56%
65 archäologie 45321 0.46%
66 meteo 45043 0.46%
67 zoologie 44387 0.45%
68 forstwirtschaft 44300 0.45%
69 geografie 42411 0.43%
70 geologie 39011 0.39%
71 nautik 34419 0.35%
72 philosophie 31355 0.32%
73 botanik 31311 0.32%
74 architektur 31167 0.32%
75 biologie 30467 0.31%
76 vogelkunde 30432 0.31%
77 mathematik 27300 0.28%
78 landwirtschaft 27033 0.27%
79 historie 24311 0.25%
80 chemie 23400 0.24%
81 verkehrsfluss 22508 0.23%
82 linguistik 18138 0.18%
83 bergbau 16731 0.17%
84 finanzen 16339 0.17%
85 typografie 15098 0.15%
86 controlling 1706 0.02%
87 mobilfunk-telekommunikation 546 0.01%

Word frequency lists

  • word frequency list with all types occurring in the German part of the corpus and their frequencies.
  • word frequency list with all types occurring in the English part of the corpus and their frequencies.

License conditions

If you are interested in obtaining a license please inquire peter.kolb@linguatools.org for license conditions and fees.