webcrawl parallel corpus german english-2015

Are you looking for parallel texts to train your neural machine translation engines? Do you want to find domain-relevant terminology? Do you want to boost matches in your TMs?

We can offer you 10 million German-English parallel sentences:

parallel sentence pairs crawled from the internet
elaborate multi-step quality filtering, including language identification filter, machine translation filter, grammaticality filter etc.
no duplicate sentence pairs
no overlap with existing publicly available corpora like europarl, DGT-TM, etc. (see full list)
web pages have been categorized for subject area (see distribution of subject areas)
crawled between 10/2013 and 05/2015 – includes up-to-date terminology
available in TMX and Moses format

International Standard Language Resource Number (ISLRN): 800-190-274-236-9

Contents of this page:

Corpus collection and preprocessing	Distribution of top-level domains
File formats	Distribution of subject areas
Evaluation of corpus quality	Word frequency lists
Use in statistical machine translation	License conditions
Corpus statistics

Corpus collection and preprocessing

Identifying multilingual web sites: We crawled millions of hosts using Apache Nutch. Sites with pages in English as well as German were fully crawled. For each English page, we identified the most similar German page using URL pattern matching as well as dictionary-based content comparison.

Content extraction and linguistic annotation: All candidate pages were processed with our linguistic text analysis tool LinA to extract the textual content from the HTML, PDF, or Office formats, followed by sentence splitting, tokenization, part-of-speech tagging, and lemmatization.

Identifying parallel pages and sentence alignment: For each prospective English-German page pair we searched for parallel segments in the pages, again using our large English-German dictionary. If parallel text segments were found, these were fed into a sentence aligner to get pairs of parallel sentences.

ParallelnessFilter: This filter checks the parallel sentences found by the sentence aligner if they really are translations of each other. The filter is implemented as a machine-learning classifier that relies on features like word alignment and dictionary-based word overlap. It achieves an accuracy of 94%.

GarbageFilter: This filter assigns a quality score to each sentence. Sentences which are grammatical receive a high score, whereas sentences that contain lists, garbled or unknown words, weired characters and the like, receive a low score. If the score is below a threshold, the sentence is discarded. In the following, the filtering process is explained in more detail.
In a first step, all sentences which are shorter than 5 tokens or longer than 60 tokens are discarded.
The second step employs a classifier that we trained using supervised machine learning. We manually collected more than 5,000 prototypical examples for „good“ and „bad“ sentences, respectively. A feature extractor is applied to represent each sentence by a dozen features including type-token ratio, number of non-letter characters, number of upper-case words, and part-of-speech n-grams. Then, a classifier is trained on this data. The classifier achieves an accuracy of 97%, as evaluated by 10-fold cross-validation.

Filtering of machine translated texts: We rely on two approaches to identify sites that contain machine translated text:

we collect site statistics like the number of languages on the site and the number of pages in each language. Suspicious sites are manually checked and blacklisted if they contain machine translated text.
we have built a classifier that can spot sentences that are probably machine translated. If a site contains a certain percentage of suspicious sentences, the site is also manually checked and blacklisted if neccessary.

Classification of subject area: We have invested significant effort into manually collecting highly representative and clean training documents for all of our 87 subject areas. For each subject area we collected texts with more than 100,000 words, both for English and German. We trained a document classifier on this data that is able to reliably annotate new documents with their best matching subject area.

Deduplication: We agressively deduplicate sentence pairs. We concatenate both sides of the sentence pair, then we normalize the resulting string by removing all characters that are not Unicode letters, and then we lowercase. We compute the MD5 hash value for the resulting string. All other sentence pairs resulting in the same hash value are discarded.

With the same method we have also made sure that the linguatools webcrawl corpus does not contain any sentence pairs that are present in one of the following publicly available corpora:

File formats

The corpus is provided in TMX and Moses format.
The TMX format includes for each translation unit (sentence pair) the source and target segment, the subject area, the crawl date, and the top-level domains. The encoding is Unicode UTF-8. The following shows a sample translation unit.

 <tu tuid="1" datatype="Text"> <prop type="domain">markt-wettbewerb militaer tourismus</prop> <tuv xml:lang="de"> <prop xml:lang="de" type="TLD">de</prop> <prop xml:lang="de" type="crawldate">2014-10-16</prop> <seg>Kostendeckende Skalierbarkeit ist eines der wichtigsten taktischen Ziele der Stiftung, um die Errungenschaften ihrer wegweisenden Projekte umzusetzen, damit Regierungen und andere Nicht-Regierungsorganisationen die Technologien im großen Stil einführen und von ihnen profitieren können – nicht nur in Afrika, sondern auch auf anderen Kontinenten.</seg> </tuv> <tuv xml:lang="en"> <prop xml:lang="en" type="TLD">com</prop> <prop xml:lang="en" type="crawldate">2014-09-27</prop> <seg>Cost-effective scalability is integral to the foundation’s tactical aim of leveraging the achievements of its ground-breaking projects so that governments and other NGOs can adopt and profit from the technologies and approaches on a broad scope on the African continent and beyond.</seg> </tuv> </tu>

<prop type="domain">markt-wettbewerb militaer tourismus</prop>

<seg>Kostendeckende Skalierbarkeit ist eines der wichtigsten taktischen Ziele der Stiftung, um die Errungenschaften ihrer wegweisenden Projekte umzusetzen, damit Regierungen und andere Nicht-Regierungsorganisationen die Technologien im großen Stil einführen und von ihnen profitieren können – nicht nur in Afrika, sondern auch auf anderen Kontinenten.</seg>

</tuv>

<seg>Cost-effective scalability is integral to the foundation’s tactical aim of leveraging the achievements of its ground-breaking projects so that governments and other NGOs can adopt and profit from the technologies and approaches on a broad scope on the African continent and beyond.</seg>

</tuv>

</tu>

The Moses format contains only the raw text of the aligned sentence pairs.

Evaluation of corpus quality

In order to measure the quality of the Webcrawl corpus and compare it to other available parallel corpora we used C-Eval, a parallel corpora cleaning and evaluation tool that is described in Zariņa et al. 2015.

We trained a classifier model using the first 100,000 sentence pairs from the DCEP corpus (German-English), plus 6,802 sentence pairs from the Wikipedia Parallel Quotations corpus. The corpora were tokenized using the tokenizer script from the Moses distribution, and additionally processed with the script clean-corpus-n.perl (also from the Moses distribution) using the parameters min=1 and max=60. C-Eval training parameters were -a fastalign and -c reptree. Then we evaluated a number of parallel German-English corpora against the model. For efficiency reasons, we limited the number of sentence pairs in each corpus to the first one million sentence pairs. However, we evaluated three slices of our Webcrawl Corpus: the first one million sentences, the fifth million, and the tenth million. The results are shown in the table below.

Corpus	good sentence pairs
news-commentary-v8	99.61%
europarl-v7 (1st 1M)	99.39%
linguatools Webcrawl (10st 1M)	99.14%
linguatools Webcrawl (5th 1M)	98.87%
linguatools Webcrawl (1st 1M)	98.39%
OpenSubtitels2013 (1st 1M)	96.35%
DGT-TM-2014-1-3	96.16%
EMEA	91.89%
commoncrawl (1st 1M)	86.62%

The linguatools webcrawl corpus almost reaches the quality of the europarl corpus. Its quality is higher than that of well-known parallel corpora like OpenSubtitles, DGT-TM, and EMEA. Most notably, the linguatools webcrawl corpus has a significantly higher quality than the commoncrawl corpus.

Use in statistical machine translation

In order to further assess the quality of the webcrawled parallel corpus, we test its use for training the statistical machine translation system Moses for the language direction German to English. We train baseline systems on several publicly available parallel corpora and compare the results to our webcrawl corpus. In training, we follow the steps described here. Training includes MERT tuning on the development corpus news-test2008. All systems use the same English language model, a 5-gram model built on a corpus with 228 million tokens using KenLM. The quality is estimated via BLEU scores that we compute on the test corpus newstest2011 with the script multi-bleu.perl from the Moses distribution. The following table shows the results.

Parallel corpus	sentence pairs	Moses BLEU score
europarl-v7	1,934,299	18.76
commoncrawl	2,376,426	18.96
europarl-v7 + commoncrawl	4,310,725	20.11
linguatools webcrawl 4M	4,304,414	20.39
EUbookshop	9,153,394	14.37

The linguatools webcrawl 4M corpus consists of the first 4.3 million sentence pairs from the linguatools 10M corpus. This sample was created to compare its quality with the combination of two publicly available corpora of the same size: europarl-v7 and commoncrawl. As can be seen, the quality of the linguatools corpus is slightly better than the combination of europarl and commoncrawl.
The result for the EUBookshop corpus demonstrates that more data is only useful if the data is of good quality. (The EUBookshop corpus was extracted from PDF documents and is therefore very noisy.)

In conclusion, the comparison with other well-known parallel corpora via the Moses BLEU score confirms the high quality of the linguatools webcrawl corpus.

Corpus statistics

Number of hosts: 112,757.

All statistics were collected on tokenized sentences, i.e. punctuation like comma is a token.

	German	English
number of tokens	187,001,386	205,219,201
average number of tokens/sentence	18.92	20.77
average number of characters/sentence	107.26	97.09
average token length	5.67	4.67
number of types	3,742,325	1,609,531
number of tokens with non-letter characters	30,246,635	29,054,126

Distribution of top-level domains

The tables show the distribution of top level domains (TLD) for German and English sentences. Only the 30 most frequent TLDs are shown for each language.

No.	TLD (German)	German Sentences	No.	TLD (Englisch)	English sentences
1	de	4537239	1	de	3894656
2	com	3067614	2	com	3573016
3	at	720643	3	at	671620
4	ch	614056	4	ch	566740
5	net	275240	5	net	280868
6	eu	263512	6	eu	274409
7	org	114274	7	uk	144922
8	cz	90399	8	org	118162
9	it	44924	9	cz	96494
10	hu	31022	10	it	53445
11	va	28214	11	ie	43778
12	biz	13084	12	hu	31529
13	be	11564	13	va	28214
14	fr	10682	14	biz	13947
15	nl	8706	15	fr	12615
16	lu	8048	16	nl	11472
17	edu	4895	17	be	11303
18	se	4881	18	lu	6544
19	hr	3483	19	se	6134
20	gr	3338	20	edu	4895
21	no	3177	21	es	3960
22	fi	3165	22	gr	3880
23	ru	2450	23	no	3686
24	ie	2429	24	hr	3485
25	li	1651	25	fi	3246
26	sk	1490	26	dk	2993
27	es	1424	27	ru	2450
28	dk	1400	28	li	1598
29	info	1272	29	sk	1512
30	pl	1215	30	pl	1421

Distribution of subject areas

The table shows the number of sentences from each subject area. Since sentences can be assigned to more than one subject area (maximally to three), the sum is greater than the total number of sentences.

	Domain	Number of sentence pairs	Percentage
1	technik	6197201	62.71%
2	informationstechnologie	2970819	30.06%
3	internet	1986228	20.10%
4	tourismus	1720585	17.41%
5	e-commerce	1040798	10.53%
6	verlag	861034	8.71%
7	theater	786733	7.96%
8	mode-lifestyle	744357	7.53%
9	schule	679484	6.88%
10	universitaet	673393	6.81%
11	verwaltung	543615	5.50%
12	transport-verkehr	534681	5.41%
13	informatik	516390	5.23%
14	kunst	505192	5.11%
15	musik	452109	4.58%
16	film	450280	4.56%
17	infrastruktur	448350	4.54%
18	staatliche-entscheidungsorgane-und-oeffentliches-finanzwesen	438509	4.44%
19	oekonomie	428123	4.33%
20	media	401179	4.06%
21	finanzmarkt	352048	3.56%
22	wirtschaftsrecht	310442	3.14%
23	auto	279440	2.83%
24	politik	278932	2.82%
25	sport	275174	2.78%
26	steuerterminologie	264190	2.67%
27	jagd	253437	2.56%
28	verkehr-kommunikation	250659	2.54%
29	marketing	227915	2.31%
30	boerse	214375	2.17%
31	radio	200195	2.03%
32	medizin	199667	2.02%
33	personalwesen	191718	1.94%
34	rechnungswesen	182317	1.84%
35	ressorts	157230	1.59%
36	immobilien	148757	1.51%
37	markt-wettbewerb	146063	1.48%
38	religion	139697	1.41%
39	astrologie	136522	1.38%
40	flaechennutzung	133794	1.35%
41	mythologie	127215	1.29%
42	militaer	124325	1.26%
43	psychologie	124032	1.26%
44	transaktionsprozesse	123867	1.25%
45	soziologie	122371	1.24%
46	bahn	118148	1.20%
47	unternehmensstrukturen	112655	1.14%
48	gastronomie	109054	1.10%
49	physik	103773	1.05%
50	literatur	103532	1.05%
51	verkehrssicherheit	102500	1.04%
52	weltinstitutionen	101563	1.03%
53	oekologie	94269	0.95%
54	pharmazie	87934	0.89%
55	jura	75095	0.76%
56	astronomie	74080	0.75%
57	bau	72846	0.74%
58	gartenbau	66582	0.67%
59	verkehr-gueterverkehr	65225	0.66%
60	handel	60286	0.61%
61	versicherung	57907	0.59%
62	raumfahrt	57533	0.58%
63	luftfahrt	55945	0.57%
64	foto	55758	0.56%
65	archäologie	45321	0.46%
66	meteo	45043	0.46%
67	zoologie	44387	0.45%
68	forstwirtschaft	44300	0.45%
69	geografie	42411	0.43%
70	geologie	39011	0.39%
71	nautik	34419	0.35%
72	philosophie	31355	0.32%
73	botanik	31311	0.32%
74	architektur	31167	0.32%
75	biologie	30467	0.31%
76	vogelkunde	30432	0.31%
77	mathematik	27300	0.28%
78	landwirtschaft	27033	0.27%
79	historie	24311	0.25%
80	chemie	23400	0.24%
81	verkehrsfluss	22508	0.23%
82	linguistik	18138	0.18%
83	bergbau	16731	0.17%
84	finanzen	16339	0.17%
85	typografie	15098	0.15%
86	controlling	1706	0.02%
87	mobilfunk-telekommunikation	546	0.01%

Word frequency lists

word frequency list with all types occurring in the German part of the corpus and their frequencies.
word frequency list with all types occurring in the English part of the corpus and their frequencies.

License conditions

If you are interested in obtaining a license please inquire peter.kolb@linguatools.org for license conditions and fees.