Corpora – linguatools.org

„There’s no data like more data!“

Parallel Corpora

Parallel corpora consist of bilingual sentence pairs. They are a highly valuable resource for translators, terminologists, and language engineers.

Webcrawl Parallel Corpora: parallel sentence pairs crawled from the web using our BSP crawler:
- Webcrawl Parallel Corpus German-English 2015: 10 million parallel sentences German-English
- more language pairs in preparation…
Wikipedia Parallel Titles Corpora: bilingual titles of Wikipedia articles, extended with redirects and textlinks. 487,406,497 unique parallel segments for 253 language pairs!
Wikipedia Parallel Quotations Corpus: a tiny corpus of German-English quotes from the German Wikipedia.

All our comparable corpora are bilingual document-aligned corpora. The documents are categorized for domain.

Wikipedia Comparable Corpora: more than 41 million bilingually aligned Wikipedia articles for 253 language pairs.

Wikipedia Monolingual Corpora: Nearly 10 billion tokens of text in 30 languages extracted from the Wikipedia