„There’s no data like more data!“
Parallel Corpora
Parallel corpora consist of bilingual sentence pairs. They are a highly valuable resource for translators, terminologists, and language engineers.
- Webcrawl Parallel Corpora: parallel sentence pairs crawled from the web using our BSP crawler:
- Webcrawl Parallel Corpus German-English 2015: 10 million parallel sentences German-English
- more language pairs in preparation…
- Wikipedia Parallel Titles Corpora: bilingual titles of Wikipedia articles, extended with redirects and textlinks. 487,406,497 unique parallel segments for 253 language pairs!
- Wikipedia Parallel Quotations Corpus: a tiny corpus of German-English quotes from the German Wikipedia.
Comparable Corpora
All our comparable corpora are bilingual document-aligned corpora. The documents are categorized for domain.
- Wikipedia Comparable Corpora: more than 41 million bilingually aligned Wikipedia articles for 253 language pairs.
Monolingual Corpora
- Wikipedia Monolingual Corpora: Nearly 10 billion tokens of text in 30 languages extracted from the Wikipedia