„There’s no data like more data!“
Parallel corpora consist of bilingual sentence pairs. They are a highly valuable resource for translators, terminologists, and language engineers.
- Wikipedia Parallel Titles Corpora: bilingual titles of Wikipedia articles, extended with redirects and textlinks. 487,406,497 unique parallel segments for 253 language pairs!
- Wikipedia Parallel Quotations Corpus: a tiny corpus of German-English quotes from the German Wikipedia.
- Webcrawl Parallel Corpora: parallel sentence pairs crawled from the web using our BSP crawler:
- Webcrawl Parallel Corpus German-English 2015: 10 million parallel sentences German-English
- more language pairs in preparation…
All our comparable corpora are bilingual document-aligned corpora. The documents are categorized for domain.
- Wikipedia Comparable Corpora: more than 41 million bilingually aligned Wikipedia articles for 253 language pairs.
- Wikipedia Monolingual Corpora: Nearly 10 billion tokens of text in 30 languages extracted from the Wikipedia
If you’d like to stay informed about corpora updates and new tools for text analysis you can subscribe to linguatools newsletter by providing your email address.