Corpora – linguatools.org

„There’s no data like more data!“

Parallel Corpora

Parallel corpora consist of bilingual sentence pairs. They are a highly valuable resource for translators, terminologists, and language engineers.

Webcrawl Parallel Corpora: parallel sentence pairs crawled from the web using our BSP crawler:
- Webcrawl Parallel Corpus German-English 2015: 10 million parallel sentences German-English
- more language pairs in preparation…
Wikipedia Parallel Titles Corpora: bilingual titles of Wikipedia articles, extended with redirects and textlinks. 487,406,497 unique parallel segments for 253 language pairs!
Wikipedia Parallel Quotations Corpus: a tiny corpus of German-English quotes from the German Wikipedia.

Comparable Corpora

All our comparable corpora are bilingual document-aligned corpora. The documents are categorized for domain.

Wikipedia Comparable Corpora: more than 41 million bilingually aligned Wikipedia articles for 253 language pairs.

Monolingual Corpora

Wikipedia Monolingual Corpora: Nearly 10 billion tokens of text in 30 languages extracted from the Wikipedia

If you’d like to stay informed about corpora updates and new tools for text analysis you can subscribe to linguatools newsletter by providing your email address.