„There’s no data like more data!“
Parallel Corpora
Parallel corpora consist of bilingual sentence pairs. They are a highly valuable resource for translators, terminologists, and language engineers.
- Webcrawl Parallel Corpora: parallel sentence pairs crawled from the web using our BSP crawler:
- Webcrawl Parallel Corpus German-English 2015: 10 million parallel sentences German-English
- more language pairs in preparation…
- Wikipedia Parallel Titles Corpora: bilingual titles of Wikipedia articles, extended with redirects and textlinks. 487,406,497 unique parallel segments for 253 language pairs!
- Wikipedia Parallel Quotations Corpus: a tiny corpus of German-English quotes from the German Wikipedia.
- several parallel texts (German-English and German-Czech) in TMX format can be downloaded here. Included are:
- German-Czech:
- Jules Verne: Robur der Eroberer – Robur Dobyvatel
- three stories by Edgar Allan Poe
- German-English:
- two fairy tales by Hans Christian Andersen
- a story by Edgar Allan Poe
- the Communist Manifesto by Karl Mary and Friedrich Engels
- German-Czech:
Comparable Corpora
All our comparable corpora are bilingual document-aligned corpora. The documents are categorized for domain.
- Wikipedia Comparable Corpora: more than 41 million bilingually aligned Wikipedia articles for 253 language pairs.
Monolingual Corpora
- Wikipedia Monolingual Corpora: Nearly 10 billion tokens of text in 30 languages extracted from the Wikipedia
If you’d like to stay informed about corpora updates and new tools for text analysis you can subscribe to linguatools newsletter by providing your email address.