Corpora

„There’s no data like more data!“

Parallel Corpora

Parallel corpora consist of bilingual sentence pairs. They are a highly valuable resource for translators, terminologists, and language engineers.

  • Webcrawl Parallel Corpora: parallel sentence pairs crawled from the web using our BSP crawler:
  • Wikipedia Parallel Titles Corpora: bilingual titles of Wikipedia articles, extended with redirects and textlinks. 487,406,497 unique parallel segments for 253 language pairs!
  • Wikipedia Parallel Quotations Corpus: a tiny corpus of German-English quotes from the German Wikipedia.
  • several parallel texts (German-English and German-Czech) in TMX format can be downloaded¬†here. Included are:
    • German-Czech:
      • Jules Verne: Robur der Eroberer –¬†Robur Dobyvatel
      • three stories by Edgar Allan Poe
    • German-English:
      • two fairy tales by Hans Christian Andersen
      • a story by Edgar Allan Poe
      • the Communist Manifesto by Karl Mary and Friedrich Engels

Comparable Corpora

All our comparable corpora are bilingual document-aligned corpora. The documents are categorized for domain.

Monolingual Corpora


If you’d like to stay informed about corpora updates and new tools for text analysis you can subscribe to linguatools newsletter by providing your email address.