Wikipedia comparable corpora

The Wikipedia Comparable Corpora are bilingual document-aligned text corpora. They have been extracted from the Wikipedia Monolingual Corpora’s XML files using the crosslanguage links. Each comparable corpus consists of document pairs: Wikipedia articles in language L1 and the linked article in language L2 on the same subject. Alltogether, there are over 41 million aligned articles for 253 language pairs. The 253 corpus files occupy 405 GB disk space when unzipped.

Download

The table cells contain the number of aligned articles for each language pair. If you hover over a cell a tooltip pops up that gives the number of tokens in each of the two languages. Click on a cell to download the corpus file.

Before downloading, make sure you have read and understood the license conditions (see below).

bg cs da de el en es fa fi fr hu it ja ko nl pl pt ro ru sv tr zh
ar 43K 52K 41K 107K 26K 175K 114K 78K 61K 124K 44K 107K 80K 55K 99K 100K 101K 46K 99K 77K 49K 84K
bg 52K 39K 91K 29K 112K 80K 47K 56K 94K 44K 86K 66K 44K 78K 82K 76K 48K 93K 66K 42K 62K
cs 53K 157K 32K 188K 116K 62K 84K 152K 65K 139K 95K 62K 127K 135K 109K 50K 130K 100K 50K 84K
da 102K 26K 123K 82K 49K 67K 98K 44K 90K 69K 48K 89K 87K 76K 41K 84K 87K 42K 61K
de 52K 852K 358K 135K 182K 549K 119K 439K 243K 119K 381K 394K 305K 114K 383K 277K 100K 223K
el 66K 48K 32K 35K 54K 31K 51K 40K 30K 43K 48K 45K 27K 48K 38K 29K 38K
en 673K 288K 238K 939K 166K 733K 393K 172K 674K 650K 541K 182K 575K 492K 138K 405K
es 130K 146K 440K 101K 380K 200K 108K 382K 312K 353K 113K 304K 277K 89K 229K
fa 74K 144K 61K 142K 98K 62K 121K 131K 133K 51K 118K 92K 58K 109K
fi 185K 71K 160K 119K 74K 140K 154K 132K 55K 155K 146K 62K 96K
fr 129K 523K 268K 130K 441K 427K 365K 142K 397K 324K 103K 263K
hu 129K 80K 51K 109K 118K 101K 54K 106K 88K 45K 81K
it 224K 117K 376K 387K 355K 143K 353K 255K 97K 253K
ja 140K 178K 194K 176K 66K 206K 146K 71K 209K
ko 100K 103K 98K 44K 111K 84K 50K 119K
nl 348K 330K 135K 294K 737K 93K 236K
pl 305K 134K 345K 245K 95K 224K
pt 121K 275K 231K 90K 214K
ro 118K 95K 42K 108K
ru 214K 101K 214K
sv 69K 181K
tr 68K

File format

The XML file’s root element is wikipediaComparable. Its attribute name contains the language pair. Then follows a header which has two daughter nodes, both of type wikipediaSource. Their attributes give the languages and the names of the two source Wikipedia Monolingual Corpora’s XML files.

The header is followed by n elements of type articlePair, which have an attribute id with a unique identification number. Each articlePair encloses two articles, one from the first language Wikipedia, and a corresponding one from the second language Wikipedia. Corresponding means that both articles are linked via a crosslanguage link (in any direction). „Deep“ links that link an article to a section of a target article have been replaced by a link to the whole article (see note on crosslanguage_links in the XML format description on the monolingual corpora page). Each article has a number of categories and a content. The categories are copied from the respective Wikipedia Monolingual Corpora XML files, as is the content. The content therefore includes p and h tags marking paragraphs and headings, and also links and tables (see the XML format description on the monolingual corpora page).

<wikipediaComparable name="nl-ro">
   <header>
      <wikipediaSource language="nl" name="nlwiki-20140804-corpus.xml"/>
      <wikipediaSource language="ro" name="rowiki-20140729-corpus.xml"/>
   </header>
   <articlePair id="1">
      <article lang="nl" name="Les Fleurs du mal">
         <categories name="Dichtbundel|Franse literatuur|19e-eeuwse literatuur"/>
         <content>
            <p>Les Fleurs du mal (De bloemen van het kwaad) is de belangrijkste dichtbundel van de Franse dichter Charles Baudelaire.</p>
            ...
         </content>
      </article>
      <article lang="ro" name="Florile răului">
         <categories name="Cărți apărute în 1857"/>
         <content>
            <p>Florile răului este o culegere de poezii ale poetului francez Charles Baudelaire.</p>
            ...
         </content>
      </article>
   </articlePair>
   ...
</wikipediaComparable>

Applications

Possible applications of the comparable corpora include

License

The Wikipedia Comparable Corpora files that you can download above are derived from the Wikipedia and are therefore made available under the same license as Wikipedia: Creative Commons Attribution-ShareAlike license.

cc-by-sa


If you’d like to stay informed about corpora updates and new tools for text analysis you can subscribe to linguatools newsletter by providing your email address.