Wikipedia comparable corpora

The Wikipedia Comparable Corpora are bilingual document-aligned text corpora. They have been extracted from the Wikipedia Monolingual Corpora’s XML files using the crosslanguage links. Each comparable corpus consists of document pairs: Wikipedia articles in language L1 and the linked article in language L2 on the same subject. Alltogether, there are over 41 million aligned articles for 253 language pairs. The 253 corpus files occupy 405 GB disk space when unzipped.

Download

The table cells contain the number of aligned articles for each language pair. If you hover over a cell a tooltip pops up that gives the number of tokens in each of the two languages. Click on a cell to download the corpus file.

Before downloading, make sure you have read and understood the license conditions (see below).

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

File format

The XML file’s root element is wikipediaComparable. Its attribute name contains the language pair. Then follows a header which has two daughter nodes, both of type wikipediaSource. Their attributes give the languages and the names of the two source Wikipedia Monolingual Corpora’s XML files.

The header is followed by n elements of type articlePair, which have an attribute id with a unique identification number. Each articlePair encloses two articles, one from the first language Wikipedia, and a corresponding one from the second language Wikipedia. Corresponding means that both articles are linked via a crosslanguage link (in any direction). „Deep“ links that link an article to a section of a target article have been replaced by a link to the whole article (see note on crosslanguage_links in the XML format description on the monolingual corpora page). Each article has a number of categories and a content. The categories are copied from the respective Wikipedia Monolingual Corpora XML files, as is the content. The content therefore includes p and h tags marking paragraphs and headings, and also links and tables (see the XML format description on the monolingual corpora page).

<wikipediaComparable name="nl-ro">

   <header>

      <wikipediaSource language="nl" name="nlwiki-20140804-corpus.xml"/>

      <wikipediaSource language="ro" name="rowiki-20140729-corpus.xml"/>

   </header>

   <articlePair id="1">

      <article lang="nl" name="Les Fleurs du mal">

         <categories name="Dichtbundel|Franse literatuur|19e-eeuwse literatuur"/>

         <content>

            <p>Les Fleurs du mal (De bloemen van het kwaad) is de belangrijkste dichtbundel van de Franse dichter Charles Baudelaire.</p>

            ...

         </content>

      </article>

      <article lang="ro" name="Florile răului">

         <categories name="Cărți apărute în 1857"/>

         <content>

            <p>Florile răului este o culegere de poezii ale poetului francez Charles Baudelaire.</p>

            ...

         </content>

      </article>

   </articlePair>

   ...

</wikipediaComparable>

Applications

Possible applications of the comparable corpora include

extraction of bilingual dictionaries (e.g. Rapp 1999, Prochasson and Fung 2011, Rapp et al. 2012),
extraction of parallel sentence pairs (e.g. Ştefănescu and Ion 2013),
helping translators find bilingual terminology.

License

The Wikipedia Comparable Corpora files that you can download above are derived from the Wikipedia and are therefore made available under the same license as Wikipedia: Creative Commons Attribution-ShareAlike license.