Wikipedia parallel titles corpora

The Wikipedia Parallel Titles Corpora consist of bilingual titles of Wikipedia articles, extended with the titles‘ redirects and textlinks (you can find an explanation of redirects and textlinks on the Wikipedia Monolingual Corpora page). The corpora come in two formats: XML and Moses. The XML versions additionally contain the articles‘ categories, which makes it possible to extract bilingual entries for certain categories only. For instance, one can extract only entries that belong to one of the German categories Mann or Frau to get a bilingual list of person names that is useful for transliteration learning. Below you find a Perl script that can do this.

The Moses format files contain all bilingual permutations of article titles, redirects, and textlinks. Therefore, some of them are quite large. The English-German parallel corpus in Moses format for example has more than 16.5 million unique parallel segments.

Alltogether, the 253 parallel corpus files contain 63,573,278 bilingual article titles. The Moses versions with the permutated redirects and textlinks contain all in all 487,406,497 unique parallel entries.

Download

The table is divided in two: the upper right half (orange) contains the Moses files, the lower left part (green) the files in XML format. The numbers in the table give the number of bilingual entries: for XML files this number means the number of bilingual article titles (not counting redirects and textlinks), for Moses files it is the total number of bilingual entries including all permutations of titles, redirects, and textlinks. Click on a cell to download the corpus file.

Before downloading, make sure you have read and understood the license conditions (see below).

ar bg cs da de el en es fa fi fr hu it ja ko nl pl pt ro ru sv tr zh
ar 432K 816K 514K 1.8M 401K 4.8M 1.9M 652K 744K 1.8M 582K 1.2M 896K 681K 882K 1.1M 1.1M 554K 1.8M 961K 710K 906K
bg 83K 536K 287K 1.5M 354K 2.5M 1.3M 279K 523K 1.0M 427K 772K 651K 489K 530K 717K 668K 388K 1.2M 682K 372K 456K
cs 105K 89K 597K 2.4M 446K 4.8M 2.0M 479K 932K 2.1M 784K 1.6M 972K 682K 1.1M 1.6M 1.2M 632K 2.3M 1.2M 675K 874K
da 89K 67K 97K 1.4M 260K 3.1M 1.3M 305K 590K 1.3M 441K 939K 610K 428K 644K 815K 719K 390K 1.2M 843K 435K 553K
de 179K 135K 239K 161K 875K 16.6M 5.9M 1.1M 2.2M 6.8M 1.6M 4.4M 2.8M 1.7M 3.1M 3.7M 3.0M 1.4M 5.8M 3.2M 1.5M 2.2M
el 57K 51K 57K 47K 82K 1.9M 846K 216K 389K 822K 330K 633K 438K 332K 406K 585K 518K 301K 885K 436K 336K 367K
en 432K 180K 315K 211K 1.1M 121K 16.6M 3.5M 5.1M 17.8M 3.7M 11.5M 7.8M 4.8M 7.0M 8.3M 8.3M 3.4M 13.6M 8.1M 3.9M 6.7M
es 210K 132K 204K 141K 498K 84K 1.0M 1.2M 2.1M 6.8M 1.6M 4.5M 2.6M 1.7M 2.9M 3.3M 3.8M 1.4M 5.2M 3.5M 1.6M 2.4M
fa 218K 94K 132K 105K 257K 64K 774K 281K 450K 1.1M 344K 789K 610K 481K 552K 661K 749K 348K 1.1M 644K 468K 616K
fi 111K 89K 136K 108K 249K 58K 348K 222K 136K 2.1M 707K 1.5M 977K 681K 995K 1.4M 1.2M 546K 2.1M 1.3M 668K 855K
fr 224K 144K 241K 161K 712K 90K 1.3M 637K 295K 260K 1.6M 5.1M 2.8M 1.7M 3.1M 3.6M 3.4M 1.4M 5.5M 3.2M 1.5M 2.4M
hu 90K 69K 106K 78K 169K 50K 250K 162K 108K 107K 187K 1.2M 737K 508K 799K 1.1M 922K 587K 1.6M 875K 546K 683K
it 177K 129K 216K 141K 563K 86K 999K 533K 244K 229K 693K 177K 2.0M 1.2M 2.1M 2.7M 2.5M 1.1M 4.1M 2.1M 1.1M 1.6M
ja 160K 106K 160K 122K 342K 69K 609K 315K 207K 179K 395K 130K 320K 1.3M 1.2M 1.5M 1.5M 605K 2.5M 1.3M 748K 1.7M
ko 146K 85K 127K 101K 211K 61K 395K 228K 196K 132K 257K 98K 208K 263K 758K 958K 994K 448K 1.6M 896K 578K 1.1M
nl 149K 113K 182K 128K 476K 68K 822K 477K 190K 192K 543K 141K 460K 241K 163K 1.8M 1.7M 731K 2.5M 2.1M 733K 1.1M
pl 160K 122K 207K 134K 497K 78K 841K 425K 214K 218K 549K 161K 490K 273K 180K 416K 2.0M 955K 3.7M 1.8M 956K 1.3M
pt 196K 123K 187K 135K 422K 80K 829K 524K 268K 201K 532K 157K 480K 287K 221K 409K 403K 883K 2.9M 1.6M 952K 1.4M
ro 110K 82K 97K 82K 180K 47K 320K 191K 134K 93K 222K 91K 201K 128K 110K 171K 181K 199K 1.4M 729K 489K 636K
ru 198K 149K 222K 148K 522K 85K 906K 472K 266K 232K 582K 164K 500K 326K 236K 380K 466K 429K 200K 2.8M 1.6M 2.2M
sv 162K 106K 171K 151K 384K 69K 745K 419K 226K 217K 465K 138K 357K 241K 195K 807K 334K 352K 161K 349K 831K 1.3M
tr 132K 82K 112K 91K 180K 61K 322K 192K 171K 117K 210K 91K 177K 154K 144K 147K 164K 193K 104K 210K 159K 711K
zh 173K 101K 143K 111K 307K 67K 621K 337K 223K 149K 374K 128K 332K 329K 231K 290K 292K 321K 171K 326K 272K 154K

XML file format

The XML files have a root element translations , and n elements of type translation . There is exactly one L1 and one L2 entry element in each translation , containing the article titles. There may be arbitrary (including zero) many redirect s, textlink s and categorie s for each language. Below is an example translation from the Czech-Turkish Parallel Titles Corpus.

<?xml version="1.0" encoding="UTF-8"?>
<translations>
<translation id="8">
   <entry lang="cs">Válka v Iráku</entry>
   <entry lang="tr">Irak Savaşı</entry>
   <redirects lang="cs">
      <redirect>Okupace Iráku</redirect>
      <redirect>Druhá válka v Zálivu</redirect>
      <redirect>Invaze do Iráku</redirect>
   </redirects>
   <redirects lang="tr">
      <redirect>2003 Irak Savaşı</redirect>
      <redirect>II. Körfez Savaşı</redirect>
      <redirect>2. Körfez Savaşı</redirect>
      <redirect>ABD&apos;nin Irak&apos;ı işgali</redirect>
      <redirect>Irak Harekâtı</redirect>
      <redirect>Irak savaşı</redirect>
      <redirect>Irak&apos;ın işgali</redirect>
      <redirect>2. Irak Savaşı</redirect>
      <redirect>Irak&apos;ın İşgali</redirect>
      <redirect>Irak işgali</redirect>
      <redirect>Irak Savaşı (II.Körfez Savaşı)</redirect>
      <redirect>İkinci Körfez Savaşı</redirect>
      <redirect>Iraq War</redirect>
      <redirect>II. Irak Savaşı</redirect>
      <redirect>2.Körfez Savaşı</redirect>
      <redirect>Amerikan-Irak Savaşı</redirect>
      <redirect>II.Körfez Savaşı</redirect>
      <redirect>Irak İşgali</redirect>
   </redirects>
   <textlinks lang="cs">
      <textlink>válce v Iráku</textlink>
      <textlink>válkou v Iráku</textlink>
      <textlink>Iráku</textlink>
      <textlink>invaze do Iráku</textlink>
      <textlink>invazi do Iráku</textlink>
      <textlink>Irácká svoboda</textlink>
      <textlink>operace Irácká svoboda</textlink>
      <textlink>válku v Iráku</textlink>
      <textlink>války v Iráku</textlink>
   </textlinks>
   <textlinks lang="tr">
      <textlink>Irak işgali</textlink>
      <textlink>Irak&apos;ın işgali</textlink>
      <textlink>Irak’ın işgali</textlink>
      <textlink>Irak’ın işgalinin</textlink>
      <textlink>Irak Savaşı</textlink>
   </textlinks>
   <categories lang="cs">
      <category>Válka v Iráku</category>
      <category>Války Iráku</category>
      <category>Války Česka</category>
      <category>Války Dánska</category>
      <category>Války USA</category>
      <category>Války Norska</category>
      <category>Války Spojeného království</category>
      <category>Války Polska</category>
      <category>Války Turecka</category>
      <category>Války 21. století</category>
      <category>Válka proti terorismu</category>
      <category>Invaze</category>
      <category>Války Austrálie</category>
   </categories>
   <categories lang="tr">
      <category>Irak Savaşı</category>
   </categories>
</translation>
</translations>

Extraction script: xml2moses.pl

The Perl script xml2moses.pl allows to extract customizable subsets of the bilingual title pairs, redirects and textlinks from an XML file. The output format is Moses (two files, each with one segment per line. The n-th line in file #1 corresponds to the n-th line in file #2). The encoding of the output files is UTF-8.

Usage: perl xml2moses.pl [OPTIONS] INPUT OUTPUT

This generates two files in Moses format (one entry per line, both output files have the same number of lines) with the names OUTPUT.<L1> and OUTPUT.<L2> .

OPTIONS:

-include-redirects Adds all redirects from both languages to generate all permutations of translation pairs, i.e. entryL1 – redirectL2_1, entryL1 – redirectL2-2, redirectL1-1 – redirectL2-1, …
-include-textlinks Same as above, but includes textlinks.
-exclude-categories-l1 FILE If the L1 part of a translation entry has one of the categories listed in FILE, the whole entry is ignored. FILE must have one category per line.
-exclude-categories-l2 FILE Same as above, but for the L2 part.
-only-categories-l1 FILE Only translations are output that have one of the categories listed in FILE in their L1 part.
-only-categories-l2 FILE Same as above, but for the L2 part.
-no-equal translations that are equal on both sides are ignored
-no-colon translations that contain a colon on either side are ignored
-no-numbers translations that contain a number on either side are ignored
-check-unicode-range This option is useful only for certain pairs of L1 and L2. It checks if a character from one side of a translation belongs to the unicode range of the other side’s language script. If yes, the translation is ignored. There are the following scripts:

  • Arabic (ar,fa)
  • Cyrillic (bg,ru)
  • Greek (el)
  • CJK (ja,ko,zh)
  • Latin (cs,da,de,en,es,fi,fr,hu,it,nl,pl,pt,ro,sv,tr).

For instance the following translation would be ignored if L1=ar and L2=en:
Subway :: Subway
whereas the next one would be output:
سب واي :: Subway .

The Moses files in the table above have been generated with the options -include-redirects and -include-textlinks .

Use for training statistical machine translation systems

In order to evaluate the usefulness of the Wikipedia Parallel Titles corpora for training statistical machine translation systems we trained the Moses SMT system with several combinations of the europarl corpus and subsets of the Wikipedia Parallel Titles corpora (German to English). We followed the procedure described here. The table below shows the results.

Corpus No. of sentence pairs Moses BLEU score
europarl-v7 1,934,299 18.76
europarl-v7 + wikititles-2014 18,484,789 18.22
europarl-v7 + wiki-onlyTitles 2,335,888 19.00
europarl-v7 + wikititles-redirects 9,229,773 18.18

The best BLEU score is achieved when the europarl corpus is augmented with the version of the Wikipedia Parallel Titles corpus that contains only the titles but neither the redirects nor the textlinks. Presumably, the redirects and textlinks introduce to much noise. The following table shows the options that were used for generating the subsets from the wikititles XML files with xml2moses.pl .

Corpus created with xml2moses.pl options
wikititles-2014 -include-redirects -include-textlinks
wiki-onlyTitles -no-equal -no-colon -no-numbers
wikititles-redirects -include-redirects -no-equal -no-colon -no-numbers

License

The Wikipedia Parallel Titles Corpora that you can download above are derived from the Wikipedia and are therefore made available under the same license as Wikipedia: Creative Commons Attribution-ShareAlike license.


If you’d like to stay informed about corpora updates and new tools for text analysis you can subscribe to linguatools newsletter by providing your email address.