Wikipedia parallel titles corpora

The Wikipedia Parallel Titles Corpora consist of bilingual titles of Wikipedia articles, extended with the titles‘ redirects and textlinks (you can find an explanation of redirects and textlinks on the Wikipedia Monolingual Corpora page). The corpora come in two formats: XML and Moses. The XML versions additionally contain the articles‘ categories, which makes it possible to extract bilingual entries for certain categories only. For instance, one can extract only entries that belong to one of the German categories Mann or Frau to get a bilingual list of person names that is useful for transliteration learning. Below you find a Perl script that can do this.

The Moses format files contain all bilingual permutations of article titles, redirects, and textlinks. Therefore, some of them are quite large. The English-German parallel corpus in Moses format for example has more than 16.5 million unique parallel segments.

Alltogether, the 253 parallel corpus files contain 63,573,278 bilingual article titles. The Moses versions with the permutated redirects and textlinks contain all in all 487,406,497 unique parallel entries.

Download

The table is divided in two: the upper right half (orange) contains the Moses files, the lower left part (green) the files in XML format. The numbers in the table give the number of bilingual entries: for XML files this number means the number of bilingual article titles (not counting redirects and textlinks), for Moses files it is the total number of bilingual entries including all permutations of titles, redirects, and textlinks. Click on a cell to download the corpus file.

Before downloading, make sure you have read and understood the license conditions (see below).

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

–

XML file format

The XML files have a root element translations , and n elements of type translation . There is exactly one L1 and one L2 entry element in each translation , containing the article titles. There may be arbitrary (including zero) many redirect s, textlink s and categorie s for each language. Below is an example translation from the Czech-Turkish Parallel Titles Corpus.

<?xml version="1.0" encoding="UTF-8"?>

<translations>

<translation id="8">

   <entry lang="cs">Válka v Iráku</entry>

   <entry lang="tr">Irak Savaşı</entry>

   <redirects lang="cs">

      <redirect>Okupace Iráku</redirect>

      <redirect>Druhá válka v Zálivu</redirect>

      <redirect>Invaze do Iráku</redirect>

   </redirects>

   <redirects lang="tr">

      <redirect>2003 Irak Savaşı</redirect>

      <redirect>II. Körfez Savaşı</redirect>

      <redirect>2. Körfez Savaşı</redirect>

      <redirect>ABD&apos;nin Irak&apos;ı işgali</redirect>

      <redirect>Irak Harekâtı</redirect>

      <redirect>Irak savaşı</redirect>

      <redirect>Irak&apos;ın işgali</redirect>

      <redirect>2. Irak Savaşı</redirect>

      <redirect>Irak&apos;ın İşgali</redirect>

      <redirect>Irak işgali</redirect>

      <redirect>Irak Savaşı (II.Körfez Savaşı)</redirect>

      <redirect>İkinci Körfez Savaşı</redirect>

      <redirect>Iraq War</redirect>

      <redirect>II. Irak Savaşı</redirect>

      <redirect>2.Körfez Savaşı</redirect>

      <redirect>Amerikan-Irak Savaşı</redirect>

      <redirect>II.Körfez Savaşı</redirect>

      <redirect>Irak İşgali</redirect>

   </redirects>

   <textlinks lang="cs">

      <textlink>válce v Iráku</textlink>

      <textlink>válkou v Iráku</textlink>

      <textlink>Iráku</textlink>

      <textlink>invaze do Iráku</textlink>

      <textlink>invazi do Iráku</textlink>

      <textlink>Irácká svoboda</textlink>

      <textlink>operace Irácká svoboda</textlink>

      <textlink>válku v Iráku</textlink>

      <textlink>války v Iráku</textlink>

   </textlinks>

   <textlinks lang="tr">

      <textlink>Irak işgali</textlink>

      <textlink>Irak&apos;ın işgali</textlink>

      <textlink>Irak’ın işgali</textlink>

      <textlink>Irak’ın işgalinin</textlink>

      <textlink>Irak Savaşı</textlink>

   </textlinks>

   <categories lang="cs">

      <category>Válka v Iráku</category>

      <category>Války Iráku</category>

      <category>Války Česka</category>

      <category>Války Dánska</category>

      <category>Války USA</category>

      <category>Války Norska</category>

      <category>Války Spojeného království</category>

      <category>Války Polska</category>

      <category>Války Turecka</category>

      <category>Války 21. století</category>

      <category>Válka proti terorismu</category>

      <category>Invaze</category>

      <category>Války Austrálie</category>

   </categories>

   <categories lang="tr">

      <category>Irak Savaşı</category>

   </categories>

</translation>

</translations>

Extraction script: xml2moses.pl

The Perl script xml2moses.pl allows to extract customizable subsets of the bilingual title pairs, redirects and textlinks from an XML file. The output format is Moses (two files, each with one segment per line. The n-th line in file #1 corresponds to the n-th line in file #2). The encoding of the output files is UTF-8.

Usage: perl xml2moses.pl [OPTIONS] INPUT OUTPUT

This generates two files in Moses format (one entry per line, both output files have the same number of lines) with the names OUTPUT.<L1> and OUTPUT.<L2> .

OPTIONS:

`-include-redirects`	Adds all redirects from both languages to generate all permutations of translation pairs, i.e. entryL1 – redirectL2_1, entryL1 – redirectL2-2, redirectL1-1 – redirectL2-1, …
`-include-textlinks`	Same as above, but includes textlinks.
`-exclude-categories-l1 FILE`	If the L1 part of a translation entry has one of the categories listed in FILE, the whole entry is ignored. FILE must have one category per line.
`-exclude-categories-l2 FILE`	Same as above, but for the L2 part.
`-only-categories-l1 FILE`	Only translations are output that have one of the categories listed in FILE in their L1 part.
`-only-categories-l2 FILE`	Same as above, but for the L2 part.
`-no-equal`	translations that are equal on both sides are ignored
`-no-colon`	translations that contain a colon on either side are ignored
`-no-numbers`	translations that contain a number on either side are ignored
`-check-unicode-range`	This option is useful only for certain pairs of L1 and L2. It checks if a character from one side of a translation belongs to the unicode range of the other side’s language script. If yes, the translation is ignored. There are the following scripts: Arabic (ar,fa) Cyrillic (bg,ru) Greek (el) CJK (ja,ko,zh) Latin (cs,da,de,en,es,fi,fr,hu,it,nl,pl,pt,ro,sv,tr). For instance the following translation would be ignored if L1=ar and L2=en: `Subway :: Subway` whereas the next one would be output: `سب واي :: Subway` .

The Moses files in the table above have been generated with the options -include-redirects and -include-textlinks .

Use for training statistical machine translation systems

In order to evaluate the usefulness of the Wikipedia Parallel Titles corpora for training statistical machine translation systems we trained the Moses SMT system with several combinations of the europarl corpus and subsets of the Wikipedia Parallel Titles corpora (German to English). We followed the procedure described here. The table below shows the results.

Corpus	No. of sentence pairs	Moses BLEU score
europarl-v7	1,934,299	18.76
europarl-v7 + wikititles-2014	18,484,789	18.22
europarl-v7 + wiki-onlyTitles	2,335,888	19.00
europarl-v7 + wikititles-redirects	9,229,773	18.18

The best BLEU score is achieved when the europarl corpus is augmented with the version of the Wikipedia Parallel Titles corpus that contains only the titles but neither the redirects nor the textlinks. Presumably, the redirects and textlinks introduce to much noise. The following table shows the options that were used for generating the subsets from the wikititles XML files with xml2moses.pl .

Corpus	created with `xml2moses.pl` options
wikititles-2014	`-include-redirects -include-textlinks`
wiki-onlyTitles	`-no-equal -no-colon -no-numbers`
wikititles-redirects	`-include-redirects -no-equal -no-colon -no-numbers`

License

The Wikipedia Parallel Titles Corpora that you can download above are derived from the Wikipedia and are therefore made available under the same license as Wikipedia: Creative Commons Attribution-ShareAlike license.