Here you can download text corpora extracted from the Wikipedia dumps in 30 languages, amounting to nearly 10 billion tokens altogether. Each XML file contains the full textual content of the individual language version of Wikipedia, extended with many annotations like article and paragraph boundaries, number of links referring to each article, crosslanguage links, categories and more. Have a quick look at a sample XML file containing one English article.
Download
Wikipedia XML file | unzipped file size | language | number of articles | number of paragraphs | number of tokens |
arwiki-20180920-corpus.xml.bz2 | 2.6G | Arabic | 709,674 | 2,067,569 | 131,120,680 |
bgwiki-20180920-corpus.xml.bz2 | 2.1G | Bulgarian | 305,692 | 1,113,244 | 113,426,508 |
cawiki-20181001-corpus.xml.bz2 | 2.7G | Catalan | 632,334 | 3,036,618 | 212,831,511 |
cswiki-20181001-corpus.xml.bz2 | 2.1G | Czech | 522,760 | 1,979,077 | 132,763,620 |
dawiki-20181001-corpus.xml.bz2 | 1.1G | Danish | 282,756 | 1,109,150 | 66,676,633 |
dewiki-20180920-corpus.xml.bz2 | 12G | German | 2,212,874 | 12,161,736 | 913,428,282 |
elwiki-20181001-corpus.xml.bz2 | 1.3G | Greek | 192,400 | 935,726 | 70,194,124 |
enwiki-20181001-corpus.xml.bz2 | 26G | English | 5,690,374 | 29,873,050 | 2,319,783,831 |
eswiki-20181001-corpus.xml.bz2 | 8.1G | Spanish | 1,560,193 | 9,159,971 | 692,375,637 |
fawiki-20181001-corpus.xml.bz2 | 2.6G | Farsi | 2,285,821 | 3,154,360 | 105,336,497 |
fiwiki-20181001-corpus.xml.bz2 | 1.9G | Finnish | 542,723 | 2,022,231 | 102,487,802 |
frwiki-20181001-corpus.xml.bz2 | 11G | French | 2,301,659 | 13,028,006 | 914,601,321 |
hewiki-20181001-corpus.xml.bz2 | 2.2G | Hebrew | 325,635 | 2,002,756 | 137,903,919 |
huwiki-20181001-corpus.xml.bz2 | 2.5G | Hungarian | 518,731 | 2,473,834 | 153,998,201 |
idwiki-20181001-corpus.xml.bz2 | 1.5G | Indonesian | 502,761 | 1,553,719 | 88,015,056 |
itwiki-20181001-corpus.xml.bz2 | 7.5G | Italian | 1,806,393 | 8,239,444 | 588,876,879 |
jawiki-20181001-corpus.xml.bz2 | 8.6G | Japanese | 1,698,241 | 8,572,952 | 110,367,615 |
kowiki-20181001-corpus.xml.bz2 | 2.0G | Korean | 466,824 | 1,737,828 | 93,761,999 |
nlwiki-20181001-corpus.xml.bz2 | 4.7G | Dutch | 2,381,710 | 5,244,373 | 307,234,270 |
nowiki-20181001-corpus.xml.bz2 | 2.3G | Norwegian | 538,156 | 1,987,561 | 125,111,145 |
plwiki-20181001-corpus.xml.bz2 | 5.3G | Polish | 1,686,452 | 5,158,412 | 333,488,821 |
ptwiki-20181001-corpus.xml.bz2 | 4.3G | Portuguese | 1,521,288 | 5,248,294 | 319,844,644 |
rowiki-20181001-corpus.xml.bz2 | 1.5G | Romanian | 467,493 | 1,449,020 | 88,772,650 |
ruwiki-20181001-corpus.xml.bz2 | 12G | Russian | 2,590,008 | 10,295,294 | 610,761,713 |
svwiki-20181001-corpus.xml.bz2 | 6.4G | Swedish | 3,812,043 | 11,082,359 | 407,815,102 |
thwiki-20181001-corpus.xml.bz2 | 1.2G | Thai | 183,526 | 693,122 | 20,784,378 |
trwiki-20181001-corpus.xml.bz2 | 1.5G | Turkish | 397,046 | 1,316,176 | 78,212,294 |
ukwiki-20181001-corpus.xml.bz2 | 4.8G | Ukrainian | 972,947 | 4,551,432 | 233,609,331 |
viwiki-20181001-corpus.xml.bz2 | 2.2G | Vietnamese | 1,342,595 | 2,525,764 | 150,287,491 |
zhwiki-20181001-corpus.xml.bz2 | 4.1G | Chinese | 1,619,172 | 4,853,082 | 54,747,370 |
XML Format Description
The XML files contain the following information:
XML element | mother element | description |
article | wikipedia (XML root) | This element encloses each Wikipedia article. It has an attribute name which contains the article’s title. The title is unique (in the current language version). |
redirect | article | Articles that are redirects to another article are not stored in the XML files. However, the redirect information is contained in the redirect element(s) of the target article. The attribute name contains an article title that redirects to the present article. An article can have 0..n daughter elements of type redirect. |
links_in | article | The attribute name contains the total number of textlinks in other articles that link to the current article. |
textlink | article | The attribute name contains the anchor text that is used in a textlink to refer to the current article. The attribute freq contains the number of times this anchor text was used to refer to the current article. |
category | article | The attribute name contains a Wikipedia category the current article is assigned to. An article can have 0..n daughter elements of type category. |
links_out | article | The attribute name contains the number of textlinks in the current article that refer to other Wikipedia articles. |
crosslanguage_link | article | The attribute name contains the title of a Wikipedia article in another language the current article is linked to. The attribute language specifies the target language. Important note: In Wikipedia, crosslanguage links may link an article to a section of a target article. In this case, the link contains the article name followed by a ‚#‘ and the section’s anchor name. There are also crosslanguage links to other namespaces, e.g. categories or portals. These links contain a colon that separates the namespace from the link target. |
disambiguation | article | This elements marks the article as a disambiguation page. It has no attributes. |
content | article | This element encloses the current article’s textual content. |
p | content, h, math, table | This element marks a paragraph boundary. |
link | content, h, math, table | Marks a textlink to another Wikipedia article. The title of the target article is contained in the attribute target. |
h | content | Marks a heading. |
math | content | Marks a math formula. |
table | content | Marks a table. |
cell | table | Marks a table cell. |
Sample XML file
To see an example of the XML format click here.
The sample contains one article from the English Wikipedia.
Extracting raw text from XML
You can extract text only from the XML files using the Perl script xml2txt.pl. The usage is:
perl xml2txt.pl [Options] INPUT OUTPUT
where INPUT is an unzipped Wikipedia corpus XML file, and OUTPUT is the raw text file that will be produced. The encoding of the output file will be UTF-8.
Options: | -articles | The article mark-up is preserved (<article name=“…“>…</article>). |
-p | The paragraph mark-up is preserved (<p>…</p>). | |
-h | The headings mark-up is preserved (<h>…</h>). | |
-nomath | All content that is enclosed in math tags is deleted. | |
-notables | All content that is enclosed in table tags is deleted. | |
-nodisambig | Articles that are marked as disambiguation page are deleted. | |
-exclude-categories FILE | All articles that belong to one of the categories listed in FILE are ignored. FILE has one category name per line. Some useful categories are given below. | |
-only-categories FILE | Only articles are output that belong to one of the categories listed in FILE. FILE has one category name per line. Some useful categories are given below. |
Useful categories
language | description | categories file |
de | people | people_de.txt |
en | medicine | medicine_en.txt |
Known and unknown bugs
The corpora are based on the Wikipedia dumps which contain articles in Wiki markup format, packed in XML. Wiki markup uses all kinds of brackets to mark links, categories, etc. Because articles can be edited manually by anybody, unmatched brackets sometimes occur. In order to minimize noise in the corpus, we discard all articles with unmatched brackets, i.e. articles that can’t be parsed.
We also (try to) discard sections with weblinks and references at the end of articles, because they often contain foreign language material. (It’s not a bug, it’s a feature!)
Applications
Main purpose of the Wikipedia Monolingual Corpora is to provide large text corpora in many languages which can be used for the routine tasks of corpus linguistics, like generating word frequency lists or collecting n-gram statistics.
However, the rich annotations in the XML files facilitate many more applications:
- categories allow to compile domain-specific corpora
- compile multilingual document-aligned comparable corpora using the crosslanguage links
- textlinks and redirects allow to collect expressions that are used to refer to a concept (i.e. a Wikipedia article)
- The annotations cover all requirements to build ESA style semantic similarity resources.
2014 version (outdated)
The table below contains links to the previous version of the Wikipedia Monolingual Corpora from 2014.
Wikipedia XML file | zipped file size | language | number of articles | number of paragraphs | number of tokens |
arwiki-20140714-corpus.xml.bz2 | 189 MB | Arabic | 273,709 | 1,654,018 | 61,601,807 |
bgwiki-20140728-corpus.xml.bz2 | 138 MB | Bulgarian | 183,983 | 1,115,838 | 43,324,881 |
cswiki-20140730-corpus.xml.bz2 | 296 MB | Czech | 341,446 | 2,382,825 | 86,076,579 |
dawiki-20140725-corpus.xml.bz2 | 144 MB | Danish | 188,415 | 1,177,350 | 43,997,748 |
dewiki-20140725-corpus.xml.bz2 | 1,88 GB | German | 1,697,608 | 14,971,566 | 649,943,374 |
elwiki-20140728-corpus.xml.bz2 | 113 MB | Greek | 96,742 | 863,312 | 37,468,841 |
enwiki-20140707-corpus.xml.bz2 | 4,25 GB | English | 4,579,471 | 43,300,386 | 1,714,676,058 |
eswiki-20140810-corpus.xml.bz2 | 1,04 GB | Spanish | 1,065,798 | 9,934,311 | 428,385,578 |
fawiki-20140802-corpus.xml.bz2 | 156 MB | Farsi | 1,438,575 | 2,584,340 | 55,847,903 |
fiwiki-20140809-corpus.xml.bz2 | 258 MB | Finnish | 349,907 | 2,117,539 | 62,254,999 |
frwiki-20140804-corpus.xml.bz2 | 1,43 GB | French | 1,516,664 | 14,854,031 | 546,824,176 |
huwiki-20140727-corpus.xml.bz2 | 314 MB | Hungarian | 261,378 | 2,595,194 | 89,823,011 |
itwiki-20140810-corpus.xml.bz2 | 1,02 GB | Italian | 1,127,405 | 9,583,932 | 381,082,826 |
jawiki-20140807-corpus.xml.bz2 | 1,28 GB | Japanese | 1,049,338 | 10,017,880 | 45,124,304 |
kowiki-20140801-corpus.xml.bz2 | 217 MB | Korean | 281,130 | 1,803,057 | 50,312,632 |
nlwiki-20140804-corpus.xml.bz2 | 631 MB | Dutch | 2,147,989 | 6,637,191 | 235,814,312 |
plwiki-20140802-corpus.xml.bz2 | 715 MB | Polish | 1,220,833 | 6,651,210 | 208,709,602 |
ptwiki-20140806-corpus.xml.bz2 | 508 MB | Portuguese | 978,010 | 5,041,117 | 183,120,152 |
rowiki-20140729-corpus.xml.bz2 | 157 MB | Romanian | 247,651 | 1,561,772 | 52,490,661 |
ruwiki-20140727-corpus.xml.bz2 | 1,09 GB | Russian | 1,444,962 | 10,796,451 | 334,731,419 |
svwiki-20140818-corpus.xml.bz2 | 407 MB | Swedish | 1,790,146 | 7,051,083 | 166,218,170 |
trwiki-20140806-corpus.xml.bz2 | 175 MB | Turkish | 233,218 | 1,691,110 | 48,378,679 |
zhwiki-20140804-corpus.xml.bz2 | 552 MB | Chinese | 967,153 | 5,629,801 | 20,577,336 |
License
The XML files that you can download above are derived from the original Wikipedia and are therefore made available under the same license as Wikipedia itself: Creative Commons Attribution-ShareAlike.