Wikipedia monolingual corpora

Here you can download text corpora extracted from the Wikipedia dumps in 30 languages, amounting to nearly 10 billion tokens altogether. Each XML file contains the full textual content of the individual language version of Wikipedia, extended with many annotations like article and paragraph boundaries, number of links referring to each article, crosslanguage links, categories and more. Have a quick look at a sample XML file containing one English article.

Download


Wikipedia XML file unzipped file size language number of articles number of paragraphs number of tokens
arwiki-20180920-corpus.xml.bz2 2.6G Arabic 709,674 2,067,569 131,120,680
bgwiki-20180920-corpus.xml.bz2 2.1G Bulgarian 305,692 1,113,244 113,426,508
cawiki-20181001-corpus.xml.bz2 2.7G Catalan 632,334 3,036,618 212,831,511
cswiki-20181001-corpus.xml.bz2 2.1G Czech 522,760 1,979,077 132,763,620
dawiki-20181001-corpus.xml.bz2 1.1G Danish 282,756 1,109,150 66,676,633
dewiki-20180920-corpus.xml.bz2 12G German 2,212,874 12,161,736 913,428,282
elwiki-20181001-corpus.xml.bz2 1.3G Greek 192,400 935,726 70,194,124
enwiki-20181001-corpus.xml.bz2 26G English 5,690,374 29,873,050 2,319,783,831
eswiki-20181001-corpus.xml.bz2 8.1G Spanish 1,560,193 9,159,971 692,375,637
fawiki-20181001-corpus.xml.bz2 2.6G Farsi 2,285,821 3,154,360 105,336,497
fiwiki-20181001-corpus.xml.bz2 1.9G Finnish 542,723 2,022,231 102,487,802
frwiki-20181001-corpus.xml.bz2 11G French 2,301,659 13,028,006 914,601,321
hewiki-20181001-corpus.xml.bz2 2.2G Hebrew 325,635 2,002,756 137,903,919
huwiki-20181001-corpus.xml.bz2 2.5G Hungarian 518,731 2,473,834 153,998,201
idwiki-20181001-corpus.xml.bz2 1.5G Indonesian 502,761 1,553,719 88,015,056
itwiki-20181001-corpus.xml.bz2 7.5G Italian 1,806,393 8,239,444 588,876,879
jawiki-20181001-corpus.xml.bz2 8.6G Japanese 1,698,241 8,572,952 110,367,615
kowiki-20181001-corpus.xml.bz2 2.0G Korean 466,824 1,737,828 93,761,999
nlwiki-20181001-corpus.xml.bz2 4.7G Dutch 2,381,710 5,244,373 307,234,270
nowiki-20181001-corpus.xml.bz2 2.3G Norwegian 538,156 1,987,561 125,111,145
plwiki-20181001-corpus.xml.bz2 5.3G Polish 1,686,452 5,158,412 333,488,821
ptwiki-20181001-corpus.xml.bz2 4.3G Portuguese 1,521,288 5,248,294 319,844,644
rowiki-20181001-corpus.xml.bz2 1.5G Romanian 467,493 1,449,020 88,772,650
ruwiki-20181001-corpus.xml.bz2 12G Russian 2,590,008 10,295,294 610,761,713
svwiki-20181001-corpus.xml.bz2 6.4G Swedish 3,812,043 11,082,359 407,815,102
thwiki-20181001-corpus.xml.bz2 1.2G Thai 183,526 693,122 20,784,378
trwiki-20181001-corpus.xml.bz2 1.5G Turkish 397,046 1,316,176 78,212,294
ukwiki-20181001-corpus.xml.bz2 4.8G Ukrainian 972,947 4,551,432 233,609,331
viwiki-20181001-corpus.xml.bz2 2.2G Vietnamese 1,342,595 2,525,764 150,287,491
zhwiki-20181001-corpus.xml.bz2 4.1G Chinese 1,619,172 4,853,082 54,747,370

XML Format Description

The XML files contain the following information:

XML element mother element description
article wikipedia (XML root) This element encloses each Wikipedia article. It has an attribute name which contains the article’s title. The title is unique (in the current language version).
redirect article Articles that are redirects to another article are not stored in the XML files. However, the redirect information is contained in the redirect element(s) of the target article. The attribute name contains an article title that redirects to the present article. An article can have 0..n daughter elements of type redirect.
links_in article The attribute name contains the total number of textlinks in other articles that link to the current article.
textlink article The attribute name contains the anchor text that is used in a textlink to refer to the current article. The attribute freq contains the number of times this anchor text was used to refer to the current article.
category article The attribute name contains a Wikipedia category the current article is assigned to. An article can have 0..n daughter elements of type category.
links_out article The attribute name contains the number of textlinks in the current article that refer to other Wikipedia articles.
crosslanguage_link article The attribute name contains the title of a Wikipedia article in another language the current article is linked to. The attribute language specifies the target language.
Important note: In Wikipedia, crosslanguage links may link an article to a section of a target article. In this case, the link contains the article name followed by a ‚#‘ and the section’s anchor name. There are also crosslanguage links to other namespaces, e.g. categories or portals. These links contain a colon that separates the namespace from the link target.
disambiguation article This elements marks the article as a disambiguation page. It has no attributes.
content article This element encloses the current article’s textual content.
p content, h, math, table This element marks a paragraph boundary.
link content, h, math, table Marks a textlink to another Wikipedia article. The title of the target article is contained in the attribute target.
h content Marks a heading.
math content Marks a math formula.
table content Marks a table.
cell table Marks a table cell.

Sample XML file

To see an example of the XML format click here.
The sample contains one article from the English Wikipedia.

Extracting raw text from XML

You can extract text only from the XML files using the Perl script xml2txt.pl. The usage is:

perl xml2txt.pl [Options] INPUT OUTPUT

where INPUT is an unzipped Wikipedia corpus XML file, and OUTPUT is the raw text file that will be produced. The encoding of the output file will be UTF-8.

Options: -articles The article mark-up is preserved (<article name=“…“>…</article>).
-p The paragraph mark-up is preserved (<p>…</p>).
-h The headings mark-up is preserved (<h>…</h>).
-nomath All content that is enclosed in math tags is deleted.
-notables All content that is enclosed in table tags is deleted.
-nodisambig Articles that are marked as disambiguation page are deleted.
-exclude-categories FILE All articles that belong to one of the categories listed in FILE are ignored. FILE has one category name per line. Some useful categories are given below.
-only-categories FILE Only articles are output that belong to one of the categories listed in FILE. FILE has one category name per line. Some useful categories are given below.

Useful categories

language description categories file
de people people_de.txt
en medicine medicine_en.txt

Known and unknown bugs

The corpora are based on the Wikipedia dumps which contain articles in Wiki markup format, packed in XML. Wiki markup uses all kinds of brackets to mark links, categories, etc. Because articles can be edited manually by anybody, unmatched brackets sometimes occur. In order to minimize noise in the corpus, we discard all articles with unmatched brackets, i.e. articles that can’t be parsed.

We also (try to) discard sections with weblinks and references at the end of articles, because they often contain foreign language material. (It’s not a bug, it’s a feature!)

Applications

Main purpose of the Wikipedia Monolingual Corpora is to provide large text corpora in many languages which can be used for the routine tasks of corpus linguistics, like generating word frequency lists or collecting n-gram statistics.

However, the rich annotations in the XML files facilitate many more applications:

  • categories allow to compile domain-specific corpora
  • compile multilingual document-aligned comparable corpora using the crosslanguage links
  • textlinks and redirects allow to collect expressions that are used to refer to a concept (i.e. a Wikipedia article)
  • The annotations cover all requirements to build ESA style semantic similarity resources.

2014 version (outdated)

The table below contains links to the previous version of the Wikipedia Monolingual Corpora from 2014.

Wikipedia XML file zipped file size language number of articles number of paragraphs number of tokens
arwiki-20140714-corpus.xml.bz2 189 MB Arabic 273,709 1,654,018 61,601,807
bgwiki-20140728-corpus.xml.bz2 138 MB Bulgarian 183,983 1,115,838 43,324,881
cswiki-20140730-corpus.xml.bz2 296 MB Czech 341,446 2,382,825 86,076,579
dawiki-20140725-corpus.xml.bz2 144 MB Danish 188,415 1,177,350 43,997,748
dewiki-20140725-corpus.xml.bz2 1,88 GB German 1,697,608 14,971,566 649,943,374
elwiki-20140728-corpus.xml.bz2 113 MB Greek 96,742 863,312 37,468,841
enwiki-20140707-corpus.xml.bz2 4,25 GB English 4,579,471 43,300,386 1,714,676,058
eswiki-20140810-corpus.xml.bz2 1,04 GB Spanish 1,065,798 9,934,311 428,385,578
fawiki-20140802-corpus.xml.bz2 156 MB Farsi 1,438,575 2,584,340 55,847,903
fiwiki-20140809-corpus.xml.bz2 258 MB Finnish 349,907 2,117,539 62,254,999
frwiki-20140804-corpus.xml.bz2 1,43 GB French 1,516,664 14,854,031 546,824,176
huwiki-20140727-corpus.xml.bz2 314 MB Hungarian 261,378 2,595,194 89,823,011
itwiki-20140810-corpus.xml.bz2 1,02 GB Italian 1,127,405 9,583,932 381,082,826
jawiki-20140807-corpus.xml.bz2 1,28 GB Japanese 1,049,338 10,017,880 45,124,304
kowiki-20140801-corpus.xml.bz2 217 MB Korean 281,130 1,803,057 50,312,632
nlwiki-20140804-corpus.xml.bz2 631 MB Dutch 2,147,989 6,637,191 235,814,312
plwiki-20140802-corpus.xml.bz2 715 MB Polish 1,220,833 6,651,210 208,709,602
ptwiki-20140806-corpus.xml.bz2 508 MB Portuguese 978,010 5,041,117 183,120,152
rowiki-20140729-corpus.xml.bz2 157 MB Romanian 247,651 1,561,772 52,490,661
ruwiki-20140727-corpus.xml.bz2 1,09 GB Russian 1,444,962 10,796,451 334,731,419
svwiki-20140818-corpus.xml.bz2 407 MB Swedish 1,790,146 7,051,083 166,218,170
trwiki-20140806-corpus.xml.bz2 175 MB Turkish 233,218 1,691,110 48,378,679
zhwiki-20140804-corpus.xml.bz2 552 MB Chinese 967,153 5,629,801 20,577,336

License

The XML files that you can download above are derived from the original Wikipedia and are therefore made available under the same license as Wikipedia itself: Creative Commons Attribution-ShareAlike.

cc-by-sa


If you’d like to stay informed about corpora updates and new tools for text analysis you can subscribe to linguatools newsletter by providing your email address.