Wikipedia monolingual corpora

Here you can download text corpora extracted from the Wikipedia dumps in 30 languages, amounting to nearly 10 billion tokens altogether. Each XML file contains the full textual content of the individual language version of Wikipedia, extended with many annotations like article and paragraph boundaries, number of links referring to each article, crosslanguage links, categories and more. Have a quick look at a sample XML file containing one English article.

Download

Wikipedia XML file	unzipped file size	language	number of articles	number of paragraphs	number of tokens
arwiki-20180920-corpus.xml.bz2	2.6G	Arabic	709,674	2,067,569	131,120,680
bgwiki-20180920-corpus.xml.bz2	2.1G	Bulgarian	305,692	1,113,244	113,426,508
cawiki-20181001-corpus.xml.bz2	2.7G	Catalan	632,334	3,036,618	212,831,511
cswiki-20181001-corpus.xml.bz2	2.1G	Czech	522,760	1,979,077	132,763,620
dawiki-20181001-corpus.xml.bz2	1.1G	Danish	282,756	1,109,150	66,676,633
dewiki-20180920-corpus.xml.bz2	12G	German	2,212,874	12,161,736	913,428,282
elwiki-20181001-corpus.xml.bz2	1.3G	Greek	192,400	935,726	70,194,124
enwiki-20181001-corpus.xml.bz2	26G	English	5,690,374	29,873,050	2,319,783,831
eswiki-20181001-corpus.xml.bz2	8.1G	Spanish	1,560,193	9,159,971	692,375,637
fawiki-20181001-corpus.xml.bz2	2.6G	Farsi	2,285,821	3,154,360	105,336,497
fiwiki-20181001-corpus.xml.bz2	1.9G	Finnish	542,723	2,022,231	102,487,802
frwiki-20181001-corpus.xml.bz2	11G	French	2,301,659	13,028,006	914,601,321
hewiki-20181001-corpus.xml.bz2	2.2G	Hebrew	325,635	2,002,756	137,903,919
huwiki-20181001-corpus.xml.bz2	2.5G	Hungarian	518,731	2,473,834	153,998,201
idwiki-20181001-corpus.xml.bz2	1.5G	Indonesian	502,761	1,553,719	88,015,056
itwiki-20181001-corpus.xml.bz2	7.5G	Italian	1,806,393	8,239,444	588,876,879
jawiki-20181001-corpus.xml.bz2	8.6G	Japanese	1,698,241	8,572,952	110,367,615
kowiki-20181001-corpus.xml.bz2	2.0G	Korean	466,824	1,737,828	93,761,999
nlwiki-20181001-corpus.xml.bz2	4.7G	Dutch	2,381,710	5,244,373	307,234,270
nowiki-20181001-corpus.xml.bz2	2.3G	Norwegian	538,156	1,987,561	125,111,145
plwiki-20181001-corpus.xml.bz2	5.3G	Polish	1,686,452	5,158,412	333,488,821
ptwiki-20181001-corpus.xml.bz2	4.3G	Portuguese	1,521,288	5,248,294	319,844,644
rowiki-20181001-corpus.xml.bz2	1.5G	Romanian	467,493	1,449,020	88,772,650
ruwiki-20181001-corpus.xml.bz2	12G	Russian	2,590,008	10,295,294	610,761,713
svwiki-20181001-corpus.xml.bz2	6.4G	Swedish	3,812,043	11,082,359	407,815,102
thwiki-20181001-corpus.xml.bz2	1.2G	Thai	183,526	693,122	20,784,378
trwiki-20181001-corpus.xml.bz2	1.5G	Turkish	397,046	1,316,176	78,212,294
ukwiki-20181001-corpus.xml.bz2	4.8G	Ukrainian	972,947	4,551,432	233,609,331
viwiki-20181001-corpus.xml.bz2	2.2G	Vietnamese	1,342,595	2,525,764	150,287,491
zhwiki-20181001-corpus.xml.bz2	4.1G	Chinese	1,619,172	4,853,082	54,747,370

XML Format Description

The XML files contain the following information:

XML element	mother element	description
article	wikipedia (XML root)	This element encloses each Wikipedia article. It has an attribute name which contains the article’s title. The title is unique (in the current language version).
redirect	article	Articles that are redirects to another article are not stored in the XML files. However, the redirect information is contained in the redirect element(s) of the target article. The attribute name contains an article title that redirects to the present article. An article can have 0..n daughter elements of type redirect.
links_in	article	The attribute name contains the total number of textlinks in other articles that link to the current article.
textlink	article	The attribute name contains the anchor text that is used in a textlink to refer to the current article. The attribute freq contains the number of times this anchor text was used to refer to the current article.
category	article	The attribute name contains a Wikipedia category the current article is assigned to. An article can have 0..n daughter elements of type category.
links_out	article	The attribute name contains the number of textlinks in the current article that refer to other Wikipedia articles.
crosslanguage_link	article	The attribute name contains the title of a Wikipedia article in another language the current article is linked to. The attribute language specifies the target language. Important note: In Wikipedia, crosslanguage links may link an article to a section of a target article. In this case, the link contains the article name followed by a ‚#‘ and the section’s anchor name. There are also crosslanguage links to other namespaces, e.g. categories or portals. These links contain a colon that separates the namespace from the link target.
disambiguation	article	This elements marks the article as a disambiguation page. It has no attributes.
content	article	This element encloses the current article’s textual content.
p	content, h, math, table	This element marks a paragraph boundary.
link	content, h, math, table	Marks a textlink to another Wikipedia article. The title of the target article is contained in the attribute target.
h	content	Marks a heading.
math	content	Marks a math formula.
table	content	Marks a table.
cell	table	Marks a table cell.

Sample XML file

To see an example of the XML format click here.
The sample contains one article from the English Wikipedia.

Extracting raw text from XML

You can extract text only from the XML files using the Perl script xml2txt.pl. The usage is:

perl xml2txt.pl [Options] INPUT OUTPUT

where INPUT is an unzipped Wikipedia corpus XML file, and OUTPUT is the raw text file that will be produced. The encoding of the output file will be UTF-8.

Options:	-articles	The article mark-up is preserved (<article name=“…“>…</article>).
	-p	The paragraph mark-up is preserved (<p>…</p>).
	-h	The headings mark-up is preserved (<h>…</h>).
	-nomath	All content that is enclosed in math tags is deleted.
	-notables	All content that is enclosed in table tags is deleted.
	-nodisambig	Articles that are marked as disambiguation page are deleted.
	-exclude-categories FILE	All articles that belong to one of the categories listed in FILE are ignored. FILE has one category name per line. Some useful categories are given below.
	-only-categories FILE	Only articles are output that belong to one of the categories listed in FILE. FILE has one category name per line. Some useful categories are given below.

Useful categories

language	description	categories file
de	people	people_de.txt
en	medicine	medicine_en.txt

Known and unknown bugs

The corpora are based on the Wikipedia dumps which contain articles in Wiki markup format, packed in XML. Wiki markup uses all kinds of brackets to mark links, categories, etc. Because articles can be edited manually by anybody, unmatched brackets sometimes occur. In order to minimize noise in the corpus, we discard all articles with unmatched brackets, i.e. articles that can’t be parsed.

We also (try to) discard sections with weblinks and references at the end of articles, because they often contain foreign language material. (It’s not a bug, it’s a feature!)

Applications

Main purpose of the Wikipedia Monolingual Corpora is to provide large text corpora in many languages which can be used for the routine tasks of corpus linguistics, like generating word frequency lists or collecting n-gram statistics.

However, the rich annotations in the XML files facilitate many more applications:

categories allow to compile domain-specific corpora
compile multilingual document-aligned comparable corpora using the crosslanguage links
textlinks and redirects allow to collect expressions that are used to refer to a concept (i.e. a Wikipedia article)
The annotations cover all requirements to build ESA style semantic similarity resources.

2014 version (outdated)

The table below contains links to the previous version of the Wikipedia Monolingual Corpora from 2014.

Wikipedia XML file	zipped file size	language	number of articles	number of paragraphs	number of tokens
arwiki-20140714-corpus.xml.bz2	189 MB	Arabic	273,709	1,654,018	61,601,807
bgwiki-20140728-corpus.xml.bz2	138 MB	Bulgarian	183,983	1,115,838	43,324,881
cswiki-20140730-corpus.xml.bz2	296 MB	Czech	341,446	2,382,825	86,076,579
dawiki-20140725-corpus.xml.bz2	144 MB	Danish	188,415	1,177,350	43,997,748
dewiki-20140725-corpus.xml.bz2	1,88 GB	German	1,697,608	14,971,566	649,943,374
elwiki-20140728-corpus.xml.bz2	113 MB	Greek	96,742	863,312	37,468,841
enwiki-20140707-corpus.xml.bz2	4,25 GB	English	4,579,471	43,300,386	1,714,676,058
eswiki-20140810-corpus.xml.bz2	1,04 GB	Spanish	1,065,798	9,934,311	428,385,578
fawiki-20140802-corpus.xml.bz2	156 MB	Farsi	1,438,575	2,584,340	55,847,903
fiwiki-20140809-corpus.xml.bz2	258 MB	Finnish	349,907	2,117,539	62,254,999
frwiki-20140804-corpus.xml.bz2	1,43 GB	French	1,516,664	14,854,031	546,824,176
huwiki-20140727-corpus.xml.bz2	314 MB	Hungarian	261,378	2,595,194	89,823,011
itwiki-20140810-corpus.xml.bz2	1,02 GB	Italian	1,127,405	9,583,932	381,082,826
jawiki-20140807-corpus.xml.bz2	1,28 GB	Japanese	1,049,338	10,017,880	45,124,304
kowiki-20140801-corpus.xml.bz2	217 MB	Korean	281,130	1,803,057	50,312,632
nlwiki-20140804-corpus.xml.bz2	631 MB	Dutch	2,147,989	6,637,191	235,814,312
plwiki-20140802-corpus.xml.bz2	715 MB	Polish	1,220,833	6,651,210	208,709,602
ptwiki-20140806-corpus.xml.bz2	508 MB	Portuguese	978,010	5,041,117	183,120,152
rowiki-20140729-corpus.xml.bz2	157 MB	Romanian	247,651	1,561,772	52,490,661
ruwiki-20140727-corpus.xml.bz2	1,09 GB	Russian	1,444,962	10,796,451	334,731,419
svwiki-20140818-corpus.xml.bz2	407 MB	Swedish	1,790,146	7,051,083	166,218,170
trwiki-20140806-corpus.xml.bz2	175 MB	Turkish	233,218	1,691,110	48,378,679
zhwiki-20140804-corpus.xml.bz2	552 MB	Chinese	967,153	5,629,801	20,577,336

License

The XML files that you can download above are derived from the original Wikipedia and are therefore made available under the same license as Wikipedia itself: Creative Commons Attribution-ShareAlike.