Is Wn compatible with the NLTK's module?¶
The API is intentionally similar, but not exactly the same (for instance see the next question), and there are differences in the ways that results are retrieved, particularly for non-English wordnets. See Migrating from the NLTK for more information. Also see Where is the Princeton WordNet data?.
Where are the
Lemma objects? What are
Unlike the original WNDB data format of the original WordNet, the
WN-LMF XML format grants words (called lexical entries in WN-LMF
Word object in Wn) and word senses
Sense in Wn) explicit, first-class status alongside
synsets. While senses are essentially links between words and
synsets, they may contain metadata and be the source or target of
sense relations, so in some ways they are more like nodes than edges
when the wordnet is viewed as a graph. The NLTK's module, using
the WNDB format, combines the information of a word and a sense into a
single object called a
Lemmas. Wn also has an unrelated concept
lemma(), but it is merely the canonical form
of a word.
Where is the Princeton WordNet data?¶
The original English wordnet, named simply WordNet but often
referred to as the Princeton WordNet to better distinguish it from
other projects, is specifically the data distributed by Princeton in
the WNDB format. The Open Multilingual Wordnet (OMW)
packages an export of the WordNet data as the OMW English Wordnet
based on WordNet 3.0 which is used by Wn (with the lexicon ID
omw-en). It also has a similar export for WordNet 3.1 data
omw-en31). Both of these are highly compatible with the original
data and can be used as drop-in replacements.
Prior to Wn version 0.9 (and, correspondingly, prior to the OMW
data version 1.4), the
pwn:3.1 English wordnets
distributed by OMW were incorrectly called the Princeton WordNet
(for WordNet 3.0 and 3.1, respectively). From Wn version 0.9 (and from
version 1.4 of the OMW data), these are called the OMW English
Wordnet based on WordNet 3.0/3.1 (
omw-en31:1.4, respectively). These lexicons are intentionally
compatible with the original WordNet data, and the 1.4 versions are
even more compatible than the previous
lexicons, so it is strongly recommended to use them over the previous
Why does Wn's database get so big?¶
The OMW English Wordnet based on WordNet 3.0 takes about 114 MiB of disk space in Wn's database, which is only about 8 MiB more than it takes as a WN-LMF XML file. The NLTK, however, uses the obsolete WNDB format which is more compact, requiring only 35 MiB of disk space. The difference with the Open Multilingual Wordnet 1.4 is more striking: it takes about 659 MiB of disk space in the database, but only 49 MiB in the NLTK. Part of the difference here is that the OMW files in the NLTK are simple tab-separated-value files listing only the words added to each synset for each language. In addition, Wn creates new synsets for each wordnet added (see the previous question). One more reason is that Wn creates various indexes in the database for efficient lookup.
Piek Vossen. 1998. Introduction to EuroWordNet. Computers and the Humanities, 32(2): 73–89.