Is Wn compatible with the NLTK's module?

The API is intentionally similar, but not exactly the same (for instance see the next question), and there are differences in the ways that results are retrieved, particularly for non-English wordnets. See Migrating from the NLTK for more information. Also see Where is the Princeton WordNet data?.

Where are the Lemma objects? What are Word and Sense objects?

Unlike the original WNDB data format of the original WordNet, the WN-LMF XML format grants words (called lexical entries in WN-LMF and a Word object in Wn) and word senses (Sense in Wn) explicit, first-class status alongside synsets. While senses are essentially links between words and synsets, they may contain metadata and be the source or target of sense relations, so in some ways they are more like nodes than edges when the wordnet is viewed as a graph. The NLTK's module, using the WNDB format, combines the information of a word and a sense into a single object called a Lemmas. Wn also has an unrelated concept called a lemma(), but it is merely the canonical form of a word.

Where is the Princeton WordNet data?

The original English wordnet, named simply WordNet but often referred to as the Princeton WordNet to better distinguish it from other projects, is specifically the data distributed by Princeton in the WNDB format. The Open Multilingual Wordnet (OMW) packages an export of the WordNet data as the OMW English Wordnet based on WordNet 3.0 which is used by Wn (with the lexicon ID omw-en). It also has a similar export for WordNet 3.1 data (omw-en31). Both of these are highly compatible with the original data and can be used as drop-in replacements.

Prior to Wn version 0.9 (and, correspondingly, prior to the OMW data version 1.4), the pwn:3.0 and pwn:3.1 English wordnets distributed by OMW were incorrectly called the Princeton WordNet (for WordNet 3.0 and 3.1, respectively). From Wn version 0.9 (and from version 1.4 of the OMW data), these are called the OMW English Wordnet based on WordNet 3.0/3.1 (omw-en:1.4 and omw-en31:1.4, respectively). These lexicons are intentionally compatible with the original WordNet data, and the 1.4 versions are even more compatible than the previous pwn:3.0 and pwn:3.1 lexicons, so it is strongly recommended to use them over the previous versions.

Why don't all wordnets share the same synsets?

The Open Multilingual Wordnet (OMW) contains wordnets for many languages created using the expand methodology [VOSSEN1998], where non-English wordnets provide words on top of the English wordnet's synset structure. This allows new wordnets to be built in much less time than starting from scratch, but with a few drawbacks, such as that words cannot be added if they do not have a synset in the English wordnet, and that it is difficult to version the wordnets independently (e.g., for reproducibility of experiments involving wordnet data) as all are interconnected. Wn, therefore, creates new synsets for each wordnet added to its database, and synsets then specify which resource they belong to. Queries can specify which resources may be examined. Also see Interlingual Queries.

Why does Wn's database get so big?

The OMW English Wordnet based on WordNet 3.0 takes about 114 MiB of disk space in Wn's database, which is only about 8 MiB more than it takes as a WN-LMF XML file. The NLTK, however, uses the obsolete WNDB format which is more compact, requiring only 35 MiB of disk space. The difference with the Open Multilingual Wordnet 1.4 is more striking: it takes about 659 MiB of disk space in the database, but only 49 MiB in the NLTK. Part of the difference here is that the OMW files in the NLTK are simple tab-separated-value files listing only the words added to each synset for each language. In addition, Wn creates new synsets for each wordnet added (see the previous question). One more reason is that Wn creates various indexes in the database for efficient lookup.


Piek Vossen. 1998. Introduction to EuroWordNet. Computers and the Humanities, 32(2): 73–89.