Interlingual Queries

This guide explains how interlingual queries work within Wn. To get started, you'll need at least two lexicons that use interlingual indices (ILIs). For this guide, we'll use the Open English WordNet (oewn:2021), the Open German WordNet (odenet:1.4), also known as OdeNet, and the Japanese wordnet (omw-ja:1.4).

>>> import wn
>>> wn.download('oewn:2021')
>>> wn.download('odenet:1.4')
>>> wn.download('omw-ja:1.4')

We will query these wordnets with the following Wordnet objects:

>>> en = wn.Wordnet('oewn:2021')
>>> de = wn.Wordnet('odenet:1.4')

The object for the Japanese wordnet will be discussed and created below, in Cross-lingual Relation Traversal.

What are Interlingual Indices?

It is common for users of the Princeton WordNet to refer to synsets by their WNDB offset and type, but this is problematic because the offset is a byte-offset in the wordnet data files and it will differ for wordnets in other languages and even between versions of the same wordnet. Interlingual indices (ILIs) address this issue by providing stable identifiers for concepts, whether for a synset across versions of a wordnet or across languages.

The idea of ILIs was proposed by [Vossen99] and it came to fruition with the release of the Collaborative Interlingual Index (CILI; [Bond16]). CILI therefore represents an instance of, and a namespace for, ILIs. There could, in theory, be alternative indexes for particular domains (e.g., names of people or places), but currently there is only the one.

As an example, the synset for apricot (fruit) in WordNet 3.0 is 07750872-n, but it is 07766848-n in WordNet 3.1. In OdeNet 1.4, which is not released in the WNDB format and therefore doesn't use offsets at all, it is 13235-n for the equivalent word (Aprikose). However, all three use the same ILI: i77784.

Not every synset is guaranteed to be associated with an ILI, and some have the special value in indicates that the project is proposing that a new ILI be created in the CILI project for the concept, but until that happens it cannot be used in interlingual queries.

Vossen99

Vossen, Piek, Wim Peters, and Julio Gonzalo. "Towards a universal index of meaning." In Proceedings of ACL-99 workshop, Siglex-99, standardizing lexical resources, pp. 81-90. University of Maryland, 1999.

Bond16

Bond, Francis, Piek Vossen, John Philip McCrae, and Christiane Fellbaum. "CILI: the Collaborative Interlingual Index." In Proceedings of the 8th Global WordNet Conference (GWC), pp. 50-57. 2016.

Using Interlingual Indices

For synsets that have an associated ILI, you can retrieve it via the wn.Synset.ili accessor:

>>> apricot = en.synsets('apricot')[1]
>>> apricot.ili
ILI('i77784')

From this object you can get various properties of the ILI, such as the ID as a string, its status, and its definition, but if you have not added CILI to Wn's database it will not be very informative:

>>> apricot.ili.id
'i77784'
>>> apricot.ili.status
'presupposed'
>>> apricot.ili.definition() is None
True

The presupposed status means that the ILI was in use by a lexicon, but there is no other source of truth for the index. CILI can be downloaded just like a lexicon:

>>> wn.download('cili:1.0')

Now the status and definition should be more useful:

>>> apricot.ili.status
'active'
>>> apricot.ili.definition()
'downy yellow to rosy-colored fruit resembling a small peach'

ILI IDs may be used to lookup synsets:

>>> Aprikose = de.synsets(ili=apricot.ili.id)[0]
>>> Aprikose.lemmas()
['Marille', 'Aprikose']

Translating Words, Senses, and Synsets

Rather than manually inserting the ILI IDs into Wn's lookup functions as shown above, Wn provides the wn.Synset.translate() method to make it easier:

>>> apricot.translate(lexicon='odenet:1.4')
[Synset('odenet-13235-n')]

The method returns a list for two reasons: first, it's not guaranteed that the target lexicon has only one synset with the ILI and, second, you can translate to more than one lexicon at a time.

Sense objects also have a translate() method, returning a list of senses instead of synsets:

>>> de_senses = apricot.senses()[0].translate(lexicon='odenet:1.4')
>>> [s.word().lemma() for s in de_senses]
['Marille', 'Aprikose']

Word have a translate() method, too, but it works a bit differently. Since each word may be part of multiple synsets, the method returns a mapping of each word sense to the list of translated words:

>>> result = en.words('apricot')[0].translate(lexicon='odenet:1.4')
>>> for sense, de_words in result.items():
...     print(sense, [w.lemma() for w in de_words])
...
Sense('oewn-apricot__1.20.00..') []
Sense('oewn-apricot__1.13.00..') ['Marille', 'Aprikose']
Sense('oewn-apricot__1.07.00..') ['lachsrosa', 'lachsfarbig', 'in Lachs', 'lachsfarben', 'lachsrot', 'lachs']

The three senses above are for apricot as a tree, a fruit, and a color. OdeNet does not have a synset for apricot trees, or it has one not associated with the appropriate ILI, and therefore it could not translate any words for that sense.

Cross-lingual Relation Traversal

ILIs have a second use in Wn, which is relation traversal for wordnets that depend on other lexicons, i.e., those created with the expand methodology. These wordnets, such as many of those in the Open Multilingual Wordnet, do not include synset relations on their own as they were built using the English WordNet as their taxonomic scaffolding. Trying to load such a lexicon when the lexicon it requires is not added to the database presents a warning to the user:

>>> ja = wn.Wordnet('omw-ja:1.4')
[...] WnWarning: lexicon dependencies not available: omw-en:1.4
>>> ja.expanded_lexicons()
[]

Warning

Do not rely on the presence of a warning to determine if the lexicon has its expand lexicon loaded. Python's default warning filter may only show the warning the first time it is encountered. Instead, inspect wn.Wordnet.expanded_lexicons() to see if it is non-empty.

When a dependency is unmet, Wn only issues a warning, not an error, and you can continue to use the lexicon as it is, but it won't be useful for exploring relations such as hypernyms and hyponyms:

>>> anzu = ja.synsets(ili='i77784')[0]
>>> anzu.lemmas()
['アンズ', 'アプリコット', '杏']
>>> anzu.hypernyms()
[]

One way to resolve this issue is to install the lexicon it requires:

>>> wn.download('omw-en:1.4')
>>> ja = wn.Wordnet('omw-ja:1.4')  # no warning
>>> ja.expanded_lexicons()
[<Lexicon omw-en:1.4 [en]>]

Wn will detect the dependency and load omw-en:1.4 as the expand lexicon for omw-ja:1.4 when the former is in the database. You may also specify an expand lexicon manually, even one that isn't the specified dependency:

>>> ja = wn.Wordnet('omw-ja:1.4', expand='oewn:2021')  # no warning
>>> ja.expanded_lexicons()
[<Lexicon oewn:2021 [en]>]

In this case, the Open English WordNet is an actively-developed fork of the lexicon that omw-ja:1.4 depends on, and it should contain all the relations, so you'll see little difference between using it and omw-en:1.4. This works because the relations are found using ILIs and not synset offsets. You may still prefer to use the specified dependency if you have strict compatibility needs, such as for experiment reproducibility and/or compatibility with the NLTK. Using some other lexicon as the expand lexicon may yield very different results. For instance, odenet:1.4 is much smaller than the English wordnets and has fewer relations, so it would not be a good substitute for omw-ja:1.4's expand lexicon.

When an appropriate expand lexicon is loaded, relations between synsets, such as hypernyms, are more likely to be present:

>>> anzu = ja.synsets(ili='i77784')[0]  # recreate the synset object
>>> anzu.hypernyms()
[Synset('omw-ja-07705931-n')]
>>> anzu.hypernyms()[0].lemmas()
['果物']
>>> anzu.hypernyms()[0].translate(lexicon='oewn:2021')[0].lemmas()
['edible fruit']