Interlingual Queries¶

This guide explains how interlingual queries work within Wn. To get started, you'll need at least two lexicons that use interlingual indices (ILIs). For this guide, we'll use the Open English WordNet (oewn:2024), the Open German WordNet (odenet:1.4), also known as OdeNet, and the Japanese wordnet (omw-ja:1.4).

>>> import wn
>>> wn.download('oewn:2024')
>>> wn.download('odenet:1.4')
>>> wn.download('omw-ja:1.4')

We will query these wordnets with the following Wordnet objects:

>>> en = wn.Wordnet('oewn:2024')
>>> de = wn.Wordnet('odenet:1.4')

The object for the Japanese wordnet will be discussed and created below, in Cross-lingual Relation Traversal.

What are Interlingual Indices?¶

It is common for users of the Princeton WordNet to refer to synsets by their WNDB offset and type, but this is problematic because the offset is a byte-offset in the wordnet data files and it will differ for wordnets in other languages and even between versions of the same wordnet. Interlingual indices (ILIs) address this issue by providing stable identifiers for concepts, whether for a synset across versions of a wordnet or across languages.

The idea of ILIs was proposed by [Vossen99] and it came to fruition with the release of the Collaborative Interlingual Index (CILI; [Bond16]). CILI therefore represents an instance of, and a namespace for, ILIs. There could, in theory, be alternative indexes for particular domains (e.g., names of people or places), but currently there is only the one.

As an example, the synset for apricot (fruit) in WordNet 3.0 is 07750872-n, but it is 07766848-n in WordNet 3.1. In OdeNet 1.4, which is not released in the WNDB format and therefore doesn't use offsets at all, it is 13235-n for the equivalent word (Aprikose). However, all three use the same ILI: i77784.

Generally, only one synset within a wordnet will be mapped to a particular ILI, but this may not always be true, nor does every synset necessarily map to an ILI. Some concepts that are lexicalized in one language may not be in another language. For example, rice in English may refer to the rice plant, rice grain, or cooked rice, but in languages like Japanese they are distinct things (稲 ine, 米 kome, and 飯 meshi / ご飯 gohan, respectively).

The ili property of Synsets serves two purposes in Wn. Mainly it is for encoding the ILI identifier associated with the synset, but it is also used to indicate when a lexicon is proposing a new concept that is not yet part of CILI. In the latter case, a WN-LMF lexicon file will have the special value of in for a synset's ILI and it will provide an <ILIDefinition> element. In Wn, this translates to wn.Synset.ili returning None, the same as if no ILI were mapped at all. Both synsets with proposed ILIs and those with no ILI cannot be used in interlingual queries. Proposed ILIs can be inspected using the wn.ili.get_proposed function, if you know have the synset, or wn.ili.get_all_proposed to get all of them.

[Vossen99]

Vossen, Piek, Wim Peters, and Julio Gonzalo. "Towards a universal index of meaning." In Proceedings of ACL-99 workshop, Siglex-99, standardizing lexical resources, pp. 81-90. University of Maryland, 1999.

[Bond16]

Bond, Francis, Piek Vossen, John Philip McCrae, and Christiane Fellbaum. "CILI: the Collaborative Interlingual Index." In Proceedings of the 8th Global WordNet Conference (GWC), pp. 50-57. 2016.

Using Interlingual Indices¶

For synsets that have an associated ILI, you can retrieve it via the wn.Synset.ili property:

>>> apricot = en.synsets('apricot')[1]
>>> apricot.ili
'i77784'

The value is a str ILI identifier. These may be used directly for things like interlingual synset lookups:

>>> de.synsets(ili=apricot.ili)[0].lemmas()
['Marille', 'Aprikose']

There may be more information about the ILI itself which you can get from the wn.ili module:

>>> from wn import ili
>>> apricot_ili = ili.get(apricot.ili)
>>> apricot_ili
ILI(id='i77784')

From this object you can get various properties of the ILI, such as the ID string, its status, and its definition, but if you have not added CILI to Wn's database, it will not be very informative:

>>> apricot_ili.id
'i77784'
>>> apricot_ili.status
'presupposed'
>>> apricot_ili.definition() is None
True

The presupposed status means that the ILI ID is in use by a lexicon, but there is no other source of truth for the index. CILI can be downloaded just like a lexicon:

>>> wn.download('cili:1.0')

Now the status and definition should be more useful:

>>> apricot_ili.status
'active'
>>> apricot_ili.definition()
'downy yellow to rosy-colored fruit resembling a small peach'

Translating Words, Senses, and Synsets¶

Rather than manually inserting the ILI IDs into Wn's lookup functions as shown above, Wn provides the wn.Synset.translate() method to make it easier:

>>> apricot.translate(lexicon='odenet:1.4')
[Synset('odenet-13235-n')]

The method returns a list for two reasons: first, it's not guaranteed that the target lexicon has only one synset with the ILI and, second, you can translate to more than one lexicon at a time.

Sense objects also have a translate() method, returning a list of senses instead of synsets:

>>> de_senses = apricot.senses()[0].translate(lexicon='odenet:1.4')
>>> [s.word().lemma() for s in de_senses]
['Marille', 'Aprikose']

Word have a translate() method, too, but it works a bit differently. Since each word may be part of multiple synsets, the method returns a mapping of each word sense to the list of translated words:

>>> result = en.words('apricot')[0].translate(lexicon='odenet:1.4')
>>> for sense, de_words in result.items():
...     print(sense, [w.lemma() for w in de_words])
...
Sense('oewn-apricot__1.20.00..') []
Sense('oewn-apricot__1.13.00..') ['Marille', 'Aprikose']
Sense('oewn-apricot__1.07.00..') ['lachsrosa', 'lachsfarbig', 'in Lachs', 'lachsfarben', 'lachsrot', 'lachs']

The three senses above are for apricot as a tree, a fruit, and a color. OdeNet does not have a synset for apricot trees, or it has one not associated with the appropriate ILI, and therefore it could not translate any words for that sense.

Cross-lingual Relation Traversal¶

ILIs have a second use in Wn, which is relation traversal for wordnets that depend on other lexicons, i.e., those created with the expand methodology. These wordnets, such as many of those in the Open Multilingual Wordnet, do not include synset relations on their own as they were built using the English WordNet as their taxonomic scaffolding. Trying to load such a lexicon when the lexicon it requires is not added to the database presents a warning to the user:

>>> ja = wn.Wordnet('omw-ja:1.4')
[...] WnWarning: lexicon dependencies not available: omw-en:1.4
>>> ja.expanded_lexicons()
[]

Warning

Do not rely on the presence of a warning to determine if the lexicon has its expand lexicon loaded. Python's default warning filter may only show the warning the first time it is encountered. Instead, inspect wn.Wordnet.expanded_lexicons() to see if it is non-empty.

When a dependency is unmet, Wn only issues a warning, not an error, and you can continue to use the lexicon as it is, but it won't be useful for exploring relations such as hypernyms and hyponyms:

>>> anzu = ja.synsets(ili='i77784')[0]
>>> anzu.lemmas()
['アンズ', 'アプリコット', '杏']
>>> anzu.hypernyms()
[]

One way to resolve this issue is to install the lexicon it requires:

>>> wn.download('omw-en:1.4')
>>> ja = wn.Wordnet('omw-ja:1.4')  # no warning
>>> ja.expanded_lexicons()
[<Lexicon omw-en:1.4 [en]>]

Wn will detect the dependency and load omw-en:1.4 as the expand lexicon for omw-ja:1.4 when the former is in the database. You may also specify an expand lexicon manually, even one that isn't the specified dependency:

>>> ja = wn.Wordnet('omw-ja:1.4', expand='oewn:2024')  # no warning
>>> ja.expanded_lexicons()
[<Lexicon oewn:2024 [en]>]

In this case, the Open English WordNet is an actively-developed fork of the lexicon that omw-ja:1.4 depends on, and it should contain all the relations, so you'll see little difference between using it and omw-en:1.4. This works because the relations are found using ILIs and not synset offsets. You may still prefer to use the specified dependency if you have strict compatibility needs, such as for experiment reproducibility and/or compatibility with the NLTK. Using some other lexicon as the expand lexicon may yield very different results. For instance, odenet:1.4 is much smaller than the English wordnets and has fewer relations, so it would not be a good substitute for omw-ja:1.4's expand lexicon.

When an appropriate expand lexicon is loaded, relations between synsets, such as hypernyms, are more likely to be present:

>>> anzu = ja.synsets(ili='i77784')[0]  # recreate the synset object
>>> anzu.hypernyms()
[Synset('omw-ja-07705931-n')]
>>> anzu.hypernyms()[0].lemmas()
['果物']
>>> anzu.hypernyms()[0].translate(lexicon='oewn:2024')[0].lemmas()
['edible fruit']