wn.similarity

Synset similarity metrics.

Taxonomy-based Metrics

The Path, Leacock-Chodorow, and Wu-Palmer similarity metrics work by finding path distances in the hypernym/hyponym taxonomy. As such, they are most useful when the synsets are, in fact, arranged in a taxonomy. For the Princeton WordNet and derivative wordnets, such as the Open English Wordnet and OMW English Wordnet based on WordNet 3.0 available to Wn, synsets for nouns and verbs are arranged taxonomically: the nouns mostly form a single structure with a single root while verbs form many smaller structures with many roots. Synsets for the other parts of speech do not use hypernym/hyponym relations at all. This situation may be different for other wordnet projects or future versions of the English wordnets.

The similarity metrics tend to fail when the synsets are not connected by some path. When the synsets are in different parts of speech, or even in separate lexicons, this failure is acceptable and expected. But for cases like the verbs in the Princeton WordNet, it might be more useful to pretend that there is some unique root for all verbs so as to create a path connecting any two of them. For this purpose, the simulate_root parameter is available on the path(), lch(), and wup() functions, where it is passed on to calls to wn.Synset.shortest_path() and wn.Synset.lowest_common_hypernyms(). Setting simulate_root to True can, however, give surprising results if the words are from a different lexicon. Currently, computing similarity for synsets from a different part of speech raises an error.

Path Similarity

When \(p\) is the length of the shortest path between two synsets, the path similarity is:

\[\frac{1}{p + 1}\]

The similarity score ranges between 0.0 and 1.0, where the higher the score is, the more similar the synsets are. The score is 1.0 when a synset is compared to itself, and 0.0 when there is no path between the two synsets (i.e., the path distance is infinite).

wn.similarity.path(synset1, synset2, simulate_root=False)

Return the Path similarity of synset1 and synset2.

Parameters
  • synset1 (wn.Synset) – The first synset to compare.

  • synset2 (wn.Synset) – The second synset to compare.

  • simulate_root (bool) – When True, a fake root node connects all other roots; default: False.

Return type

float

Example

>>> import wn
>>> from wn.similarity import path
>>> ewn = wn.Wordnet('ewn:2020')
>>> spatula = ewn.synsets('spatula')[0]
>>> path(spatula, ewn.synsets('pancake')[0])
0.058823529411764705
>>> path(spatula, ewn.synsets('utensil')[0])
0.2
>>> path(spatula, spatula)
1.0
>>> flip = ewn.synsets('flip', pos='v')[0]
>>> turn_over = ewn.synsets('turn over', pos='v')[0]
>>> path(flip, turn_over)
0.0
>>> path(flip, turn_over, simulate_root=True)
0.16666666666666666

Leacock-Chodorow Similarity

When \(p\) is the length of the shortest path between two synsets and \(d\) is the maximum taxonomy depth, the Leacock-Chodorow similarity is:

\[-\text{log}\left(\frac{p + 1}{2d}\right)\]
wn.similarity.lch(synset1, synset2, max_depth, simulate_root=False)

Return the Leacock-Chodorow similarity between synset1 and synset2.

Parameters
  • synset1 (wn.Synset) – The first synset to compare.

  • synset2 (wn.Synset) – The second synset to compare.

  • max_depth (int) – The taxonomy depth (see wn.taxonomy.taxonomy_depth())

  • simulate_root (bool) – When True, a fake root node connects all other roots; default: False.

Return type

float

Example

>>> import wn, wn.taxonomy
>>> from wn.similarity import lch
>>> ewn = wn.Wordnet('ewn:2020')
>>> n_depth = wn.taxonomy.taxonomy_depth(ewn, 'n')
>>> spatula = ewn.synsets('spatula')[0]
>>> lch(spatula, ewn.synsets('pancake')[0], n_depth)
0.8043728156701697
>>> lch(spatula, ewn.synsets('utensil')[0], n_depth)
2.0281482472922856
>>> lch(spatula, spatula, n_depth)
3.6375861597263857
>>> v_depth = taxonomy.taxonomy_depth(ewn, 'v')
>>> flip = ewn.synsets('flip', pos='v')[0]
>>> turn_over = ewn.synsets('turn over', pos='v')[0]
>>> lch(flip, turn_over, v_depth, simulate_root=True)
1.3862943611198906

Wu-Palmer Similarity

When LCS is the lowest common hypernym (also called "least common subsumer") between two synsets, \(i\) is the shortest path distance from the first synset to LCS, \(j\) is the shortest path distance from the second synset to LCS, and \(k\) is the number of nodes (distance + 1) from LCS to the root node, then the Wu-Palmer similarity is:

\[\frac{2k}{i + j + 2k}\]
wn.similarity.wup(synset1, synset2, simulate_root=False)

Return the Wu-Palmer similarity of synset1 and synset2.

Parameters
  • synset1 (wn.Synset) – The first synset to compare.

  • synset2 (wn.Synset) – The second synset to compare.

  • simulate_root – When True, a fake root node connects all other roots; default: False.

Raises

wn.Error – When no path connects the synset1 and synset2.

Return type

float

Example

>>> import wn
>>> from wn.similarity import wup
>>> ewn = wn.Wordnet('ewn:2020')
>>> spatula = ewn.synsets('spatula')[0]
>>> wup(spatula, ewn.synsets('pancake')[0])
0.2
>>> wup(spatula, ewn.synsets('utensil')[0])
0.8
>>> wup(spatula, spatula)
1.0
>>> flip = ewn.synsets('flip', pos='v')[0]
>>> turn_over = ewn.synsets('turn over', pos='v')[0]
>>> wup(flip, turn_over, simulate_root=True)
0.2857142857142857

Information Content-based Metrics

The Resnik, Jiang-Conrath, and Lin similarity metrics work by computing the information content of the synsets and/or that of their lowest common hypernyms. They therefore require information content weights (see wn.ic), and the values returned necessarily depend on the weights used.

Resnik Similarity

The Resnik similarity (Resnik 1995) is the maximum information content value of the common subsumers (hypernym ancestors) of the two synsets. Formally it is defined as follows, where \(c_1\) and \(c_2\) are the two synsets being compared.

\[\text{max}_{c \in \text{S}(c_1, c_2)} \text{IC}(c)\]

Since a synset's information content is always equal or greater than the information content of its hypernyms, \(S(c_1, c_2)\) above is more efficiently computed using the lowest common hypernyms instead of all common hypernyms.

wn.similarity.res(synset1, synset2, ic)

Return the Resnik similarity between synset1 and synset2.

Parameters
Return type

float

Example

>>> import wn, wn.ic, wn.taxonomy
>>> from wn.similarity import res
>>> pwn = wn.Wordnet('pwn:3.0')
>>> ic = wn.ic.load('~/nltk_data/corpora/wordnet_ic/ic-brown.dat', pwn)
>>> spatula = pwn.synsets('spatula')[0]
>>> res(spatula, pwn.synsets('pancake')[0], ic)
0.8017591149538994
>>> res(spatula, pwn.synsets('utensil')[0], ic)
5.87738923441087

Jiang-Conrath Similarity

The Jiang-Conrath similarity metric (Jiang and Conrath, 1997) combines the ideas of the taxonomy-based and information content-based metrics. It is defined as follows, where \(c_1\) and \(c_2\) are the two synsets being compared and \(c_0\) is the lowest common hypernym of the two with the highest information content weight:

\[\frac{1}{\text{IC}(c_1) + \text{IC}(c_2) - 2(\text{IC}(c_0))}\]

This equation is the simplified form given in the paper were several parameterized terms are cancelled out because the full form is not often used in practice.

There are two special cases:

  1. If the information content of \(c_0\), \(c_1\), and \(c_2\) are all zero, the metric returns zero. This occurs when both \(c_1\) and \(c_2\) are the root node, but it can also occur if the synsets did not occur in the corpus and the smoothing value was set to zero.

  2. Otherwise if \(c_1 + c_2 = 2c_0\), the metric returns infinity. This occurs when the two synsets are the same, one is a descendant of the other, etc., such that they have the same frequency as each other and as their lowest common hypernym.

wn.similarity.jcn(synset1, synset2, ic)

Return the Jiang-Conrath similarity of two synsets.

Parameters
Return type

float

Example

>>> import wn, wn.ic, wn.taxonomy
>>> from wn.similarity import jcn
>>> pwn = wn.Wordnet('pwn:3.0')
>>> ic = wn.ic.load('~/nltk_data/corpora/wordnet_ic/ic-brown.dat', pwn)
>>> spatula = pwn.synsets('spatula')[0]
>>> jcn(spatula, pwn.synsets('pancake')[0], ic)
0.04061799236354239
>>> jcn(spatula, pwn.synsets('utensil')[0], ic)
0.10794048564613007

Lin Similarity

Another formulation of information content-based similarity is the Lin metric (Lin 1997), which is defined as follows, where \(c_1\) and \(c_2\) are the two synsets being compared and \(c_0\) is the lowest common hypernym with the highest information content weight:

\[\frac{2(\text{IC}(c_0))}{\text{IC}(c_1) + \text{IC}(c_0)}\]

One special case is if either synset has an information content value of zero, in which case the metric returns zero.

wn.similarity.lin(synset1, synset2, ic)

Return the Lin similarity of two synsets.

Parameters
Return type

float

Example

>>> import wn, wn.ic, wn.taxonomy
>>> from wn.similarity import lin
>>> pwn = wn.Wordnet('pwn:3.0')
>>> ic = wn.ic.load('~/nltk_data/corpora/wordnet_ic/ic-brown.dat', pwn)
>>> spatula = pwn.synsets('spatula')[0]
>>> lin(spatula, pwn.synsets('pancake')[0], ic)
0.061148956278604116
>>> lin(spatula, pwn.synsets('utensil')[0], ic)
0.5592415686750427