wn.ic¶

Information Content is a corpus-based metrics of synset or sense specificity.

The mathematical formulae for information content are defined in Formal Description, and the corresponding Python API function are described in Calculating Information Content. These functions require information content weights obtained either by computing them from a corpus, or by loading pre-computed weights from a file.

Note

The term information content can be ambiguous. It often, and most accurately, refers to the result of the information_content() function (\(\text{IC}(c)\) in the mathematical notation), but is also sometimes used to refer to the corpus frequencies/weights (\(\text{freq}(c)\) in the mathematical notation) returned by load() or compute(), as these weights are the basis of the value computed by information_content(). The Wn documentation tries to consistently refer to former as the information content value, or just information content, and the latter as information content weights, or weights.

Formal Description¶

The Information Content (IC) of a concept (synset) is a measure of its specificity computed from the wordnet's taxonomy structure and corpus frequencies. It is defined by Resnik 1995 ([RES95]), following information theory, as the negative log-probability of a concept:

\[\text{IC}(c) = -\log{p(c)}\]

A concept's probability is the empirical probability over a corpus:

\[p(c) = \frac{\text{freq}(c)}{N}\]

Here, \(N\) is the total count of words of the same category as concept \(c\) ([RES95] only considered nouns) where each word has some representation in the wordnet, and \(\text{freq}\) is defined as the sum of corpus counts of words in \(\text{words}(c)\), which is the set of words subsumed by concept \(c\):

\[\text{freq}(c) = \sum_{w \in \text{words}(c)}{\text{count}(w)}\]

It is common for \(\text{freq}\) to not contain actual frequencies but instead weights distributed evenly among the synsets for a word. These weights are calculated as the word frequency divided by the number of synsets for the word:

\[\text{freq}_{\text{distributed}}(c) = \sum_{w \in \text{words}(c)}{\frac{\text{count}(w)}{|\text{synsets}(w)|}}\]

[RES95] (1,2)

Resnik, Philip. "Using information content to evaluate semantic similarity." In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), Montreal, Canada, pp. 448-453. 1995.

Example¶

In the Princeton WordNet 3.0 (hereafter WordNet, but note that the equivalent lexicon in Wn is the OMW English Wordnet based on WordNet 3.0 with specifier omw-en:1.4), the frequency of a concept like stone fruit is not just the number of occurrences of stone fruit, but also includes the counts of the words for its hyponyms (almond, olive, etc.) and other taxonomic descendants (Jordan almond, green olive, etc.). The word almond has two synsets: one for the fruit or nut, another for the plant. Thus, if the word almond is encountered \(n\) times in a corpus, then the weight (either the frequency \(n\) or distributed weight \(\frac{n}{2}\)) is added to the total weights for both synsets and to those of their ancestors, but not for descendant synsets, such as for Jordan almond. The fruit/nut synset of almond has two hypernym paths which converge on fruit:

almond ⊃ stone fruit ⊃ fruit
almond ⊃ nut ⊃ seed ⊃ fruit

The weight is added to each ancestor (stone fruit, nut, seed, fruit, …) once. That is, the weight is not added to the convergent ancestor for fruit twice, but only once.

Calculating Information Content¶

wn.ic.information_content(synset: Synset, freq: dict[str, dict[str | None, float]]) → float¶

Calculate the Information Content value for a synset.

The information content of a synset is the negative log of the synset probability (see synset_probability()).

wn.ic.synset_probability(synset: Synset, freq: dict[str, dict[str | None, float]]) → float¶

Calculate the synset probability.

The synset probability is defined as freq(ss)/N where freq(ss) is the IC weight for the synset and N is the total IC weight for all synsets with the same part of speech.

Note: this function is not generally used directly, but indirectly through information_content().

Computing Corpus Weights¶

If pre-computed weights are not available for a wordnet or for some domain, they can be computed given a corpus and a wordnet.

The corpus is an iterable of words. For large corpora it may help to use a generator for this iterable, but the entire vocabulary (i.e., unique words and counts) will be held at once in memory. Multi-word expressions are also possible if they exist in the wordnet. For instance, WordNet has stone fruit, with a single space delimiting the words, as an entry.

The wn.Wordnet object must be instantiated with a single lexicon, although it may have expand-lexicons for relation traversal. For best results, the wordnet should use a lemmatizer to help it deal with inflected wordforms from running text.

wn.ic.compute(corpus: Iterable[str], wordnet: Wordnet, distribute_weight: bool = True, smoothing: float = 1.0) → dict[str, dict[str | None, float]]¶

Compute Information Content weights from a corpus.

Parameters:

corpus – An iterable of string tokens. This is a flat list of words and the order does not matter. Tokens may be single words or multiple words separated by a space.
wordnet – An instantiated wn.Wordnet object, used to look up synsets from words.
distribute_weight – If True, the counts for a word are divided evenly among all synsets for the word.
smoothing – The initial value given to each synset.

Example

>>> import wn, wn.ic, wn.morphy
>>> ewn = wn.Wordnet("ewn:2020", lemmatizer=wn.morphy.morphy)
>>> freq = wn.ic.compute(["Dogs", "run", ".", "Cats", "sleep", "."], ewn)
>>> dog = ewn.synsets("dog", pos="n")[0]
>>> cat = ewn.synsets("cat", pos="n")[0]
>>> frog = ewn.synsets("frog", pos="n")[0]
>>> freq["n"][dog.id]
1.125
>>> freq["n"][cat.id]
1.1
>>> freq["n"][frog.id]  # no occurrence; smoothing value only
1.0
>>> carnivore = dog.lowest_common_hypernyms(cat)[0]
>>> freq["n"][carnivore.id]
1.3250000000000002

Reading Pre-computed Information Content Files¶

The load() function reads pre-computed information content weights files as used by the WordNet::Similarity Perl module or the NLTK Python package. These files are computed for a specific version of a wordnet using the synset offsets from the WNDB format, which Wn does not use. These offsets therefore must be converted into an identifier that matches those used by the wordnet. By default, load() uses the lexicon identifier from its wordnet argument with synset offsets (padded with 0s to make 8 digits) and parts-of-speech from the weights file to format an identifier, such as omw-en-00001174-n. For wordnets that use a different identifier scheme, the get_synset_id parameter of load() can be given a callable created with wn.util.synset_id_formatter(). It can also be given another callable with the same signature as shown below:

get_synset_id(*, offset: int, pos: str) -> str

When loading pre-computed information content files, it is recommended to use the ones with smoothing (i.e., *-add1.dat or *-resnik-add1.dat) to avoid math domain errors when computing the information content value.

Warning

The weights files are only valid for the version of wordnet for which they were created. Files created for WordNet 3.0 do not work for WordNet 3.1 because the offsets used in its identifiers are different, although the get_synset_id parameter of load() could be given a function that performs a suitable mapping. Some Open Multilingual Wordnet wordnets use the WordNet 3.0 offsets in their identifiers and can therefore technically use the weights, but this usage is discouraged because the distributional properties of text in another language and the structure of the other wordnet will not be compatible with that of the English WordNet. For these cases, it is recommended to compute new weights using compute().

wn.ic.load(source: str | Path, wordnet: Wordnet, get_synset_id: Callable | None = None) → dict[str, dict[str | None, float]]¶

Load an Information Content mapping from a file.

Parameters:

source – A path to an information content weights file.
wordnet – A wn.Wordnet instance with synset identifiers matching the offsets in the weights file.
get_synset_id – A callable that takes a synset offset and part of speech and returns a synset ID valid in wordnet.

Raises:

wn.Error – If wordnet does not have exactly one lexicon.

Example

>>> import wn, wn.ic
>>> pwn = wn.Wordnet("pwn:3.0")
>>> path = "~/nltk_data/corpora/wordnet_ic/ic-brown-resnik-add1.dat"
>>> freq = wn.ic.load(path, pwn)