wn.ic¶
Information Content is a corpus-based metrics of synset or sense specificity.
The mathematical formulae for information content are defined in Formal Description, and the corresponding Python API function are described in Calculating Information Content. These functions require information content weights obtained either by computing them from a corpus, or by loading pre-computed weights from a file.
Note
The term information content can be ambiguous. It often, and most
accurately, refers to the result of the information_content()
function (\(\text{IC}(c)\) in the mathematical notation), but
is also sometimes used to refer to the corpus frequencies/weights
(\(\text{freq}(c)\) in the mathematical notation) returned by
load()
or compute()
, as these weights are the basis of
the value computed by information_content()
. The Wn
documentation tries to consistently refer to former as the
information content value, or just information content, and the
latter as information content weights, or weights.
Formal Description¶
The Information Content (IC) of a concept (synset) is a measure of its specificity computed from the wordnet's taxonomy structure and corpus frequencies. It is defined by Resnik 1995 ([RES95]), following information theory, as the negative log-probability of a concept:
A concept's probability is the empirical probability over a corpus:
Here, \(N\) is the total count of words of the same category as concept \(c\) ([RES95] only considered nouns) where each word has some representation in the wordnet, and \(\text{freq}\) is defined as the sum of corpus counts of words in \(\text{words}(c)\), which is the set of words subsumed by concept \(c\):
It is common for \(\text{freq}\) to not contain actual frequencies but instead weights distributed evenly among the synsets for a word. These weights are calculated as the word frequency divided by the number of synsets for the word:
Example¶
In the Princeton WordNet 3.0 (hereafter WordNet, but note that the
equivalent lexicon in Wn is the OMW English Wordnet based on WordNet
3.0 with specifier omw-en:1.4
), the frequency of a concept like
stone fruit is not just the number of occurrences of stone
fruit, but also includes the counts of the words for its hyponyms
(almond, olive, etc.) and other taxonomic descendants (Jordan
almond, green olive, etc.). The word almond has two synsets: one
for the fruit or nut, another for the plant. Thus, if the word
almond is encountered \(n\) times in a corpus, then the weight
(either the frequency \(n\) or distributed weight
\(\frac{n}{2}\)) is added to the total weights for both synsets
and to those of their ancestors, but not for descendant synsets, such
as for Jordan almond. The fruit/nut synset of almond has two
hypernym paths which converge on fruit:
almond ⊃ stone fruit ⊃ fruit
almond ⊃ nut ⊃ seed ⊃ fruit
The weight is added to each ancestor (stone fruit, nut, seed, fruit, …) once. That is, the weight is not added to the convergent ancestor for fruit twice, but only once.
Calculating Information Content¶
- wn.ic.information_content(synset, freq)¶
Calculate the Information Content value for a synset.
The information content of a synset is the negative log of the synset probability (see
synset_probability()
).
- wn.ic.synset_probability(synset, freq)¶
Calculate the synset probability.
The synset probability is defined as freq(ss)/N where freq(ss) is the IC weight for the synset and N is the total IC weight for all synsets with the same part of speech.
Note: this function is not generally used directly, but indirectly through
information_content()
.
Computing Corpus Weights¶
If pre-computed weights are not available for a wordnet or for some domain, they can be computed given a corpus and a wordnet.
The corpus is an iterable of words. For large corpora it may help to use a generator for this iterable, but the entire vocabulary (i.e., unique words and counts) will be held at once in memory. Multi-word expressions are also possible if they exist in the wordnet. For instance, WordNet has stone fruit, with a single space delimiting the words, as an entry.
The wn.Wordnet
object must be instantiated with a single
lexicon, although it may have expand-lexicons for relation
traversal. For best results, the wordnet should use a lemmatizer to
help it deal with inflected wordforms from running text.
- wn.ic.compute(corpus, wordnet, distribute_weight=True, smoothing=1.0)¶
Compute Information Content weights from a corpus.
- Parameters
corpus (Iterable[str]) – An iterable of string tokens. This is a flat list of words and the order does not matter. Tokens may be single words or multiple words separated by a space.
wordnet (wn.Wordnet) – An instantiated
wn.Wordnet
object, used to look up synsets from words.distribute_weight (bool) – If
True
, the counts for a word are divided evenly among all synsets for the word.smoothing (float) – The initial value given to each synset.
- Return type
Example
>>> import wn, wn.ic, wn.morphy >>> ewn = wn.Wordnet('ewn:2020', lemmatizer=wn.morphy.morphy) >>> freq = wn.ic.compute(["Dogs", "run", ".", "Cats", "sleep", "."], ewn) >>> dog = ewn.synsets('dog', pos='n')[0] >>> cat = ewn.synsets('cat', pos='n')[0] >>> frog = ewn.synsets('frog', pos='n')[0] >>> freq['n'][dog.id] 1.125 >>> freq['n'][cat.id] 1.1 >>> freq['n'][frog.id] # no occurrence; smoothing value only 1.0 >>> carnivore = dog.lowest_common_hypernyms(cat)[0] >>> freq['n'][carnivore.id] 1.3250000000000002
Reading Pre-computed Information Content Files¶
The load()
function reads pre-computed information content
weights files as used by the WordNet::Similarity Perl module or the NLTK Python package. These files are computed for
a specific version of a wordnet using the synset offsets from the
WNDB format,
which Wn does not use. These offsets therefore must be converted into
an identifier that matches those used by the wordnet. By default,
load()
uses the lexicon identifier from its wordnet argument
with synset offsets (padded with 0s to make 8 digits) and
parts-of-speech from the weights file to format an identifier, such as
omw-en-00001174-n
. For wordnets that use a different identifier
scheme, the get_synset_id parameter of load()
can be given a
callable created with wn.util.synset_id_formatter()
. It can also
be given another callable with the same signature as shown below:
get_synset_id(*, offset: int, pos: str) -> str
When loading pre-computed information content files, it is recommended
to use the ones with smoothing (i.e., *-add1.dat
or
*-resnik-add1.dat
) to avoid math domain errors when computing the
information content value.
Warning
The weights files are only valid for the version of wordnet for
which they were created. Files created for WordNet 3.0 do not work
for WordNet 3.1 because the offsets used in its identifiers are
different, although the get_synset_id parameter of load()
could be given a function that performs a suitable mapping. Some
Open Multilingual Wordnet
wordnets use the WordNet 3.0 offsets in their identifiers and can
therefore technically use the weights, but this usage is
discouraged because the distributional properties of text in
another language and the structure of the other wordnet will not be
compatible with that of the English WordNet. For these cases, it is
recommended to compute new weights using compute()
.
- wn.ic.load(source, wordnet, get_synset_id=None)¶
Load an Information Content mapping from a file.
- Parameters
source (Union[str, pathlib.Path]) – A path to an information content weights file.
wordnet (wn.Wordnet) – A
wn.Wordnet
instance with synset identifiers matching the offsets in the weights file.get_synset_id (Optional[Callable]) – A callable that takes a synset offset and part of speech and returns a synset ID valid in wordnet.
- Raises
wn.Error – If wordnet does not have exactly one lexicon.
- Return type
Example
>>> import wn, wn.ic >>> pwn = wn.Wordnet('pwn:3.0') >>> path = '~/nltk_data/corpora/wordnet_ic/ic-brown-resnik-add1.dat' >>> freq = wn.ic.load(path, pwn)