Lemmatization and Normalization

Wn provides two methods for expanding queries: lemmatization and normalization. Wn also has a setting that allows alternative forms stored in the database to be included in queries.

See also

The wn.morphy module is a basic English lemmatizer included with Wn.

Lemmatization

When querying a wordnet with wordforms from natural language text, it is important to be able to find entries for inflected forms as the database generally contains only lemmatic forms, or lemmas (or lemmata, if you prefer irregular plurals).

>>> import wn
>>> en = wn.Wordnet('oewn:2021')
>>> en.words('plurals')
[]
>>> en.words('plural')
[Word('oewn-plural-a'), Word('oewn-plural-n')]

Lemmas are sometimes called citation forms or dictionary forms as they are often used as the head words in dictionary entries. In Natural Language Processing (NLP), lemmatization is a technique where a possibly inflected word form is transformed to yield a lemma. In Wn, this concept is generalized somewhat to mean a transformation that yields a form matching wordforms stored in the database. For example, the English word sparrows is the plural inflection of sparrow, while the word leaves is ambiguous between the plural inflection of the nouns leaf and leave and the 3rd-person singular inflection of the verb leave.

For tasks where high-accuracy is needed, wrapping the wordnet queries with external tools that handle tokenization, lemmatization, and part-of-speech tagging will likely yield the best results as this method can make use of word context. That is, something like this:

for lemma, pos in fancy_shmancy_analysis(corpus):
    synsets = w.synsets(lemma, pos=pos)

For modest needs, however, Wn provides a way to integrate basic lemmatization directly into the queries.

Lemmatization in Wn works as follows: if a wn.Wordnet object is instantiated with a lemmatizer argument, then queries involving wordforms (e.g., wn.Wordnet.words(), wn.Wordnet.senses(), wn.Wordnet.synsets()) will first lemmatize the wordform and then check all resulting wordforms and parts of speech against the database as successive queries.

Lemmatization Functions

The lemmatizer argument of wn.Wordnet is a callable that takes two string arguments: (1) the original wordform, and (2) a part-of-speech or None. It returns a dictionary mapping parts-of-speech to sets of lemmatized wordforms. The signature is as follows:

lemmatizer(s: str, pos: Optional[str]) -> Dict[Optional[str], Set[str]]

The part-of-speech may be used by the function to determine which morphological rules to apply. If the given part-of-speech is None, then it is not specified and any rule may apply. A lemmatizer that only deinflects should not change any specified part-of-speech, but this is not a requirement, and a function could be provided that undoes derivational morphology (e.g., democraticdemocracy).

Querying With Lemmatization

As the needs of lemmatization differs from one language to another, Wn does not provide a lemmatizer by default, and therefore it is unavailable to the convenience functions wn.words(), wn.senses(), and wn.synsets(). A lemmatizer can be added to a wn.Wordnet object. For example, using wn.morphy:

>>> import wn
>>> from wn.morphy import Morphy
>>> en = wn.Wordnet('oewn:2021', lemmatizer=Morphy())
>>> en.words('sparrows')
[Word('oewn-sparrow-n')]
>>> en.words('leaves')
[Word('oewn-leave-v'), Word('oewn-leaf-n'), Word('oewn-leave-n')]

Querying Without Lemmatization

When lemmatization is not used, inflected terms may not return any results:

>>> en = wn.Wordnet('oewn:2021')
>>> en.words('sparrows')
[]

Depending on the lexicon, there may be situations where results are returned for inflected lemmas, such as when the inflected form is lexicalized as its own entry:

>>> en.words('glasses')
[Word('oewn-glasses-n')]

Or if the lexicon lists the inflected form as an alternative form. For example, the English Wordnet lists irregular inflections as alternative forms:

>>> en.words('lemmata')
[Word('oewn-lemma-n')]

See below for excluding alternative forms from such queries.

Alternative Forms in the Database

A lexicon may include alternative forms in addition to lemmas for each word, and by default these are included in queries. What exactly is included as an alternative form depends on the lexicon. The English Wordnet, for example, adds irregular inflections (or "exceptional forms"), while the Japanese Wordnet includes the same word in multiple orthographies (original, hiragana, katakana, and two romanizations). For the English Wordnet, this means that you might get basic lemmatization for irregular forms only:

>>> en = wn.Wordnet('oewn:2021')
>>> en.words('learnt', pos='v')
[Word('oewn-learn-v')]
>>> en.words('learned', pos='v')
[]

If this is undesirable, the alternative forms can be excluded from queries with the search_all_forms parameter:

>>> en = wn.Wordnet('oewn:2021', search_all_forms=False)
>>> en.words('learnt', pos='v')
[]
>>> en.words('learned', pos='v')
[]

Normalization

While lemmatization deals with morphological variants of words, normalization handles minor orthographic variants. Normalized forms, however, may be invalid as wordforms in the target language, and as such they are only used behind the scenes for query expansion and not presented to users. For instance, a user might attempt to look up résumé in the English wordnet, but the wordnet only contains the form without diacritics: resume. With strict string matching, the entry would not be found using the wordform in the query. By normalizing the query word, the entry can be found. Similarly in the Spanish wordnet, soñar (to dream) and sonar (to ring) are two different words. A user who types soñar likely does not want to get results for sonar, but one who types sonar may be a non-Spanish speaker who is unaware of the missing diacritic or does not have an input method that allows them to type the diacritic, so this query would return both entries by matching against the normalized forms in the database. Wn handles all of these use cases.

When a lexicon is added to the database, potentially two wordforms are inserted for every one in the lexicon: the original wordform and a normalized form. When querying against the database, the original query string is first compared with the original wordforms and, if normalization is enabled, with the normalized forms in the database as well. If this first attempt yields no results and if normalization is enabled, the query string is normalized and tried again.

Normalization Functions

The normalized form is obtained from a normalizer function, passed as an argument to wn.Wordnet, that takes a single string argument and returns a string. That is, a function with the following signature:

normalizer(s: str) -> str

While custom normalizer functions could be used, in practice the choice is either the default normalizer or None. The default normalizer works by downcasing the string and performing NFKD normalization to remove diacritics. If the normalized form is the same as the original, only the original is inserted into the database.

Examples of normalization

Original Form

Normalized Form

résumé

resume

soñar

sonar

San José

san jose

ハラペーニョ

ハラヘーニョ

Querying With Normalization

By default, normalization is enabled when a wn.Wordnet is created. Enabling normalization does two things: it allows queries to check the original wordform in the query against the normalized forms in the database and, if no results are returned in the first step, it allows the queried wordform to be normalized as a back-off technique.

>>> en = wn.Wordnet('oewn:2021')
>>> en.words('résumé')
[Word('oewn-resume-n'), Word('oewn-resume-v')]
>>> es = wn.Wordnet('omw-es:1.4')
>>> es.words('soñar')
[Word('omw-es-soñar-v')]
>>> es.words('sonar')
[Word('omw-es-sonar-v'), Word('omw-es-soñar-v')]

Note

Users may supply a custom normalizer function to the wn.Wordnet object, but currently this is discouraged as the result is unlikely to match normalized forms in the database and there is not yet a way to customize the normalization of forms added to the database.

Querying Without Normalization

Normalization can be disabled by passing None as the argument of the normalizer parameter of wn.Wordnet. The queried wordform will not be checked against normalized forms in the database and neither will it be normalized as a back-off technique.

>>> en = wn.Wordnet('oewn:2021', normalizer=None)
>>> en.words('résumé')
[]
>>> es = wn.Wordnet('omw-es:1.4', normalizer=None)
>>> es.words('soñar')
[Word('omw-es-soñar-v')]
>>> es.words('sonar')
[Word('omw-es-sonar-v')]

Note

It is not possible to disable normalization for the convenience functions wn.words(), wn.senses(), and wn.synsets().