Lemmatization and Normalization¶
Wn provides two methods for expanding queries: lemmatization and normalization. Wn also has a setting that allows alternative forms stored in the database to be included in queries.
See also
The wn.morphy
module is a basic English lemmatizer included
with Wn.
Lemmatization¶
When querying a wordnet with wordforms from natural language text, it is important to be able to find entries for inflected forms as the database generally contains only lemmatic forms, or lemmas (or lemmata, if you prefer irregular plurals).
>>> import wn
>>> en = wn.Wordnet('oewn:2021')
>>> en.words('plurals')
[]
>>> en.words('plural')
[Word('oewn-plural-a'), Word('oewn-plural-n')]
Lemmas are sometimes called citation forms or dictionary forms as they are often used as the head words in dictionary entries. In Natural Language Processing (NLP), lemmatization is a technique where a possibly inflected word form is transformed to yield a lemma. In Wn, this concept is generalized somewhat to mean a transformation that yields a form matching wordforms stored in the database. For example, the English word sparrows is the plural inflection of sparrow, while the word leaves is ambiguous between the plural inflection of the nouns leaf and leave and the 3rd-person singular inflection of the verb leave.
For tasks where high-accuracy is needed, wrapping the wordnet queries with external tools that handle tokenization, lemmatization, and part-of-speech tagging will likely yield the best results as this method can make use of word context. That is, something like this:
for lemma, pos in fancy_shmancy_analysis(corpus):
synsets = w.synsets(lemma, pos=pos)
For modest needs, however, Wn provides a way to integrate basic lemmatization directly into the queries.
Lemmatization in Wn works as follows: if a wn.Wordnet
object
is instantiated with a lemmatizer argument, then queries involving
wordforms (e.g., wn.Wordnet.words()
, wn.Wordnet.senses()
,
wn.Wordnet.synsets()
) will first lemmatize the wordform and then
check all resulting wordforms and parts of speech against the
database as successive queries.
Lemmatization Functions¶
The lemmatizer argument of wn.Wordnet
is a callable that
takes two string arguments: (1) the original wordform, and (2) a
part-of-speech or None
. It returns a dictionary mapping
parts-of-speech to sets of lemmatized wordforms. The signature is as
follows:
lemmatizer(s: str, pos: Optional[str]) -> Dict[Optional[str], Set[str]]
The part-of-speech may be used by the function to determine which
morphological rules to apply. If the given part-of-speech is
None
, then it is not specified and any rule may apply. A
lemmatizer that only deinflects should not change any specified
part-of-speech, but this is not a requirement, and a function could be
provided that undoes derivational morphology (e.g., democratic →
democracy).
Querying With Lemmatization¶
As the needs of lemmatization differs from one language to another, Wn
does not provide a lemmatizer by default, and therefore it is
unavailable to the convenience functions wn.words()
,
wn.senses()
, and wn.synsets()
. A lemmatizer can be added
to a wn.Wordnet
object. For example, using wn.morphy
:
>>> import wn
>>> from wn.morphy import Morphy
>>> en = wn.Wordnet('oewn:2021', lemmatizer=Morphy())
>>> en.words('sparrows')
[Word('oewn-sparrow-n')]
>>> en.words('leaves')
[Word('oewn-leave-v'), Word('oewn-leaf-n'), Word('oewn-leave-n')]
Querying Without Lemmatization¶
When lemmatization is not used, inflected terms may not return any results:
>>> en = wn.Wordnet('oewn:2021')
>>> en.words('sparrows')
[]
Depending on the lexicon, there may be situations where results are returned for inflected lemmas, such as when the inflected form is lexicalized as its own entry:
>>> en.words('glasses')
[Word('oewn-glasses-n')]
Or if the lexicon lists the inflected form as an alternative form. For example, the English Wordnet lists irregular inflections as alternative forms:
>>> en.words('lemmata')
[Word('oewn-lemma-n')]
See below for excluding alternative forms from such queries.
Alternative Forms in the Database¶
A lexicon may include alternative forms in addition to lemmas for each word, and by default these are included in queries. What exactly is included as an alternative form depends on the lexicon. The English Wordnet, for example, adds irregular inflections (or "exceptional forms"), while the Japanese Wordnet includes the same word in multiple orthographies (original, hiragana, katakana, and two romanizations). For the English Wordnet, this means that you might get basic lemmatization for irregular forms only:
>>> en = wn.Wordnet('oewn:2021')
>>> en.words('learnt', pos='v')
[Word('oewn-learn-v')]
>>> en.words('learned', pos='v')
[]
If this is undesirable, the alternative forms can be excluded from queries with the search_all_forms parameter:
>>> en = wn.Wordnet('oewn:2021', search_all_forms=False)
>>> en.words('learnt', pos='v')
[]
>>> en.words('learned', pos='v')
[]
Normalization¶
While lemmatization deals with morphological variants of words, normalization handles minor orthographic variants. Normalized forms, however, may be invalid as wordforms in the target language, and as such they are only used behind the scenes for query expansion and not presented to users. For instance, a user might attempt to look up résumé in the English wordnet, but the wordnet only contains the form without diacritics: resume. With strict string matching, the entry would not be found using the wordform in the query. By normalizing the query word, the entry can be found. Similarly in the Spanish wordnet, soñar (to dream) and sonar (to ring) are two different words. A user who types soñar likely does not want to get results for sonar, but one who types sonar may be a non-Spanish speaker who is unaware of the missing diacritic or does not have an input method that allows them to type the diacritic, so this query would return both entries by matching against the normalized forms in the database. Wn handles all of these use cases.
When a lexicon is added to the database, potentially two wordforms are inserted for every one in the lexicon: the original wordform and a normalized form. When querying against the database, the original query string is first compared with the original wordforms and, if normalization is enabled, with the normalized forms in the database as well. If this first attempt yields no results and if normalization is enabled, the query string is normalized and tried again.
Normalization Functions¶
The normalized form is obtained from a normalizer function, passed
as an argument to wn.Wordnet
, that takes a single string
argument and returns a string. That is, a function with the following
signature:
normalizer(s: str) -> str
While custom normalizer functions could be used, in practice the
choice is either the default normalizer or None
. The default
normalizer works by downcasing the string and performing NFKD
normalization to remove diacritics. If the normalized form is the same
as the original, only the original is inserted into the database.
Original Form |
Normalized Form |
---|---|
résumé |
resume |
soñar |
sonar |
San José |
san jose |
ハラペーニョ |
ハラヘーニョ |
Querying With Normalization¶
By default, normalization is enabled when a wn.Wordnet
is
created. Enabling normalization does two things: it allows queries to
check the original wordform in the query against the normalized forms
in the database and, if no results are returned in the first step, it
allows the queried wordform to be normalized as a back-off technique.
>>> en = wn.Wordnet('oewn:2021')
>>> en.words('résumé')
[Word('oewn-resume-n'), Word('oewn-resume-v')]
>>> es = wn.Wordnet('omw-es:1.4')
>>> es.words('soñar')
[Word('omw-es-soñar-v')]
>>> es.words('sonar')
[Word('omw-es-sonar-v'), Word('omw-es-soñar-v')]
Note
Users may supply a custom normalizer function to the
wn.Wordnet
object, but currently this is discouraged as
the result is unlikely to match normalized forms in the database
and there is not yet a way to customize the normalization of forms
added to the database.
Querying Without Normalization¶
Normalization can be disabled by passing None
as the
argument of the normalizer parameter of wn.Wordnet
. The
queried wordform will not be checked against normalized forms in the
database and neither will it be normalized as a back-off technique.
>>> en = wn.Wordnet('oewn:2021', normalizer=None)
>>> en.words('résumé')
[]
>>> es = wn.Wordnet('omw-es:1.4', normalizer=None)
>>> es.words('soñar')
[Word('omw-es-soñar-v')]
>>> es.words('sonar')
[Word('omw-es-sonar-v')]
Note
It is not possible to disable normalization for the convenience
functions wn.words()
, wn.senses()
, and
wn.synsets()
.