Working with Lexicons¶
Terminology¶
In Wn, the following terminology is used:
- lexicon
An inventory of words, senses, synsets, relations, etc. that share a namespace (i.e., that can refer to each other).
- wordnet
A group of lexicons (but usually just one).
- resource
A file containing lexicons.
- package
A directory containing a resource and optionally some metadata files.
- collection
A directory containing packages and optionally some metadata files.
- project
A general term for a resource, package, or collection, particularly pertaining to its creation, maintenance, and distribution.
In general, each resource contains one lexicon. For large projects like the Open English WordNet, that lexicon is also a wordnet on its own. For a collection like the Open Multilingual Wordnet, most lexicons do not include relations as they are instead expected to use those from the OMW's included English wordnet, which is derived from the Princeton WordNet. As such, a wordnet for these sub-projects is best thought of as the grouping of the lexicon with the lexicon providing the relations.
Lexicon and Project Specifiers¶
Wn uses lexicon specifiers to deal with the possibility of having
multiple lexicons and multiple versions of lexicons loaded in the same
database. The specifiers are the joining of a lexicon's name (ID) and
version, delimited by :
. Here are the possible forms:
* -- any/all lexicons
id -- the most recently added lexicon with the given id
id:* -- all lexicons with the given id
id:version -- the lexicon with the given id and version
*:version -- all lexicons with the given version
For example, if ewn:2020
was installed followed by ewn:2019
,
then ewn
would specify the 2019
version, ewn:*
would
specify both versions, and ewn:2020
would specify the 2020
version.
The same format is used for project specifiers, which refer to
projects as defined in Wn's index. In most cases the project specifier
is the same as the lexicon specifier (e.g., ewn:2020
refers both
to the project to be downloaded and the lexicon that is installed),
but sometimes it is not. The 1.4 release of the Open Multilingual
Wordnet, for instance, has the project specifier omw:1.4
but it
installs a number of lexicons with their own lexicon specifiers
(omw-zsm:1.4
, omw-cmn:1.4
, etc.). When only an id is given
(e.g., ewn
), a project specifier gets the first version listed
in the index (in the default index, conventionally, the first version
is the latest release).
Downloading Lexicons¶
Use wn.download()
to download lexicons from the web given
either an indexed project specifier or the URL of a resource, package,
or collection.
>>> import wn
>>> wn.download('odenet') # get the latest Open German WordNet
>>> wn.download('odenet:1.3') # get the 1.3 version
>>> # download from a URL
>>> wn.download('https://github.com/omwn/omw-data/releases/download/v1.4/omw-1.4.tar.xz')
The project specifier is only used to retrieve information from Wn's index. The lexicon IDs of the corresponding resource files are what is stored in the database.
Adding Local Lexicons¶
Lexicons can be added from local files with wn.add()
:
>>> wn.add('~/data/omw-1.4/omw-nb/omw-nb.xml')
Or with the parent directory as a package:
>>> wn.add('~/data/omw-1.4/omw-nb/')
Or with the grandparent directory as a collection (installing all packages contained by the collection):
>>> wn.add('~/data/omw-1.4/')
Or from a compressed archive of one of the above:
>>> wn.add('~/data/omw-1.4/omw-nb/omw-nb.xml.xz')
>>> wn.add('~/data/omw-1.4/omw-nb.tar.xz')
>>> wn.add('~/data/omw-1.4.tar.xz')
Listing Installed Lexicons¶
If you wish to see which lexicons have been added to the database,
wn.lexicons()
returns the list of wn.Lexicon
objects that describe each one.
>>> for lex in wn.lexicons():
... print(f'{lex.id}:{lex.version}\t{lex.label}')
...
omw-en:1.4 OMW English Wordnet based on WordNet 3.0
omw-nb:1.4 Norwegian Wordnet (Bokmål)
odenet:1.3 Offenes Deutsches WordNet
ewn:2020 English WordNet
ewn:2019 English WordNet
Removing Lexicons¶
Lexicons can be removed from the database with wn.remove()
:
>>> wn.remove('omw-nb:1.4')
Note that this removes a single lexicon and not a project, so if, for
instance, you've installed a multi-lexicon project like omw
, you
will need to remove each lexicon individually or use a star specifier:
>>> wn.remove('omw-*:1.4')
WN-LMF Files, Packages, and Collections¶
Wn can handle projects with 3 levels of structure:
WN-LMF XML files
WN-LMF packages
WN-LMF collections
WN-LMF XML Files¶
A WN-LMF XML file is a file with a .xml
extension that is valid
according to the WN-LMF specification.
WN-LMF Packages¶
If one needs to distribute metadata or additional files along with
WN-LMF XML file, a WN-LMF package allows them to include the files in
a directory. The directory should contain exactly one .xml
file,
which is the WN-LMF XML file. In addition, it may contain additional
files and Wn will recognize three of them:
LICENSE
(.txt
|.md
|.rst
)the full text of the license
README
(.txt
|.md
|.rst
)the project README
citation.bib
a BibTeX file containing academic citations for the project
omw-sq/
├── omw-sq.xml
├── LICENSE.txt
└── README.md
WN-LMF Collections¶
In some cases a project may manage multiple resources and distribute them as a collection. A collection is a directory containing subdirectories which are WN-LMF packages. The collection may contain its own README, LICENSE, and citation files which describe the project as a whole.
omw-1.4/
├── omw-sq
│ ├── oms-sq.xml
│ ├── LICENSE.txt
│ └── README.md
├── omw-lt
│ ├── citation.bib
│ ├── LICENSE
│ └── omw-lt.xml
├── ...
├── citation.bib
├── LICENSE
└── README.md