Abstract
Understanding names and their reference is central to the analysis of
unrestricted texts and poses a significant challenge for a number of
Natural Language Processing (NLP) applications. Named Entity Recognition serves as an important preprocessing tool for Information Extraction (IE), Information Retrieval (IR) and Machine Translation (MT).
Petasisen et al. (2000) have described Named Entity Recognition (NER)
as the task of identifying and semantically tagging proper names in
running texts, into categories like person, location, organization,
etc. This description captures the approach to NER presented in this
thesis. The focus here is on named entities as being proper names, while many other approaches also have treated numerical and
temporal expressions as named entities.
This thesis has served as part of the Nomen Nescio project in
connection with the Text Laboratory at the University of Oslo.
The main focus of this thesis can be divided in two parts that are
related to each other. The first part concerns the choice of semantic
categories for proper names and the second part concerns experiments
on practical solutions to automatic categorizing proper names into the
chosen categories.
Before the practical work of categorizing proper names can begin the choice of categories has to be made. The following six categories were chosen:
1) Person names (e.g. names of people, pets and humanoids)
2) Location names (e.g. names of countries, cities, mountains, lakes, oceans etc.)
3) Organization names (e.g. institutions, firms, organizations, pop groups etc.)
4) Publication names (e.g. films, books, songs, short stories, papers, paintings etc.)
5) Event names (e.g. historical events, sport events etc.)
6) Miscellaneous names (e.g. products, vessels, and other proper names
not belonging in any of the categories above)
The task of categorizing proper names is not a trivial task, not
even when done manually, and that is why considerable time has been
used to make guidelines. Chapter 5 gives the explicit guidelines made for the annotation of named entities. These guidelines are now being used to re-annotate test and training corpus for the Norwegian Nomen Nescio NER application.
The method I have used for the practical categorization is a
rule-based method based on the Constraint Grammar (CG) formalism for
morphosyntactic tagging. The CG formalism is developed by Fred Karlsson at the University of Helsinki. Paul Meurer at the University of Bergen has implemented a Norwegian CG-tagger developed by the Text Laboratory, at the University of Oslo.
There seem to be limitations to what the CG formalism permits, and the
task of getting semantic information into the system needed to be
resolved. I have tried to be creative in overcoming those
limitations in the following ways:
* CG can not look at the strings of words to get
information that is necessary for the semantic
categorization. To overcome this drawback, I have used semantic
labels from gazetteers, i.e. name lists, and a suffix module
(chapter 6 and 7).
* The CG does not give a direct opportunity to add large name
lists into the formalism, and for this reason, name lists have been
added to the lexicon.
* As there are no semantic labels on the words in the lexicon, I
have used sets in the CG tagger to "simulate semantics" (chapter
7).
The Oslo-Bergen Tagger needed to be modified to handle the task of NER, both in respect to the preprocessor as described in chapter 6, and the disambiguation module. The modifications of the preprocessor have been related to the identification of complex proper names. The modifications of the Oslo-Bergen Tagger include use of regular expressions together with Document Centered Approach (DCA), a suffix module and expansion of the syntactic disambiguation. Additionally the lexicon has been expanded to include proper names from various gazetteers.
Chapter 7 gives a description of different parts of ARNER, an
Automatic Rule-based Named Entity Recognizer for Norwegian, and
the work on developing the system. The section on semantics gives a
description of how the semantics have been made available to the ARNER rules.
Chapter 8 reveals the first results of the performance of the ARNER system, and ideas on how the system may be improved. The evaluation figures have showed that the system needs more rules, and safer
rules, and that there is a lot to gain by implementing a DCA.
The result of my work, the ARNER rules and sets, will be used
in the rule based NER the Norwegian Nomen Nescio group is developing.