Annotation of Cross-Document Coreference: A Pilot Study

The main objective of this project was to develop a methodology in the form of detailed annotation schemes and guidelines for marking noun phrase and event coreference within one document and across different documents. This focus meant that the different types of coreference can be fully investigated in order that the most appropriate sets of guidelines and schemes possible for the annotation could be developed. This ensures that future annotations based on this methodology capture the phenomena both reliably and in detail. The project involved extensive discussions between annotators in order to redraft and improve the guidelines as well as major changes to enable the existing annotation tool PALinkA, which was used to annotate the sample corpus, to accommodate events as well as noun phrases. The project was funded by British Academy.


One of the first steps of developing the annotation methodology was to decide on our definition of coreference. Following van Deemter and Kibble (1999), which is also reflected in Mitkov et al. (2000), we used a narrow definition to ensure higher quality and reliability of annotation. For NP coreference there is a sizable research on annotation guidelines. Even so, investigation of the existing guidelines revealed that quite often they are either not appropriate for the domain of security/terrorism news, or they mark too few phenomena. The annotation methodology used in this project is not limited only to IDENTITY relations between entities, but it also incorporates relations such as SYNONYMY, GENERALISATION and SPECIALISATION. In order to ensure consistency between annotators, wherever possible, the relations between entities were automatically extracted from WordNet. The annotation scheme also encodes the type of realisation of coreferential relations. A coreferential item is labelled as NP, COPULAR, APPOSITION, BRACKETED TEXT or SPEECH PRONOUN. The journalistic domain selected for investigation also required to extend the existing guidelines in order to address the problem of continuous change from direct to indirect speech.

The guidelines for event annotation developed in this project focused mainly on what constitutes an event and how to identify the appropriate arguments (participants and other slots) associated with an event. Investigation of the selected clusters revealed that each cluster has specific events and that it is very difficult to annotate all the events related to terrorism/security issues in all the five clusters. For this reason, two clusters which contained quite different types of events (the cluster about war in Zaire focused on bombing and attacks, whilst the cluster about Peru concentrated on a hostage crisis) were selected for analysis and annotation. This approach ensured comprehensive analysis of the events.


Laura Hasler, Constantin Orasan and Karin Naumann (2006) NPs for Events: Experiments in Coreference Annotation. In Proceedings of the 5th edition of the International Conference on Language Resources and Evaluation (LREC2006), 24 -- 26 May, Genoa, Italy, pp. 1167 -- 1172 (pdf:LREC poster)


A by-product of this project is a corpus annotated for the phenomena investigated in this project. The corpus annotated with NP coreference contains all five clusters used in the project and totals almost 55000 words. During the research, it became clear that it is not possible to produce a corpus of similar size for event coreference and for this reason this corpus contains only slightly over 12,500 words.

ClusterNP coreferenceEvents
BukavuAnnotator 1 (8917 words, 16 texts)
Annotator 2 (2900 words, 5 texts)
Annotator 1 (2720 words, 5 texts)
Annotator 2 (3046 words, 5 texts)
ChinaAnnotator 1 (6775 words, 19 texts)N/A
IsraelAnnotator 1 (10900 words, 20 texts)N/A
PeruAnnotator 1 (12541 words, 19 texts)Annotator 1 (3640 words, 5 texts)
Annotator 2 (3179 words, 5 texts)
TajikistanAnnotator 1 (10600 words, 20 texts)
Annotator 2 (2716 words, 5 texts)

The corpus in MMAX format

The corpus annotated with NP coreference is also available in the MMAX format. The corpus was converted by Yannick Versley using the script he developed. This version of the corpus can be downloaded from here.

