"#task1", "Task 2" => "#task2", "Task 3" => "#task3", "Task 4" => "#task4"); $navLinks = array("Home" => $rootPath, "Events" => $rootPath . "/events/index.php", "Anaphora Resolution Evaluation" => "/events/ARE/index.php", "Evaluation methods for ARE" => ""); generateTopDocument("Evaluation methods for ARE"); generateMenu($sideLinks, $navLinks, 0); ?>
ARE - Evaluation metrics

For evaluation purposes, the participants are expected to produce an XML output as indicated for each task. In all the cases the scores will be calculated using the IDs and start-end positions of the elements and not the string indicated in the value attribute. This means that in the evaluation of task 1 the results will be the same regardless whether the output is

      <pair id="p6">
        <pronoun id="62" value=" it"/>
        <antecedent id="4" value=" the Palestinian Authority"/>
      </pair>
    
or
      <pair id="p6">
        <pronoun id="62"/>
        <antecedent id="4"/>
      </pair>
    

Task 1

Task 1 will be evaluated using success rate. Because in task 1 the pronouns to be resolved and the candidates are known success rate is calculated as the ratio between the number of pronouns correctly resolved divided to the total number of pronouns to be resolved.

The anaphora resolution method can select any entity from the coreferential chain. For each pronoun to be resolved (i.e. from the input file) the following scores are given:

Task 2

Task 2 will be evaluated using precision, recall and f-measure. These are calculated using the MUC scores as defined in (Vilain et. al., 1995)

Task 3

Task 3 is evaluated using modified versions of precision and recall. In this task the pronouns to be resolved are not indicated in the input file. For this reason non-referential pronouns need to be filtered out. This makes necessary to use precision and recall. Because the candidates are not known, it is possible that there will not be a perfect match between the entities in the gold standard and those identified by the program. For this reason we introduced the following overlap measure between two strings:

overlap(Str1, Str2) = length(overlap string)/max(length(Str1), length(Str2))


For example the overlap between the government of Zair and Zair's government is 0 whereas the overlap between the government of Zair and the government is 0.5.

To calculate precision and recall the following formulae are used:

Precision = score of correctly resolved pronouns/number of pronouns attempted to resolve
Recall = score of correctly resolved pronouns/number of pronouns in the gold standard
where the score of correctly resolved pronouns is calculated as:
Score = sum(overlap(str1, str2))
where:

As in the task 1, if a pronoun is resolved to another pronoun the score is 1 if there is there is at least one antecedent in the co-reference chain which is non-pronominal, and 0.5 if there is no non-pronominal element in the chain or one of the pronouns in the chain is not correctly resolved.

Task 4

Task 4 will be evaluated using precision, recall and f-measure. These are calculated using a modified version of the metrics proposed in (Vilain et. al., 1995). The versions we use, instead of counting the number of common pairs, we use the overlap metric proposed for task 3. This means that when a pair is compared, the overlap between its elements is calculated.