In this project, we propose using statistical approaches to the analysis of corpus data in order to discover Typical Usage Patterns (TUPs) and hence create a resource for the Disambiguation of Verbs by Collocation (DVC). This project goes beyond the current state-of-the-art represented by word sense disambiguation based on machine-readable dictionaries; typical valency-based approaches, which rarely pay attention to collocations; WordNet, which does not analyse lexical syntagmatics or collocates; and semantic role labelling, which tags mainly thematic roles (e.g. Agent, Patient, Location), rather than semantic types (e.g. Human, Firearm, Route). In our project we propose to associate meanings with normal usage patterns, rather than words in isolation, and to integrate lexical collocations with valency, providing an empirically well-founded resource for use in mapping meaning onto word use in free text. DVC will show the comparative frequency of each pattern of each verb, enabling programs to develop statistically based probabilistic reasoning about meanings, rather than trying to evaluate all possibilities equally.
The internal structure of lexical arguments of verbs will be analysed using computational linguistics techniques, so that for example the relationship between "repair the roof" and "repair the damage" is recognized. Even though the nouns "roof" and "damage" have different semantic types, they activate the same meaning of "repair". Once this has been done, the structural relationship is applied to other verbs, e.g. "treat a patient" and "treat his injuries".
In a pilot project at Masaryk University, Brno, CZ, involving analysis of 700 verbs, Prof. Hanks, the co-investigator of the project, showed that, while words may be highly ambiguous, patterns are rarely ambiguous and, furthermore, most uses of most verbs can be assigned unambiguously to a pattern. The existing verbs will be used to train a statistical method the output from which will be verified lexicographically. As the number of annotated verbs increases, the training procedure will be repeated and so improve the accuracy and speed of the annotation. At each step, the researchers employed by the project will analyse the computer output and correct errors. The objective of the DVC project is to analyse 3000 common English verbs and annotate at least 250 corpus lines for each verb. An in-depth data analysis of 100 verbs will be carried out. The resource will be made publicly available at the end of the project. The DVC project is based on and will contribute to the Theory of Norms and Exploitations (TNE) of Prof. Hanks. TNE says, in essence, that a language consists of two interlinked systems of rules governing word use: a set of rules for the normal uses of words and a second-order set of rules governing the ways in which normal patterns are exploited. Exploitations are deliberately unusual utterances. They play a large role in linguistic change (word-meaning change). as today's exploitation may become tomorrow's norm.
The value of DVC will be proven by textual entailment and paraphrasing, in this way demonstrating its potential usefulness in a large number of fields of computational linguistics which benefit from these two applications.
The project will be disseminated using a wide variety of means. A fully user-friendly publicly available website will contain news about the progress of the project and will provide links to project research papers. It will also host interactive demos that will enable visitors to see the patterns collected and test the technologies developed in the project. Papers will be submitted to international conference and peer-reviewed journals. Evaluation conferences such as SEMEVAL and RTE will be used to assess the methods developed in this project in a standard environment. An important outcome of the project will be a monograph (theory, methodology, empirical findings).
- Start date: 1 October 2012
- Duration: 36 months
- Project's website: http://clg.wlv.ac.uk/projects/DVC/
- Prof. Ruslan Mitkov, Director of the Research Institute in Information and Language Processing
- Prof. Patrick Hanks, Lead researcher
- Dr. Constantin Orasan
- Dr. Ismail El Maarouf, Research Associate
- Jane Bradbury, Research Associate
- CPA website
- Public access to PDEV (Pattern Dictionary of English Verbs; in progress)
- CPA ontology (soon)
No publications yet