- Prof. Ruslan Mitkov (PI)
- Prof. Patrick Hanks (Co-I)
- Dr Ismail El Maarouf (RA)
- Ms Jane Bradbury (RA, 2012-14)
- Dr S. Moze (RA 2014-15)
Objectives of the DVC project:
To discover, through corpus pattern analysis, the characteristic patterns of use for English verbs and to link meanings to each pattern.
To build a pattern dictionary, which will contain patterns for 3000 English verbs
To test the hypothesis that, whereas words are very ambiguous, patterns are mostly unambiguous
To set up links between patterns and a tagged corpus
To build an empirically well-founded ontology of semantic types
To carry out an in-depth analysis of 100 verbs, with rich pattern descriptions, showing internal structure, alternations, and relevant dependencies
To develop an original theory of the relationship between word meaning and word use
This project has been a resounding success in terms of research done and conclusions reached. We have developed new insights into the nature of meaning in language and we have created new tools for lexical analysis and lexicography. We have demonstrated, by painstaking analysis of nearly 463,000 corpus lines from the ’balanced and representative’ British National Corpus, that the meaning of clauses and sentences depends in large part on collocational patterns of word use, not merely on a simple concatenation of lexical items, each one havening a distinct meaning. Language is not a Lego set!
DVC results have been published on line, in a user-friendly version at www.pdev.org.uk, along with the associated empirically well-founded shallow ontology of 220 semantic types, which was developed in the course of the project, being applied to nouns for the purpose of verb sense disambiguation. A semantic type represents an intrinsic property of each lexical item. The CPA ontology of semantic types is supplemented in relevant contexts by contextually assigned roles and semantic prosody.
The DVC project has shown that a verb pattern consists of a combination of collocations (lexical sets) and syntagmatics (clause roles, a.k.a. arguments). Each pattern is used repeatedly an infinite number of times, with lexical variations, in everyday usage. But in addition, the project has shed light on the creative use of language. It has been shown that normal, conventional patterns of word use are exploited in various ways. The project discovered evidence for three basic types of exploitation of normal usage patterns: figurative uses such as metaphors, metonymy, and similes; anomalous arguments; and syntactic exploitations. Exploitations are used for rhetorical effect, but also to create meanings expressing new and unfamiliar situations. The evidence that we have analysed suggests that exploitations are rule-governed – but governed by a quite different set of rules from those that govern syntactic well-formedness. Analysis of exploitation rules has been earmarked as a topic for a future research project.
After a preliminary study, the project rejected generative models of grammar as an analytical basis and instead adopted SPOCA, a ‘slot-and-filler’ Hallidayan model of Functional Grammar, which emphasizes the importance of clause roles [Subject, Predicator, Object, Complement, Adverbial] in the everyday use of natural language to create meanings. Future research will relate this model to the Dependency Grammars used in Computational Linguistics. Relevant ‘subargumental’ cues were identified where necessary: for example the importance of the presence or absence of (a particular class of) determiner in establishing the meaning of the verb ‘take’ with ‘place’ as direct object, as in take place vs. take one’s place vs. take someone else’s place vs. take first place.
The project proceeded verb by verb, in contrast to FrameNet, which proceeds frame by frame. This means that, when DVC declares a verb entry as complete, all normal patterns of use have been identified. The patterns are contrastive: that is, each pattern has a distinct implicature or meaning, which can be used to predict the meaning of sentences in previously unseen texts.
Corpus evidence for 1729 verbs was studied in detail. Typically, 250 corpus lines were analysed for each verb, insofar as that number of corpus lines was available.
BNC errors in word-class assignment (i.e. uses that were not verbs at all despite being (wrongly) classified as such in BNC) were rejected.
Work on DVC continues. In addition to completing the analysis of verbs, there is a need for corpus pattern analysis of nouns and adjectives. There is also a need to create more precise links between the patterns and a parsed corpus, which could in itself contribute to improvements in parsing technology. All of these will be the subject of future research proposals.
In 2013 MIT Press published Professor Hanks’s monograph describing his new theory of linguistic behaviour under the title Lexical Analysis: Norms and Exploitations. For other publications by members of the DVC group, see below.
- Hanks, I. El Maarouf, and M. Oakes, ‘Automatic extraction of MWEs for the Pattern Dictionary of English Verbs’, in M. Sailer and S. Markantonatou (eds), Multiword Expressions: Insights from a Multilingual Perspective, Berlin: Language Science Press (accepted).
- Hanks, ‘Definition’. In P. Durkin (ed.) Oxford Handbook of Lexicography. Oxford University Press. In Press.
- Hanks, ‘Cognitive Semantics and the Lexicon’: review article onD. Geeraerts’ Theories of Lexical Semantics. In International Journal of Lexicography 28 (1).
- El Maarouf and M. Oakes, ‘Statistical measures for characterising MWEs’, in Proceedings of PARSEME (Iasi, Romania), 2015.
- Bechara, S. Može, I. El Maarouf, C. Orăsan, P. Hanks, and R. Mitkov ‘The Role of Corpus Pattern Analysis in Machine Translation Evaluation.’ In Proceedings of the AIETI7 Conference (Málaga, Spain), 2015.
- El Maarouf, H. Mousselly Sergieh, E. Alferov, Haofen Wang, Zhijia Fang, and D. Cooper, ‘The GuanXi network: a new multilingual LLOD for Language Learning applications’ in Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data (NLP&LOD2), pages 42–51, Hissar, Bulgaria, 11 September 2015.
- El Maarouf, G. Marsic, and C. Orasan, ‘Barbecued opakapa: using semantic preferences for ontology population’, in Proceedings of RANLP 2015 (Hissar, Bulgaria), 2015.
- El Maarouf and M. Oakes, ‘Statistical measures to characterise mwus involving ’mordre’ in French or ’bite’ in English’, in Proceedings of MUMTT Workshop, EUROPHRAS 2015 (Malaga, Spain).
- Baisa, J. Bradbury, S. Cinkova, I. El Maarouf, A. Kilgarriff, and O. Popescu, ‘Semeval-2015 task 15: A cpa dictionary-entry building task’, in Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), (Denver, Colorado), pp. 315–324, Association for Computational Linguistics.
- Hanks, I. El Maarouf, and M. Oakes, ‘Automatic extraction of MWEs for the Pattern Dictionary of English Verbs,’ in Proceedings of PARSEME Workshop, (Malta), 2015.
- Hanks, I. El Maarouf, and M. Oakes, ‘Measures of collocational strength and flexibility for the identification of MWEs’, in Proceedings of PARSEME Workshop (Malta), 2015.
- Bechara, S. Moze, I. El Maarouf, C. Orasan, P. Hanks, and R. Mitkov, ‘The role of corpus pattern analysis in machine translation evaluation’, in Proceedings of AIETI7 (Malaga, Spain), 2015.
- Hanks, I. El Maarouf, and M. Oakes, ‘Word association measures for finding verb sense patterns’, in Proceedings of PARSEME Workshop, (Frankfurt, Germany), 2014.
- Gupta, H. Bechara, I. El Maarouf, and C. Orasan, ‘NLP techniques developed at the university of Wolverhampton for semantic similarity and textual entailment’, in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), (Dublin, Ireland), pp. 785–789, Association for Computational Linguistics.
- El Maarouf, J. Bradbury, V. Baisa, and P. Hanks, ‘Disambiguating verbs by collocation: Corpus lexicography meets natural language processing, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland.
- El Maarouf, J. Bradbury, and P. Hanks, ‘Pdev-lemon: a linked data implementation of the pattern dictionary of English verbs based on the lemon model,’ in Proceedings of the 3rd Workshop on Linked Data in Linguistics (LDL) at the Ninth International Conference on Language Resources and Evaluation (LREC’14), (Reykjavik, Iceland), 2014.
- Bradbury and I. El Maarouf, ‘An empirical classification of verbs based on semantic types: the case of the ’poison’ verbs,’ in Proceedings of JSSP2013 (Trento, Italy), 2013.
- El Maarouf and V. Baisa, ‘Automatic classification of semantic patterns from the pattern dictionary of English verbs’, in Proceedings of JSSP2013 (Trento, Italy), 2013.
- El Maarouf, “Methodological aspects of corpus pattern analysis,” ICAME journal, vol. 37, pp. 119–148, 2013.
- El Maarouf, J. Bradbury, V. Baisa, and P. Hanks, “Disambiguating verbs by collocation: Corpus lexicography meets natural language processing,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), (Reykjavik, Iceland), 2014.
- Hanks and J. Bradbury, ‘Why do we need pattern dictionaries (and what is a pattern dictionary anyway)?’, in Kernerman Dictionary News, Vol. 21, July 2013.
- Hanks. ‘Creatively Exploiting Linguistic Norms’. In T. Veale, K. Feyaerts, and C. J. Forceville (eds.) Creativity and the Agile Mind: A Multi-disciplinary Study of a Multi-faceted Phenomenon. De Gruyter Mouton. 2013
- Hanks. Lexical Analysis: Norms and Exploitations. MIT Press, 2013.
- Hanks. ‘The corpus revolution in lexicography’. In International Journal of Lexicography 25 (4), Silver Jubilee Issue, 2012
- Hanks. ‘How people use words to make meanings: semantic types meet valencies’. In A. Boulton and J. Thomas (eds.) Input, Process and Product: Developments in Teaching and Language Corpora. Masaryk University Press, 2012.
- Hanks, ‘Lexicography and technology in the Renaissance and now’. In G. Stickel and T. Váradi (eds.) Lexical challenges in a multilingual Europe: contributions to the annual conference 2012 of EFNIL in Budapest. Peter Lang.
In the course of the project, international collaboration on corpus pattern analysis was established with:
- Faculty of Informatics, Masaryk University, Brno, Czech Repubic
- Università degli Studi, Pavia, Italy
- University Institute for Applied Linguistics, Universitat Pompeu Fabra, Barcelona, Spain
- Pontifica Universidad Catolica de Valparaiso, Chile
- Institut für deutsche Sprache, Mannheim, Germany
Professor Hanks gave plenary and keynote lectures and workshops as follows:
- 2014, 2015, Michaelmas term. Training course in definition writing for lexicographers working on the Oxford English Dictionary
- 2014 Invited lecture, Real Academia Española, Madrid, Spain
- 2014 Invited lecture, Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra, Barcelona, Spain
- March 2014 Invited lecture, Lexical Studies Research Network Conference, Cardiff, UK
- 2014 Invited lecture, CLIN 24, organized by the Instituut voor Nederlandse Lexicologie, Leiden, The Netherlands
- 2013 Invited lecture and workshop, Joint Symposium on Semantic Processing (JSSP), Trento, Italy
- 2013 Keynote lecture, JILC 2013 Workshop on Corpus Pattern Analysis for Romance languages, Lorient, France
- May 2013 Invited lecture, seminar organized by the Intelligent Systems Group, Trinity College Dublin, Ireland
- 2012 Invited plenary lecture, European Federation of National Institutions for Language (EFNIL), Budapest, Hungary
- 2012 Invited plenary lecture, Stockholm Metaphor Festival
The DVC team:
- organized an international NLP challenge hosted at the SemEval 2015 competition.
- organized an international workshop on Coprus Pattern Analysis in Wolverhampton (2013)
- converted PDEV to a Linguistic Linked Data resource in order to facilitate its use in Semantic Web and NLP applications.
- created a preliminary network of semantic strings for automatic verb class discovery.
- created verb groups according to the prototypical meaning of the verb and developed a semantic tagger based on automatic ontology population techniques.
- developed a training module integrated in the SketchEngine, which allows trainees to perform corpus analysis and be automatically assessed by the system.
- developed a clustering module integrated in the SketchEngine, which sorts lines according to their similarity.
- created a Bootstrapping module integrated in the SketchEngine, which guesses the pattern number of an untagged line, based on the user’s tagging history.
- developed a fully synchronized lightweight lexicographical interface for pattern dictionaries, which allows the lexicographer to list and sort verbs according to various properties, create patterns for a verb and share the work with other users, annotate lines in the BNC with pattern numbers that link back to the dictionary, easily manually renumber patterns, navigate the ontology by semantic type or by nouns using an automatically populated ontology as well as a list of manually created suggestions, extract verbs and patterns which share common features such as making use of a specific semantic type, easily export data, and statistics. This editor has also been tested on Italian and is being used by colleagues.