Improving post-editing and automatic translation by the creation of phraseological databases: an experiment

Speaker: Prof Jean-Pierre Colson (University of Louvain, Belgium)
Date: 14 November 2014
Location: MD083
Time: 2pm


In spite of the success of phraseology across a range of linguistic disciplines such as corpus linguistics, discourse analysis or semantics, it may come as a surprise that the notion is hardly mentioned in Translation Studies. Delisle (2003), for instance, treats set phrases as part of the lexicon. They are also most conspicuously absent from the major reference work in the field, the Routledge Encyclopedia of Translation Studies (Baker and Saldanha 2011). The same holds true of collocations.

As a matter of fact, the interest for phraseology in translation studies came mainly from the European Society for Phraseology (Europhras) and from corpus linguistics (e.g. Teubert 2002). Computational linguistics, in its turn, is showing a growing interest for matters involving translation and collocations in the broad sense. It is now generally recognised that phraseology poses a serious problem to machine translation (MT), because it involves a higher semantic level that cannot be grasped by processing the individual words. Multi-word units (including also lexical bundles, Biber et al. 2004) have indeed been called a pain in the neck for NLP (Sag et al. 2001).

Recent findings from studies devoted to the performance of MT with regard to phraseology (Monti, Mitkov, Corpas Pastor and Seretan, eds. 2013) suggest that traditional, syntactically based systems obtain lower scores than statistically based systems such as Google Translate. However, Google Translate yields erroneous results for phraseology in at least 40 percent of the cases, which may easily be confirmed by typing randomly chosen collocations in context, especially if they are partly fixed or if the set phrases are less common.

I will try to show that a major stumbling block for post-editing or MT remains the very incomplete listing of all set phrases by dictionaries and databases, even for the world’s most documented language, English. This obviously has to do with the rather poor results obtained by the automatic extraction of collocations, after no less than 50 years of research (Gries 2013). I will also propose a tentative step in the direction of a better automatic extraction of phraseology in the broad sense, based on the well-known Firthian principle that You shall know a word by the company it keeps (Firth 1957), and on its implementation in terms of metric clusters, a statistical technique derived from IR (Information Retrieval, Baeza-Yates and Ribeiro-Neto 1999). Achieving an appropriate balance between the principles of raw frequency, recurrent frequency and statistical co-occurrence may also be a key to success in future automatic extraction of collocations. The first results yielded by this method are promising, and may already be profitable to post-editing and automatic translation.

In the present phase of the experiment, an SQL database of about 400,000 candidate collocations and lexical bundles has be constituted for English by means of the algorithm mentioned above. A demonstration will be shown of a web application enabling users to input a text and discover (part of) its phraseology within a matter of a few seconds.


Baeza-Yates, R. & B. Ribeiro-Neto (1999). Modern Information Retrieval. New York: ACM Press, Addison Wesley.

Baker, M. & G. Saldanha (eds.) (2011). Routledge Encyclopedia of Translation Studies. New York: Routledge.

Biber, D., Conrad, S. & V. Cortes (2004). “If you look at… Lexical Bundles in University Teaching and Textbooks.” Applied Linguistics 25(3): 371-405.

Colson, J.-P. (2007). The World Wide Web as a corpus for set phrases. In: H. Burger, D. Dobrovol’skij, P. Kühn & N. R. Norrick (eds.),Phraseologie / Phraseology. Ein internationales Handbuch der zeitgenössischen Forschung / An International Handbook of Contemporary Research. Volume 2. Berlin, New York: Walter de Gruyter, p. 1071-1077.

Colson, J.-P. (2008). Cross-linguistic phraseological studies: An overview. In: Granger, S. & F. Meunier (eds.),Phraseology. An interdisciplinary perspective. John Benjamins, Amsterdam / Philadelphia, p. 191-206.

Colson J.-P. (2010a). The Contribution of Web-based Corpus Linguistics to a Global Theory of Phraseology. In: Ptashnyk, S., Hallsteindóttir, E. & N. Bubenhofer (eds.), Corpora, Web and Databases. Computer-Based Methods in Modern Phraselogy and Lexicography. Hohengehren, Schneider Verlag, p. 23-35.

Colson, J.-P. (2010b). Automatic extraction of collocations: a new Web-based method. In: S. Bolasco, S., Chiari, I. & L. Giuliano, Proceedings of JADT 2010, Statistical Analysis of Textual Data, Sapienza University of Rome, 9-11 June 2010. Milan, LED Edizioni, p. 397-408.

Colson, J.-P. (2012). A new statistical classification of set phrases. In : Pamies, A., Pazos Bretaña, J.M. & L. Luque Nadal (eds.), Phraseology and Discourse : Cross Linguistic and Corpus-based Approaches. Phraseologie und Parömiologie, Band 29. Hohengehren, Schneider Verlag, p. 377-385.

Delisle, J. (2003). La traduction raisonnée. Ottawa: Presses de l’Université d’Ottawa.

Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In F. Palmer (Ed.), Selected Papers of J. R. Firth 1952–1959. London: Longman, p. 168–205.

Gries, S. (2013). 50-something years of work on collocations. What is or should be next … International Journal of Corpus Linguistics, 18, p. 137-165.

Monti, J., Mitkov, R., Corpas Pastor, G. & V. Seretan (eds) (2013). Workshop Proceedings: Multi-word units in machine translation and translation technologies, Nice 14th Machine Translation Summit.

Sag, I., Baldwin, T., Bond, F., Copestake, A. & D. Flickinger (2002). “Multiword expressions: A pain in the neck for NLP”. In: Proccedings of Computational Linguistics and Intelligent Text Processing (CICLing-2002), Lecture Notes in Computer Science, 2276, 1-15.

Teubert, W. (2002). “The role of parallel corpora in translation and multilingual lexicography”. In: Altenberg, B. & S. Granger (eds.) Lexis in Contrast: Corpus-based approaches, 189–214.