Review of Nadja Nesselhauf (2005), Collocations in a Learner Corpus. Studies in Corpus Linguistics 14. Amsterdam: John Benjamins. 331 pp + xii.

    For CMLR; Jan 2006; pre-publication; do not cite without permission.

If vocabulary was flavour of the month in last-decade applied linguistics, in this one it is the multiword unit (MWU). This refinement follows a certain logic, because if claims made for said unit are even half accurate then all levels of the language teaching industry are in for a significant re-think. Ideas to be incorporated would be that grammars emerge from phrases not vice versa, main tasks in language acquisition are piecemeal not rule based, and functioning lexicons consist not in manageable handfuls of words but vast array of combinations lexicalised to varying degrees and operating within mazes of apparently random restrictions. No surprise that the rethink has hardly begun, with progress somewhat held up until recently by the lack of clear terms and an empirical database. To contribute to ongoing work on both fronts is the purpose of Nadja Nesselhauf’s book, based on her doctoral study of one type of multiword unit in the written production of advanced German-speaking ESL students.


Nesselhauf has assembled a corpus of advanced learner writing with a view to inspecting one of its MWUs, following procedures for learner corpus research established by Granger (1998). But Nesselhauf goes beyond anything published to date in her delimitation of phenomena and her generation of comparable data. In a detailed but (largely) readable account of her methodology, she carefully separates out collocation as the type of MWU she will look for, and within that verb-noun collocations (i.e., ride a bike not *drive a bike), with the specification that the restriction (on ride) be fully arbitrary rather than meaningful. A catalogue of such collocations is hand-extracted from her learner corpus, and native collocations separated from learner deviations by native raters; collocations are counted in terms of frequency and range, and deviations are categorized by type and probable intended meaning. All data can be traced back through individual writers to a background questionnaire itemizing years of ESL study, extent of exposure to English abroad, conditions of writing like timed and untimed, and dictionary yes or no. To call this “a lot of work” is an understatement, and indeed the amount of handwork involved raises the question whether this approach can be scaled up to a larger corpus (than her 200,000 words) as Nesselhauf proposes.


But even a smallish corpus carved with instruments this fine can generate interesting information. A predictable finding is that collocation remains a serious problem well into advanced learning. Less predictable is that neither years of instruction, nor years abroad, nor writing with or without time pressure, with or without a dictionary, has any effect on number of collocations employed or number of deviations. Particularly interesting are the deviations exposed by imputing intended meaning – like when a learner writes “I don’t take care of carrots,” which is a good collocation, except that he probably means “I don’t care for carrots” (which a computer match of learner strings against a standard corpus would have missed).


So, an even worse problem that we thought, but what solution? Nesselhauf explores awareness vs. learning as solutions. Learners are not (encouraged) in the habit of scanning language to become aware of restrictions on word combination. The collocation problem resembles one from the vocabulary research, that a word met in rich contexts can have a meaning so obvious that the word itself does not register in memory - “ride a bike” paints a picture so clear there is little motivation to notice it was ride not drive. A lengthy pedagogical implications section suggests ways of promoting awareness as well developing a collocational syllabus.


Interspersed in the treatment are attempts to clarify unresolved issues in the MWU agenda. One concerns Kjellmer’s (1991) idea that while natives process language in prefabricated sequences learners rely on grammars and lexicons which leaves them “sounding odd.” A problem is that if fluency is impossible without access to MWU’s (Sinclair, 1991), but learners do become fluent users of second languages, then either they employ such units or else fluency can be achieved on a words-and-rules basis. One way through the paradox is the frequent finding (e.g., Cobb, 2003) that learners do use MWU’s including collocations, and indeed over-use the few that they have, which is why their language sounds odd. Nesselhauf proposes another angles on the Kjellmer question, which there is no space to mention and anyway a review should not give away too much!


It is be hoped that the clear thinking and methodological exactitude of Nesselhauf’s study will be taken up in further studies. Should others accept the challenge to advance the MWU agenda through hard and careful work as Nesselhauf has done, there are some things to watch for in the write-up. First, while detail and precision clearly advance the research, the reading can be heavy going (e.g., pp. 240-241 offer two pages of closely reasoned linguistics with just one example to give a breather). Second, some of the apparatus of a thesis is out of place for a book audience (e.g., 30 pages of endnotes). Third, with all the pains taken to quantify her data, Nesselhauf nonetheless relies entirely on descriptive statistics even when making comparisons (e.g., between collocation counts in timed and non-timed writing). She even describes the results of comparisons with folkloric expressions: the length of stay in an English speaking country “does not seem to lead to” an increased use collocations (p. 236); the percentage of deviant collocations for users and non-users of dictionaries was “exactly the same” at 36.1% (p. 231). Isn’t the point of t-tests to tell us which differences are really different, etc?


These criticisms are just to say that no study can do everything and much work remains to be done in this area. The methods developed here are eminently replicable, the holes to plug are obvious. This is not easy research, but careers will be made in it  – this is the first act in a drama that will unfold for years to come.


Jan 3, 2006


By Tom Cobb

Dépt. de linguistique et de didactique des langues

Université du Québec à Montréal

For Canadian Modern Language Review.




Cobb, T. (2003). Analyzing late interlanguage with learner corpora: Quebec replications of three European studies. Canadian Modern Language Review, 59(3), 393-423


Granger, S., Ed. (1998). Learner English on computer. London: Longman.


Kjellmer, G. (1991). A mint of phrases. In Aijmer, K., & Altenberg, B. (Eds.), English Corpus Linguistics (111-127). London: Longman.


Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.