SizeFitsAll.19Dec99

One size fits all? Francophone learners and English vocabulary tests.

Tom Cobb

Département de linguistique et de didactique des langues

Université du Québec à Montréal

WWW-Prepublication of paper to appear in Canadian Modern Language Review 57 (2), 295-324.

Abstract. Which need is greater, the need for standard measures of vocabulary knowledge, or the need for vocabulary measures tailored to learners' first languages (L1s)? This question is explored using placement test data from more than 1000 francophone students entering English language courses at the Université du Québec à Montréal in 1997 and 1998. The test consisted of several measures including a standard vocabulary size test (Nation's, 1990, Levels Test). The study shows that a standard vocabulary measure can miss important information about learners' knowledge. It also suggests that an interlanguage sensitive measure can be a better predictor of broader language proficiency, and concludes that different tests may be needed for different stages of second language development.

Résumé. Qu'est-ce qui est plus important, établir des normes standardisées pour mesurer les connaissances du vocabulaire, ou créer des tests faits sur mesure adaptés à la langue première (L1) de l'apprenant? Cette question est explorée en examinant les résultats de test de classement de plus de 1000 étudiants francophones apprenant l'anglais à l'Université du Québec à Montréal en 1997 et 1998. Les tests comprenaient plusieurs parties dont un test standardisé pour mesurer la magnitude du vocabulaire (le Levels Test de Nation, 1990). L'étude montre qu'une mesure de vocabulaire standardisé peut manquer des renseignements importants sur les connaissances de l'apprenant. Elle suggère également qu'un test interlangue sensible plus nuancé sera plus à même de prédire l'étendue de la maîtrise d'une langue, et conclut qu'il faudrait différents types de tests pour les différents stades dans le développement d'une langue seconde.

Prospects and problems for a standard vocabulary test

Advantages of a standard measure

Vocabulary acquisition was once the neglected area of language study (Meara, 1980), but its theoretical interest and practical importance are now generally recognized. Even so, the abundant research into second language vocabulary acquisition (SLVA) of the last 15 years has been slow to find its way into classrooms and course books (Singleton, 1997). A reason often cited for this is the absence of standard measures of vocabulary size that would allow course designers and instructors to determine where, on the open seas of a second lexicon, particular learners could most usefully cast their nets. The alternative, in the absence of a systematic approach, has been to serve up the most common 1000 or so words of English through direct instruction, as most courses do reasonably well (Meara, 1993), and set learners adrift thereafter to haul in the words they happen to meet.

Meara (e.g., 1996, p. 41) proposes several advantages for using standardized vocabulary measures in instructional programs. These would "ask (and answer) questions about how many words people know, how fast their vocabularies grow, and how these factors are related to other aspects of linguistic competence." Instead, the vocabulary measures we have are mainly one-off tests that are designed for use with particular groups and purposes. Since they are incompatible with each other, it is difficult to integrate the data they produce. This approach to testing contributes to the fragmentation of the SLVA field. Meara regards Nation's (1983/1990) Vocabulary Levels Test as "the nearest thing we have to a standard test in vocabulary"(1996, p. 38), and his own Yes/No Vocabulary Checklist (Meara & Buxton, 1987) is another candidate.

Nation and his colleagues (1990; 1995; 1997) have attempted to build a systematic approach to vocabulary instruction, with their frequency-based Vocabulary Levels Test at its centre. Based on corpus analysis and experimental research, the Levels Test samples words from the 2000, 3000, 5000, and 10,000-word frequency levels, and from a zone of academic discourse known as the University Word List (UWL, recently supplanted by the Academic Word List). The test provides diagnostic advice as to where learners could most usefully direct their word-learning efforts, in view of their reading goals (e.g., whether or not they intend to do academic reading) and the predicted return on learning investment at the various levels (e.g., high at the 2000 level, low at the 10,000 level).

The Levels Test samples recognition knowledge of 18 words sampled from each of five frequency levels, in the manner shown in Table 1. The test-taker's task is to match one of the six words on the left to one of the three brief definitions on the right by writing the appropriate number in the space. The total number of words tested at each level is actually more than 18, because the words in the definitions are also test words. With only 18 items at each of the five levels the test is compact (takes a native speaker about five minutes) and usable in classroom conditions (especially since the entire test may not be applicable in every case). Guessing is reduced by employing the multiple choice format shown in Table 1, where six words are matched to three glosses, making the choice ratio 1:6 rather than the usual 1:4 but without increasing the time for reading through additional distractors. A weak score at any level is defined as knowing fewer than 15 out of 18 items, or less than 83% according to Nation's (1990, p. 140) experience using the test.

TABLE 1: ITEMS FROM TWO LEVELS OF THE VOCABULARY LEVELS TEST

Testees try to identify the meanings of three of the words on the left, by writing the number of the appropriate word beside the given meaning. Item (a) is taken from the 2000 level, item (b) from the University Word List level.

(a) 1. blame

2. hide ___ keep out of sight

3. hit ___ have a bad effect

4. invite ___ ask

5. pour

6. spoil

(b) 1. affluence

2. axis ___ introduction of a new thing

3. episode ___ one event in a series

4. innovation ___ wealth

5. precise

6. tissue

Once a target learning zone has been identified, how the words in this zone should be learned is left to learners and their instructors, but Nation and his colleagues also offer ample suggestions for classroom activities (1990; 1994), text sequencing procedures (Worthington & Nation, 1996), and procedures for matching texts to learners with the text analysis computer program VocabProfile (Hwang & Nation, 1994). Given the likelihood of individual differences in levels and acquisition rates in this area there is also a case for independent learning systems, whether flashcards or on-line tutoring. (For an example of the latter incorporating these tests and lists, see The Compleat Lexical Tutor at http://132.208.224.131).

The Levels Test has proven useful in several classroom applications and research ventures since its inception in 1983. One such venture (recounted in Cobb, 1999) took place at Sultan Qaboos University, in the Sultanate of Oman, where it shed light on the longstanding mystery of why many entering students were unable to pass a standard elementary test of language proficiency after four months of intensive language study. The proficiency test was Cambridge University's Preliminary English Test, with a stated lexical base of 2,387 words, while the Levels Test revealed that these students' typical vocabulary size was more in the range of 500 to 1000 words. With this information, it was possible to design materials that met the students' needs.

The Levels Test provided useful information of a different type at City University in Hong Kong (Cobb & Horst, 1999). A longstanding problem in this institution is that diploma students have difficulty reading academic texts. The Levels Test disclosed that these students' knowledge of terms from the University Word List (850 sub-technical terms used to scaffold information in such texts) was consistently weak, and that UWL scores strongly predicted reading comprehension scores. Retesting with the same instrument at yearly intervals also showed these students' incidental acquisition in this zone to be rather minor, suggesting a role for direct instruction. These latter findings were particularly interesting in that they could be compared to those of EFL learners elsewhere in the world who had been tested on the same measure, for example Laufer's (1994) Israeli learners, who made strong UWL gains over a similar period of time but without direct instruction. This is an instance of Meara's (1996) point, that a standard test allows comparison across learning contexts. However, it is also an instance of how a standard vocabulary test may not measure the same thing when used with members of different language groups.

Some problems with a standard measure

Unfortunately, the Chinese and Israeli learners' progress with the UWL may not be strictly comparable. These two groups of learners come from typologically different first language backgrounds, and the manner in which their L1s interact with test form or content is not controlled. For instance, most Chinese words are monosyllabic, so that the polysyllabic items of the UWL (episode, affluence, innovation) may well pose a greater learning burden (Hsia, Chung, & Wong, 1995) than they do for Israelis, whose L1, Hebrew, is mainly polysyllabic. Such factors could have implications for how these two groups of students are best tested. It is arguable that a vocabulary test for use in instructional planning should load on the zones that are least similar to a learner's L1 and hence measure knowledge of the L2 lexicon independent of any transfer facilitation (or inhibition). This would be particularly true if one purpose of a vocabulary test, as proposed above by Meara (1996), is to predict broader language ability. This ability will arguably correlate more strongly with the effort a learner has put into mastering aspects of the L2 lexicon that are not similar to the L1 lexicon, particularly in the case of cognate languages where some portion of the second lexicon can be had cheap or even for free.

Another way that a vocabulary test may mispredict broader proficiency is in cases where learners' L1s encourage them to adopt what Johnson and Ngor (1996) call a ‘lexical processing strategy’ for reading. They found that Chinese learners reading English tended not to use the grammatical information contained in words (Chinese words do not contain this information) but rather to guess at relationships between content words. In this case it could be predicted that Chinese learners would equate learning word meanings with learning the language, work hard on vocabulary, and then do well on a recognition vocabulary test but badly on a test involving use of the same words in sentences or texts. It could be concluded that a recognition vocabulary test should not be used with Chinese learners.

Vocabulary tests can also interact with learners' L1s on the level of format and culture. Experiments with Meara and Buxton's (1987) Yes/No checklist provide examples of each. The format of the checklist test is that learners are asked to indicate, yes or no, whether they are familiar with each word in a series of lists at ascending frequency levels. Guessing is controlled by the inclusion of plausible non-words in the lists (e.g., cheatle), which are used to calculate how much testees are overestimating their lexical knowledge. (See Table 2.) If testees indicate they know non-words like mascarate, then they are penalized. However, the test is known to function poorly with Arabic speaking learners, who identify a very large proportion of non-words as known (Al-Hazemi, 1993; Ryan, 1997). An explanation for this is that vowels are not normally written in Arabic script but rather supplied by the reader following a contextual interpretation (Abu Rabia & Seigel, 1995). With cognitive process transfer (Koda, 1988), Arabic speakers reading English are often blind to vowel-based distinctions between words, especially words out of context. Thus, they are likely to judge tilt and toilet as the same word (Ryan & Meara, 1991), or mascarate (in Table 2) as miscreate.

Culture, or more specifically the conditions of language use within a culture, can also interact with lexical knowledge in a vocabulary test. Meara and Buxton's Yes/No test was used with francophone learners in Montreal, and there was once again a high degree of opting for non-words, although this time it was for a different reason. Unlike Arabic speakers, francophone learners expect written words to contain vowels, but in French Canada they may never have seen the written forms of many of the English words they have heard. Meara, Lightbown, and Halter (1994) report that subjects often reported 'knowing' non-words, such as leddy that sounded like a word they might have heard on English television (lady) but never seen written.

TABLE 2: ITEMS FROM LEVEL 1 OF THE YES/NO CHECKLIST VOCABULARY TEST

Testees have to write Y (for YES) in the box for each word if they know what it means, or N (for NO) if they do not know what it means or are not sure.

1 [ ] bridge 2 [ ] modern 3 [ ] curtain

4 [ ] prison 5 [ ] classinate 6 [ ] mascarate

7 [ ] engine 8 [ ] hurt 9 [ ] ugly

To summarize, it seems that a vocabulary test can focus on either the language or the learner. The Levels Test and the Yes/No Test both focus on the language, measuring the learner against the target lexicon, but do not deal with what the learner already knows through knowledge of his or her L1. With this approach comes standardization and some degree of comparability, but also some potential cost in exactness with particular groups of learners. How much cost? And how much tighter fit can be achieved with a principled adaptation of test to learner? The purpose of the present study is to shed light on these questions through an evaluation of the Levels Test as it functioned for placement and program development in an institutional setting. (Endnote 1)

Evaluating vocabulary tests

How should a vocabulary test be evaluated? The usual way of doing this is to compare a new test to a previous test that has itself been shown to predict some observable language behaviour. For example, Schmitt, 1995, measured several vocabulary tests against the TOEFL (Test of English as a Foreign Language), which in turn has been shown to predict success in academic study. Instead, the approach here will be to work from principles (what information should a test provide) and predictions (what behavior will learners display if they know 1000, 2000, 3000 etc. words?)

Three principles proposed by Meara (1996, p. 41) are that a standard test should "answer questions about how many words people know, how fast their vocabularies grow, and how these factors are related to other aspects of linguistic competence." From these prescriptions one can develop evaluation criteria by asking of a specific test how well it answers these questions. In the present study, only the first and third of Meara's principles will be developed in this way, since answering the second question about learning rate is not applicable to a placement context. The two remaining principles, predicting how many words learners know and how this knowledge relates to other aspects of language proficiency, will be operationalized as measurable behaviours and predictions.

First, regarding how many words are known: if a vocabulary size test judged a learner to have recognition knowledge of 2000 words, how would one determine whether this claim was valid? Some ways could be predicted not to succeed. Graduates of the 2000 level might be asked to answer comprehension questions on a text constrained to 2000-level words as proof that they knew and could use these words. However, it is well known that extensive word knowledge is needed to affect reading comprehension to any meaningful extent (Mezynski, 1983; Stahl, 1991) whereas the Levels Test does not claim to measure more than basic recognition knowledge. There is a similar problem with asking subjects to write sentences as proof of knowing words. Miller and Gildea (1987) found that learners with only recognition knowledge were unable to do this. It is too much to expect learners to do something with words they may have only partial recognition knowledge of. One thing we have learned in the last 15 years is that vocabulary knowledge is incremental (Nagy, Herman & Anderson, 1985); hence we must measure recognition knowledge on its own terms.

In the present study, predictions of recognition knowledge will be evaluated within the domain of recognition knowledge in the following way: Levels Test scores will be compared to learners' recognition knowledge needs as expressed by their dictionary look-ups. The assumption is that if learners have basic recognition knowledge of a word, then they are unlikely to look it up in a dictionary, and, by extension, that if they have recognition ability at a certain frequency level then it is unlikely they will be doing any major part of their dictionary look-ups at that level.

Second, regarding the relationship between vocabulary knowledge and other aspects of linguistic ability: the strong version of this relationship (stated or implicit in lexical approaches to language learning, e.g., Willis, 1990) is that vocabulary is central or even preconditional to other types of language proficiency, such as reading, writing, and grammar, and hence even recognition vocabulary should predict these to some extent. So the two evaluation criteria will be the following:

The dictionary look-ups of learners judged competent at Level X will not be words from Level X but words from lower frequency levels.
There will be a non-trivial correlation between vocabulary knowledge and other aspects of language proficiency (such as reading or writing.)

Context of studies

The Levels Test was used as part of a placement procedure for several hundred francophone students entering English courses at the Université du Québec à Montréal (UQAM) between 1997 and 1999. The motivation to introduce the Levels Test was related to both placement and program development. The present research, consisting of five related studies, evaluates the usefulness of the Levels Test for these purposes.

Vocabulary testing had never been a part of the placement procedure in this institution prior to 1997, and yet lexis was one of the students' self-reported areas of weakness, so it was reasonable that the placement procedure should include a vocabulary component as a means to more accurate placement. Further, if testing confirmed the existence of a systematic vocabulary problem then this would be a rationale for incorporating a more deliberate focus on vocabulary within language courses. Systematic vocabulary training had never been included in any École de langues course in the past, and was out of keeping with communicative language teaching as practiced in Quebec.

Participants

Most of the students entering language courses were Canadian francophones with roughly nine years of classroom English instruction behind them. For most of them, this instruction had consisted of 2-5 hours per week focused mainly on oral skills. About 20% of the students were francophone immigrants of diverse origins and language learning experience. Another 20% were students intending to enter ESL teacher training programs who wanted or were required to take additional language training.

The placement test used prior to 1997 had asked similar cohorts of students to write paragraphs outlining their main problems with English and their motivation for taking an English course. Lack of vocabulary was consistently identified as the main problem. Improving academic reading was the main motivation for 75 per cent, often with a view to performing well on the reading section of the TOEFL. (Reading is a strongly vocabulary dependent skill, relative to grammar and oral communication.) In summary, vocabulary testing and consideration of training were responses to learners' expressed needs.

Materials

Choice of a standard vocabulary test

The two most plausible candidates for a vocabulary test were the Levels Test and the Yes/No Checklist. The Levels Test was not known to have any obvious problem for francophone learners, like the leddy problem of the Yes/No test. Also, the Levels Test had been used several times in other places in the context of academic reading (Cobb & Horst, in press; Sutarsyah, Nation and Kennedy, 1994), had been found the most reliable of several vocabulary measures, and was the test correlating most highly with TOEFL scores (Schmitt, 1995). Thus the Levels Test was chosen, or rather two levels of it, the 2000-word frequency level and the University Word List (UWL). Only the sections testing these wordlists were used, because together these lists account for a fairly reliable 90% of the tokens in an academic text. It was assumed that these learners' lexical needs would probably lie in these two areas.

An integrated placement measure

A placement test was constructed which consisted of the 2000 and UWL sections of the Levels Test (18 questions each), a TOEFL-style reading passage with 10 multiple choice comprehension questions, a 10-sentence grammar error identification task, and a 100-word writing task on the topic 'What difference would it make to your life, studies, or career if your English was much better than it is now?' The grammar and reading questions were submitted to cycles of item analysis and replacement until no question was answered by more than 85 per cent or fewer than 35 per cent of testees in a given testing session. The lexis of all test passages and questions was constrained using Laufer and Nation's (1995) measure of lexical richness, such that every word on the test was a member of either the 2000 list or UWL. This test was meant to serve as a multi-dimensional measure that could render three services:

• place students accurately

• determine whether direct vocabulary instruction was needed

• evaluate the Levels Test's ability to predict other aspects of language proficiency.

When finished, the placement test was computerized to facilitate its delivery to more than 750 students per year and to ease the collection and processing of data. The entire test of 56 questions and writing task appeared as a sequence of five standard Macintosh computer screens. The time allowed was 50 minutes, with 'time remaining' and 'pages remaining' clearly indicated on all screens. Most testees did not use all the time available, and very few reported having any difficulty with navigation (knowing where they were with respect to other parts of the test). All entries were made by clicking the mouse, with the exception of the 100-word writing task which required basic keyboard skills (students had the option to write the paragraph on paper).

Reformatting the Levels Test

The format of the Levels Test posed an interface design challenge, as it was necessary to ensure that the experience of taking the on-line version of the test resembled the on-paper task as closely as possible. In the paper version of the test, entering answers is simple (writing a number in a space) and all the questions of the same type can be seen at once (to facilitate the comparison of distractors and the revision of answers). The computerized test was formatted to present the 18 questions at each level on a single screen. Each 6x3 question-cluster (see Table 1) was transformed into three multiple choice questions with the same six choices, answerable with a mouse click. The original and the adapted format are shown in Table 3, and a screen picture of the adapted test is shown in Appendix 1. A simple pilot test was conducted with 10 learners using both paper and computer versions of the test with a two week interval, and there was no significant difference in scores. Of the 1500 students who have used the test, fewer than ten have reported any difficulty with the computer interface. This was determined both by informal observation and by employing a tracking system in the computer program that recorded the time between interactions, the number of revised answers, and other reliable indicators of human-machine interaction (see Cobb, 1997, Ch. 9 and 12).

TABLE 3: ITEM FROM THE ORIGINAL LEVELS TEST RECODED FOR COMPUTER VERSION

Item (a) is taken from the 2000 level of the original test, item (b) is the same item from the computer version of the test. In the computer version, testees click the mouse on the square beside the appropriate word and the square becomes filled with an "x".

(a) 1. blame

2. hide _2_ keep out of sight

3. hit ___ have a bad effect

4. invite ___ ask

5. pour

6. spoil

(b) 1. keep out of sight [ ] blame [x] hide [ ] hit [ ] invite [ ] pour [ ] spoil

2. have a bad effect [ ] blame [ ] hide [ ] hit [ ] invite [ ] pour [ ] spoil

3. ask [ ] blame [ ] hide [ ] hit [ ] invite [ ] pour [ ] spoil

Dictionary look-ups kit

A website was developed where students could read newspaper stories from the Montreal Gazette, look up words in Merriam Webster's World Wide Webster dictionary (at http://www.m-w.com), and submit these to an instructor as part of a class word bank building project (follow links to Group Lex from http://www.er.uqam.ca/nobel/r21270/4150). The Gazette has often been used as a source of reading material in the École de langues and has proven popular with students as it deals with familiar topics in challenging yet comprehensible English. The look-ups idea follows a methodology developed by Cohen, Glasman, Rosenbaum-Cohen and Ferrara (1988) and a technology developed by Hulstijn (1993) and adapted here for Internet.

Study 1: Determining the English vocabulary levels of UQAM students

Given the students' belief that vocabulary was their main problem with English, and academic reading their main objective, it was initially expected that vocabulary testing at the École de langues would lead to the implementation of vocabulary components within reading courses. It seemed reasonable to expect that the 1000-2000 wordlist might form the basis of a useful instructional module in intermediate reading courses, and the UWL in more advanced courses.

Results and Discussion

In 11 testing sessions over the course of a year, the mean percentage score for 768 students on the Levels Test was 74% (SD = 16) at the 2000 level and 68% (SD = 18) at the UWL level, both somewhat below the suggested criterion of 83%.

On the basis of this information, an experimental UWL module was added to two academic reading courses. However, the administration of the École de langues did not feel that the 2000-level scores (mean = 74%) bespoke any clear need to reallocate course time from reading and skills training to vocabulary work. With roughly 10% of the lowest placing testees either not being admitted to courses or else soon dropping out, it was reasonable to assume that the average scores at the 2000 level of students actually attending English classes would not be much below the 83% criterion. One typical cohort of testees (n = 37) was tracked through the registration and drop-out process, and with eliminations the mean 2000-level score was 82% (SD = 9.3).

However, reading instructors familiar with the various frequency lists represented in the Levels Test (view these lists at http://132.208.224.131) observed that much reading class time was in fact devoted to discussing vocabulary items, many of them rather common items, and in fact items from the 2000 list of the most commonly used words of English. Thus there appeared to be a discrepancy between test results and the perceptions of both learners and instructors.

Study 2: Which words do students look up?

It was decided to verify the instructors' impressions in a more rigorous manner by investigating which words students were actually looking up in a dictionary. This involved building a dictionary activity into a reading course and keeping a record of the words students had sought information about. For one 14-week session, students in two randomly chosen classes (n = 80) were assigned the task of reading at least five stories per week on the Montreal Gazette website, summarizing each, looking up any words found interesting or necessary for comprehension, and submitting five of these along with definitions and sentence contexts to the on-line word bank discussed above. If students could not find five words that they felt they needed to look up, then they looked for words their classmates might be interested in learning.

Results and discussion

Because of the large number of words collected in this look-up study (7594 tokens, 4623 types), only words looked up by more than five students (176 types) are reported (for the complete listing of both five and two look-ups see http://www.er.uqam.nobel/r21270/lookups.) These 176 words were divided into frequency zones using Hwang and Nation's (1994) VocabProfile, a text analysis program which assigns words in a text to frequency lists following the scheme of the Levels Test. The interesting result, consistent with instructors' intuitions, was that 34.6 per cent of the look-ups were 2000-level items. That is, more than one third of the 176 words that students looked up were common items, some even very common items from the 1000 frequency level (e.g., weak, worth, and youth). This interest in high frequency items is all the more remarkable in view of the fact that newspaper writing is lexically rich as a genre (Hwang, 1989), containing a high proportion of less frequent but information bearing words to which learners’ attention might have been drawn.

Is there any pattern to the words the students looked up? Table 4 shows the 54 look-up items from the 2000-level list. It seems clear the students have an unerring aim for the words of English that are not (or not obviously) cognate with French words. In other words, their 2000-level look-ups are mainly words of Anglo-Saxon origin, which tend to be well represented in the high frequency zones of English and are not usually inferable from knowledge of French. Interestingly, two of the words are test words on the Levels Test (lack and roar).

TABLE 4: 2000-LEVEL WORDS LOOKED UP BY MORE THAN FIVE STUDENTS

abroad

aim

bare

beam

bear

beneath

boast

bold

borrow

broad

bundle

claim

curse

damp

drag

eager

elderly

flood

further

illness

increase

indeed

lack

length

meant

nearby

plenty

prompted

raise

request

roar

seize

settle

settlement

skills

slightly

slopes

spread

steep

stir

strike

swallow

sweep

thread

threat

throat

trial

urge

wage

weak

worth

wound

wrap

youth

(54 words, or 34.6 per cent of 176 words looked up by 5 or more students)

How important are the Anglo-Saxon words of English? They comprise only about 35 per cent of the lexicon as a whole, with terms of French, Latin and Greek origin comprising most of the rest. However, in the high frequency zones, Anglo-Saxon weighs in at closer to 50 per cent (Roberts, 1965, cited in Nation, 1990, p. 18). Since the most frequent 2000 words of English reliably comprise about 80% of the individual words or tokens in an average text (Carroll, Davies, & Richman, 1971), Anglo-Saxon terms account for about half of these, or 40% of tokens in an average text. Many of these are pronouns and other function words that most students could be expected to know; however, note that the item beneath appears in Table 4, and several other prepositions and conjunctions appear in the larger list of look-ups. The proportion of Anglo-Saxon words is probably even higher in spoken language, which leans heavily on the first 1000 words of the language, where the Anglo-Saxon proportion is 56 per cent. In other words, any systematic weakness in learners' Anglo-Saxon lexicon could make it difficult for them to understand English with any precision, and would shed light on their perception that they are weak in vocabulary. (Endnote 2)

To summarize, this study has identified a pattern whereby high frequency words were looked up more often than would be expected given the students' performance on the Levels Test. These words were non-cognate or not obviously cognate with French words. However, as mentioned above the Levels Test is not bereft of AS items and the next study will report on testees' performance with these.

Study 3: Item facility analysis of the Levels Test

It is well known that the English lexicon comprises two main strands, the Greco-Latin and the Anglo-Saxon. How does the Levels Test reflect this twin inheritance? The test designer has clearly attempted to represent both strands more or less equally. This became apparent through the following analysis. Each of the 18 test items consists of a word and a gloss. A rating of Greco-Latinness (GL) or Anglo-Saxonness (AS) was established for each word and gloss in each item at the 2000 level by looking up the words in a standard dictionary offering word etymologies (the Webster Dictionary). In the glosses, AS prepositions, pronouns and other grammatical words were assumed known to all testees and ignored. For example, elect = choose by voting was rated GL-GL (élire = choisir par vote) despite the presence of by in the gloss. Where a gloss was a phrase containing both GL and AS terms (e.g., roar = loud deep sound) the classification was based on simple majority of content words (loud and deep outweigh sound so the item was rated AS-AS).

The ratings were used to assign a GL strength to each of the 18 test items (word-gloss pairs) at the 2000 level. A GL word and GL gloss (GL-GL) was assigned a strength of 3 (total = complete); GL-AS was assigned a 2 (original = first); AS-GL was assigned a 1 (pride = having a high opinion of yourself); and AS-AS was assigned a 0 (melt = become like water). Some decisions are clearly built into these strength assignments. For example, designating the gloss having a high opinion of yourself as GL assumes that most learners know that have and avoir, high and haut, are related. Also, in the mixed pairs, designating GL-AS items strength 2 but AS-GL items only strength 1 is based on the expectation that words out of context are more difficult to interpret than words in phrases (10 of the 18 glosses are phrases). Also, the test is set up such that all of the glosses but only three of the six words in each set must be used (see Table 1), so glosses are more likely to be helped by elimination. Three raters following the guidelines arrived at the same ratings. (See Table 5.)

An interesting point is raised by designating item 17 (sport = game) and item 18 (victory = winning) in Table 5 as GL-AS. While game and winning must clearly be classified as AS words, they are nonetheless known to all Quebeckers who watch at least some of their ice hockey on English television. But as test items, these words do not necessarily function as samples indicating knowledge of other words in the frequency level. In a vocabulary size test, the tested words are meant to sample knowledge of many more words beyond just themselves, so test items like game and win may over-represent the vocabulary knowledge of francophone Canadians. This is an example of how a standard vocabulary test can interact with culture.

The analysis revealed that at the 2000 level of the test, there are five GL-GL items (GL strength 3), six GL-AS (strength 2), two AS-GL items (strength 1), and five AS-AS items (strength 0). The mean GL strength level amounted to 1.6 (SD = 1.2). By this reckoning, the Levels Test balances the two strands reasonably well, although with some bias toward GL items.

Once test scores had been assembled, a facility index (the percentage of students answering correctly) was calculated for each item. Based on the results of Studies 1 and 2, it was predicted that GL strength would correlate significantly with success on test items. If so, this would suggest that learners had drawn heavily on their knowledge of French cognates for their test scores. It was further predicted that success with most AS items would be low and reflect mainly the guessing opportunities afforded by the elimination of GL items.

TABLE 5: LEVELS TEST, 2000 LEVEL: GL ASSIGNMENT AND FACILITY INDEX

Test word	Gloss	AS-GL balance	GL strength	Facility Index (SD)
1. total	complete	GL-GL	3	.91 (.06)
2. original	first	GL-AS	2	.80 (.07)
3. private	not public	GL-GL	3	.92 (.07)
4. elect	choose by voting	GL-GL	3	.93 (.05)
5. melt	become like water	AS-AS	0	.68 (.08)
6. manufacture	make	GL-AS	2	.57 (.07)
7. hide	keep out of sight	AS-AS	0	.60 (.08)
8. spoil	have a bad effect on	AS-GL	1	.29 (.06)
9. invite	ask	GL-AS	2	.78 (.06)
10. pride	having a high opinion of yourself	AS-GL	1	.77 (.07)
11. debt	something you must pay	GL-GL	3	.69 (.07)
12. roar	loud, deep sound	AS-AS	0	.62 (.10)
13. salary	money paid regularly for doing a job	GL-GL	3	.95 (.05)
14. temperature	heat	GL-AS	2	.82 (.07)
15. flesh	meat	AS-AS	0	.40 (.09)
16. birth	being born	AS-AS	0	.89 (.05)
17. sport	game	GL-AS	2	.89 (.05)
18. victory	winning	GL-AS	2	.92 (.06)

Results and discussion

Facility analysis showed that while most testees had not made a large number of errors on the test, many had made the same errors. The rightmost column in Table 5 shows the percentage and standard deviation of the 768 testees who answered each item correctly, with GL strength rating in the column immediately to the left for easy comparison. The table shows that 7 out of 18 test items are known to 89% or more of testees (items 1, 3, 4, 13, 16, 17, 18). The mean GL strength of these items is 2.3. Four out of 18 items are known to 62 per cent or fewer of the testees (items 6, 7, 8, 12 and 15). The mean GL strength of these items is .6. In other words, the GL strength of high success items is almost four times that of low.

There is nothing irregular in such a distribution in itself, provided there was no systematic or non-random basis for it. But that is not the case here. Comparing the facility index to the content of each item in Table 5, one can see that most of the items unknown to large numbers of testees involve AS terms such as melt (Question 5), make (Question 6), hide, keep, or sight (Question 7), spoil (Question 8), loud, deep, or roar (Question 12), and meat or flesh (Question 15). In other words, weakness is not distributed throughout the system but concentrated on one type of item. Similarly, strength is concentrated on words like total and complete (Question 1), private and public (Question 2), elect and choose (Question 3). The pattern is not complete, however, with AS items win (Question 17) and game (Question 18) posing little problem, as noted already, and less explicably neither birth and born (Question 16) nor pride (Question 10).

Overall, the correlation between GL strength and item facility is r = .63 (p<.05); in other words, the more GL an item has, the more testees will know or be able to guess it. If the 2000-level test had consisted only of items having GL strength 1 or 0 (i.e. requiring knowledge of at least one Anglo-Saxon term), then the average score at this level for these 768 students would have been 63.4% (SD = 7.5), and the case for giving these students basic vocabulary training would have been clear.

Scores from one testing session (n = 148) were selected for more detailed examination. This session was randomly selected from all the testing sessions which had not included future ESL teachers, who were seen as a separate population (see Study 5 below). The mean score for the 148 testees at the 2000 level was 79.31 (SD = 13.29). As already noted, a mean this high would not present a strong case for vocabulary training. But with items broken down into GL and AS components, the picture changes dramatically. The mean score for GL = 3 and GL = 2 items taken together was 79.30% (SD = 14.24); the mean score for GL = 1 and GL = 0 items amounted to 39.34% (SD = 22.06). In other words, success was double on the GL biased items.

The bar chart in Figure 1 represents the mean GL and AS scores of individuals from the testing cohort, ranked from left to right by mean overall placement test score, and divided into three sample subgroups by ability level: ranks 1-10, ranks 70-80, and ranks 110-120. The columns representing GL scores are consistently high across the ability range, whereas the columns representing AS knowledge drop sharply. Something the bar chart does not show for lack of space is that the drop occurs very shortly after the tenth testee (see http://www.er.uqam.ca/nobel/r21270/lookups/barchart.htm for data on the complete set of scores). In other words, the GL-AS difference is not spread evenly across the ability range, but rather increases as ability decreases. This same pattern was observed in four other cohorts that were examined.

FIGURE 1: AS AND GL KNOWLEDGE ACROSS THE ABILITY RANGE

To summarize, these learners are systematically less likely to know AS than GL items in the high frequency zone. This fact is not disclosed but rather masked by a test that samples from the lexicon of English as a whole. This is not to criticize the Levels Test, which clearly samples the English lexicon representatively. It is rather an argument that a test will not find L1-specific information if it does not look for it.

But does differential familiarity with the lexical strands of English have any bearing on these learners' broader ability to function in English? As mentioned above, AS terms constitute about 40% of text lexis and as much as 56% of spoken lexis, so weakness in this area could be expected to affect general linguistic functioning and would help account for the students' perception that they are weak in vocabulary. This expectation is tested in the next study.

Study 4: Predicting broader proficiency

What sort of correlation should we expect between recognition vocabulary knowledge and broader proficiency in a second language? Given what many now describe as the centrality of lexical knowledge in all aspects of linguistic functioning, we might expect the correlation to be substantial. And yet correlations between lexis and broader proficiency are not commonly reported in the research literature. Possibly the closest we come to a standard by which other predictions can be evaluated is the L1 work of Anderson and Freebody (1983, used as a point of reference in Meara & Buxton, 1987), in which a series of multiple choice vocabulary tests were found to predict reading comprehension scores at r=.8, r=.75, and r=.66, with a mean correlation of r= .73. Additional support for this figure comes from a study by Qian (1999), which examined TOEFL recognition vocabulary and reading scores of 217 ESL students, and found the same correlation of r=.73. So, with the understanding that it is not strictly appropriate to compare correlation strengths across studies, samples, and instruments, a correlation of r = .73 between passive vocabulary knowledge and reading scores shall serve as a guideline in the evaluation that follows.

The fourth study includes two analyses. The first tested the correlation between Levels Test scores (at the 2000-level and UWL) and reading scores. It was predicted that correlations would be substantially less than r=.73, because of the test's overestimation of these learners’ vocabulary knowledge as shown above. The second analysis examined the independent contributions of testees' 2000-level scores on AS and GL items to reading scores in a multiple regression analysis. It was predicted that GL knowledge would account for more score variance than AS knowledge, because of undetermined amounts of guessing in AS scores and the effects of unrepresentative items in the test (like sport and win.) The data for this part of the study was once again 2000-level placement test results for the cohort used in Study 3 (n=148), with the 18 items divided into GL and AS components in the method described.

The text on which the reading test was based was a normal academic text with a lexical profile very much in line with the frequency zones targeted by the Levels Test. Its profile as determined by VocabProfile analysis was as follows: 84 per cent of tokens were from the 0-2000 level, 10 per cent were from the UWL level, and the remaining 6 per cent were topic-specific words judged inferable from the context by three instructors (assuming knowledge of the 2000 and UWL items comprising the contexts). A further analysis of the GL-AS composition of the 0-2000 zone, based on wordlists which can be viewed at http://132.208.224.131, revealed a roughly equal distribution, as predicted by Roberts (1965). The ten multiple choice questions were of an almost identical lexical composition to the text, being drawn from the text with no terms added other than question words. The questions were inferential in the sense of requiring comprehension of rephrased or integrated text material but assumed no specialized topic knowledge. In other words, an effort was made to align the reading task with both the Levels Test and the normal GL-AS distribution of English texts.

Results and discussion

The Levels Test predicted reading scores only moderately well by the proposed guideline of r =.73. The correlation between reading and overall 2000 level vocabulary was r=.62, and reading and UWL was r =.59 (both p<.05). However, when 2000 level scores were broken down into their GL and AS components, calculated as independent percentages, and entered into multiple regression analysis with reading scores as the dependent variable, a different picture emerged: The correlation between GL knowledge and reading scores was r=.74, while the correlation between AS and reading scores amounted to only r=.05 (p>.05), accounting for less than 1 per cent of variance. In other words, these students’ AS knowledge had apparently contributed almost nothing to their reading scores. The same analysis with three other test cohorts produced similar results.

This finding, while surprising in its extremity, is nonetheless consistent with the hypothesis that testees' success with AS test items was largely due to either guesswork, once GL items were eliminated, or else to unrepresentative knowledge of AS terms like game and win. In either case, a correct answer on an AS item would not reflect any significant amount of broader lexical knowledge, and would not contribute much to variance on an integrated measure such as a reading comprehension score. Still, the finding stands in need of confirmation with a test of somewhat more than 18 items representing 2000 words, in other words with less room for the operation of token effects.

Pending a further investigation of AS and GL contributions, the overall correlation of 2000-level and UWL vocabulary and reading at r=.62 and r=.59 are in any case not high (with vocabulary scores accounting for only about 35% of reading score variance). This in itself is, arguably, sufficient justification to begin experimenting with other types of tests. There are numerous directions that the search for an interlanguage sensitive test could lead in, and the final study explores one of these, once again in the context of a practical institutional concern.

Study 5: Exploring a L1-specific vocabulary test

Subjects, materials, procedures, predictions

Some courses given at the École de langues are specifically designed for future ESL teachers to help them perfect their English language skills. For many of these learners, the test described above, and especially the vocabulary component, was found not to generate adequate variance to make good placement decisions. A more demanding test was needed, and this need presented an opportunity to use the findings of the foregoing studies in the design of a different type of placement instrument.

The participants in this study were 73 applicants to the ESL teacher training program in the Département de linguistique et de didactique des langues at UQAM. These applicants were much more proficient in English than the majority of participants in the previous studies. A new multi-skill test including a modified version of the Levels Test was created for these applicants. The modified vocabulary test targeted knowledge of English words independent of any ability to exploit cognates. (Since the vocabulary part of the test was experimental, its results were not used for placement purposes.)

The Levels Test was adapted in three ways. First, it was shortened to 20 questions, in line with administrative time constraints. Second, only 2000-level and UWL items that are not cognate with or inferable from French words were included as test words (e.g., stretched, wealth, burst, and slight in Table 6). These were obtained by picking and choosing AS test words from parallel versions of the Levels Test (published in a paper by Laufer & Nation, 1999). Some GL items continued to appear in the wording of the contexts (e.g., difference in Table 6). Third, the format of the test was changed to the controlled productive version of the test described by Laufer and Nation, which resembles a c-test (a cloze passage with the first half of each gap provided). This format activates whatever memory trace is available for a word, yet renders guessing difficult or impossible.

TABLE 6: CONTROLLED PRODUCTIVE TEST, 2000 AND UWL LEVELS, NON COGNATE

Testees have to complete the words by typing in the spaces provided. Spelling and grammar are not important if testees show they know the word.

This sweater is too tight. It needs to be stre___.

The rich man died and left all his we___ to his son.

If you blow up that balloon any more it will bur___.

The differences were so sl___ that they went unnoticed.

For complete test see Appendix 2, or follow links to Tests at http://132.208.224.131

The rest of the test, as in the studies reported above, consisted of reading, writing, and grammar sections, thereby facilitating a comparison of lexical knowledge with broader language ability. It was predicted that scores on this AS-biased vocabulary test would reflect learners' knowledge of English independent of their ability to exploit L1-L2 lexical similarities, and hence would strongly predict scores on the broader measure. Such a result would have to be interpreted cautiously, however, because the method adopted to eliminate guessing (the c-test) also turned the measure into a test of production as well as recognition.

Results and Discussion

The mean score for vocabulary was 58.15% (SD=25.13) and the average of the broader measure scores was 57.56% (SD=12.56). The correlation between vocabulary and averaged broader measures was a substantial r=.90 (p<.001), or 81% of variance. This can be compared to the correlation of r=.59, or 35% of variance, between the regular Levels Test and reading comprehension reported in Study 4. The probable explanation for this discrepancy is that the vocabulary knowledge measured in the present study reflects true exposure to English, while that measured in Study 4 reflects exposure, guessing from cognates, elimination, and happenstance knowledge in unknown proportions -- statistical noise.

However, the finding should be interpreted cautiously; the modified test not only eliminated cognates but also called upon productive as well as receptive knowledge. 'Deep' (Qian, 1999) or 'productive' (Cobb, 1998) vocabulary knowledge has been shown to correlate with reading and other integrated measures more highly than recognition knowledge does. Some goals for a future study will be to vary the cognate and active-passive factors independently, to test the same learners with both original and modified versions of the Levels test, and to compare the independent contributions of extensive GL and AS knowledge to the broader proficiency of higher level learners. A practical limitation to a cognate-eliminating test is that it can only be used with advanced learners. On the evidence above, most intermediate learners would simply fail such a test, generating no useful variance whatsoever to aid with their placement.

In the mean time, the finding of this study demonstrates the advantage in principle of developing vocabulary tests that incorporate information about specific L1-L2 interactions and provides a reason to do further research on this topic.

Conclusion

This series of studies began with some questions about vocabulary testing, gave reasons for choosing Nation's (1990) Levels Test for use with lower-intermediate francophone learners in Quebec, and then discussed some problems that were found with the test in this context. The test did not predict either dictionary look-ups or reading comprehension particularly well, and most importantly did not reveal a substantial gap in the high-frequency lexicons of most testees. The explanation provided for this was that the test allowed learners to answer about half the items correctly on the basis of knowing French-English cognates, and then a few more on the basis of guesswork or happenstance knowledge. This interpretation was supported by the finding that subjects' scores on non-cognate items contributed little or nothing to their scores on an integrated language task (reading comprehension). Therefore, it is concluded that a vocabulary test which ignores key facts about learners and their L1 -- in this case, the fact that francophone learners can answer questions about English words that they have not necessarily learned through exposure to English -- will not stand up to validity checks such as the look-up test or the prediction test. On the other hand, a test that does take such facts into account seems able to make strong predictions of broader proficiency.

It is important to note, however, that the method of handling L1-L2 interactions that was developed for Study 5, the elimination of cognates, could only be used with advanced learners, and that other methods would have to be developed for the majority of students signing up for language courses at institutions like the École de langues who are more typically at intermediate or lower proficiency levels. What sort of vocabulary test could be used with intermediate francophone students? Ironically, in view of the arguments made above, an all-purpose language based test like the Levels Test probably serves these learners, and those who would teach them, rather well.

It seems clear that these learners' Levels Test scores are based partly on imaginative work with cognates, but then, imaginative work with cognates is a legitimate learning strategy that can hardly be eliminated from the early stages of learning a cognate language. Furthermore, as was seen in Study 4, variance on GL items seems able to predict about 55% of the variance in reading scores (r=.74), suggesting that not all learners are fully aware of cognates (confirming a point made by Lightbown & Libben, 1984). Such awareness is a useful quality to look for in a placement test, and where it is found missing or weak it should probably be instructed (Treville, 1996). On the other hand, at even the earliest stages it is desirable to know something about learners' knowledge of the parts of English that cannot be inferred from French. It seems that the likely design of an improved, interlanguage sensitive test for these learners will lie in the direction of measuring both cognate-handling skill and cognate independent knowledge deliberately and separately.

A suitable test for advanced learners, on the other hand, as suggested by the strong finding in Study 5, will almost certainly involve a greater emphasis on independent L2 abilities that can only be gained through familiarity with the L2 itself. It seems unavoidable, in other words, that we will need different tests for different levels of learners.

The broader question underlying the studies reported here concerned the prospects for standardization in L2 vocabulary testing. The evidence from the foregoing studies, while far from conclusive, suggests that if the goal is to have maximally predictive vocabulary tests, then we should not expect to find a single vocabulary test that will function across languages, as desirable as this would be for some purposes, nor even a test that can function across levels within a language.

English in Quebec: An important role for vocabulary tests

In the broad perspective, the Anglo-Saxon issue in these studies is merely an example of information that may be missed in an all-purpose or language-based vocabulary measure. In Quebec, however, the AS issue also has some importance in its own right

A finding in the present study is that francophone learners are consistently weak on the AS side of English, and that it is quite possible for this fact to escape notice even with standard vocabulary testing. How important is it for Quebec francophones to know the AS side of English? It is true that most non-cognate terms have a roughly equivalent cognate version (mess - disorder, give - donate, eager - enthusiastic), so that learners with an eye to the costs and benefits of their labour might well decide to pay less attention to alternate versions of words they already know. This would be especially tempting since most of the irregularity and hence labour of learning English is piled up on the AS side (begin - began - begun vs. commence - commenced - commenced), to the extent that the AS strand may even require its own learning principles, i.e. more rote and less rule learning, as proposed by Pinker (1989, Ch.4). (Endnote 3)

However, a decision to ignore the AS side of the lexicon would be hasty. English word pairs that seem equivalent rarely are so (Sinclair, 1991), particularly on the dimensions of tone and register. Perspiration and sweat may be referentially equivalent, but they are hardly interchangeable between contexts. The twin strands of English are a rich resource for communicating information, especially pragmatic information. Some cognitive linguists argue that native words display more of the language's true nature, encoding the bodily and perceptual metaphors through which English speakers conceptualize their world and experience (Sweetser, 1990). Whether or not this is true, it is clear that English speakers employ the native lexicon heavily in everyday speech, with the result that francophones who know mainly the cognate side of English plus a few street or sports terms may well be able to say everything they want to say in English, but may understand less well what is said to them.

A theme in the literacy discussion in English Canada is that young anglophones often run up against a "lexical bar" (Corson 1985; 1997) whereby their lack of familiarity with the Greco-Latin side of English hinders their educational, economic and even cognitive (Olson, 1994, p. 109) development. Ironically, a different sort of lexical bar may be operating among young francophones learning English--a reverse lexical bar but no less insidious. The first step in dealing with such a bar is to expose it with appropriate vocabulary testing. The second is to focus instruction on Quebec learners' real lexical needs since it seems clear these are not being met through either current instruction or incidental exposure.

Biographical information

Thomas Cobb is a professor of ESL in the Département de linguistique et de didactique des langues at the Université du Québec à Montréal. He has taught English for academic and professional purposes at universities in Saudi Arabia, the Sultanate of Oman, Hong Kong, and British Columbia. He has a PhD in educational technology from Concordia University in Montreal, with a specialization in research methods, instructional design, and computer assisted language learning.

Acknowledgements

Funding for this research was provided by the Social Sciences and Humanities Research Council of Canada (Grant No. 410-2000-1283). I am grateful to Marlise Horst and CMLR reviewers for helpful comments on previous drafts.

Notes

(Endnote 1) While the Levels Test has effectively become the standard or in any case most widely used vocabulary test, it should be mentioned that the test was developed in an Asian context and that Nation felt it was "not suitable for learners whose mother tongue is a language which has been strongly influenced by Latin" (1990, p. 262). The analysis presented here examines the prospects for any language-based vocabulary measure, and is not intended as a criticism of the Levels Test, which indeed was originally developed for specific learners.

(Endnote 2) A note on terminology: referring to these words as 'Anglo-Saxon' is more a convenience than an etymological claim. It is well known that many words popularly thought to be native English actually originate in Latin or even French (McArthur, 1998). No less English a word than beef is from Latin (bovus) via French (boeuf), although it is now fully nativized as indicated by its acceptance of native morphologies (beefy). The point is less etymology than learner perception (Carroll, 1992). In these studies, the term Anglo-Saxon (AS) shall refer to 'non-cognate with French, usu. Anglo-Saxon.'

(Endnote 3) Theories about the supposed difficulty of mastering the irregularities of AS must be weighed against Corson's (1995) account of the inherent psycholinguistic ease of acquiring and processing the AS lexicon. Its morphologies are relatively transparent, based mainly on compounding of monosyllables (headland, rooftop) and affixation that does not require additional phonological entries in the mental lexicon (happy - happiness, neighbour – neighbourhood, as compared to author – authority, certain – ascertain).

References

Abu Rabia, S. & Siegel, L.S. (1995). Different orthographies different context effects: The effects of Arabic sentence context in skilled and poor readers. Reading Psychology 16, 1-19.

Al-Hazemi, H. (1993). Low level EFL vocabulary tests for Arabic speakers. University of Wales: Unpublished PhD thesis.

Anderson, R.C. & Freebody, P. (1983). Reading comprehension and the assessment and acquisition of word knowledge. In J.T. Guthrie (Ed.), Comprehension and teaching: Research reviews (pp. 77-117). Newark, DE: International Reading Association.

Carroll, J.B., Davies, P., & Richman, B. (1971). The American Heritage Word Frequency Book. Boston: Houghton Mifflin.

Carroll, S.E. (1992). On cognates. Second Language Research 8, 93-119.

Cobb, T.M. (1997). From concord to lexicon: Development and test of a corpus-based lexical tutor. Unpublished PhD dissertation, Dept. of Education, Concordia University, Montreal. (Available at http://www.er.uqam.ca/nobel/r21270/thesis0.html).

Cobb, T.M. (1999). Applying constructivism: A test for the learner-as-scientist. Educational Technology Research and Development, 47 (3), 15-31.

Cobb, T., & Horst, M. (In press). Carrying learners across the lexical threshold. In J. Flowerdew & M. Peacock (Eds.), The EAP curriculum. London: Cambridge University Press.

Cobb, T., & Horst, M. (1999). Vocabulary sizes of some City University students. Journal of the Division of Language Studies of City University of Hong Kong, 1 (1), 59-68. (Also at http://www.er.uqam.ca/nobel/r21270/cv/CitySize.html.)

Cohen, A., Glasman, H., Rosenbaum-Cohen, P.R., Ferrara, J., & Fine, J. (1988). Reading English for specialized purposes: Discourse analysis and the use of student informants. In P.L. Carrell, J. Devine, & D. Eskey (Eds.), Interactive approaches to second language reading (pp. 152-167). New York: Cambridge University Press.

Corson, D. (1985). The lexical bar. Oxford: Pergamon Press.

Corson, D. (1995). Using English words. Norwell, MA: Kluwer Academic Publishers.

Corson, D. (1997). The learning & use of academic English words. Language Learning 47, 671-718.

Hulstijn, J.H. (1993). When do foreign-language readers look up the meaning of unfamiliar words? The influence of task and learner variables. Modern Language Journal 77, 139-147.

Hsia, S., Chung, P., & Wong,D. (1995). ESL learners' word organization strategies: A case of Chinese learners of English words in Hong Kong. Language and Education, 9, 81-102.

Hwang, K. (1989). Reading newspapers for the improvement of vocabulary and reading skills. Unpublished MA thesis. English Language Institute, Victoria University of Wellington, New Zealand.

Hwang, K., & Nation, P. (1994). VocabProfile: Vocabulary analysis software. English Language Institute, Victoria University of Wellington, New Zealand.

Johnson, R.K. & Ngor, Y.S. (1996). Coping with second language texts: The development of lexically based reading strategies. In D.A. Watkins & J.B.Biggs (Eds.), The Chinese Learner (pp. 123-140). Hong Kong: Faculty of Education, Hong Kong University.

Koda, K. (1988). Cognitive processes in second-language reading: Transfer of L1 reading skills and strategies. Second Language Research 4, 133-156.

Laufer, B. (1994). The lexical profile of second language writing: Does it change over time? RELC Journal, 25 (2), 21-33.

Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics 16, 307-322.

Laufer, B., & Nation, P. (1999). A vocabulary-size test of controlled productive ability. Language Testing, 16 (1), 33-51.

Lightbown, P., & Libben, G. (1984). The recognition and use of cognates by L2 learners. In R. Anderson (Ed.), Second languages: A cross-linguistic perspective (pp. 123-140). Rowley MA: Newbury House.

Meara, P. (1980). Vocabulary acquisition: a neglected aspect of language learning. Language teaching and linguistics: Abstracts, 13, 221-246.

Meara, P. (1993). Tintin and the world service: A look at lexical environments. IATEFL: Annual Conference Report, 32-37.

Meara, P. (1996). The dimensions of lexical competence. In G. Brown, K. Malmkjaer, & J. Williams (Eds.), Performance and competence in second language acquisition (pp. 35-53). Cambridge: Cambridge University Press.

Meara, P. And Buxton, B. (1987). An alternative to multiple choice
vocabulary tests. Language Testing 4, 142-154.

Meara, P., Lightbown, P.M., & Halter, R. (1994). The effect of cognates on the applicability of YES/NO vocabulary tests. Canadian Modern Language Review 50, 296-311.

Mezynski, K. (1983). Issues concerning the acquisition of knowledge: Effects of vocabulary training on reading comprehension. Review of Educational Research 53, 253-279.

Miller, G.A., & Gildea, P.M. (1987). How children learn words. Scientific American, 257 (3), 94-99.

Nagy, W.E., Herman, P.A., & Anderson, R.C. (1985). Learning words from context. Reading Research Quarterly 20, 233-253.

Nation, P. (1983). Testing and teaching vocabulary. Guidelines, 5 (1), 12-25.

Nation, P. (1990). Teaching and learning vocabulary. New York: Newbury House.

Nation, P. (1994). New ways in teaching vocabulary. Alexandra VA: TESOL Inc.

Nation, P., & Waring, R. (1997). Vocabulary size, text coverage, and word lists. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition, pedagogy (pp. 6-19). New York: Cambridge University Press.

Olson, D. (1994). The world on paper: The conceptual & cognitive implications of writing & reading. New York: Cambridge University Press.

Qian, D. D. (1999). Assessing the roles of depth and breadth of vocabulary knowledge in ESL reading comprehension. The Canadian Modern Language Review, 56, 2, 282-307.

Ryan, A. (1997). Learning the orthographic form of L2 vocabulary - a receptive and a productive process. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition, & pedagogy (pp. 181-198). Cambridge: Cambridge University Press.

Ryan, A., & Meara, P. (1991). The case of the invisible vowels: Arabic speakers reading English words. Reading in a Foreign Language 7, 531-540.

Schmitt, N. (1995). An examination of the behaviour of four vocabulary tests. Paper presented at the Dyffryn Conference, Centre for Applied Language Studies, University of Wales, Swansea, April 1995.

Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.

Singleton, D. (1997). Learning and processing L2 vocabulary: State of the art article. Language Teaching 30, 213-225.

Stahl, S.A. (1991). Beyond the instrumentalist hypothesis: Some relationships between word meanings and comprehension. In Schwanenflugel, P.J. (Ed.), The psychology of word meanings (pp. 157-186). Hillsdale, NJ: Erlbaum.

Sutarsyah, C., Nation, P., & Kennedy, G. (1994). How useful is EAP vocabulary for ESP? A corpus based case study. RELC Journal, 25 (2), 34-50.

Sweetser, E. (1990). From etymology to pragmatics: Metaphorical and cultural aspects of semantic structure. Cambridge: Cambridge University Press.

Treville, M.-C. (1996). Lexical learning and reading in L2 at the beginner level: The advantage of cognates. Canadian Modern Language Review 53, 173-190.

Willis, D. (1990). The lexical syllabus. London: Collins Cobuild.

Worthington, D., & Nation, P. (1996). Using texts to sequence the introduction of new vocabulary in an EAP course. RELC Journal, 27 (2), 1-11.

Appendix A. Computer adaptation of Levels Test, 2000 level (screen picture)

Appendix B. French L1-adapted Controlled Productive Levels Test, composed of AS items from three versions and two levels (2000 and UWL).

1. This sweater is too tight. It needs to be stre___.

2. The rich man died and left all his we___ to his son.

3. If you blow up that balloon any more it will bur___.

4. The differences were so sl___ that they went unnoticed.

5. The dress you're wearing is lov___ .

6. It's the de___ that counts, not the thought.

7. He is walking on the ti___ of his toes.

8. She wan___ aimlessly in the streets.

9. This year long sk___ are fashionable again.

10. They had to cl___ a steep mountain to reach the cabin.

11. Plants receive water from the soil through their ro___.

12. La___ of rain led to a shortage of water in the city.

13. Many people in Canada mow the la___ on Sunday morning.

14. There has been a recent tr___ among prosperous families towards a smaller number of children.

15. She showed off her slen___ figure in a long narrow dress.

16. It was a cold day. There was a ch___ in the air.

17. His beard was too long. He decided to tr___ it.

18. You'll sn___ that branch if you bend it too far.

19. You must be aw___ that very few jobs are available.

20. The airport is far away. If you want to ens___ that you catch your plane, you'll have to leave early.