The original idea behind this website (written 1998)
Why & how to use frequency lists to learn words
Read a 2018 update here on this piece
By Tom Cobb, July 1998
For other takes on this topic see Meara, 1995, and Nation & Waring, 1997
Why would you want to learn the frequent words of English? For the simple reason that English, like any language, has the habit of recycling a relatively small number of words over and over again, and if you know these words then your reading power can be enhanced dramatically for a relatively modest learning investment.
With a random or 'discovery' approach to lexical growth, you will learn many words that are rare and relatively useless for you, yet you will fail to notice which words recur often enough to repay the effort of deliberate learning. The word lists presented on this website are the result of more than 50 years work and are based on large scale computational analysis of English text and speech corpora. They are intended to deliver the main words of English to you in a shortened time frame, and deliver along with them enough contextual and definitional information to get solid learning underway.
Percent of word tokens in average text
How small a number are these 'main words' that are recycled over and over in English (or any other language)? Suppose your goal is to read academic English texts with good comprehension, and to use reading as a way to expand your vocabulary still further. In that case, your first goal should be to make sure you know the 2000 most frequent word families of English (headwords and their main inflections and derivations), because these words make up roughly 80% of the individual words (word tokens) in any English text. This can be seen in the table on the left, which shows data from the Brown corpus, (which can be accessed from this site), as cited in Nation (1990) p.17, and Nation (2001) p. 15.
Table 1 shows us that in English just a few word types account for most of the word tokens in any text. Ten words account for 23.7 % of the ink on any page (repeated words like "the" and "of"). Just 1000 word families account for more than 70% of the words or ink, and 2000 account for about 80%. So you need to find out what these 2000 word families are and be sure you know them.
You could, of course, wait and meet these words "naturally" in the normal course of reading the texts that interest you, but this takes a long time. An alternative is to meet these words in convenient lists provided on this website. While it is true that nothing can replace the experience of meeting new words in rich natural contexts, some of this experience has been reproduced for you here by linking the word lists to a computer program called a "concordance", and fom there to the dictionary Wordnet. A concordance provides several contexts for each word, derived from a large collection of texts called a corpus. Is reading these computer contexts as useful as meeting words in natural contexts? Probably not, but research by Cobb (1997) suggests that using computer concordances can get the learning process off to a good start.
After the first 2000 words
However, Table 1 also presents some bad news about vocabulary growth. It suggests that after you have learned the most frequent 2000 words of English, then simply continuing to accumulate more words on a frequency basis gives a much lower rate of return. You could learn another 3000 words (up to the 5000 frequency mark) and increase the amount of black ink coverage in an average text by only about 8%. The graph below, which is just the table above turned on its head, dramatizes the drop-off in coverage after 2000 words.
It is not obvious how to proceed after you have reached the 2000 mark. However, it seems clear that knowing 2000 words or 80% of the words in an average text is not sufficient for either comprehension of academic texts or for further independent vocabulary acquisition through reading such texts.
What texts look like to different learners
[ *NEW 08* Generate other viewpoints on the full forestry text at www.lextutor.ca/cloze/vp/ - get up Demo, then choose 'post' ratios on the Menu (cloze all 'post_1k' words, etc.) ]Here is what a text looks like to someone who knows the most frequent 2000 words and no others. Words that are not on the 2000 list have been replaced by gaps:
If _____ planting rates are _____ with planting _____ satisfied in each _____ and the forests milled at the earliest opportunity, the _____ wood supplies could further increase to about 36 million _____ meters _____ in the period 2001-2015. (Nation, 1990, p. 242.)
(Text A: 80% of words known)
Text A has 40 words, seven of which are unknown or (7/40 =) 16%. It seems clear that someone reading this text would get a some idea of the topic, but not exactly what was being said about the topic.
Here is the same text with 95% of its words known, or 5% unknown:
If current planting rates are maintained with planting targets satisfied in each _____ and the forests milled at the earliest opportunity, the available wood supplies could further _____ to about 36 million cubic meters annually in the period 2001-2015.
(Text B: 95% of words known)
In Text B, the main idea of the text is reasonably clear. And the concepts needed to fill the two remaining gaps are also clear, so that if these had been new words instead of gaps there is a good chance the words would have been understood through inference.
In fact, research has shown that reading in a second language is reliably successful, and supports further vocabulary acquisition, when at least 95% of the individual words (word tokens) in a text are known. With fewer than that, the reader does not have enough to go on (Laufer 1989; 1992; Hirsh and Nation, 1992).
(You can see different views of the forestry text in full, or see how other texts look to learners with different amounts of lexical knowledge at cloze/vp_cloze - enter your text and set the 'zone to cloze' to Post-1k, Post-2k, or Post-AWL.)Two strategies for the journey from 80% to 95 %
After learning the most frequent 2000 words, you can adopt one of two strategies for further vocabulary acquisition. Strategy 1 is simply to carry on up the slope of Figure 1, past the 2000 hump, learning words at the 3000, 4000, 5000 zones and far beyond. Any learner is bound to adopt this strategy to some extent -- thinking about and looking up interesting new words encountered randomly in newspapers, books, or movies.
However, there are some problems with this strategy. First, the learning task is enormous. The learner reaches the 90% mark only at 5000 words, or after another 3000 new words have been learned. Second, as well as being numerous these words are difficult to learn because they are relatively infrequent and are not encountered over and over again. Third, while the first 2000 words have been identified and made somewhat easy to learn (e.g., on this website), useful frequency lists at the 3000, 4000, 5000 zones and beyond are not available at present.
Strategy 2 is to take advantage of research that has been done to target the vocabulary needed for the different purposes a learner might have for reading. Nation and his colleagues in New Zealand have analysed academic texts and determined that across domains there are certain words that, while not necessarily frequent in the language at large, are very frequent in academic texts. These are normally Greco-Latin terms like "probablilty," "conclusion," and "hypothesis." There are approximately 570 of these words, and they have been brought together by Averil Coxhead as the Academic Word List (AWL). This list appears on this website and is included in the diagnostic tests.
The good news is the 2000 list and the AWL together, a combined list of 2570 words, can bring the coverage of an academic text up to approximately 90%. In other words, if you know the first 2000 plus 570 AWL words, then you know about 90% of the words you will meet in any academic text. To see support for these claims, see examples of computer text analysis with VocabProfile on this site. You will see that there is a reasonably reliable profile of texts by frequency zone, with the AWL words claiming more as the texts you select are more "academic." For the rest of the journey (90% to 95%), for the moment you are pretty much on your own. But you have an adequate base for inferences and look-ups.
As for how to use these lists to learn words, if you go to the List_Learn page you will find all these lists connected by simple mouseclicks to both a speech engine and a large corpus of natural English. In other words, each word can be heard and met in a diverse set of contexts. Such an encounter should get the learning process well under way, and then of course you will more readily recognize these words when you meet them again and further learning will occur.
In conclusion, students learning English anywhere on the planet can use this website to test themselves on how well they know the 2000 and AWL lists, fill any significant gaps they find, and make their way toward basic lexical competence.
Cobb, T. From concord to lexicon: Development and test of a corpus-based lexical tutor. Montreal: Concordia University, PhD dissertation.
Cognitive Science Lab, Princeton University. Wordnet: A lexical database for English.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly 34 (2), 213-238.
Hirsh, D. & Nation, P. (1992). What vocabulary size is needed to read unsimplified texts for pleasure? Reading in a Foreign Language, 8 (2), 689-696.
Laufer, B. (1992). How much lexis is necessary for reading comprehension? In P.J. Arnaud & H. Béjoint (Eds.), Vocabulary and applied linguistics (pp. 126-132). London: Macmillan.
Nation, P. (1990). Teaching and learning vocabulary. New York: Newbury House.
Nation, P. (2001). Learning vocabulary in another language. New York: Cambridge University Press.
Nation, P., & Waring, R. (1997). Vocabulary size, text coverage, and word lists. In Schmitt, N., & McCarthy, M. (Eds.) Vocabulary: Description, acquisition, pedagogy (pp. 6-19). New York: Cambridge University Press.