Compiling French word frequency lists for the VAT: a feasibility study

Glyn Jones, Consultant to the Project


[ "The project" being the Open University Lexical Frequency Project, coordinated by Robin Goodfellow, who has kindly provided me with these lists. - Tom Cobb ]

 

Summary:

In my opinion it is quite feasible, within the budgeted time frame, to produce word lists which would enable the construction of, at the very least, a working demonstration version of the Vocabulary Assessment Tool for French. However, if the PAROLE corpus (see below) can be made available then it should be possible to do better than this: in fact to produce word lists that are as valid for French as the General Service List and University Word List (the lists used by Laufer & Nation) are for English.

 

1 Introduction

The aim of the Vocabulary Assessment Tool (VAT) project is to develop the necessary tools to derive a Lexical Frequency Profile (LFP) for texts written by learners of French, as an aid to assessing the quality of those texts.

The concept of the LFP was elaborated by for English by Laufer & Nation (1995). They developed a computer program which counts the words in a learner’s text and sorts them into four categories:

The number of words in each category, as a proportion of the whole, constitutes the LFP of the text.

A critical aspect of Laufer & Nation’s approach is that the word lists which they use are lemmatised: that is, grouped into word families. The significance of this will be discussed in section 4 below.

In order to implement the same approach for French it is necessary to have word frequency lists for French analogous to those used by Laufer & Nation for English. The aim of this report is to assess the feasibility of finding or constructing such lists.

The objective can be broken down into four sub-goals:

This report will consider each of these sub-goals in turn.

 

2 A general frequency list

2.1 Requirements

Laufer & Nation (personal communication) used the General Service List (GSL) (West, 1953). This was developed from a corpus of 5 million words with the needs of EFL learners in mind. According to Coxhead (2000) the GSL contains "the most widely useful 2,000 word families in English" (p. 213), not strictly speaking the most frequent; West apparently used criteria other than frequency – namely ease of learning, coverage of useful concepts and stylistic level – in selecting words.

It would appear that no ready-made frequency list comparable to the GSL exists for French. For the VAT project to succeed, therefore, requires such a list to be constructed: preferably a list that is at least as valid for French as Laufer & Nation’s sources are for English. Given the age of the GSL it should not be difficult, on the face of it, to achieve this.

There are three possible ways of constructing such a list:

 

2.2 Pre-existing corpora

Ideally a word frequency list would be derived from a large scale corpus analogous to the COBUILD Bank of English: comprising a comprehensive range of text types and geographical varieties, including spoken as well as written sources, and all texts being recent in origin. There are several collections of French texts which meet some of these criteria. Where they fall short of the ideal is that they are mostly based on a restricted range of text types. Any single one of them they may be considered insufficiently heterogeneous to be representative of modern written French.

Among these collections the most noteworthy are:

FRANTEXT

This is a collection of literary texts held by the Institut National de la Langue Française (INALF). This is described as vast, and containing texts from all periods of French literature including the 20th century. However, the exact extent and composition of the corpus, like the texts themselves, is only accessible to subscribing institutions. Subscription costs FFR 2000 per year.

ABU

This is another collection of literary texts which has the advantage of being available free of charge via the Internet. However, it contains only a small quantity of 20th century material (and early 20th century at that): works by Anatole France, Valéry Larbaud and Raymond Radiguet totalling some 80,000 words.

SILFIDE (Serveur Interactif pour la Langue Française, son Identité, sa Diffusion et son Étude)

This project offers direct access, from its Web site, to a wide variety of texts: literary, journalistic, non-fiction and officialese. However, the service is under construction and it is not possible to judge the size of the texts at the moment, let alone download. Nor is it clear whether the service, once running, will be free of charge or limited to paying subscribers.

PAROLE

This is one of many collections held by ELRA (European Languages Resources Association). It contains over 20 million words drawn from three main sources

The corpus is available on CD-ROM at a cost of EUR 1540 for members of ELRA or EUR 4300 to non-member. Membership costs EUR 750 for non-profit making organisations. To judge from a sample, the texts are encoded in SGML (or possibly XML) rather than plain ASCII. Therefore some processing would be necessary before the texts could be analysed for frequency. It seems that there is no ready-made frequency list available for this corpus.

LIFC(?)

According to a very recent communication from Thierry Chanier the OU could have access to frequency data derived from a very large corpus of newspaper text: 52 million words from Le Monde and Le Soir. This list is apparently already lemmatised (although it is not known how or to what level). Thierry Chanier’s message implies, however, that the frequency data may be compromised by the addition to the list of words chosen for other, pedagogical criteria.

 

2.4 The DIY approach: compiling a corpus from scratch

Potentially, the Web is a source of huge quantities of written French. A wide variety of text types are available: journalism, personal statements, advertisements, learned articles, official government papers and reports.

However, accessing Internet texts in large quantities is not straightforward. Increasingly, text on the Web is split into sections that are stored in separate pages and, often, broken up by pictures and other graphical devices. Accessing a single edition of an on-line newspaper, for example, involves a tortuous navigation to and from the home page, sometimes through two or more levels of hierarchy. The prevalence of frames and database-driven pages makes it impractical to download whole sites automatically, and in any case little time would be saved in this way as the resulting HTML files would not be directly analysable for word frequency.

In a trial run, the most effective method of collecting text proved to be the following:

Using this approach it was possible, in the course of half a day, to collect a mini-corpus of some 78,000 words, ranging from the a complete edition of a Canadian local newspaper (the slowest to collect) to an academic dissertation of 30,000 words (the quickest). Assuming some improvement through practice it should be possible to increase this rate of collection to 200,000 words per day. Even then it would take a week to collect a million words.

In short, although the DIY option offers maximum flexibility in the choice of material, it is not realistic to expect to compile by this method a corpus approaching the 5 million words of the GSL within the fifteen person-days budgeted for this project. Nevertheless, DIY collection on a more modest scale could be used to supplement ready-made collections (or lists) in order to broaden the range of text types.

 

2.4 The combined approach: using a pre-existing corpus in together with newly collected texts.

There is no reason in principle why a frequency list should not be based on a combination of different sources, and indeed this may be the most effective way of achieving a balance of text types.

In practice the only complication arises is it is not possible, for some reason, to combine the texts into one central corpus for analysis purposes. This would be the case, for example, if the LIFC data are supplied as a list without the original corpus. To use this in combination with other data requires two conditions to be met:

The merging program would then scan both lists and copy all words to a third master list. Any words occurring in one list only are simply copied with their frequency data intact. Any that occur in both lists are copied once only but with their frequency counts added together.

 

2.5 Discussion

Of the ready-made sources available the LIFC word list is based on the largest corpus and is apparently free. However, it is based on a very restricted sampling of written French: just two national newspapers.

The most varied ready-made corpus is the PAROLE collection. It is true that more than half of its 20 million or so words come from Le Monde, so this portion of the corpus would be redundant if it were used in combination with the LIFC data. To offset against this though (and its not inconsiderable cost) there are two compelling reasons for acquiring this collection: the two million words of mixed origin, and the CNRS texts, which could be used to derive an academic word list (see below).

If both the LIFC and PAROLE (excluding Le Monde) data are used in combination the resulting list is still based disproportionately on national newspapers. To balance this effect a sub-corpus of self-assembled texts, totalling about a million words could be added within the project’s time budget.

The recommended approach is therefore to use a combination of these sources: the LIFC data, parts of the PAROLE corpus, and a self-assembled sub-corpus.

If this is not possible (because the PAROLE corpus is deemed too expensive), then the alternative approach is to use only the LIFC data: not LIFC data in combination with self-assembled texts, as without the PAROLE corpus it would be necessary to devote nearly all of the budgeted time to compiling an academic corpus (see below).

 

3 An academic word list

Laufer & Nation used a list, called the University Word List (UWL), that was assembled manually by combining words from four other lists that had been compiled, in their turn, according to a variety of criteria. Nation himself recommends (personal communication) using the newer Academic Word List (AWL) compiled by Coxhead (2000). This is based on a corpus of 3.5 million words of academic text (mostly learned articles and chapters from textbooks, drawn from four main disciplines: arts, commerce, law and science; each subdivided into seven subject areas (yielding 28 specialist subject areas). The criteria for inclusion in the list were

This yielded a list of 570 words (or rather word families, see below).

Producing a similar list for French clearly requires a corpus of texts (and, of course, a general word list by which to determine which words to exclude).

The CNRS portion of the PAROLE corpus offers, fortuitously, a corpus of comparable size to Coxhead’s. However, it is not evident that it will provide similar coverage. One of the sources is cited is a periodical called CNRS Info, whose articles cover (to judge by its Web site) mainly natural sciences, social sciences and history: no commerce and no law.

The recommended approach is to use the PAROLE corpus as a source of academic texts, but to be prepared to supplement this with self-assembled sub-corpora for subject areas not covered by PAROLE, notably commerce and law.

If the PAROLE corpus is not available then a very substantial proportion of the time budgeted for this project will have to be devoted to compiling an academic corpus.

To meet the requirements of the VAT a word list based solely on frequency should be sufficient. That is to say, it should not be necessary to apply the criterion of range invoked by Coxhead. However, if it is considered worthwhile, it might be possible to derive range data using Nation’s program RANGE which is designed precisely for this purpose.

 

4 Mechanically deriving frequency lists from corpora

A number of programs exist which will automatically count the number of word types in electronically stored text. For the purposes of this project the requirements are:

Monoconc meets both these criteria and could be used for the initial processing of both the general and academic corpora.

Producing the academic word list involves a further step: systematically eliminating all the words that occur in the general word list. This could be done with a simple program or macro, or even by hand. In any case this has to be done after the general wordlist has been lemmatised, as its final composition is not known before this. (Strictly speaking, this step might not be necessary for the purposes of this project, as the VAT text analysis program could be written so as to search the general word list first. If a word is found there the program would not search for it in the academic word list, so it would not be counted twice, and it would be of no consequence that several words are included, redundantly, in both lists. However, if the academic word list is to be used for other purposes in future it might be necessary to eliminate the general words from it.)

 

  1. Lemmatisation

5.1 Requirements

The word lists used by Laufer and Nation (the GSL and the UWL) are lemmatised. That is, the words are grouped in morphological families. When they refer to the 1000 most frequent words in English, therefore, they mean the 1000 most frequent families, not the 1000 most frequent word types.

Lemmatisation normally involves the grouping together of inflected forms of the same lexical headword. That is to say, the members of a lemma belong to the same word class. However, the concept of a word family, as applied by Laufer & Nation, goes beyond this. They refer to Bauer & Nation (1993), who distinguish six levels of relatedness. At level one words are not grouped at all and each word form is a family in its own right. At level six a family includes all words which can be derived by affixation from the same root, except those involving classical roots. They give no examples to illustrate this stricture, but presumably it refers to cases where to perceive any relationship between the words you have to be aware of common Latin or Greek root present as a bound form, as in deception and reception (Latin capere). The criteria which distinguish the intervening levels are related to different groups of affixes and are for the most part (and by Bauer & Nation’s own admission) drawn arbitrarily.

In their 1995 article Laufer & Nation state that the words in their lists are grouped in families at level three. This is not borne out by the lists themselves as supplied with Nation’s software utilities, which are grouped at level six. Coxhead’s AWL is also grouped at level six.

At level three a family comprises all the words formed from a given root by inflection (their level two), plus all those formed by derivational affixation where the affix is productive in modern English (used in new coinages) AND the effect of affixation on word class and meaning is predictable AND the root is not modified beyond what is normal at level two. For example, at this level governable is in the same family as govern because the suffix –able is still productive in English, whereas government is not, because –ment is not productive.

In spite of the arbitrariness of Bauer & Nation’s system, and their dependence on the identification of specific English affixes, the levels in question here – three and six – could actually be applied to French (the stricture on classical roots at level six may need to be modified as many such roots are productive in modern French, such as télé- in télécharger (download) and télécommande (remote control).

However, the point of grouping words in the LFP is that it is supposed to reflect learners’ vocabulary knowledge. If a learner knows one member of a family they are likely to know the others. The VAT ought to adopt a level of grouping which reflects learners’ knowledge of French vocabulary in a similar way, and it is not necessarily the case that this can be achieved by applying the same formal criteria as are valid for English (assuming that Laufer & Nation’s criteria are valid). This is an interesting and problematic issue, and one that cannot be decided within the scope of this study, nor probably within the preliminary development phase of the VAT. Provisionally I would recommend that the lists used for the VAT be grouped according to criteria approximating to Laufer & Nation’s level three, on the grounds that this is feasible, and that it can be expected to produce lists which are statistically similar to those used in the LFP (where the concept "second block of 1000 words" represents a similar level of linguistic difficulty and lexical richness in both languages).

 

5.2 How to do it

Although there are lemmatisers for French (that is, software tools which analyse French text and allocate word forms to word families) they all suffer from serious disadvantages, such as

Perhaps more serious than any of these factors, though, is that they are simply not designed to group words into lists, but to mark the words in a running text. Although they could be applied to a list (treating a list as if it were running text), the output would require a considerable amount of further processing before it would resemble one of Laufer & Nation’s structured lists.

Furthermore, what is required is not only a means of grouping together the word types in an existing list (the output of a mechanical frequency analysis), but an exhaustive listing of all the members of each family whether they are in the original list or not. For example, it is highly likely that the second person plural past historic form of parler (parlâtes) does not occur in a list of the most frequent words in the corpus; indeed, it may not occur in the corpus at all. However, if a learner should use the form parlâtes in a piece of written work analysed by the VAT we would presumably want the program to recognise it as a member of the family parler. This means that after grouping the words from the initial frequency analysis we need to supplement each group with all those words which belong in the family but do not occur in the corpus.

It is hard to see how to do this except, painstakingly, by hand.

The following illustration shows how this might be done in practice. Table 1 shows an extract from the frequency list for the mini-corpus referred to above as produced by Monoconc 2.0, sorted alphabetically and copied to an Excel spreadsheet. It contains, among other words, types belonging to the families abandonner and abattre. The first two columns show the frequency count of each type, first as an absolute number of occurrences, then as a fraction of the total word count.

1

0.000011

abandon

1

0.000011

abandonné

1

0.000011

abandonnée

2

0.000022

abandonner

1

0.000011

abassi

3

0.000032

abats

4

0.000043

abattage

2

0.000022

abattoir

5

0.000054

abattoirs

1

0.000011

abattu

1

0.000011

abattues

1

0.000011

abattus

1

0.000011

abbeville

1

0.000011

abcès

Table 1

Table 2 shows the same list with the word families grouped. Each word type is still in the table with its original data, but three additional columns contain, respectively, the headword of the family to which a word belongs (for practical purposes it does not matter much which family member this is), then the total frequency count for the respective family (in both numerical formats). This, of course, is identical for every member of the family. This example has been processed manually, however it may be expedient to write a macro, or a simple database application, to generate the contents of the additional columns once the words to be grouped have been selected.

1

0.000011

abandon

abandon

5

0.000055

1

0.000011

abandonné

abandon

5

0.000055

1

0.000011

abandonnée

abandon

5

0.000055

2

0.000022

abandonner

abandon

5

0.000055

1

0.000011

abassi

3

0.000032

abats

abattre

17

0.000184

4

0.000043

abattage

abattre

17

0.000184

2

0.000022

abattoir

abattre

17

0.000184

5

0.000054

abattoirs

abattre

17

0.000184

1

0.000011

abattu

abattre

17

0.000184

1

0.000011

abattues

abattre

17

0.000184

1

0.000011

abattus

abattre

17

0.000184

1

0.000011

abbeville

1

0.000011

abcès

Table 2

In Table 3 original data have been removed and the additional forms for the verb paradigm abandonner have been inserted. These have been generated by a Word macro and (in this case) pasted into the spreadsheet. The same will need to be done for abattre, and indeed for every other word family. Incidentally, it doesn’t much matter if the inflected present participle adjective forms abandonnante, abandonnantes, and abandonnants, do not exist for this verb. The macro generates them willy-nilly as they do exist for a great many regular –er verbs.

abandon

abandon

5

0.000055

abandonné

abandon

5

0.000055

abandonnée

abandon

5

0.000055

abandonner

abandon

5

0.000055

abandonner

abandon

5

0.000055

abandonne

abandon

5

0.000055

abandonnes

abandon

5

0.000055

abandonnons

abandon

5

0.000055

abandonnez

abandon

5

0.000055

abandonnent

abandon

5

0.000055

abandonnais

abandon

5

0.000055

abandonnait

abandon

5

0.000055

abandonnions

abandon

5

0.000055

abandonniez

abandon

5

0.000055

abandonnaient

abandon

5

0.000055

abandonnai

abandon

5

0.000055

abandonnas

abandon

5

0.000055

abandonna

abandon

5

0.000055

abandonnâmes

abandon

5

0.000055

abandonnâtes

abandon

5

0.000055

abandonnèrent

abandon

5

0.000055

abandonnerais

abandon

5

0.000055

abandonnerait

abandon

5

0.000055

abandonnerions

abandon

5

0.000055

abandonneriez

abandon

5

0.000055

abandonneraient

abandon

5

0.000055

abandonnerai

abandon

5

0.000055

abandonneras

abandon

5

0.000055

abandonnera

abandon

5

0.000055

abandonnerons

abandon

5

0.000055

abandonnerez

abandon

5

0.000055

abandonneront

abandon

5

0.000055

abandonnasse

abandon

5

0.000055

abandonnasses

abandon

5

0.000055

abandonnât

abandon

5

0.000055

abandonnassions

abandon

5

0.000055

abandonnassize

abandon

5

0.000055

abandonnassent

abandon

5

0.000055

abandonné

abandon

5

0.000055

abandonnés

abandon

5

0.000055

abandonnée

abandon

5

0.000055

abandonnées

abandon

5

0.000055

abandonnant

abandon

5

0.000055

abandonnants

abandon

5

0.000055

abandonnante

abandon

5

0.000055

abandonnantes

abandon

5

0.000055

 

abassi

1

0.000011

abats

abattre

17

0.000184

abattage

abattre

17

0.000184

abattoir

abattre

17

0.000184

abattoirs

abattre

17

0.000184

abattu

abattre

17

0.000184

abattues

abattre

17

0.000184

abattus

abattre

17

0.000184

Table 3

The next step in this process involves creating a version of the above list reduced to headwords (ie in which each word family has only one entry). This is sorted in reverse order of frequency in order to determine the composition of the word lists used in the VAT; that is, to specify which word families belong in the first 1000 word block and which in the second. The sorted list is then expanded again so that it again resembles Figure 3, but with the frequency data replaced or supplemented by the numbers 1 or 2, indicating which block of words each item belongs to (and with words outside of this frequency range excluded). Finally, it should be a simple mechanical step to convert these data into whatever file format is required by the VAT.

 

6 Conclusions

In my opinion it is quite feasible, within the budgeted time frame, to produce word lists which would enable the construction of, at the very least, a working demonstration version of the VAT. However, if the PAROLE corpus can be made available then it should be possible to do better than this: in fact to produce word lists that are as valid for French as the GSL and UWL (the lists used by Laufer & Nation) are for English.

I anticipate that the 15 person-days would be allocated approximately as follows:

  • Downloading and compiling additional corpus texts

5 days

  • Writing ad-hoc program utilities and macros

2 days

  • Lemmatising and formatting word lists

6 days

  • Miscellaneous (project admin, contingency)

2 days

Total

15 days

If the PAROLE corpus is available then the five days allowed for collecting additional material can be devoted to improving the coverage of what are already varied and appropriate collections of texts. If not, then this time would have to be devoted entirely to constructing an academic corpus. In that event the LIFC collection (derived from Le Monde and Le Soir) would have to stand in for a general corpus, and the academic corpus would be somewhat sparse at one million or so words. The resulting lists would no doubt be adequate for producing a preliminary trial version of the VAT but they could not claim similar validity to those used in the LFP.

Glyn Jones, Nov 2000

 

References

Books & Articles

Bauer, L. & I.S.P. Nation (1993). "Word families". International Journal of Lexicography 6/3

Coxhead, A. (2000). "A New Academic Word List". TESOL Quarterly 34/2

Laufer, B. & I.S.P. Nation (1995) Vocabulary Size and Use: "Lexical Richness in L2 Written Production". Applied Linguistics 16/3

West, M. (1953) "A General Service List of English Words". Longman, Green & Co., London

 

Software

Monoconc 6.0, Athelstan, Houston TX, USA

RANGE, available from Paul Nation’s Web site (see below)

 

Web sites

ABU (Association des Bibliothècaires Unifiés?) http://abu.cnam.fr/

CNRS Info www.cnrs.fr/Cnrspresse/cnrsinfo.html

ELRA http://www.icp.grenet.fr/ELRA/fr/cata/tabtext.html

INALF http://www.inalf.fr/

SILFIDE www.loria.fr/projets/Silfide/Index.html

University of Wellington, New Zealand (Paul Nation) www.vuw.ac.nz/lals