How and why to make a technical (domain-specific) vocabulary list

Last update: 30 March 2022

Why: To create a more complete word list with higher coverage for a particular corpus or set of texts. 1k with proper nouns + 2k+ AWL will typicall reach 90% coverage, and with a technical list in a coherent domain (e.g., law, engineering, medicine) this can often be expanded to 95%. Basic comprehension has been shown to correspond to knowing 95% of the words in a text (see Coverage page).

How

THIS IS ONE APPROACH TO TECHNICAL LIST BUILDING

  1. Scan, wep-scrape, or otherwise assemble a corpus of texts in your domain (the bigger the better, following some sort of selection principle if possible.

  2. Get the texts all into one big file, called a corpus, either using Corpus Builder or simple copy-paste, and Save this big file As TEXT, (such that it has the letters .txt at the end of the file. The format is important.

  3. Run this file through VP-Classic, via either paste-in-window mode or upload if the text is larger than 50,000 wds. Choose the 'Block Proper Nouns' option.

  4. The main thing of interest in the output is the off-list (post-AWL) component, which is where your domain specific lexis will be found.

  5. Turn the Off-List output into a frequency list. VP already gives you a frequency list for Off-List (in the TYPES section), but in its present form it is not fully ideal for the purpose of refining a technical list.

  6. CLick the button 'Extract RAW list' at the bottom of Offlist Types. Copy Off-List TOKENS (including all duplicates) into memory, open the Frequency Indexer, paste in or upload your Off-List words, and ask the program give you a Most Frequent Words First list.

  7. You are looking for the high-frequency items, probably of about 200-300 words, in your corpus. Look down your list and choose a cut-off where frequency is dropping off. Now do a trial run. Remember what you are looking for: the smallest list (to learn) that gives the most coverage (when reading).

  8. In the frequency output, eliinate any remaining proper nouns, and then use your mouse to select the words down to the cut-off. Copy these into memory.

  9. Now you need to change these words into families or lemmas, so they will pull maximum coverage. Go to Familizer, paste in the words from your clipboard, and the output is your potential technical list. Save it to your own computer as a TEXT file with a relevant and memorable name.

  10. Now return to VP-Classic entry window. Paste your user list into the option box 'User/Tech List'.

  11. In the analysis output, a new pink category will appear below AWL, telling you the coverage your technical list has for this particular text.

  12. Repeat with several texts. If coverage is consistently low (under 5%), go back and choose a different cut-off at the frequency step. If consistently high, congratulations, here is your specialist vocab course - but keep challenging it with more test texts of the same type as your corpus texts before you build a course around it.

  13. After several iterations, either you have a potentially useful technical list in front of you (Bravo!). Or else (a) your corpus was non-representative, (b) your domain does not possess any small body of high-frequency offlist lexis (no guarantees here), or (c) your domain is not in fact a domain, linguistically (is there a lexicon of haircutting, chicken farming? etc).