Home > Text Tools > Proper Noun Stripper
 Proper Noun Stripper

This routine eliminates capped mid-sentence words from a file with the following Perl REGEX (Regular Expression)
  $file =~ s/[^\.\!\?\:\'\n]\s+(?=(\b[A-Z][A-Za-z]+\b))//g);
The effect is to substitute words whose first letter does not ^ follow any [ ] terminal punctuation .?!: or new line \n yet begins a new word /b with any capital letter [A-Z] followed by another letter capital or not [A-Za-z] (to include all-cap words like BBC) - to substitute such words with nothing // throughout the $file everywhere they are found (globally).
  • Proper-stripping would be useful e.g. when profiling a text where propers simply augment the offlist component artificially (since they are interpreted in context and not 'learned' per se for future use, transferable meaning, etc.)
  • Note the (almost certainly imperfect) assumption of correlation betweeen mid-sentence capitals and proper nouns

 1.  Text-Paste mode                    

   2.  Upload mode : (HTML or TXT file on own disk) then   and finally save result to own disk as *.txt.

This algorithm was developed with Batia Laufer for use in VocabProfiling