This program applies the following REGEX (Regular Expression) to the indicated file:
@capwords = ($no_lines =~ /[^\.!?:]\s+(?=(\b[A-ZÀ-Ü][A-Za-zà-ù]+\b))/g);
Its purpose is to remove mid-sentence capitals (i.e. most proper nouns) from your text
Notes:
File must be TXT or HTML.Program assumes a text with standard sentence punctuation (e.g., not an unpunctuated list or similar)
(Tag-stripping would be useful e.g. when building a corpus of web pages.)