Home > Text Tools > Proper Stripper
HTML (etc) Proper Noun Stripper

This program applies the following REGEX (Regular Expression) to the indicated file:
@capwords = ($no_lines =~ /[^\.!?:]\s+(?=(\b[A-ZÀ-Ü][A-Za-zà-ù]+\b))/g);
Its purpose is to remove mid-sentence capitals (i.e. most proper nouns) from your text

Notes:
File must be TXT or HTML.

Program assumes a text with standard sentence punctuation (e.g., not an unpunctuated list or similar)

(Tag-stripping would be useful e.g. when building a corpus of web pages.)


  Do this: [1] Hard Disk for HTML file + [2]

  + [3] Save resulting stripped file back to own disk as *.txt.


Stay tuned for more text processing tools...

T Cobb - UQAM