This code offers five preprocessing methods for processing english text: lowercasing, stemming, lemmatization, removal of stopwords, and text cleaning (i.e. removal of punctuation).
foo@bar:~$ python normalize_text.py (<textfile.txt>) lowercasingfoo@bar:~$ python normalize_text.py (<textfile.txt>) stemmingfoo@bar:~$ python normalize_text.py (<textfile.txt>) lemmatizationfoo@bar:~$ python normalize_text.py (<textfile.txt>) removal_of_stopwordsfoo@bar:~$ python normalize_text.py (<textfile.txt>) text_cleaningIncluded in this repository is the file, dracula.txt, which is the full book, Dracula by Bram Stoker. The code selects the text file as the second input from the console, so any text file can be used.
To run the code with the dracula.txt file, it would be as follows, with one of the preprocessing methods mentioned above:
foo@bar:~$ python normalize_text.py dracula.txt (<preprocessing_method>)Once the code is run, the output is stored in the text file, output.txt. However, of the five methods available currently, there exists a unique txt file with sample outputs of the first and last 25 words produced from each method.