Skip to content

hemalr24/preprocessing-methods

Repository files navigation

Preprocessing Methods For English Text

This code offers five preprocessing methods for processing english text: lowercasing, stemming, lemmatization, removal of stopwords, and text cleaning (i.e. removal of punctuation).

Lowercasing

foo@bar:~$ python normalize_text.py (<textfile.txt>) lowercasing

Stemming

foo@bar:~$ python normalize_text.py (<textfile.txt>) stemming

Lemmatization

foo@bar:~$ python normalize_text.py (<textfile.txt>) lemmatization

Removal of Stopwords

foo@bar:~$ python normalize_text.py (<textfile.txt>) removal_of_stopwords

Text Cleaning

foo@bar:~$ python normalize_text.py (<textfile.txt>) text_cleaning

Included Text

Included in this repository is the file, dracula.txt, which is the full book, Dracula by Bram Stoker. The code selects the text file as the second input from the console, so any text file can be used.

To run the code with the dracula.txt file, it would be as follows, with one of the preprocessing methods mentioned above:

foo@bar:~$ python normalize_text.py dracula.txt (<preprocessing_method>)

Output

Once the code is run, the output is stored in the text file, output.txt. However, of the five methods available currently, there exists a unique txt file with sample outputs of the first and last 25 words produced from each method.

References

  1. https://www.ibm.com/think/topics/stemming#:~:text=Porter%20stemmer&text=Essentially%2C%20this%20stemmer%20classifies%20every,of%20consonant%20and%20vowel%20groups.
  2. https://www.ibm.com/think/topics/stemming Review Assignment Due Date

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages