Preprocessing Methods For English Text

This code offers five preprocessing methods for processing english text: lowercasing, stemming, lemmatization, removal of stopwords, and text cleaning (i.e. removal of punctuation).

Lowercasing

foo@bar:~$ python normalize_text.py (<textfile.txt>) lowercasing

Stemming

foo@bar:~$ python normalize_text.py (<textfile.txt>) stemming

Lemmatization

foo@bar:~$ python normalize_text.py (<textfile.txt>) lemmatization

Removal of Stopwords

foo@bar:~$ python normalize_text.py (<textfile.txt>) removal_of_stopwords

Text Cleaning

foo@bar:~$ python normalize_text.py (<textfile.txt>) text_cleaning

Included Text

Included in this repository is the file, dracula.txt, which is the full book, Dracula by Bram Stoker. The code selects the text file as the second input from the console, so any text file can be used.

To run the code with the dracula.txt file, it would be as follows, with one of the preprocessing methods mentioned above:

foo@bar:~$ python normalize_text.py dracula.txt (<preprocessing_method>)

Output

Once the code is run, the output is stored in the text file, output.txt. However, of the five methods available currently, there exists a unique txt file with sample outputs of the first and last 25 words produced from each method.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
dracula.txt		dracula.txt
dracula.txt:Zone.Identifier		dracula.txt:Zone.Identifier
homework1_report.pdf		homework1_report.pdf
lemmatization.txt		lemmatization.txt
lowercasing.txt		lowercasing.txt
normalize_text.py		normalize_text.py
output.txt		output.txt
removal_of_stopwords.txt		removal_of_stopwords.txt
stemming.txt		stemming.txt
text_cleaning.txt		text_cleaning.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Preprocessing Methods For English Text

Lowercasing

Stemming

Lemmatization

Removal of Stopwords

Text Cleaning

Included Text

Output

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

hemalr24/preprocessing-methods

Folders and files

Latest commit

History

Repository files navigation

Preprocessing Methods For English Text

Lowercasing

Stemming

Lemmatization

Removal of Stopwords

Text Cleaning

Included Text

Output

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages