This tutorial, led by Dr. Shaheen Khatoon, Senior Lecturer in Data Science at the School of Architecture, Computing, and Engineering, University of East London, UK, introduces basic Natural Language Processing (NLP) techniques using PySpark. The focus is on preprocessing text, converting text into numerical form, and applying a regression classifier on a text corpus.
- Dr Shaheen Khatoon, Ph.D., PMP, PMI-ACP, ITIL, PgCert TLHE
- Senior Lecturer in Data Science (Computer Science and Digital Technologies)
- School of Architecture, Computing and Engineering
- University of East London, UK
By the end of this tutorial, you will learn:
- Basic NLP techniques to preprocess text.
- Different techniques to convert text into numerical form (CountVectorizer, HashingTF, and TF-IDF).
- How to apply a regression classifier on the text corpus.
- Basic knowledge of Python and PySpark.
- PySpark environment set up for development.
- An understanding of NLP concepts is beneficial but not mandatory.
The tutorial covers the following major steps in handling text data for ML modeling:
- Reading the corpus.
- Tokenization.
- Cleaning/stopword removal.
- Stemming.
- Converting into numerical form.
A corpus, an entire collection of text documents such as emails, messages, or user reviews, will be used. We start with basic preprocessing using text.
We divide text into tokens, removing unnecessary characters like punctuations, and explore how to do this using PySpark.
We remove common words that add little value to the analysis, using PySpark's StopWordsRemover.
We discuss the process of reducing words to their base form, focusing on the use of the nltk library for lemmatization.
We explore methods like Bag of Words, CountVectorizer, and TF-IDF to convert text into a form that can be used by ML algorithms.
Finally, we apply what we've learned to perform text classification, predicting the sentiment of movie reviews.
- Ensure PySpark is installed and correctly set up.
- Download the necessary datasets as indicated in the tutorial.
- Follow the step-by-step instructions provided in each section.
- Experiment with TF-IDF and observe the difference in results compared to using CountVectorizer.
- Try different classification models to identify the best model using various feature engineering techniques.
- We will continue with other text representation techniques, such as word2vec, in the upcoming session.
Thank you for participating in this tutorial. Happy Learning!