Skip to content

This project implements a machine learning model for text classification using Support Vector Machine (SVM). The model is designed to predict IAB categories for articles based on their content.

Notifications You must be signed in to change notification settings

89Aman/text-classification-model

Repository files navigation

Text Classification Model for Predicting IAB Categories

1. Approach Overview

The goal of this project is to develop a machine learning model that can predict IAB categories for articles based on their text content. We use a supervised learning approach where a training dataset with known text and categories is used to train the model, and a test dataset with unknown categories is used to evaluate the model's performance.

Key Steps:

  1. Data Preprocessing:

    • We load large CSV datasets (up to 1.24 GB) in manageable chunks to prevent memory overflow.
    • The text data is vectorized using HashingVectorizer, which converts the raw text into a numerical format suitable for machine learning.
  2. Feature Engineering:

    • HashingVectorizer: This vectorizer is a fast and memory-efficient alternative to TfidfVectorizer. It hashes text into a fixed number of features (set to 5000 in our case). This allows us to handle large datasets and high-dimensional text data without requiring all terms to be stored in memory.
    • Stop Words Removal: We removed common English stop words to improve the model’s performance by eliminating frequently occurring but unimportant words.
  3. Model Selection:

    • Support Vector Machine (LinearSVC): We initially chose Linear Support Vector Classification (SVM) because of its strong performance in text classification tasks. However, SVM is computationally expensive, particularly for large datasets.
    • Naive Bayes (MultinomialNB): To improve speed, we also implemented Naive Bayes, which is faster for large datasets and often performs well for text classification.
  4. Parallel Processing:

    • The model training process is parallelized using joblib and the parallel_backend function, enabling multi-core training to take full advantage of the available hardware (12th Gen Intel Core i5-12300F, 6 cores, 12 logical processors).
  5. Model Evaluation:

    • Predictions are made on a test dataset using the trained model.
    • The predicted categories are saved in a CSV file along with the corresponding Index from the test data for easy evaluation.

2. Feature Engineering Details

  • Text Vectorization: We used the HashingVectorizer from scikit-learn to transform text data into a high-dimensional sparse matrix. This matrix represents the frequency of hashed terms in the text and is efficient for handling large datasets. The vectorizer transforms the text into 5000 features (hashed values), which is a reasonable dimensionality for a large text dataset.

    • Pros of HashingVectorizer:
      • Fast and memory-efficient.
      • No need to store a vocabulary, making it ideal for large datasets.
      • Handles unseen words gracefully by hashing them into the same space.
  • Stop Words Removal: Stop words (like "the", "is", "in") are common words that do not provide meaningful information in most text classification tasks. By removing these stop words, we reduced the dimensionality of the input text and improved model performance.

4. Source Files

  • text-classification-model.py: The Python script that loads the training and test data, vectorizes the text, trains the model, makes predictions, and saves the results.
  • train.csv: The large training dataset containing ~697,528 rows of text data and target IAB categories.
  • test.csv: The test dataset used for making predictions, containing text and a unique index for each entry.
  • predicted_test_data.csv: The output CSV file containing the predicted categories for each article along with the corresponding index.

5. Model Execution

To Run the Model:

  1. Install the necessary libraries:

  2. Run the model with the following command:

  • Replace --model svm with --model nb to use Naive Bayes instead of SVM.
  1. The results will be saved in predicted_test_data.csv with two columns: target and Index.

6. Potential Improvements

  • Data Augmentation: To further improve accuracy, data augmentation techniques such as synonym replacement, paraphrasing, or adding noisy examples could be explored.
  • Hyperparameter Tuning: Grid search or random search could be used to tune hyperparameters and improve model performance.
  • Ensemble Models: Combining multiple models (e.g., SVM + Naive Bayes) using an ensemble approach might yield better predictions.
  • Cloud-Based Solution: For even faster processing of large datasets, a cloud-based service with GPU acceleration could be used.

About

This project implements a machine learning model for text classification using Support Vector Machine (SVM). The model is designed to predict IAB categories for articles based on their content.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages