This repository contains the code for the Project 2 of EPFL's Machine Learning course. The goal of this project is to perform sentiment analysis over a dataset of Tweets.
In order to run the code properly, you will need the following packages:
- NLTK (The Natural Language Processing Toolkit)
In this library, you wil specifically need to download:
- PorterStemmer
- TweetTokenizer
- Corpus of stop-words
- NumPy
- Scikit-Learn
- Matplotlib
- Gensim
The structure of the files is the following:
helpers.pycontains the function useful for the CSV submission,preprocessing.pycontains functions useful to load and preprocess tweets,w2v.pycontains functions useful for creating and training a Word2Vec model,pipeline.pycontains the Scikit-Learn pipeline we used for the final version,cross_validation.pycontains functions for performing the cross validation method we used,plot.pycontains functions to plot the results of the cross-validation,Project2-ML.ipynbis a notebook used when choosing which method was the best, andrun.pycontains the code to load the data and run one of the models in order to compute predictions.
The train and test data can be found in the data/ folder.
The command python3 run.py will let you compute the predictions for the best model, and write the predictions for the test data into a file named submission.csv.
This file contains the function used to create the CSV submission for CrowdAI.
This file contains the functions to load and prepare the train and test data. It also contains the function used to tokenize the tweets.
This file contains the function to create and train a Word2Vec model, here we used a modified one called FastText. It also contains a function used to convert tweets to vectors.
This file contains the function that creates a Scikit-Learn pipeline for using a Bag of Words representation of words with TF-IDF weighting. To classify these vectors, we then use the LinearSVC classifier.
This file contains the function that performs a repeated k-fold cross-validation (more precisely, stratified k-fold).
This file contains the function used to plot the results of the cross-validation.
This notebook contains code that evaluate the different models, when trying to find the best.
This file contains the code that loads the training and test data.
We then compute the representation of tweets in a vector space and classify those. We use in this case the model that gave us the best results (TfIdfVectorizer + LinearSVC). We finfally compute and submit the predictions, in the format accepted by the CrowdAI competition.