This repository requires python 3.7.3, pip and virtualenv. Setup a virtual environment as follows:
virtualenv env
source env.sh
pip install -r requirements.txtWhenever working with python, run source env.sh in the current terminal session. If new packages are installed, update the list of dependencies by running pip freeze > requirements.txt.
Scripts in subdirectories (e.g. those in scraping) should be run as modules to avoid path and module conflicts:
python -m scraping.new_dataset # instead of `python scraping/new_dataset.py` or `cd scraping && python new_dataset.py`The TensorFlow workflow in this repository is adapted from this boilerplate.
During and after training, the training and validation losses are plotted in TensorBoard. To visualize, run tensorboard --logdir=experiments and open localhost:6006 in the browser.
After running a script that produces visualizations (for example, scripts.centroids), go to projector.tensorflow.org and upload the TSV files inside the projector directory.
-
Stormfront dataset: place the
all_filesdirectory and theannotations_metadata.csvfile inside this repository'sdatadirectory. Renameall_filestostormfront, andannotations_metadata.csvtostormfront.csv. -
Twitter hate speech dataset: rename the file to
twitter.csvand place it in thedatadirectory. -
Google News Word2Vec: place the file directly in the
datadirectory. -
Twitter moral foundations dataset: rename the directory to
twitter_mfand place it in thedatadirectory. To scrape the tweets from their id's, runpython -m scraping.twitter_mfand then to clean the data runpython -m scripts.clean_twitter_mf. To have a fixed heldout dataset that represents well the rest of the data, create a shuffled version of the data:
cat data/twitter_mf.clean.csv | head -1 > data/twitter_mf.clean.shuffled.csv
# macOS users: `sort` by hash is a good replacement for `shuf`.
cat data/twitter_mf.clean.csv | tail -24771 | shuf >> data/twitter_mf.clean.shuffled.csv- WikiText: download, unzip, and place both of the word-level datasets in the
datadirectory. Clean the data withpython -m scripts.clean_wikitext.
Ping @danielwatson6 for access to the YouTube and the ambiguity corpora.
- Set up an environment variable
DATASETS=/path/to/youtube/data/dirand name the directory with the YouTube CSV filesyoutube_right. This is done unlike with the rest of the data to avoid the massive dataset not fitting on available SSD space. - Run
python -m scraping.new_datasetto scrape the rest of the YouTube data. - Rename the ambiguity data to
ambiguity.csvand place it in thedatafolder.
For the scraping scripts to work, you need your own API keys.
- Place your YouTube API key in a file
scraping/api_key. - Place your Twitter API keys in a JSON file
scraping/twitter_api.jsonwith the following keys:consumer_key,consumer_secret,access_token_key,access_token_secret.