Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .analysis.py.swp
Binary file not shown.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
constants.py
*.pyc
Binary file added .sentiment.txt.swp
Binary file not shown.
24 changes: 23 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,24 @@
# TextMining
This is the base repo for the text mining and analysis project for Software Design, Spring 2016 at Olin College.

## Overview
As a player of Magic: the Gathering, I was interested in analyzing the community's opinions of different popular decks. In particular, one very recently discovered deck (known as "Eldrazi") has been the subject of some controversydue to its sudden popularity and power level. I did aggregated sentiment analysis of comments on posts about particular decks, then compared them to comments about the Eldrazi deck.

## Implementation
###textmining.py
The first half of my code uses the Praw package to grab the most recent 100 posts in a given Magic-related subreddit, then filter those posts into a list if the title includes mention of a particular deck. The entire comment tree is then flattened into a list of strings, which is pickled and written to a file. One possible design decision that I decided not to pursue would have been to do some sorting of original comments and replies to comments. However, it would have been a difficult distinction to make with regards to the exact relevance of certain levels of comments, so I decided to just use all of them.
###analysis.py
The second module looks for a file with comments about a given deck, then unpickles it to the original list. Because sentiment analysis on particularly large strings tends to be unreliable, I split each comment into individual sentences. I then did some basic filtering to remove strings without words by checking if they contained letters. Since Indico doesn't care about non-alphabetic characters, I didn't bother to clean up the strings beyond that. Finally, I averaged the sentiments of each sentence in each comment, then averaged the comments.

## Results
I scraped comments relating to four different classic Magic decks to use as a benchmark, then from posts about the Eldrazi deck.
```
Average sentiment about Miracles: 0.468114394113
Average sentiment about Shardless: 0.432950678235
Average sentiment about Storm: 0.458152381391
Average sentiment about Delver: 0.491417681612
Average sentiment about Eldrazi: 0.422164683039
```
The number for Shardless is somewhat unreliable, since I only found one post, but the rest of the decks hover around neutral sentiment (slightly below neutral, but Magic players tend to be a salty bunch). As I had hoped, comments about the Eldrazi deck were noticeably, if not substantially, more negative on average than comments about the other decks.

## Reflection
Exchanging data with external sources worked well. The process of using the Reddit API was pretty seamless, although the restriction of only making one API call every two seconds was somewhat inhibiting, and the Indico API was even easier to implement. Since the majority of data processing was done with simple generators and filtering, I did the majority of unit testing in the command line with the APIs. The biggest challenge was in dealing with empty (deleted?) comments, which I eventually sorted out with exception handling and assigned a neutral sentiment value so as to not affect the averages. On the whole this project was definitely scoped according to the amount of time I was able to allocate to it, so it was less ambitious than it could have been. If I were to do it again I would have done more things with the data, like filtering the posts into a larger range of categories and more carefully analyzing comments in relation to each other.
31 changes: 31 additions & 0 deletions analysis.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import indicoio
import pickle
from constants import API_KEY

def analyze_comments(archetype):
"""Analyze a list of comments and print the average sentiment, according to Indico."""

f = open('{}_comments.pickle'.format(archetype.lower()), 'r')
stored_comments = pickle.load(f)

indicoio.config.api_key = API_KEY #configure indico API

def get_avg_sentiment(comment):
comment_sentences = filter(lambda string: any(c.isalpha() for c in string), comment.split('. '))
sentiments = [indicoio.sentiment(comment) for comment in comment_sentences]
try:
avg_sentiment = sum(sentiments) / len(sentiments)
except:
return 0.5
return avg_sentiment

all_sentiments = [get_avg_sentiment(comment) for comment in stored_comments]
all_average = sum(all_sentiments) / len(all_sentiments)
print 'Average sentiment about {}: '.format(archetype) + str(all_average)

decks = ['Miracles', 'Shardless', 'Storm', 'Delver']

if __name__ == '__main__':
for deck in decks:
analyze_comments(deck)
analyze_comments('Eldrazi')
226 changes: 226 additions & 0 deletions delver_comments.pickle

Large diffs are not rendered by default.

1,220 changes: 1,220 additions & 0 deletions eldrazi_comments.pickle

Large diffs are not rendered by default.

Loading