GitHub - BuildWeek-PostHere-Subreddit/MachineLearning

Post-Here Subreddit Recommender

This application interfaces with the Flask API for demonstration purposes: http://subredditapp.matthewlsessions.com/

Overview:

The scope of this application is to take a potential reddit post from a user and provide a list of subreddits that are relevant to said post. This application will use Machine Learning and Natural Language Processing (NLP) techniques to provide the best possible results to our users.

Project Architecture:

Our team of full-stack web developers created an application on React that interfaces with our Data Science API. The Data Science API accepts text from the React app and then sends the text to our model. The model first vectorizes the text input using a TFIDF Vectorizer that we fit to our Subreddit data. The model then computes the cosine similarity between our vectorized text input and our vectorized training data. Finally, our model returns the most relevant Subreddit ID’s based on the computation.

Once our application has the relevant Subreddit ID’s, it queries more subreddit information from our NoSQL Database and returns the Subreddit name, link, title, description, number of subscribers, active subscribers, and the subreddit score to the React app.

Collecting Subreddit Data:

Gathering subreddit names

We found a site the keeps track of the 5000 top subreddits so we scraped all the subreddit names and saved them in a CSV file.

from bs4 import BeautifulSoup as bs
import requests
import lxml
import pandas as pd
import numpy as np

def get_text(num):
    get = requests.get(f'http://redditlist.com/all?page={num}')
    soup = bs(get.content, 'lxml')
    items = soup.find_all('div', class_='listing-item')
    text = [i.find('span',class_='subreddit-url').a.text for i in items]
    return(text)

df = pd.DataFrame([], columns=['names'])

for i in range(41):
    text = get_text(i)
    dfa = pd.DataFrame(text, columns=['names'])
    df = df.append(dfa)
    if i % 5 == 0:
        print(f"Page {i} of 40.")

df = df.drop_duplicates(subset=None, keep='first')

df.to_csv('subreddit_names.csv')

Pulling Millions of Posts

We then used the Subreddit names to pull data from the Reddit API. This script pulls 1000 subreddit posts from each Subreddit along with other relevant data and saves it to a Dataframe.

import pandas as pd
import praw

redd = praw.Reddit(client_id='', client_secret='', username='', password='', user_agent='testagent')

df = pd.read_csv('subreddit_namesn.csv')

def df_range(mindf,maxdf, df):
    req_list = df.values[mindf:maxdf].tolist()
    return(req_list)

names = df_range(3001, 4990, df)

df = pd.DataFrame([], columns=['name', 'title', 'url', 'banner_url', 'subscribers', 'active_accounts', 'score', 'text'])


counter = 0
for i in names:
    name = i[0]
    sub = redd.subreddit(name)
    title = sub.title
    url = sub.url
    banner_url = sub.banner_img
    subscribers = sub.subscribers
    active_accounts = sub.accounts_active
    score = 0
    data = sub.hot(limit=1000)
    text = ''
    for words in data:
        text = text + words.title
        score = score + words.score
    dfa = pd.DataFrame([[name, title, url, banner_url, subscribers, active_accounts, score, text]], columns=['name', 'title', 'url', 'banner_url', 'subscribers', 'active_accounts', 'score', 'text'])
    df = df.append(dfa)
    if counter % 20 == 0:
        print(f'Call {counter} of 4990')
    counter = counter + 1
    df.to_csv('all_data2.csv')

Model Exploration:

We explored dozens of different models and techniques that would allow us to maintain a relatively small model while producing good results. We had to cut back on the size of our training data due to hosting limitations so we decided to use a Tfidf Vectorizer to transform our data and then compare the Cosine Similarities.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data		Data
Models		Models
Notebooks		Notebooks
Pics		Pics
Scripts_for_Data_Collection		Scripts_for_Data_Collection
LICENSE		LICENSE
README.md		README.md
tasks.txt		tasks.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Post-Here Subreddit Recommender

This application interfaces with the Flask API for demonstration purposes: http://subredditapp.matthewlsessions.com/

Overview:

Project Architecture:

Collecting Subreddit Data:

Gathering subreddit names

Pulling Millions of Posts

Model Exploration:

Handling the Data:

Fitting the Model:

Transforming the Training Data:

Vectorizing the Data:

Transforming Test Data:

Computing the Cosines:

Checking the Results:

Export the Model:

Authors

- Matthew Sessions - Machine Learning Engineer

- Johana Luna - Machine Learning Engineer

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

BuildWeek-PostHere-Subreddit/MachineLearning

Folders and files

Latest commit

History

Repository files navigation

Post-Here Subreddit Recommender

This application interfaces with the Flask API for demonstration purposes: http://subredditapp.matthewlsessions.com/

Overview:

Project Architecture:

Collecting Subreddit Data:

Gathering subreddit names

Pulling Millions of Posts

Model Exploration:

Handling the Data:

Fitting the Model:

Transforming the Training Data:

Vectorizing the Data:

Transforming Test Data:

Computing the Cosines:

Checking the Results:

Export the Model:

Authors

- Matthew Sessions - Machine Learning Engineer

- Johana Luna - Machine Learning Engineer

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages