Welcome to the Colorado Crawler Wikipedia project, a solution designed for extracting and analyzing web data from Wikipedia pages related to Colorado. This can be used to train NLP models on words related to your needs.
- Web Scraping: Utilize
wikipedia_scraper.pyto crawl web pages and gather the data you need. - Easy Setup: Quick installation with a simple
pip install seleniumbase
- Clone the repository and set up the environment:
git clone https://github.com/LukeFarch/COCrawlerWiki.git
cd COCrawlerWiki- Change the paths in the code as necessary to match your environment and needs.
- To start crawling Wikipedia for Colorado-related pages, run:
python wikipedia_crawler.py- Do you wanna get a word count? This will tell you how many files are under 10 words (failed). Adjust as needed
word_count.py- Follow the prompts on screen to start crawling cities or counties