Comprehensive Cannabis Strain Dataset | 2,793+ Strains | Web Scraping & Data Processing
Grow Data is a comprehensive cannabis strain database containing detailed information on 2,793 unique cannabis strains scraped from Wikileaf.com. This project demonstrates advanced web scraping techniques, data cleaning, and processing methodologies to create a structured dataset for cannabis research and application development.
- 2,793+ Cannabis Strains with complete profiles
- Comprehensive Data Points including THC/CBD levels, strain types, effects, and descriptions
- Clean, Structured Data in multiple formats (CSV, JSON, JavaScript)
- Web Scraping Pipeline using Python, Pandas, and Beautiful Soup
- Data Processing Notebooks for cleaning and transformation
- Ready-to-Use Datasets for web applications and research
| Metric | Value |
|---|---|
| Total Strains | 2,793 |
| Data Points per Strain | 8+ fields |
| File Formats | CSV, JSON, JavaScript |
| Data Source | Wikileaf.com |
| Processing Method | Python + Beautiful Soup |
Each strain record contains:
{
index: "0",
strain: "Green Crack",
strain_url: "https://www.wikileaf.com/strain/green-crack/",
logo: "https://assets.wikileaf.com/assets/strains/strain/...",
info: "<p>Detailed strain information...</p>",
more_info: "<p>Additional strain details...</p>",
THC: "<p>THC level classification</p>",
CBD: "<p>CBD level classification</p>",
Sativa: "<p>Sativa percentage</p>",
Indica: "<p>Indica percentage</p>"
}grow_data/
βββ Resources/
β βββ csv/ # Processed CSV datasets
β β βββ ALL_data.csv # Complete strain database
β β βββ strain_data.csv # Strain names and URLs
β β βββ logo_data.csv # Strain logos and images
β β βββ more_info_data.csv # Extended strain information
β βββ js/ # JavaScript data and notebooks
β β βββ data.js # JavaScript-formatted dataset
β β βββ ALL_data.ipynb # Main data processing notebook
β β βββ TableBuild_*.ipynb # Specialized processing notebooks
β β βββ about_strain.ipynb # Strain analysis notebook
β βββ pics/ # Project assets
β βββ header_pic.png # Project header image
β βββ gif.gif # Demo animation
βββ LICENSE # MIT License
βββ README.md # This file
- Python 3.7+ - Core programming language
- Pandas - Data manipulation and analysis
- Beautiful Soup - Web scraping and HTML parsing
- Jupyter Notebook - Interactive development environment
- CSV/JSON - Data storage formats
- JavaScript - Client-side data integration
pip install pandas beautifulsoup4 requests jupyter-
Clone the repository
git clone https://github.com/yourusername/grow_data.git cd grow_data -
Explore the data
import pandas as pd # Load the complete dataset df = pd.read_csv('Resources/csv/ALL_data.csv') print(f"Total strains: {len(df)}") print(df.head())
-
Use in web applications
<script src="Resources/js/data.js"></script> <script> console.log(`Loaded ${strainData.length} strains`); </script>
- Target: Wikileaf.com strain database
- Method: Beautiful Soup + Requests
- Scope: 2,793+ individual strain pages
- HTML Processing: Extract clean text from HTML content
- Data Validation: Ensure data integrity and consistency
- Error Handling: Manage missing data and scraping failures
- Format Conversion: CSV β JSON β JavaScript
- Structure Optimization: Organize for different use cases
- Performance: Optimize for web application loading
- Cannabis strain analysis and classification
- THC/CBD distribution studies
- Strain effect correlation research
- Market trend analysis
- Cannabis strain search engines
- Grow planning applications
- Educational platforms
- E-commerce integration
- Machine learning model training
- Natural language processing on strain descriptions
- Recommendation system development
- Market analysis and insights
| Field | Description | Type |
|---|---|---|
index |
Unique strain identifier | Integer |
strain |
Strain name | String |
strain_url |
Wikileaf strain page URL | URL |
logo |
Strain image URL | URL |
info |
Primary strain description | HTML/Text |
more_info |
Additional strain details | HTML/Text |
THC |
THC level classification | String |
CBD |
CBD level classification | String |
Sativa |
Sativa percentage indicator | String |
Indica |
Indica percentage indicator | String |
index,strain,THC,CBD,Sativa,Indica
0,Green Crack,Normal,Very Low,Normal,Very Low
1,Blue Dream,Very High,Very Low,Very High,Very Low
2,Sour Diesel,Very High,Very Low,Very High,Very Low
We welcome contributions to improve the dataset and processing pipeline!
- Fork the repository
- Create a feature branch (
git checkout -b feature/enhancement) - Commit your changes (
git commit -m 'Add enhancement') - Push to the branch (
git push origin feature/enhancement) - Open a Pull Request
- π Data Updates - Refresh strain information
- π§Ή Data Cleaning - Improve processing algorithms
- π Analysis Tools - Add data analysis notebooks
- π Integration - Create API endpoints
- π Documentation - Enhance guides and examples
- Data Currency: Scraped data reflects Wikileaf content at time of collection
- Manual Intervention: Some strain names required manual correction due to URL inconsistencies
- Legal Compliance: Ensure compliance with local cannabis laws when using this data
- Attribution: Data sourced from Wikileaf.com - respect their terms of service
This dataset powers several applications in the cannabis cultivation ecosystem:
- GrowApp Cannabis Guide - Comprehensive grow planning platform
- Strain Search Tools - Advanced strain discovery interfaces
- Nutrient Calculators - Feeding schedule generators
- Plant Diagnostics - Problem identification systems
This project is licensed under the MIT License - see the LICENSE file for details.
- β Commercial use allowed
- β Modification allowed
- β Distribution allowed
- β Private use allowed
- β License and copyright notice required
- Wikileaf.com - Primary data source for strain information
- Cannabis Community - For strain knowledge and cultivation wisdom
- Open Source Contributors - For tools and libraries that made this possible
- Python Community - For Beautiful Soup, Pandas, and Jupyter ecosystems
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: shannon@loyal9.app
Built with πΏ for the cannabis cultivation community
Empowering growers with data-driven insights