Skip to content

Comprehensive cannabis strain database with 2,793+ strains scraped from Wikileaf. Python-based data processing pipeline with CSV/JSON outputs for research and web applications. Check out a real world use case in the link below:

License

Notifications You must be signed in to change notification settings

Shannon-Goddard/grow_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌿 Grow Data - Cannabis Strain Database

Comprehensive Cannabis Strain Dataset | 2,793+ Strains | Web Scraping & Data Processing

MIT License Python Jupyter Data Source

πŸš€ Project Overview

Grow Data is a comprehensive cannabis strain database containing detailed information on 2,793 unique cannabis strains scraped from Wikileaf.com. This project demonstrates advanced web scraping techniques, data cleaning, and processing methodologies to create a structured dataset for cannabis research and application development.

🎯 Key Features

  • 2,793+ Cannabis Strains with complete profiles
  • Comprehensive Data Points including THC/CBD levels, strain types, effects, and descriptions
  • Clean, Structured Data in multiple formats (CSV, JSON, JavaScript)
  • Web Scraping Pipeline using Python, Pandas, and Beautiful Soup
  • Data Processing Notebooks for cleaning and transformation
  • Ready-to-Use Datasets for web applications and research

πŸ“Š Dataset Statistics

Metric Value
Total Strains 2,793
Data Points per Strain 8+ fields
File Formats CSV, JSON, JavaScript
Data Source Wikileaf.com
Processing Method Python + Beautiful Soup

πŸ—‚οΈ Data Structure

Each strain record contains:

{
  index: "0",
  strain: "Green Crack",
  strain_url: "https://www.wikileaf.com/strain/green-crack/",
  logo: "https://assets.wikileaf.com/assets/strains/strain/...",
  info: "<p>Detailed strain information...</p>",
  more_info: "<p>Additional strain details...</p>",
  THC: "<p>THC level classification</p>",
  CBD: "<p>CBD level classification</p>",
  Sativa: "<p>Sativa percentage</p>",
  Indica: "<p>Indica percentage</p>"
}

πŸ“ Project Structure

grow_data/
β”œβ”€β”€ Resources/
β”‚   β”œβ”€β”€ csv/                    # Processed CSV datasets
β”‚   β”‚   β”œβ”€β”€ ALL_data.csv       # Complete strain database
β”‚   β”‚   β”œβ”€β”€ strain_data.csv    # Strain names and URLs
β”‚   β”‚   β”œβ”€β”€ logo_data.csv      # Strain logos and images
β”‚   β”‚   └── more_info_data.csv # Extended strain information
β”‚   β”œβ”€β”€ js/                    # JavaScript data and notebooks
β”‚   β”‚   β”œβ”€β”€ data.js           # JavaScript-formatted dataset
β”‚   β”‚   β”œβ”€β”€ ALL_data.ipynb    # Main data processing notebook
β”‚   β”‚   β”œβ”€β”€ TableBuild_*.ipynb # Specialized processing notebooks
β”‚   β”‚   └── about_strain.ipynb # Strain analysis notebook
β”‚   └── pics/                  # Project assets
β”‚       β”œβ”€β”€ header_pic.png    # Project header image
β”‚       └── gif.gif          # Demo animation
β”œβ”€β”€ LICENSE                    # MIT License
└── README.md                 # This file

πŸ› οΈ Technologies Used

  • Python 3.7+ - Core programming language
  • Pandas - Data manipulation and analysis
  • Beautiful Soup - Web scraping and HTML parsing
  • Jupyter Notebook - Interactive development environment
  • CSV/JSON - Data storage formats
  • JavaScript - Client-side data integration

πŸš€ Getting Started

Prerequisites

pip install pandas beautifulsoup4 requests jupyter

Quick Start

  1. Clone the repository

    git clone https://github.com/yourusername/grow_data.git
    cd grow_data
  2. Explore the data

    import pandas as pd
    
    # Load the complete dataset
    df = pd.read_csv('Resources/csv/ALL_data.csv')
    print(f"Total strains: {len(df)}")
    print(df.head())
  3. Use in web applications

    <script src="Resources/js/data.js"></script>
    <script>
      console.log(`Loaded ${strainData.length} strains`);
    </script>

πŸ“ˆ Data Processing Pipeline

1. Web Scraping

  • Target: Wikileaf.com strain database
  • Method: Beautiful Soup + Requests
  • Scope: 2,793+ individual strain pages

2. Data Cleaning

  • HTML Processing: Extract clean text from HTML content
  • Data Validation: Ensure data integrity and consistency
  • Error Handling: Manage missing data and scraping failures

3. Data Transformation

  • Format Conversion: CSV β†’ JSON β†’ JavaScript
  • Structure Optimization: Organize for different use cases
  • Performance: Optimize for web application loading

🎯 Use Cases

πŸ”¬ Research Applications

  • Cannabis strain analysis and classification
  • THC/CBD distribution studies
  • Strain effect correlation research
  • Market trend analysis

πŸ’» Web Development

  • Cannabis strain search engines
  • Grow planning applications
  • Educational platforms
  • E-commerce integration

πŸ“Š Data Science

  • Machine learning model training
  • Natural language processing on strain descriptions
  • Recommendation system development
  • Market analysis and insights

πŸ“‹ Data Fields

Field Description Type
index Unique strain identifier Integer
strain Strain name String
strain_url Wikileaf strain page URL URL
logo Strain image URL URL
info Primary strain description HTML/Text
more_info Additional strain details HTML/Text
THC THC level classification String
CBD CBD level classification String
Sativa Sativa percentage indicator String
Indica Indica percentage indicator String

πŸ” Sample Data

index,strain,THC,CBD,Sativa,Indica
0,Green Crack,Normal,Very Low,Normal,Very Low
1,Blue Dream,Very High,Very Low,Very High,Very Low
2,Sour Diesel,Very High,Very Low,Very High,Very Low

🀝 Contributing

We welcome contributions to improve the dataset and processing pipeline!

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/enhancement)
  3. Commit your changes (git commit -m 'Add enhancement')
  4. Push to the branch (git push origin feature/enhancement)
  5. Open a Pull Request

Areas for Contribution

  • πŸ”„ Data Updates - Refresh strain information
  • 🧹 Data Cleaning - Improve processing algorithms
  • πŸ“Š Analysis Tools - Add data analysis notebooks
  • 🌐 Integration - Create API endpoints
  • πŸ“š Documentation - Enhance guides and examples

⚠️ Limitations & Considerations

  • Data Currency: Scraped data reflects Wikileaf content at time of collection
  • Manual Intervention: Some strain names required manual correction due to URL inconsistencies
  • Legal Compliance: Ensure compliance with local cannabis laws when using this data
  • Attribution: Data sourced from Wikileaf.com - respect their terms of service

πŸ”— Related Projects

This dataset powers several applications in the cannabis cultivation ecosystem:

  • GrowApp Cannabis Guide - Comprehensive grow planning platform
  • Strain Search Tools - Advanced strain discovery interfaces
  • Nutrient Calculators - Feeding schedule generators
  • Plant Diagnostics - Problem identification systems

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

License Summary

  • βœ… Commercial use allowed
  • βœ… Modification allowed
  • βœ… Distribution allowed
  • βœ… Private use allowed
  • ❗ License and copyright notice required

πŸ™ Acknowledgments

  • Wikileaf.com - Primary data source for strain information
  • Cannabis Community - For strain knowledge and cultivation wisdom
  • Open Source Contributors - For tools and libraries that made this possible
  • Python Community - For Beautiful Soup, Pandas, and Jupyter ecosystems

πŸ“ž Contact & Support


Built with 🌿 for the cannabis cultivation community

Empowering growers with data-driven insights

About

Comprehensive cannabis strain database with 2,793+ strains scraped from Wikileaf. Python-based data processing pipeline with CSV/JSON outputs for research and web applications. Check out a real world use case in the link below:

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published