A web-based application for annotating biological text with entity relationships and metadata.
For detailed installation instructions, see INSTALLATION.md.
- Install dependencies:
pip install -r requirements.txtImportant: Make sure PyMuPDF is installed for PDF highlighting features:
pip install PyMuPDF>=1.23.0- Configure the application (IMPORTANT):
Edit config.py and update the following settings:
# Update this to your email address (required by Unpaywall API)
UNPAYWALL_EMAIL = "your-email@example.com" # CHANGE THIS
# Optional: Add admin emails for access
ADMIN_EMAILS = "admin@example.com,researcher@university.edu"
# Optional: Customize ports and paths
HOST = "127.0.0.1"
PORT = 8050
BE_PORT = 5001
DB_PATH = "harvest.db"IMPORTANT: The UNPAYWALL_EMAIL must be set to a valid email address before using PDF download features. This is required by the Unpaywall API.
- If upgrading from an older version, run the migration script:
python3 migrate_db_v2.pyThis will safely update your database schema while preserving existing data.
- Update schema types (entity types and relation types):
python3 update_schema_types.pyThis ensures your database has all the latest entity types and relation types for the annotation dropdowns. See SCHEMA_UPDATE_GUIDE.md for details.
Note: The migration will:
- Remove redundant article metadata fields (title, authors, year) from the doi_metadata table
- Remove contributor_email from sentences table (tracked at triple level)
- Add
project_idcolumn to triples table for project association - Add new tables for projects and admin authentication
The schema update will:
- Add any new entity types (e.g., Metabolite, Coordinates) to the database
- Add any new relation types (e.g., may_influence, contributes_to) to the database
- Ensure dropdowns show all available annotation options
Important: After upgrading, make sure to run both scripts before starting the application to avoid missing dropdown options or database errors.
The easiest way to start the application is using the included launcher script, which handles both the backend and frontend:
python3 launch_harvest.pyThis will:
- Start the backend API server on port 5001
- Start the frontend UI server on port 8050
- Verify both services launched successfully
- Monitor the processes and handle graceful shutdown when you press Ctrl+C
Then access the application at http://localhost:8050
Alternatively, you can manually start each service in separate terminals:
- Run the backend:
python3 harvest_be.py- Run the frontend (in a separate terminal):
python3 harvest_fe.py- Access the application at
http://localhost:8050
You can customize the ports and hosts using environment variables:
HARVEST_PORT: Backend API port (default: 5001)PORT: Frontend UI port (default: 8050)HARVEST_HOST: Backend host (default: 127.0.0.1)FRONTEND_HOST: Frontend host (default: 127.0.0.1)HARVEST_DB: Database file path (default: harvest.db)HARVEST_ADMIN_EMAILS: Comma-separated list of admin emails (optional)HARVEST_DEPLOYMENT_MODE: Deployment mode - "internal" or "nginx" (default: internal)HARVEST_BACKEND_PUBLIC_URL: Backend URL for nginx mode (required when mode is "nginx")HARVEST_URL_BASE_PATHNAME: URL base pathname for subpath deployments (default: "/", e.g., "/harvest/")
The application supports two deployment modes:
- Simple setup for development and single-server deployments
- Backend runs on localhost only, protected from external access
- Frontend proxies all backend requests internally
- No reverse proxy required
- Production-ready deployment with reverse proxy
- Supports load balancing, SSL termination, and advanced routing
- Backend accessible at configured public URL
- Ideal for scaled deployments
Deployment Guides:
- DEPLOYMENT_GUIDE.md - Complete deployment guide for all modes (internal, nginx, subpath)
To access admin features, you need to create an admin user:
python3 create_admin.pyThis will prompt you for an email and password. The password will be securely hashed and stored in the database.
Alternatively, you can set admin emails via environment variable:
export HARVEST_ADMIN_EMAILS="admin1@example.com,admin2@example.com"The Admin panel (accessible from the Admin tab) allows you to:
- Manage Projects: Create projects with lists of DOIs for organized annotation campaigns
- Edit Triples: Update any triple's entity names or relationships
- Delete Triples: Remove incorrect or duplicate triples
- sentences: Stores annotated sentences with DOI hash reference
- doi_metadata: Stores DOI and hash (article metadata fetched on-demand from CrossRef)
- triples: Stores entity relationships with contributor email tracking
- entity_types: Entity type definitions
- relation_types: Relationship type definitions
- user_sessions: Session tracking for multi-user support
- projects: Project definitions with DOI lists for organized annotation
- admin_users: Admin user authentication (password hashed with bcrypt)
The Literature Search feature enables semantic paper discovery from multiple academic sources. See docs/SEMANTIC_SEARCH.md for detailed documentation.
Quick Start:
- Login via the Admin tab (authentication required)
- Navigate to the Literature Search tab
- Select search sources (Semantic Scholar, arXiv, Web of Science)
- Enter your search query in natural language
- Optionally enable "Build on previous searches" for cumulative results
- Click "Search Papers" to find relevant literature
- Select papers and export DOIs to projects
Key Features:
- Multi-source search (Semantic Scholar, arXiv, Web of Science)
- Semantic reranking using AI embeddings
- Session-based cumulative searching
- Smart deduplication across sources
- Export to projects for annotation
The Literature Review feature uses ASReview, an AI-powered active learning tool, to efficiently screen and shortlist papers. This significantly reduces manual review effort by prioritizing papers most likely to be relevant. See docs/LITERATURE_REVIEW.md for detailed documentation.
Quick Start:
- Complete a Literature Search to gather candidate papers
- Click "Start Literature Review" to create a screening project
- Upload papers to remote ASReview service
- Screen papers presented in order of predicted relevance
- Mark papers as relevant or irrelevant
- AI model learns your criteria and re-ranks remaining papers
- Export relevant papers to HARVEST projects for annotation
Key Features:
- Active learning: AI learns from your decisions
- Smart prioritization: Review most relevant papers first
- Reduces workload: Can cut manual screening by 95%+
- GPU-accelerated: Deployed on remote GPU host for optimal performance
- Systematic approach: Structured review with progress tracking
Requirements:
- ASReview service deployed on GPU-enabled host (see docs)
ASREVIEW_SERVICE_URLconfigured in config.py- Admin authentication
- Enter your email address (required for attribution)
- (Optional) Select a project to work on from the dropdown
- (Optional) Enter and validate a DOI to link your annotation
- Enter the sentence to annotate
- Add triples defining relationships between entities
- Save your annotations
- Browse saved annotations in the Browse tab
- Go to the Admin tab
- Login with your admin credentials
- Create projects to organize annotation work
- View and manage existing projects
- Download PDFs for project DOIs (where available)
- Upload PDFs for paywalled articles
- Delete projects with options for handling associated triples:
- Keep triples as uncategorized (recommended)
- Reassign triples to another project
- Delete all associated triples
- Edit or delete triples as needed for quality control
- Filter triples by project when searching for specific entries
Before using PDF download features, you MUST edit config.py and set your email address:
UNPAYWALL_EMAIL = "your-email@example.com" # REQUIRED - Change this to your emailThis email is required by the Unpaywall API to check open access status. Without it, PDF downloads will fail.
Other customizable settings in config.py:
HOSTandPORT: Server address and port settingsDB_PATH: Database file locationPDF_STORAGE_DIR: Where to store downloaded PDFsADMIN_EMAILS: Additional admin email addressesENABLE_PDF_DOWNLOAD: Toggle PDF download featureENABLE_PDF_VIEWER: Toggle embedded PDF viewerENABLE_PDF_HIGHLIGHTING: Toggle PDF highlighting/annotation feature (requires ENABLE_PDF_VIEWER=True)
The application includes an integrated PDF viewer with text highlighting capabilities. This feature can be enabled/disabled using the ENABLE_PDF_HIGHLIGHTING setting in config.py.
When enabled, the viewer allows you to:
- Highlight text in PDFs using a highlighter pen-like tool
- Choose highlight colors from a color picker
- Save highlights directly to the PDF file for permanent storage
- View saved highlights when reopening the PDF
- Clear all highlights if needed
Security Features:
- Maximum of 50 highlights per save operation (prevents abuse)
- Highlight text limited to 10,000 characters each
- File size validation (100 MB limit)
- Input sanitization and validation on all highlight data
- Protection against path traversal attacks
How to Use the Highlighting Feature:
- Select a DOI from a project to load its PDF in the viewer
- Click the "๐๏ธ Highlight" button to enable highlighting mode
- Click and drag on the PDF to create a highlight
- Change the highlight color using the color picker if desired
- Click "๐พ Save" to permanently store highlights in the PDF file
- Use "๐๏ธ Clear All" to remove all highlights from the PDF
Keyboard Shortcuts:
H: Toggle highlight modeCtrl+S: Save highlights- Arrow keys or Page Up/Down: Navigate pages
Technical Details:
- Highlights are stored as PDF annotations using the PyMuPDF library
- The viewer uses PDF.js for rendering with a custom overlay for highlighting
- All highlights are validated and sanitized before being saved
- Highlights persist in the PDF file and are readable by other PDF viewers
For projects with DOI lists, administrators can automatically download open-access PDFs:
- Create a project with DOI list
- Click "Download PDFs" button in the project management section
- The system will:
- Check each DOI for open access availability (via Unpaywall API)
- Download available open-access PDFs automatically
- Skip DOIs where PDFs already exist
- Provide a list of DOIs requiring manual upload
PDFs are named using the DOI hash (e.g., abc123def456.pdf) and stored in project_pdfs/project_<id>/.
For paywalled articles or failed downloads:
- The download report shows which DOIs need manual upload
- Obtain PDFs through your institutional access
- Use the upload function to add PDFs to the project
- Name files according to the provided doi_hash
Important: This tool only downloads legally available open-access content. You must have appropriate permissions for any manually uploaded PDFs.
Over time, you may accumulate sentences without associated triples (incomplete entries). Use the cleanup script to identify and remove them:
# Dry run (shows what would be deleted without deleting)
python3 cleanup_orphaned_sentences.py
# Actually perform cleanup
python3 cleanup_orphaned_sentences.py --execute
# Assign a custom name for the default project
python3 cleanup_orphaned_sentences.py --execute --default-project "General Annotations"The cleanup script will:
- Find and optionally delete sentences without any triples
- Assign triples with NULL project_id to a default "Uncategorized" project
Note: Always run the dry-run first to see what will be affected!
When you delete a triple through the admin panel:
- If it's the last triple for a sentence, the sentence is automatically deleted as well
- This prevents orphaned sentences and maintains database integrity
Administrators can create projects with predefined lists of DOIs. This helps organize annotation campaigns around specific papers or topics. Users can select a project from the dropdown, and the system will suggest DOIs from that project's list.