HARVEST: Human-in-the-loop Actionable Research and Vocabulary Extraction Technology

A web-based application for annotating biological text with entity relationships and metadata.

Setup

Quick Installation

For detailed installation instructions, see INSTALLATION.md.

Quick Start

Install dependencies:

pip install -r requirements.txt

Important: Make sure PyMuPDF is installed for PDF highlighting features:

pip install PyMuPDF>=1.23.0

Configure the application (IMPORTANT):

Edit config.py and update the following settings:

# Update this to your email address (required by Unpaywall API)
UNPAYWALL_EMAIL = "your-email@example.com"  # CHANGE THIS

# Optional: Add admin emails for access
ADMIN_EMAILS = "admin@example.com,researcher@university.edu"

# Optional: Customize ports and paths
HOST = "127.0.0.1"
PORT = 8050
BE_PORT = 5001
DB_PATH = "harvest.db"

IMPORTANT: The UNPAYWALL_EMAIL must be set to a valid email address before using PDF download features. This is required by the Unpaywall API.

If upgrading from an older version, run the migration script:

python3 migrate_db_v2.py

This will safely update your database schema while preserving existing data.

Update schema types (entity types and relation types):

python3 update_schema_types.py

This ensures your database has all the latest entity types and relation types for the annotation dropdowns. See SCHEMA_UPDATE_GUIDE.md for details.

Note: The migration will:

Remove redundant article metadata fields (title, authors, year) from the doi_metadata table
Remove contributor_email from sentences table (tracked at triple level)
Add project_id column to triples table for project association
Add new tables for projects and admin authentication

The schema update will:

Add any new entity types (e.g., Metabolite, Coordinates) to the database
Add any new relation types (e.g., may_influence, contributes_to) to the database
Ensure dropdowns show all available annotation options

Important: After upgrading, make sure to run both scripts before starting the application to avoid missing dropdown options or database errors.

Running the Application

Option 1: Using the Launcher (Recommended)

The easiest way to start the application is using the included launcher script, which handles both the backend and frontend:

python3 launch_harvest.py

This will:

Start the backend API server on port 5001
Start the frontend UI server on port 8050
Verify both services launched successfully
Monitor the processes and handle graceful shutdown when you press Ctrl+C

Then access the application at http://localhost:8050

Option 2: Manual Launch

Alternatively, you can manually start each service in separate terminals:

Run the backend:

python3 harvest_be.py

Run the frontend (in a separate terminal):

python3 harvest_fe.py

Access the application at http://localhost:8050

Configuration

You can customize the ports and hosts using environment variables:

HARVEST_PORT: Backend API port (default: 5001)
PORT: Frontend UI port (default: 8050)
HARVEST_HOST: Backend host (default: 127.0.0.1)
FRONTEND_HOST: Frontend host (default: 127.0.0.1)
HARVEST_DB: Database file path (default: harvest.db)
HARVEST_ADMIN_EMAILS: Comma-separated list of admin emails (optional)
HARVEST_DEPLOYMENT_MODE: Deployment mode - "internal" or "nginx" (default: internal)
HARVEST_BACKEND_PUBLIC_URL: Backend URL for nginx mode (required when mode is "nginx")
HARVEST_URL_BASE_PATHNAME: URL base pathname for subpath deployments (default: "/", e.g., "/harvest/")

Deployment

The application supports two deployment modes:

Internal Mode (Default)

Simple setup for development and single-server deployments
Backend runs on localhost only, protected from external access
Frontend proxies all backend requests internally
No reverse proxy required

Nginx Mode

Production-ready deployment with reverse proxy
Supports load balancing, SSL termination, and advanced routing
Backend accessible at configured public URL
Ideal for scaled deployments

Deployment Guides:

DEPLOYMENT_GUIDE.md - Complete deployment guide for all modes (internal, nginx, subpath)

Admin Features

Creating an Admin User

To access admin features, you need to create an admin user:

python3 create_admin.py

This will prompt you for an email and password. The password will be securely hashed and stored in the database.

Alternatively, you can set admin emails via environment variable:

export HARVEST_ADMIN_EMAILS="admin1@example.com,admin2@example.com"

Admin Panel

The Admin panel (accessible from the Admin tab) allows you to:

Manage Projects: Create projects with lists of DOIs for organized annotation campaigns
Edit Triples: Update any triple's entity names or relationships
Delete Triples: Remove incorrect or duplicate triples

Database Schema

sentences: Stores annotated sentences with DOI hash reference
doi_metadata: Stores DOI and hash (article metadata fetched on-demand from CrossRef)
triples: Stores entity relationships with contributor email tracking
entity_types: Entity type definitions
relation_types: Relationship type definitions
user_sessions: Session tracking for multi-user support
projects: Project definitions with DOI lists for organized annotation
admin_users: Admin user authentication (password hashed with bcrypt)

Usage

Literature Search

The Literature Search feature enables semantic paper discovery from multiple academic sources. See docs/SEMANTIC_SEARCH.md for detailed documentation.

Quick Start:

Login via the Admin tab (authentication required)
Navigate to the Literature Search tab
Select search sources (Semantic Scholar, arXiv, Web of Science)
Enter your search query in natural language
Optionally enable "Build on previous searches" for cumulative results
Click "Search Papers" to find relevant literature
Select papers and export DOIs to projects

Key Features:

Multi-source search (Semantic Scholar, arXiv, Web of Science)
Semantic reranking using AI embeddings
Session-based cumulative searching
Smart deduplication across sources
Export to projects for annotation

Literature Review (AI-Assisted Screening)

The Literature Review feature uses ASReview, an AI-powered active learning tool, to efficiently screen and shortlist papers. This significantly reduces manual review effort by prioritizing papers most likely to be relevant. See docs/LITERATURE_REVIEW.md for detailed documentation.

Quick Start:

Complete a Literature Search to gather candidate papers
Click "Start Literature Review" to create a screening project
Upload papers to remote ASReview service
Screen papers presented in order of predicted relevance
Mark papers as relevant or irrelevant
AI model learns your criteria and re-ranks remaining papers
Export relevant papers to HARVEST projects for annotation

Key Features:

Active learning: AI learns from your decisions
Smart prioritization: Review most relevant papers first
Reduces workload: Can cut manual screening by 95%+
GPU-accelerated: Deployed on remote GPU host for optimal performance
Systematic approach: Structured review with progress tracking

Requirements:

ASReview service deployed on GPU-enabled host (see docs)
ASREVIEW_SERVICE_URL configured in config.py
Admin authentication

For Annotators

Enter your email address (required for attribution)
(Optional) Select a project to work on from the dropdown
(Optional) Enter and validate a DOI to link your annotation
Enter the sentence to annotate
Add triples defining relationships between entities
Save your annotations
Browse saved annotations in the Browse tab

For Administrators

Go to the Admin tab
Login with your admin credentials
Create projects to organize annotation work
View and manage existing projects
Download PDFs for project DOIs (where available)
Upload PDFs for paywalled articles
Delete projects with options for handling associated triples:
- Keep triples as uncategorized (recommended)
- Reassign triples to another project
- Delete all associated triples
Edit or delete triples as needed for quality control
Filter triples by project when searching for specific entries

PDF Management

Configuration

Before using PDF download features, you MUST edit config.py and set your email address:

UNPAYWALL_EMAIL = "your-email@example.com"  # REQUIRED - Change this to your email

This email is required by the Unpaywall API to check open access status. Without it, PDF downloads will fail.

Other customizable settings in config.py:

HOST and PORT: Server address and port settings
DB_PATH: Database file location
PDF_STORAGE_DIR: Where to store downloaded PDFs
ADMIN_EMAILS: Additional admin email addresses
ENABLE_PDF_DOWNLOAD: Toggle PDF download feature
ENABLE_PDF_VIEWER: Toggle embedded PDF viewer
ENABLE_PDF_HIGHLIGHTING: Toggle PDF highlighting/annotation feature (requires ENABLE_PDF_VIEWER=True)

PDF Viewer with Highlighting

The application includes an integrated PDF viewer with text highlighting capabilities. This feature can be enabled/disabled using the ENABLE_PDF_HIGHLIGHTING setting in config.py.

When enabled, the viewer allows you to:

Highlight text in PDFs using a highlighter pen-like tool
Choose highlight colors from a color picker
Save highlights directly to the PDF file for permanent storage
View saved highlights when reopening the PDF
Clear all highlights if needed

Security Features:

Maximum of 50 highlights per save operation (prevents abuse)
Highlight text limited to 10,000 characters each
File size validation (100 MB limit)
Input sanitization and validation on all highlight data
Protection against path traversal attacks

How to Use the Highlighting Feature:

Select a DOI from a project to load its PDF in the viewer
Click the "🖍️ Highlight" button to enable highlighting mode
Click and drag on the PDF to create a highlight
Change the highlight color using the color picker if desired
Click "💾 Save" to permanently store highlights in the PDF file
Use "🗑️ Clear All" to remove all highlights from the PDF

Keyboard Shortcuts:

H: Toggle highlight mode
Ctrl+S: Save highlights
Arrow keys or Page Up/Down: Navigate pages

Technical Details:

Highlights are stored as PDF annotations using the PyMuPDF library
The viewer uses PDF.js for rendering with a custom overlay for highlighting
All highlights are validated and sanitized before being saved
Highlights persist in the PDF file and are readable by other PDF viewers

Automatic PDF Download

For projects with DOI lists, administrators can automatically download open-access PDFs:

Create a project with DOI list
Click "Download PDFs" button in the project management section
The system will:
- Check each DOI for open access availability (via Unpaywall API)
- Download available open-access PDFs automatically
- Skip DOIs where PDFs already exist
- Provide a list of DOIs requiring manual upload

PDFs are named using the DOI hash (e.g., abc123def456.pdf) and stored in project_pdfs/project_<id>/.

Manual PDF Upload

For paywalled articles or failed downloads:

The download report shows which DOIs need manual upload
Obtain PDFs through your institutional access
Use the upload function to add PDFs to the project
Name files according to the provided doi_hash

Important: This tool only downloads legally available open-access content. You must have appropriate permissions for any manually uploaded PDFs.

Database Maintenance

Cleanup Orphaned Sentences

Over time, you may accumulate sentences without associated triples (incomplete entries). Use the cleanup script to identify and remove them:

# Dry run (shows what would be deleted without deleting)
python3 cleanup_orphaned_sentences.py

# Actually perform cleanup
python3 cleanup_orphaned_sentences.py --execute

# Assign a custom name for the default project
python3 cleanup_orphaned_sentences.py --execute --default-project "General Annotations"

The cleanup script will:

Find and optionally delete sentences without any triples
Assign triples with NULL project_id to a default "Uncategorized" project

Note: Always run the dry-run first to see what will be affected!

Cascade Deletion

When you delete a triple through the admin panel:

If it's the last triple for a sentence, the sentence is automatically deleted as well
This prevents orphaned sentences and maintains database integrity

Project-Based Annotation

Administrators can create projects with predefined lists of DOIs. This helps organize annotation campaigns around specific papers or topics. Users can select a project from the dropdown, and the system will suggest DOIs from that project's list.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github		.github
assets		assets
docs		docs
frontend		frontend
test_scripts		test_scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
asreview_client.py		asreview_client.py
cleanup_orphaned_sentences.py		cleanup_orphaned_sentences.py
config.py		config.py
create_admin.py		create_admin.py
email_config.py		email_config.py
email_service.py		email_service.py
email_verification_store.py		email_verification_store.py
harvest_be.py		harvest_be.py
harvest_fe.py		harvest_fe.py
harvest_fe_original.py		harvest_fe_original.py
harvest_store.py		harvest_store.py
init_directories.py		init_directories.py
launch_harvest.py		launch_harvest.py
literature_search.py		literature_search.py
migrate_db_v2.py		migrate_db_v2.py
package-lock.json		package-lock.json
pdf_analytics_endpoints.py		pdf_analytics_endpoints.py
pdf_annotator.py		pdf_annotator.py
pdf_download_db.py		pdf_download_db.py
pdf_manager.py		pdf_manager.py
pdf_manager_enhanced.py		pdf_manager_enhanced.py
pdf_sources.py		pdf_sources.py
pyproject.toml		pyproject.toml
requirements-full.txt		requirements-full.txt
requirements-minimal.txt		requirements-minimal.txt
requirements-standard.txt		requirements-standard.txt
requirements.txt		requirements.txt
security_config.py		security_config.py
update_schema_types.py		update_schema_types.py
wsgi_be.py		wsgi_be.py
wsgi_fe.py		wsgi_fe.py

IntegrativeBioinformaticsLab/HARVEST

Folders and files

Latest commit

History

Repository files navigation

HARVEST: Human-in-the-loop Actionable Research and Vocabulary Extraction Technology

Setup

Quick Installation

Quick Start

Running the Application

Option 1: Using the Launcher (Recommended)

Option 2: Manual Launch

Configuration

Deployment

Internal Mode (Default)

Nginx Mode

Admin Features

Creating an Admin User

Admin Panel

Database Schema

Usage

Literature Search

Literature Review (AI-Assisted Screening)

For Annotators

For Administrators

PDF Management

Configuration

PDF Viewer with Highlighting

Automatic PDF Download

Manual PDF Upload

Database Maintenance

Cleanup Orphaned Sentences

Cascade Deletion

Project-Based Annotation

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages