A flexible learning framework for protein sequence optimization using protein language models (PLMs).
PLM Framework provides a comprehensive toolkit for protein engineering through active learning. It leverages state-of-the-art protein language models (like ESM-2) to guide the exploration of protein sequence space, helping researchers discover variants with improved properties.
- Protein Embedding: Efficiently embed protein sequences using ESM models with caching
- Active Learning Loop: Iterative propose → assay → fit cycle for efficient exploration
- Multiple Learning Strategies:
- Ridge/MLP regression for property prediction
- Reinforcement learning policy for directed evolution
- Flexible Acquisition Functions: UCB, Expected Improvement, Thompson sampling
- Comprehensive Data Management: SQLite backend with versioning
- User-Friendly Interfaces: Python API and CLI
# On macOS/Linux
curl -sSL https://install.python-poetry.org | python3 -
# On Windows (PowerShell)
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -# Clone the repository
git clone https://github.com/yourusername/plm-framework.git
cd plm-framework
# Install with Poetry
poetry install# Create default configuration and directories
plm init --config-path config.yamlFor the first round when no experimental data is available, you can use two strategies:
# Propose initial variants using ESM2 logits (recommended)
plm propose-initial --config-path config.yaml "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG" proposals_round0.csv --batch-size 10 --n-mutations 1 --strategy esm_logit
# Alternatively, use diversity-based sampling
plm propose-initial --config-path config.yaml "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG" proposals_round0.csv --batch-size 10 --n-mutations 1 --strategy diversityThe esm_logit strategy leverages the ESM2 protein language model to identify mutations that are more likely to be functional:
- Generates random single/multiple point mutations
- Scores each mutation based on the ESM2 model's predicted probability at that position
- Selects variants with the highest logit scores
This approach uses the language model's understanding of protein sequences to prioritize mutations that maintain the protein's structural and functional properties.
The diversity strategy focuses on maximizing the diversity of the initial batch:
- Generates random mutations
- Embeds all variants using the ESM2 model
- Selects variants that are maximally diverse in the embedding space
This is useful when you want to explore different regions of the sequence space.
Both methods will:
- Generate candidate variants with the specified number of mutations
- Score and rank the variants according to the chosen strategy
- Save the top-scoring variants to a CSV file
You can restrict mutations to specific regions of the protein sequence using the --mutation-range parameter:
# Only mutate residues 50-150
plm propose-initial --config-path config.yaml "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG" proposals_round0.csv --mutation-range "50-150"
# Mutate specific residues and ranges
plm propose-initial --config-path config.yaml "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG" proposals_round0.csv --mutation-range "5,10,15-20,30-40"This feature is useful when:
- You want to focus on specific functional domains
- You have prior knowledge about which regions are more likely to yield beneficial mutations
- You want to avoid disrupting critical structural elements
After selecting variants for testing:
# Embed sequences from a FASTA or CSV file
plm embed proposals_round0.csv embeddings.h5 --model-name facebook/esm2_t33_650M_UR50DAfter obtaining experimental measurements:
# Train model and run active learning loop
plm learn --config-path config.yaml measured_data.csv results/ --n-rounds 5 --batch-size 10# Propose new variants using the trained model
plm propose --config-path config.yaml model.pkl candidates.fasta proposals.csv --batch-size 10The framework accepts several input file formats:
Used for protein sequences without experimental data:
>variant_1
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
>variant_2
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAAG
Used for variants with experimental measurements:
id,sequence,score,uncertainty
variant_1,MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG,0.85,0.05
variant_2,MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAAG,0.92,0.03
- Initial Setup: Start with the original protein sequence
- Generate Candidates: Create a set of candidate variants
- Round 0 (Initial Exploration): Propose diverse variants without prior measurements
- Experimental Testing: Assay the proposed variants and record measurements
- Embed Measured Sequences: Add the experimentally measured sequences to the embedding file
- Active Learning Loop: Train model → Propose variants → Assay → Update model → Repeat
- Analysis: Evaluate model performance and extract insights
The framework includes tools for benchmarking performance on protein engineering datasets, such as those from Protein Gym.
# Run benchmark on a Protein Gym dataset
python scripts/benchmark_protein_gym.py \
--config config.yaml \
--dataset test_data/S22A1_HUMAN_Yee_2023_activity.csv \
--output-dir benchmark_results/s22a1 \
--n-rounds 5 \
--strategies ucb,ei,ts,diversityThis will:
- Split the dataset into training and test sets
- Simulate multiple rounds of active learning
- Evaluate performance metrics after each round
- Generate plots comparing different acquisition strategies
The benchmark tracks several performance metrics across rounds:
- R²: Coefficient of determination (higher is better)
- RMSE: Root mean squared error (lower is better)
- Spearman Correlation: Rank correlation between predictions and actual values (higher is better)
- Top-N Mean: Average score of the top N predicted variants (higher is better)
These metrics help evaluate how well the model learns to predict protein function and how effectively the active learning strategy explores the sequence space.
MIT