Arcade is a framework for controllable codon design with flexible optimization targets like MFE (Minimum Free Energy), CAI (Codon Adaptation Index), GC content, and specific gene expression levels, etc.
We recommend starting from a clean Conda environment:
conda create -n arcade python=3.10
conda activate arcadeYou can install dependencies as the following instructions:
Dependency management extends the setup from the original CodonBERT repository, with updated dependencies defined in pyproject.toml:
pip install poetry
poetry installEnsure you have CUDA drivers if you plan on using a GPU.
We use the CodonBERT Pytorch model can be downloaded here from repository CodonBERTCodonBERT: large language models for mRNA design and optimization.
The artifact is under a license. The code and repository are under a software license.
Arcade (Activation Engineering for Controllable Codon Design) builds on CodonBERT and introduces a framework for:
- Controllable codon design
- Steering on a metric with continuous values
- Steering on the Encoder-only foundation model
Arcade/
├── scripts/ # Main scripts
├── recons_token_cls.py # script for codon design
├── finetune_token_cls.py # script for finetuning the pretrained base model with token classification head
├── fetch_steering_vectors.py # script for making steering vectors
├── data/ # Data for finetuning and steering vectors
├── GENCODE # Download data for finetuning
├── steering_vectors # Steering vectors for designing sequence
├── checkpoints/ # fine-tuned models
├── arcade/ # Arcade checkpoint
├── codonbert/ # CodonBERT checkpoint
├── results/ # Model outputs
├── calculator/ # Calculator for metrics (e.g. CAI, MFE, GC content)
We provide a checkpoint with a token classification head, along with precomputed steering vectors.
cd checkpoint
wget https://cdn.prod.accelerator.sanofi/llm/CodonBERT.zip
unzip CodonBERT.zip
sed -i 's|"base_model_name_or_path": *".*"|"base_model_name_or_path": "'"$(realpath codonbert/)"'"|g' arcade/adapter_config.jsoncd ../scriptsIf you only want to see one sequence:
CUDA_VISIBLE_DEVICES=0 python -u recons_token_cls.py \
--lambda_gc 1 \
--single_example \
--input_example ATGCCAUse the test data at data/GENCODE/gencode.v47.pc_transcripts_cds_test.fa
CUDA_VISIBLE_DEVICES=0 python -u recons_token_cls.py \
--lambda_cai 1 \
--save_file_name 'cai' With multiple targets:
CUDA_VISIBLE_DEVICES=0 python -u recons_token_cls.py \
--lambda_cai 1 \
--lambda_gc 1 \
--save_file_name 'cai+gc' If you want to use your own data and base model to get started, you can train the token classification head and generate steering vectors using the following commands:
CUDA_VISIBLE_DEVICES=0 python scripts/finetune_token_cls.py
--model_path path/to/downloaded/base/model (e.g., CodonBERT checkpoints) \
--data_path path/to/your/training/data \
--use_loraThis script computes steering vectors between two groups of mutated sequences (e.g., high vs. low expression) using a frozen CodonBERT model.
CUDA_VISIBLE_DEVICES=0 python scripts/fetch_steering_vectors.py \
--data_type fasta \
--high_fa_path data/for_steering/high_gc.fa \
--low_fa_path data/for_steering/low_gc.fa \
--model_path checkpoint/arcade \
--save_name 'gc' \
--save_dir data/steering_vectorsThis project is licensed under the ARCADE Software License Agreement.