Skip to content

Kingsford-Group/arcade

Repository files navigation

Arcade: Activation Engineering for Controllable Codon Design

Arcade is a framework for controllable codon design with flexible optimization targets like MFE (Minimum Free Energy), CAI (Codon Adaptation Index), GC content, and specific gene expression levels, etc.

Environment Setup

We recommend starting from a clean Conda environment:

conda create -n arcade python=3.10
conda activate arcade

You can install dependencies as the following instructions:

Dependency management extends the setup from the original CodonBERT repository, with updated dependencies defined in pyproject.toml:

pip install poetry
poetry install

Ensure you have CUDA drivers if you plan on using a GPU.

Download Pretrained Model CodonBERT

We use the CodonBERT Pytorch model can be downloaded here from repository CodonBERTCodonBERT: large language models for mRNA design and optimization.
The artifact is under a license. The code and repository are under a software license.

Arcade

Arcade (Activation Engineering for Controllable Codon Design) builds on CodonBERT and introduces a framework for:

  • Controllable codon design
  • Steering on a metric with continuous values
  • Steering on the Encoder-only foundation model

Project Structure

Arcade/
├── scripts/               # Main scripts 
    ├── recons_token_cls.py         # script for codon design
    ├── finetune_token_cls.py       # script for finetuning the pretrained base model with token classification head
    ├── fetch_steering_vectors.py   # script for making steering vectors
├── data/                  # Data for finetuning and steering vectors
    ├── GENCODE            # Download data for finetuning
    ├── steering_vectors   # Steering vectors for designing sequence
├── checkpoints/           # fine-tuned models
    ├── arcade/            # Arcade checkpoint
    ├── codonbert/         # CodonBERT checkpoint
├── results/               # Model outputs
├── calculator/            # Calculator for metrics (e.g. CAI, MFE, GC content)

Quick Start

We provide a checkpoint with a token classification head, along with precomputed steering vectors.

Codon Design with Different Targets

cd checkpoint
wget https://cdn.prod.accelerator.sanofi/llm/CodonBERT.zip
unzip CodonBERT.zip
sed -i 's|"base_model_name_or_path": *".*"|"base_model_name_or_path": "'"$(realpath codonbert/)"'"|g' arcade/adapter_config.json
cd ../scripts

If you only want to see one sequence:

CUDA_VISIBLE_DEVICES=0 python -u recons_token_cls.py \
  --lambda_gc 1 \
  --single_example \
  --input_example ATGCCA

Use the test data at data/GENCODE/gencode.v47.pc_transcripts_cds_test.fa

CUDA_VISIBLE_DEVICES=0 python -u recons_token_cls.py \
  --lambda_cai 1 \
  --save_file_name 'cai' 

With multiple targets:

CUDA_VISIBLE_DEVICES=0 python -u recons_token_cls.py \
  --lambda_cai 1 \
  --lambda_gc 1 \
  --save_file_name 'cai+gc' 

If you want to use your own data and base model to get started, you can train the token classification head and generate steering vectors using the following commands:

Fine-Tune Token Classification Model

CUDA_VISIBLE_DEVICES=0 python scripts/finetune_token_cls.py
  --model_path path/to/downloaded/base/model (e.g., CodonBERT checkpoints) \
  --data_path path/to/your/training/data \
  --use_lora

Fetch Steering Vectors

This script computes steering vectors between two groups of mutated sequences (e.g., high vs. low expression) using a frozen CodonBERT model.

CUDA_VISIBLE_DEVICES=0 python scripts/fetch_steering_vectors.py \
  --data_type fasta \
  --high_fa_path data/for_steering/high_gc.fa \
  --low_fa_path data/for_steering/low_gc.fa \
  --model_path checkpoint/arcade \
  --save_name 'gc' \
  --save_dir data/steering_vectors

License

This project is licensed under the ARCADE Software License Agreement.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •