Arcade: Activation Engineering for Controllable Codon Design

Arcade is a framework for controllable codon design with flexible optimization targets like MFE (Minimum Free Energy), CAI (Codon Adaptation Index), GC content, and specific gene expression levels, etc.

Environment Setup

We recommend starting from a clean Conda environment:

conda create -n arcade python=3.10
conda activate arcade

You can install dependencies as the following instructions:

Dependency management extends the setup from the original CodonBERT repository, with updated dependencies defined in pyproject.toml:

pip install poetry
poetry install

Ensure you have CUDA drivers if you plan on using a GPU.

Download Pretrained Model CodonBERT

We use the CodonBERT Pytorch model can be downloaded here from repository CodonBERTCodonBERT: large language models for mRNA design and optimization.
The artifact is under a license. The code and repository are under a software license.

Arcade

Arcade (Activation Engineering for Controllable Codon Design) builds on CodonBERT and introduces a framework for:

Controllable codon design
Steering on a metric with continuous values
Steering on the Encoder-only foundation model

Project Structure

Arcade/
├── scripts/               # Main scripts 
    ├── recons_token_cls.py         # script for codon design
    ├── finetune_token_cls.py       # script for finetuning the pretrained base model with token classification head
    ├── fetch_steering_vectors.py   # script for making steering vectors
├── data/                  # Data for finetuning and steering vectors
    ├── GENCODE            # Download data for finetuning
    ├── steering_vectors   # Steering vectors for designing sequence
├── checkpoints/           # fine-tuned models
    ├── arcade/            # Arcade checkpoint
    ├── codonbert/         # CodonBERT checkpoint
├── results/               # Model outputs
├── calculator/            # Calculator for metrics (e.g. CAI, MFE, GC content)

Quick Start

We provide a checkpoint with a token classification head, along with precomputed steering vectors.

Codon Design with Different Targets

cd checkpoint
wget https://cdn.prod.accelerator.sanofi/llm/CodonBERT.zip
unzip CodonBERT.zip
sed -i 's|"base_model_name_or_path": *".*"|"base_model_name_or_path": "'"$(realpath codonbert/)"'"|g' arcade/adapter_config.json

cd ../scripts

If you only want to see one sequence:

CUDA_VISIBLE_DEVICES=0 python -u recons_token_cls.py \
  --lambda_gc 1 \
  --single_example \
  --input_example ATGCCA

Use the test data at data/GENCODE/gencode.v47.pc_transcripts_cds_test.fa

CUDA_VISIBLE_DEVICES=0 python -u recons_token_cls.py \
  --lambda_cai 1 \
  --save_file_name 'cai'

With multiple targets:

CUDA_VISIBLE_DEVICES=0 python -u recons_token_cls.py \
  --lambda_cai 1 \
  --lambda_gc 1 \
  --save_file_name 'cai+gc'

If you want to use your own data and base model to get started, you can train the token classification head and generate steering vectors using the following commands:

Fine-Tune Token Classification Model

CUDA_VISIBLE_DEVICES=0 python scripts/finetune_token_cls.py
  --model_path path/to/downloaded/base/model (e.g., CodonBERT checkpoints) \
  --data_path path/to/your/training/data \
  --use_lora

Fetch Steering Vectors

This script computes steering vectors between two groups of mutated sequences (e.g., high vs. low expression) using a frozen CodonBERT model.

CUDA_VISIBLE_DEVICES=0 python scripts/fetch_steering_vectors.py \
  --data_type fasta \
  --high_fa_path data/for_steering/high_gc.fa \
  --low_fa_path data/for_steering/low_gc.fa \
  --model_path checkpoint/arcade \
  --save_name 'gc' \
  --save_dir data/steering_vectors

License

This project is licensed under the ARCADE Software License Agreement.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
benchmarks		benchmarks
calculator		calculator
checkpoint/arcade		checkpoint/arcade
data		data
evaluate		evaluate
scripts		scripts
.gitconfig		.gitconfig
.gitignore		.gitignore
ARTIFACT_LICENSE.md		ARTIFACT_LICENSE.md
README.md		README.md
SOFTWARE_LICENSE.md		SOFTWARE_LICENSE.md
licence.txt		licence.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Arcade: Activation Engineering for Controllable Codon Design

Environment Setup

Download Pretrained Model CodonBERT

Arcade

Project Structure

Quick Start

Codon Design with Different Targets

Fine-Tune Token Classification Model

Fetch Steering Vectors

License

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

Kingsford-Group/arcade

Folders and files

Latest commit

History

Repository files navigation

Arcade: Activation Engineering for Controllable Codon Design

Environment Setup

Download Pretrained Model CodonBERT

Arcade

Project Structure

Quick Start

Codon Design with Different Targets

Fine-Tune Token Classification Model

Fetch Steering Vectors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages