RNAPro: An accurate RNA structure prediction model by Kaggle synthesis

Model Description

RNAPro is a state-of-the-art RNA 3D folding model developed in collaboration with the hosts and winners of the Stanford RNA 3D Folding Kaggle competition. The model incorporates RNA-specific modules — including template modeling, multiple sequence alignment (MSA), and a pretrained RNA language model — to enhance RNA structure prediction performance. Read more about the kaggle competition and model in the preprint.

Installation

Environment setup (conda example)

Clone this repository and cd into it

git clone https://github.com/NVIDIA-Digital-Bio/RNAPro
cd ./RNAPro

Create a conda environment

conda create -n rnapro python=3.12 -y
conda activate rnapro
pip install -r requirements.txt

Install RNAPro

pip install -e .

Docker (Recommended for Training and Inference)

The code was developed using the nvcr.io/nvidia/pytorch:25.09-py3 docker image. Run step 1., and inside the container, run step 3.

For more detailed instruction steps, check the Docker Installation guide.

Train Models

1. Data preparation

Data is expected to be in the release_data directory.

mkdir release_data
cd release_data

Download training & MSA data:

This will take some time and requires ~100GB of disk space.

cd release_data
mkdir kaggle; cd kaggle
curl -L -o stanford-rna-3d-folding-all-atom-train-data.zip https://www.kaggle.com/api/v1/datasets/download/rhijudas/stanford-rna-3d-folding-all-atom-train-data
unzip stanford-rna-3d-folding-all-atom-train-data.zip
rm stanford-rna-3d-folding-all-atom-train-data.zip

# Get MSAs from https://www.kaggle.com/datasets/rhijudas/clone-of-stanford-rna-3d-modeling-competition-data
curl -L -o stanford-rna-3d-folding.zip https://www.kaggle.com/api/v1/datasets/download/rhijudas/clone-of-stanford-rna-3d-modeling-competition-data
mkdir MSA_v2
unzip stanford-rna-3d-folding.zip "MSA_v2/*" -d /MSA_v2
rm stanford-rna-3d-folding.zip
cd ..

Download CCD cache:

To train the model, you will need the CCD cache. The CCD cache is generated by processing the file from https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz. You can generate the required files by running the command below, which will create the following files:

python3 scripts/gen_ccd_cache.py

release_data/
└── ccd_cache/
    ├── components.cif
    ├── components.cif.rdkit_mol.pkl
    └── clusters-by-entity-40.txt

Download Protenix pretrained checkpoints:

cd release_data
mkdir protenix_models; cd protenix_models
wget https://af3-dev.tos-cn-beijing.volces.com/release_model/protenix_base_default_v0.5.0.pt
cd ..

Download RibonanzaNet2 pretrained checkpoint:

cd release_data
mkdir ribonanzanet2_checkpoint; cd ribonanzanet2_checkpoint
curl -L -o ribonanzanet2.tar.gz https://www.kaggle.com/api/v1/models/shujun717/ribonanzanet2/pyTorch/alpha/1/download
tar -xzvf ribonanzanet2.tar.gz
rm ribonanzanet2.tar.gz
cd ..

Templates can be obtained in two ways:

MMseqs2-based template identification: MMseqs2 3D RNA Template Identification
Kaggle 1st place template-based approaches: RNA 3D Folds — TBM-only approach

Each of these notebooks generate a submission.csv file containing the templates. The csv file can be downloaded in the output section of the notebook. You do not need to rerun the notebooks, unless you want to generate templates for new sequences.

To convert the csv into the training-ready or inference-ready binary format, use:

python preprocess/convert_templates_to_pt_files.py --input_csv <path/to/submission.csv> --output_name template_features.pt

The resulting .pt file(s) can be referenced via template_data in configs and used with use_template='ca_precomputed'.

Overview

After running the above steps, the repository structure should look like this:

release_data/
├── ccd_cache/
│   ├── clusters-by-entity-40.txt
│   ├── components.v20240608.cif
│   └── components.v20240608.cif.rdkit_mol.pkl
├── kaggle/
│   ├── MSA_v2/
│   ├── <training data files>
│   └── template_features.pt
├── protenix_models/
│   └── protenix_base_default_v0.5.0.pt
└── ribonanzanet2_checkpoint/
    ├── dropout.py
    ├── Network.py
    ├── pairwise.yaml
    └── pytorch_model_fsdp.bin

2. Training

We provide the trained model checkpoint via NGC and HuggingFace (Public best and Private best).

We provide a convenience script for training. Please modify it according to your purpose:

sh rnapro_train_example.sh

Inference

Expected Input & Output Format

For details on the input format and output format, please refer to the overview.

1. Prepare inputs

Input csv files
- Prepare a CSV file with the columns: target_id and sequence.
RNA MSA
- MSAs are user-provided.
Templates (same as training)
- Obtain templates via either:
  - MMseqs2-based identification: https://www.kaggle.com/code/rhijudas/mmseqs2-3d-rna-template-identification
  - Kaggle 1st place TBM-only approach: https://www.kaggle.com/code/jaejohn/rna-3d-folds-tbm-only-approach (generally stronger)
- Each approach produces a submission.csv. Convert it to .pt:
  - python preprocess/convert_templates_to_pt_files.py --input_csv path/to/submission.csv --output_name path/to/template_features.pt --max_n 40
  - Use with --use_template ca_precomputed --template_data path/to/template_features.pt.

Inference via Bash Script

You can run inference via script:

bash rnapro_inference_example.sh

The script configures and forwards the following parameters to the CLI:

--model_name: Model config to use (e.g., rnapro_base).
--dump_dir: Directory where inference results are saved.
--load_checkpoint_path: Path to the model checkpoint .pt.
--seeds: Comma-separated seeds (default in example: 42).
--dtype: Precision (bf16 or fp32).
--use_msa: Enable MSAs (recommended for RNA).
--rna_msa_dir: Directory containing precomputed MSAs.
--use_template: Template mode (use ca_precomputed for prepared templates).
--template_data: Path to .pt template file converted from submission.csv.
--template_idx: Top-k template selection index:
- 0 -> top1, 1 -> top2, 2 -> top3, 3 -> top4, 4 -> top5
--num_templates: Number of templates to use (e.g., 10).
--model.N_cycle: Diffusion cycles (e.g., 10).
--sample_diffusion.N_sample: Number of samples per seed (e.g., 1).
--sample_diffusion.N_step: Diffusion steps (e.g., 200).
--load_strict: Strict weight loading.
--num_workers: Data loader workers.
--triangle_attention / --triangle_multiplicative: Kernel backends (torch, cuequivariance, etc.).
--sequences_csv: Optional CSV with headers sequence,target_id for batched inference.

Acceleration

Training and inference can be accelerated using various optimized kernels (e.g., cuEquivariance, Triton, and specialized LayerNorm/attention backends). Refer to the Kernels Setup Guide for installation steps, supported options, and recommended configurations.

Configuration Notes

RibonanzaNet2

RNAPro uses a frozen pre-trained RNA foundation model RibonanzaNet2 as an encoder to extract RNA sequence and pairwise features. These are projected and injected into a RNA post-trained Protenix with learned gating and RNA templates. The RibonanzaNet2 module can be enabled or disabled:

--model.use_RibonanzaNet2 true
--model.ribonanza_net_path ./release_data/ribonanzanet2_checkpoint

Template Modes

Mode	Use Case
`ca_precomputed`	Inference with precomputed C1' templates
`masked_templates`	Training with masking ground truth if you do not have a template dataset

--use_template ca_precomputed
--model.use_template ca_precomputed

Template Embedder Options

# Number of pairformer blocks in template embedder
--model.template_embedder.n_blocks 2
--num_templates 4

Acknowledgements

We thank Stanford Das Lab, HHMI, the co-hosts and winners of the Stanford RNA 3D Folding Kaggle competition for their collaboration in this research.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSES		LICENSES
configs		configs
docs		docs
examples		examples
model_cards		model_cards
preprocess		preprocess
rnapro		rnapro
runner		runner
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
rnapro_inference_example.sh		rnapro_inference_example.sh
rnapro_train_example.sh		rnapro_train_example.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RNAPro: An accurate RNA structure prediction model by Kaggle synthesis

Model Description

Installation

Environment setup (conda example)

Docker (Recommended for Training and Inference)

Train Models

1. Data preparation

Download training & MSA data:

Download CCD cache:

Download Protenix pretrained checkpoints:

Download RibonanzaNet2 pretrained checkpoint:

Templates can be obtained in two ways:

Overview

2. Training

Inference

Expected Input & Output Format

1. Prepare inputs

Inference via Bash Script

Acceleration

Configuration Notes

RibonanzaNet2

Template Modes

Template Embedder Options

Acknowledgements

License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

NVIDIA-Digital-Bio/RNAPro

Folders and files

Latest commit

History

Repository files navigation

RNAPro: An accurate RNA structure prediction model by Kaggle synthesis

Model Description

Installation

Environment setup (conda example)

Docker (Recommended for Training and Inference)

Train Models

1. Data preparation

Download training & MSA data:

Download CCD cache:

Download Protenix pretrained checkpoints:

Download RibonanzaNet2 pretrained checkpoint:

Templates can be obtained in two ways:

Overview

2. Training

Inference

Expected Input & Output Format

1. Prepare inputs

Inference via Bash Script

Acceleration

Configuration Notes

RibonanzaNet2

Template Modes

Template Embedder Options

Acknowledgements

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages