Skip to content

RNAPro is a state-of-the-art RNA 3D folding model developed in collaboration with the hosts and winners of the Stanford RNA 3D Folding Kaggle competition.

License

Notifications You must be signed in to change notification settings

NVIDIA-Digital-Bio/RNAPro

Repository files navigation

RNAPro: An accurate RNA structure prediction model by Kaggle synthesis

Model Description

RNAPro is a state-of-the-art RNA 3D folding model developed in collaboration with the hosts and winners of the Stanford RNA 3D Folding Kaggle competition. The model incorporates RNA-specific modules — including template modeling, multiple sequence alignment (MSA), and a pretrained RNA language model — to enhance RNA structure prediction performance. Read more about the kaggle competition and model in the preprint.

Installation

Environment setup (conda example)

  1. Clone this repository and cd into it
git clone https://github.com/NVIDIA-Digital-Bio/RNAPro
cd ./RNAPro
  1. Create a conda environment
conda create -n rnapro python=3.12 -y
conda activate rnapro
pip install -r requirements.txt
  1. Install RNAPro
pip install -e .

Docker (Recommended for Training and Inference)

The code was developed using the nvcr.io/nvidia/pytorch:25.09-py3 docker image. Run step 1., and inside the container, run step 3.

For more detailed instruction steps, check the Docker Installation guide.

Train Models

1. Data preparation

Data is expected to be in the release_data directory.

mkdir release_data
cd release_data

Download training & MSA data:

This will take some time and requires ~100GB of disk space.

cd release_data
mkdir kaggle; cd kaggle
curl -L -o stanford-rna-3d-folding-all-atom-train-data.zip https://www.kaggle.com/api/v1/datasets/download/rhijudas/stanford-rna-3d-folding-all-atom-train-data
unzip stanford-rna-3d-folding-all-atom-train-data.zip
rm stanford-rna-3d-folding-all-atom-train-data.zip

# Get MSAs from https://www.kaggle.com/datasets/rhijudas/clone-of-stanford-rna-3d-modeling-competition-data
curl -L -o stanford-rna-3d-folding.zip https://www.kaggle.com/api/v1/datasets/download/rhijudas/clone-of-stanford-rna-3d-modeling-competition-data
mkdir MSA_v2
unzip stanford-rna-3d-folding.zip "MSA_v2/*" -d /MSA_v2
rm stanford-rna-3d-folding.zip
cd ..

Download CCD cache:

To train the model, you will need the CCD cache. The CCD cache is generated by processing the file from https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz. You can generate the required files by running the command below, which will create the following files:

python3 scripts/gen_ccd_cache.py
release_data/
└── ccd_cache/
    ├── components.cif
    ├── components.cif.rdkit_mol.pkl
    └── clusters-by-entity-40.txt

Download Protenix pretrained checkpoints:

cd release_data
mkdir protenix_models; cd protenix_models
wget https://af3-dev.tos-cn-beijing.volces.com/release_model/protenix_base_default_v0.5.0.pt
cd ..

Download RibonanzaNet2 pretrained checkpoint:

cd release_data
mkdir ribonanzanet2_checkpoint; cd ribonanzanet2_checkpoint
curl -L -o ribonanzanet2.tar.gz https://www.kaggle.com/api/v1/models/shujun717/ribonanzanet2/pyTorch/alpha/1/download
tar -xzvf ribonanzanet2.tar.gz
rm ribonanzanet2.tar.gz
cd ..

Templates can be obtained in two ways:

Each of these notebooks generate a submission.csv file containing the templates. The csv file can be downloaded in the output section of the notebook. You do not need to rerun the notebooks, unless you want to generate templates for new sequences.

To convert the csv into the training-ready or inference-ready binary format, use:

python preprocess/convert_templates_to_pt_files.py --input_csv <path/to/submission.csv> --output_name template_features.pt

The resulting .pt file(s) can be referenced via template_data in configs and used with use_template='ca_precomputed'.

Overview

After running the above steps, the repository structure should look like this:

release_data/
├── ccd_cache/
│   ├── clusters-by-entity-40.txt
│   ├── components.v20240608.cif
│   └── components.v20240608.cif.rdkit_mol.pkl
├── kaggle/
│   ├── MSA_v2/
│   ├── <training data files>
│   └── template_features.pt
├── protenix_models/
│   └── protenix_base_default_v0.5.0.pt
└── ribonanzanet2_checkpoint/
    ├── dropout.py
    ├── Network.py
    ├── pairwise.yaml
    └── pytorch_model_fsdp.bin

2. Training

We provide the trained model checkpoint via NGC and HuggingFace (Public best and Private best).

We provide a convenience script for training. Please modify it according to your purpose:

sh rnapro_train_example.sh

Inference

Expected Input & Output Format

For details on the input format and output format, please refer to the overview.

1. Prepare inputs

  • Input csv files

    • Prepare a CSV file with the columns: target_id and sequence.
  • RNA MSA

    • MSAs are user-provided.
  • Templates (same as training)

    • Obtain templates via either:
      • MMseqs2-based identification: https://www.kaggle.com/code/rhijudas/mmseqs2-3d-rna-template-identification
      • Kaggle 1st place TBM-only approach: https://www.kaggle.com/code/jaejohn/rna-3d-folds-tbm-only-approach (generally stronger)
    • Each approach produces a submission.csv. Convert it to .pt:
      • python preprocess/convert_templates_to_pt_files.py --input_csv path/to/submission.csv --output_name path/to/template_features.pt --max_n 40
      • Use with --use_template ca_precomputed --template_data path/to/template_features.pt.

Inference via Bash Script

You can run inference via script:

bash rnapro_inference_example.sh

The script configures and forwards the following parameters to the CLI:

  • --model_name: Model config to use (e.g., rnapro_base).
  • --dump_dir: Directory where inference results are saved.
  • --load_checkpoint_path: Path to the model checkpoint .pt.
  • --seeds: Comma-separated seeds (default in example: 42).
  • --dtype: Precision (bf16 or fp32).
  • --use_msa: Enable MSAs (recommended for RNA).
  • --rna_msa_dir: Directory containing precomputed MSAs.
  • --use_template: Template mode (use ca_precomputed for prepared templates).
  • --template_data: Path to .pt template file converted from submission.csv.
  • --template_idx: Top-k template selection index:
    • 0 -> top1, 1 -> top2, 2 -> top3, 3 -> top4, 4 -> top5
  • --num_templates: Number of templates to use (e.g., 10).
  • --model.N_cycle: Diffusion cycles (e.g., 10).
  • --sample_diffusion.N_sample: Number of samples per seed (e.g., 1).
  • --sample_diffusion.N_step: Diffusion steps (e.g., 200).
  • --load_strict: Strict weight loading.
  • --num_workers: Data loader workers.
  • --triangle_attention / --triangle_multiplicative: Kernel backends (torch, cuequivariance, etc.).
  • --sequences_csv: Optional CSV with headers sequence,target_id for batched inference.

Acceleration

Training and inference can be accelerated using various optimized kernels (e.g., cuEquivariance, Triton, and specialized LayerNorm/attention backends). Refer to the Kernels Setup Guide for installation steps, supported options, and recommended configurations.

Configuration Notes

RibonanzaNet2

RNAPro uses a frozen pre-trained RNA foundation model RibonanzaNet2 as an encoder to extract RNA sequence and pairwise features. These are projected and injected into a RNA post-trained Protenix with learned gating and RNA templates. The RibonanzaNet2 module can be enabled or disabled:

--model.use_RibonanzaNet2 true
--model.ribonanza_net_path ./release_data/ribonanzanet2_checkpoint

Template Modes

Mode Use Case
ca_precomputed Inference with precomputed C1' templates
masked_templates Training with masking ground truth if you do not have a template dataset
--use_template ca_precomputed
--model.use_template ca_precomputed

Template Embedder Options

# Number of pairformer blocks in template embedder
--model.template_embedder.n_blocks 2
--num_templates 4

Acknowledgements

We thank Stanford Das Lab, HHMI, the co-hosts and winners of the Stanford RNA 3D Folding Kaggle competition for their collaboration in this research.

License

Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
The source code is made available under Apache-2.0.
The model weights are made available under the NVIDIA Open Model License.

About

RNAPro is a state-of-the-art RNA 3D folding model developed in collaboration with the hosts and winners of the Stanford RNA 3D Folding Kaggle competition.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published