RNAPro is a state-of-the-art RNA 3D folding model developed in collaboration with the hosts and winners of the Stanford RNA 3D Folding Kaggle competition. The model incorporates RNA-specific modules — including template modeling, multiple sequence alignment (MSA), and a pretrained RNA language model — to enhance RNA structure prediction performance. Read more about the kaggle competition and model in the preprint.
- Clone this repository and
cdinto it
git clone https://github.com/NVIDIA-Digital-Bio/RNAPro
cd ./RNAPro- Create a conda environment
conda create -n rnapro python=3.12 -y
conda activate rnapro
pip install -r requirements.txt- Install RNAPro
pip install -e .The code was developed using the nvcr.io/nvidia/pytorch:25.09-py3 docker image.
Run step 1., and inside the container, run step 3.
For more detailed instruction steps, check the Docker Installation guide.
Data is expected to be in the release_data directory.
mkdir release_data
cd release_dataThis will take some time and requires ~100GB of disk space.
cd release_data
mkdir kaggle; cd kaggle
curl -L -o stanford-rna-3d-folding-all-atom-train-data.zip https://www.kaggle.com/api/v1/datasets/download/rhijudas/stanford-rna-3d-folding-all-atom-train-data
unzip stanford-rna-3d-folding-all-atom-train-data.zip
rm stanford-rna-3d-folding-all-atom-train-data.zip
# Get MSAs from https://www.kaggle.com/datasets/rhijudas/clone-of-stanford-rna-3d-modeling-competition-data
curl -L -o stanford-rna-3d-folding.zip https://www.kaggle.com/api/v1/datasets/download/rhijudas/clone-of-stanford-rna-3d-modeling-competition-data
mkdir MSA_v2
unzip stanford-rna-3d-folding.zip "MSA_v2/*" -d /MSA_v2
rm stanford-rna-3d-folding.zip
cd ..To train the model, you will need the CCD cache. The CCD cache is generated by processing the file from https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz. You can generate the required files by running the command below, which will create the following files:
python3 scripts/gen_ccd_cache.pyrelease_data/
└── ccd_cache/
├── components.cif
├── components.cif.rdkit_mol.pkl
└── clusters-by-entity-40.txt
cd release_data
mkdir protenix_models; cd protenix_models
wget https://af3-dev.tos-cn-beijing.volces.com/release_model/protenix_base_default_v0.5.0.pt
cd ..Download RibonanzaNet2 pretrained checkpoint:
cd release_data
mkdir ribonanzanet2_checkpoint; cd ribonanzanet2_checkpoint
curl -L -o ribonanzanet2.tar.gz https://www.kaggle.com/api/v1/models/shujun717/ribonanzanet2/pyTorch/alpha/1/download
tar -xzvf ribonanzanet2.tar.gz
rm ribonanzanet2.tar.gz
cd ..- MMseqs2-based template identification: MMseqs2 3D RNA Template Identification
- Kaggle 1st place template-based approaches: RNA 3D Folds — TBM-only approach
Each of these notebooks generate a submission.csv file containing the templates. The csv file can be downloaded in the output section of the notebook.
You do not need to rerun the notebooks, unless you want to generate templates for new sequences.
To convert the csv into the training-ready or inference-ready binary format, use:
python preprocess/convert_templates_to_pt_files.py --input_csv <path/to/submission.csv> --output_name template_features.ptThe resulting .pt file(s) can be referenced via template_data in configs and used with use_template='ca_precomputed'.
After running the above steps, the repository structure should look like this:
release_data/
├── ccd_cache/
│ ├── clusters-by-entity-40.txt
│ ├── components.v20240608.cif
│ └── components.v20240608.cif.rdkit_mol.pkl
├── kaggle/
│ ├── MSA_v2/
│ ├── <training data files>
│ └── template_features.pt
├── protenix_models/
│ └── protenix_base_default_v0.5.0.pt
└── ribonanzanet2_checkpoint/
├── dropout.py
├── Network.py
├── pairwise.yaml
└── pytorch_model_fsdp.bin
We provide the trained model checkpoint via NGC and HuggingFace (Public best and Private best).
We provide a convenience script for training. Please modify it according to your purpose:
sh rnapro_train_example.shFor details on the input format and output format, please refer to the overview.
-
Input csv files
- Prepare a CSV file with the columns: target_id and sequence.
-
RNA MSA
- MSAs are user-provided.
-
Templates (same as training)
- Obtain templates via either:
- MMseqs2-based identification:
https://www.kaggle.com/code/rhijudas/mmseqs2-3d-rna-template-identification - Kaggle 1st place TBM-only approach:
https://www.kaggle.com/code/jaejohn/rna-3d-folds-tbm-only-approach(generally stronger)
- MMseqs2-based identification:
- Each approach produces a
submission.csv. Convert it to.pt:python preprocess/convert_templates_to_pt_files.py --input_csv path/to/submission.csv --output_name path/to/template_features.pt --max_n 40- Use with
--use_template ca_precomputed --template_data path/to/template_features.pt.
- Obtain templates via either:
You can run inference via script:
bash rnapro_inference_example.shThe script configures and forwards the following parameters to the CLI:
--model_name: Model config to use (e.g.,rnapro_base).--dump_dir: Directory where inference results are saved.--load_checkpoint_path: Path to the model checkpoint.pt.--seeds: Comma-separated seeds (default in example:42).--dtype: Precision (bf16orfp32).--use_msa: Enable MSAs (recommended for RNA).--rna_msa_dir: Directory containing precomputed MSAs.--use_template: Template mode (useca_precomputedfor prepared templates).--template_data: Path to.pttemplate file converted from submission.csv.--template_idx: Top-k template selection index:- 0 -> top1, 1 -> top2, 2 -> top3, 3 -> top4, 4 -> top5
--num_templates: Number of templates to use (e.g.,10).--model.N_cycle: Diffusion cycles (e.g.,10).--sample_diffusion.N_sample: Number of samples per seed (e.g.,1).--sample_diffusion.N_step: Diffusion steps (e.g.,200).--load_strict: Strict weight loading.--num_workers: Data loader workers.--triangle_attention/--triangle_multiplicative: Kernel backends (torch,cuequivariance, etc.).--sequences_csv: Optional CSV with headerssequence,target_idfor batched inference.
Training and inference can be accelerated using various optimized kernels (e.g., cuEquivariance, Triton, and specialized LayerNorm/attention backends). Refer to the Kernels Setup Guide for installation steps, supported options, and recommended configurations.
RNAPro uses a frozen pre-trained RNA foundation model RibonanzaNet2 as an encoder to extract RNA sequence and pairwise features. These are projected and injected into a RNA post-trained Protenix with learned gating and RNA templates. The RibonanzaNet2 module can be enabled or disabled:
--model.use_RibonanzaNet2 true
--model.ribonanza_net_path ./release_data/ribonanzanet2_checkpoint| Mode | Use Case |
|---|---|
ca_precomputed |
Inference with precomputed C1' templates |
masked_templates |
Training with masking ground truth if you do not have a template dataset |
--use_template ca_precomputed
--model.use_template ca_precomputed# Number of pairformer blocks in template embedder
--model.template_embedder.n_blocks 2
--num_templates 4We thank Stanford Das Lab, HHMI, the co-hosts and winners of the Stanford RNA 3D Folding Kaggle competition for their collaboration in this research.
Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
The source code is made available under Apache-2.0.
The model weights are made available under the NVIDIA Open Model License.