In this repository, we provide the code for the Metal-Binding Graph Neural Network (MBGNN) method, as described in our paper ''Co-evolution-based Metal-binding Residue Prediction with Graph Neural Networks''. MBGNN is a novel method that utilizes co-evolved residue networks and effectively captures dependencies within protein structures using graph neural networks, enhancing the prediction of co-evolved metal-binding residues and their associated metal types.
The structure and description of the main files and directories are given below:
├── dataset # Raw data required for training and testing the models and extracted co-evolved pairs
│ ├── train_chains.fasta
│ ├── test_chains.fasta
│ ├── README.md # Description of the dataset
│ ├── test_coevolved_pairs.csv
│ ├── test_residues.tsv
│ ├── train_coevolved_pairs.csv
│ └── train_residues.tsv
├── compare_results # Results of the comparison between MBGNN and other methods
│ ├── compare_metal_binding_prediction_results.ipynb
│ ├── compare_metal_type_prediciton_results.ipynb
│ ├── LMetalSite # Results of LMetalSite method
│ ├── MBGNN_metal-binding_preds.tsv # Predicted metal-binding residues by MBGNN
│ ├── MBGNN_metal_type_preds.tsv # Predicted metal types by MBGNN
│ ├── MetalNet # Results of MetalNet and MetalNet2 methods
│ └── M_Ionic # Rsults of M_Ionic method
├── example.fasta # Example fasta file containing arbitrary protein sequences from the test set
├── model_weights # Trained models weights
│ ├── metal_binding_predictor
│ └── metal_type_predictor
├── scripts # Scripts required for the prediction, each script can be run independently
│ ├── construct_graphs.py
│ ├── esm2.py
│ ├── extract_co_evovled_pairs.py
│ ├── gnn_model.py
│ ├── metal_binding_predictor.py
│ ├── metal_type_predictor.py
│ └── msa.py
├── main.py # Script to run the prediction for arbitrary protein sequences
└── training # Directory containing the Jupyter notebooks for training the models
├── construct_training_graphs.py
├── train_metal_binding_predictors.ipynb
└── train_metal_type_predictors.ipynb
The source code is implemented using Python 3.11, Pytorch 2.3.0, and Pytorch Geometric 2.6.1. All the required packages are given below.
torch=2.3.0
torch-geometric=2.6.1
networkx>=3.3
biopython>=1.85
esm=2.0.1
numpy>=1.26.2
scikit-learn>=1.4.2
pandas>=2.2
To use the provided code, you need to install the required packages first. You can create a conda environmet and install the required packages using the following command:
# Create a conda environment
$ conda create -n mbgnn python=3.11
$ conda activate mbgnn
# Install the required packages
$ pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121
$ pip install torch_geometric==2.6.1
$ pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.3.0+cu121.html
$ pip install networkx>=3.3 biopython>=1.85 numpy>=1.26.2 scikit-learn>=1.4.2 pandas>=2.2
$ pip install fair-esm
# Clone the source code of MBGNN
$ git clone https://github.com/SRastegari/MBGNN.git
$ cd MBGNNFor predicting metal-binding residues and their associated metal types for arbitrary protein sequences in a .fasta file, you need to run the main.py script. The main.py script will extract the protein sequences from the provided .fasta file, construct the co-evolved residue network, and predict the metal-binding residues and their associated metal types using the trained models. all you have to do is to pass the path to the .fasta file containing the protein sequences to the main.py scrip as follows:
$ python main.py path/to/file_name.fastayou can run the mentioned command on the provided example fasta file, example.fasta, as follows:
$ python main.py example.fastaAll the predicted metal-binding residues and their associated metal types, as well as intermediate results, will be saved in the run_{file_name} directory.
Alternatively, you can use the provided Colab notebook to upload arbitrary protein sequences and run the prediction. The Colab notebook is available here.
To train the models from scratch, follow these steps:
Ensure you have set up the environment as described in the Setup the environment.
Use the train protein sequence and metal-binding residue datasets provided in the data directory. Train sequences are provided in the train_chains.fasta file, and the metal-binding residues are provided in the train_residues.tsv file. Also extracted co-evolved residue pairs for training and testing are provided in the train_coevolved_pairs.csv.
To derive ESM2 embeddings for the protein sequences, use the esm2.py script provided in the scripts directory. This script will generate embeddings for each sequence in the train_chains.fasta file. you can run the script as follows:
$ python scripts/esm2.py data/train_chains.fasta /path/to/save/embeddingsTo prepare PyG graphs corrspond to the co-evolved residue networks for training metal-binding predictors, use the training/construct_training_graphs.py script as follows:
$ python training/construct_training_graphs.py data/train_coevolved_pairs.csv /path/to/embeddings /path/to/save/graphs_list /path/to/train_residues --mode=metal_bindingSimilarly, to create the co-evolved residue network for training metal-type predictors, use the same training/construct_training_graphs.py script but with the appropriate parameters for metal types as follows:
$ python training/construct_training_graphs.py data/train_coevolved_pairs.csv /path/to/embeddings /path/to/save/graphs_list /path/to/train_residues --mode=metal_typeUse the train_metal_binding_predictors.ipynb notebook provided in the training directory to train the metal-binding residue prediction model using the PyG graphs created in step 2.2.
Use the train_metal_type_predictors.ipynb notebook provided in the training directory to train the metal type prediction model using the PyG graphs created in step 2.3.
We used the dataset provided by MetalNet2, which consisted of 4,449 metal-binding protein chains collected from the Protein Data Bank (PDB) as of May 2023. The dataset
contained a training set and a fixed hold-out test set, which respectively included 18,230 and 1,981 metal-binding CHED residues. Furthermore, 11 metal types were considered as
labels for each metal-binding residue, including Zn, Ca, Mg, Mn, Fe, SF4, Ni, Cu, Co, FeS, and Fe3S.
Annotated train and test residues are provided in dataset/train_residues.tsv and dataset/test_residues.tsv files, respectively. The protein sequences corresponding to the train and test residues are provided in the dataset/train_chains.fasta and dataset/test_chains.fasta files, respectively. Finally, The co-evolved residue pairs extracted from the train and test sequences are provided in the dataset/train_coevolved_pairs.csv and dataset/test_coevolved_pairs.csv files, respectively, but also can be extracted using the scripts/msa.py to perform multiple sequence alignment and extract co-evolved residue pairs using scripts/extract_co_evovled_pairs.py script as follows:
$ python scripts/msa.py data/train_chains.fasta /path/to/save/msa
$ python scripts/extract_co_evovled_pairs.py /path/to/msa /path/to/save/coevolved_pairsSome parts of the code in this repository are adapted from the MetalNet2 repository. We would like to thank the authors for their valuable work.
If you find this repository useful in your research, please cite the following paper:
@article{rastegari2025co,
title={Co-evolution-based Metal-binding Residue Prediction with Graph Neural Networks},
author={Rastegari, Sayedmohammadreza and Tabakhi, Sina and Liu, Xianyuan and Sang, Wei and Lu, Haiping},
journal={arXiv preprint arXiv:2502.16189},
year={2025}
}