Create a project directory
Clone the git repository
git clone https://github.com/fmi-basel/Deep-learning-bZIPs
Create a conda environment
conda create -p $(pwd)/conda_env -f Deep-learning-bZIPs/environment.yml (this will create an environment in the current folder installing all packages defined in environment.yml)
Download the archive from zenodo (Link to follow) and unpack it inside the repository folder:
cd Deep-learning-bZIPs
tar xzvf deep-learning-bzips.tar.gz
bZIP-ppi
├── configs # Parameters for CNN and ESM models, accelerate config
├── datasets # User generated datasets (empty)
├── logs # Logfiles of runs
├── notebooks
│ └── analysis # Jupyter notebooks for inference and analysis of experiments
│ └── dataset # Jupyter notebook for preparation of dataset splits
├── precalculated_checkpoints # Precalculated checkpoints from trainings presented in the paper
├── precalculated_datasets # Datasets prepared with notebooks/dataset/prepare_dataset.ipynb
├── raw_datasets # Dataset from which the datasets splits are prepared
├── results_paper # Precalculated inference results for valid and/or test sets
├── results # user generated results (empty)
├── src # source code
├── experiments # Scripts to reproduce experiments from the paper, contains example slurm submission scripts for CNN or ESM training
│ └── checkpoints # user generated checkpoints (empty)
│ └── logs # Tensorboard logging folder
│ └── wandb # WandDB logging folder
Datasets can be prepared with the jupyter notebook notebooks/dataset/dataset_preparation.ipynb. More instructions are given in the notebook.
Note: The jupyter server needs to be started in the root directory of this repository
For training of the CNN model, standard hardware with a GPU of 8 GB memory is sufficient.
In case of training the ESM model, a GPU that supports bfloat16 and has at least 32 GB memory is required. We used A40 or A100 80GB models. To run ESM training on multiple GPUs, the accelerate package is used. An example configuration file for accelerate is provided in configs/accelerate_config.yaml. This file needs to be adapted to the given environment, e.g. the number of GPUs is given with num_processes. For more information please refer to the accelerate documentation.
Before starting a run, the following environment variables need to be exported:
export PYTHONPATH="/path/to/project/root"
export WANDB_HOST="..." OR export WAND_MODE="offline"
export WANDB_PROJECT="..."
export WANDB_ENTITY="..."
When not using wandb logging set WANDB_MODE="offline" or WANDB_MODE="disabled" and give any project and entity name.
When not using wandb artifacts, a local directory containing the dataset splits in csv format can be defined with the dataset_dir argument. The name of the dataset split files and the input and label column names are defined in a dictionary that is passed to the dataset_metadata. In the provided scripts and notebooks, the local directory option is used.
When using wandb dataset artifacts, the arguments dataset_dir and dataset_metadata need to be set to None. The wandb_dataset argument in the config dictionary reflects the name of the wandb dataset artifact. By default the latest version will be used.
Slurm script to run all folds and splits as a job array:
experiments/002_cnn_10fold_crossvalidation.sh
To run the training directly use the following python script:
experiments/002_cnn_10fold_crossvalidation.py
Slurm script to run all folds as a job array:
experiments/005_esm_10fold_crossvalidation_prot_split_training.sh
To run the training directly use the following python script:
experiments/005_esm_10fold_crossvalidation_training.py
Follow the steps described in the notebook:
notebooks/analysis/002_1_cnn_10fold_crossvalidation_inference.ipynb
and
notebooks/analysis/002_2_cnn_10fold_crossvalidation_figures.ipynb
Slurm script to run inference
experiments/005_esm_10fold_crossvalidation_prot_split_inference.sh
and
notebooks/analysis/005_comparison_cnn_esm_10fold_crossvalidation_figures.ipynb
for visualization
To run CNN training with same dataset split (80% train, 10% valid, 10% test) used for ESM training:
Slurm script:
experiments/003_cnn_single_fold.sh
To run the training directly use the following python script:
python3 experiments/003_cnn_single_fold.py
To run the ESM training with the same dataset split used in above CNN training:
Training with the dataset split by bZIP:
python3 experiments/006_esm_single_fold_training.py --split prot
Training with the dataset split by mutation:
python3 experiments/006_esm_single_fold_training.py --split mut
To score the CNN model follow the steps in the notebooks:
notebooks/analysis/003_cnn_splits_single_fold_inference.ipynb
For ESM scoring run the following scripts:
006_esm_single_fold_prot_split_inference.sh
and
006_esm_single_fold_mut_split_inference.sh
Notebook to compare both models:
notebooks/analysis/006_comparison_cnn_esm_single_fold_figures.ipynb
Model checkpoints are saved in experiments/checkpoints, wandb offline run files in experiments/wandb
python3 experiments/007_binder_optimization.py
notebooks/analysis/007_binder_design.ipynb
In the function run.run_optimization of experiments/001_cnn_hyperparamter_search.py the n_jobs parameter should not be larger than the available number of GPUs. Per job 1 GPU is used unless the gpus parameter is also changed. Changing the gpus parameter will enable pytorch distributed parallel protocol. It should be noted that DDP will have influence on the effective batch size and learning rate. See: []