Skip to content

fmi-basel/Deep-learning-bZIPs

Repository files navigation

bZIP-ppi

Installation

Create a project directory

Clone the git repository git clone https://github.com/fmi-basel/Deep-learning-bZIPs

Create a conda environment conda create -p $(pwd)/conda_env -f Deep-learning-bZIPs/environment.yml (this will create an environment in the current folder installing all packages defined in environment.yml)

Dataset and precalculated results

Download the archive from zenodo (Link to follow) and unpack it inside the repository folder:

cd Deep-learning-bZIPs
tar xzvf deep-learning-bzips.tar.gz

Repository structure

bZIP-ppi  
├── configs # Parameters for CNN and ESM models, accelerate config
├── datasets # User generated datasets (empty)  
├── logs # Logfiles of runs   
├── notebooks  
│   └── analysis # Jupyter notebooks for inference and analysis of experiments  
│   └── dataset # Jupyter notebook for preparation of dataset splits  
├── precalculated_checkpoints # Precalculated checkpoints from trainings presented in the paper
├── precalculated_datasets # Datasets prepared with notebooks/dataset/prepare_dataset.ipynb
├── raw_datasets # Dataset from which the datasets splits are prepared  
├── results_paper # Precalculated inference results for valid and/or test sets 
├── results # user generated results (empty)  
├── src # source code
├── experiments # Scripts to reproduce experiments from the paper, contains example slurm submission scripts for CNN or ESM training
│   └── checkpoints # user generated checkpoints (empty)
│   └── logs # Tensorboard logging folder
│   └── wandb # WandDB logging folder

Dataset preparation for CNN and ESM training

Datasets can be prepared with the jupyter notebook notebooks/dataset/dataset_preparation.ipynb. More instructions are given in the notebook.

Note: The jupyter server needs to be started in the root directory of this repository

Training

Prerequisites

For training of the CNN model, standard hardware with a GPU of 8 GB memory is sufficient.

In case of training the ESM model, a GPU that supports bfloat16 and has at least 32 GB memory is required. We used A40 or A100 80GB models. To run ESM training on multiple GPUs, the accelerate package is used. An example configuration file for accelerate is provided in configs/accelerate_config.yaml. This file needs to be adapted to the given environment, e.g. the number of GPUs is given with num_processes. For more information please refer to the accelerate documentation.

Before starting a run, the following environment variables need to be exported:

export PYTHONPATH="/path/to/project/root"
export WANDB_HOST="..." OR export WAND_MODE="offline"
export WANDB_PROJECT="..."
export WANDB_ENTITY="..."

When not using wandb logging set WANDB_MODE="offline" or WANDB_MODE="disabled" and give any project and entity name.

When not using wandb artifacts, a local directory containing the dataset splits in csv format can be defined with the dataset_dir argument. The name of the dataset split files and the input and label column names are defined in a dictionary that is passed to the dataset_metadata. In the provided scripts and notebooks, the local directory option is used.

When using wandb dataset artifacts, the arguments dataset_dir and dataset_metadata need to be set to None. The wandb_dataset argument in the config dictionary reflects the name of the wandb dataset artifact. By default the latest version will be used.

Training scripts 10-fold crossvalidation CNN:

Slurm script to run all folds and splits as a job array:

experiments/002_cnn_10fold_crossvalidation.sh

To run the training directly use the following python script:

experiments/002_cnn_10fold_crossvalidation.py

Training scripts 10-fold crossvalidation ESM:

Slurm script to run all folds as a job array:

experiments/005_esm_10fold_crossvalidation_prot_split_training.sh

To run the training directly use the following python script:

experiments/005_esm_10fold_crossvalidation_training.py

Scoring 10-fold crossvalidation CNN:

Follow the steps described in the notebook:

notebooks/analysis/002_1_cnn_10fold_crossvalidation_inference.ipynb

and

notebooks/analysis/002_2_cnn_10fold_crossvalidation_figures.ipynb

Scoring 10-fold crossvalidation ESM:

Slurm script to run inference

experiments/005_esm_10fold_crossvalidation_prot_split_inference.sh

and

notebooks/analysis/005_comparison_cnn_esm_10fold_crossvalidation_figures.ipynb

for visualization

Training scripts for single fold CNN and ESM

To run CNN training with same dataset split (80% train, 10% valid, 10% test) used for ESM training:

Slurm script:

experiments/003_cnn_single_fold.sh

To run the training directly use the following python script:

python3 experiments/003_cnn_single_fold.py

To run the ESM training with the same dataset split used in above CNN training:

Training with the dataset split by bZIP:

python3 experiments/006_esm_single_fold_training.py --split prot

Training with the dataset split by mutation:

python3 experiments/006_esm_single_fold_training.py --split mut

Scoring scripts for single fold ESM and CNN

To score the CNN model follow the steps in the notebooks:

notebooks/analysis/003_cnn_splits_single_fold_inference.ipynb

For ESM scoring run the following scripts:

006_esm_single_fold_prot_split_inference.sh

and

006_esm_single_fold_mut_split_inference.sh

Notebook to compare both models:

notebooks/analysis/006_comparison_cnn_esm_single_fold_figures.ipynb

Model checkpoints are saved in experiments/checkpoints, wandb offline run files in experiments/wandb

Binder optimization script

python3 experiments/007_binder_optimization.py

Visualization of binder optimzation results

notebooks/analysis/007_binder_design.ipynb

Hyperparameter search CNN

In the function run.run_optimization of experiments/001_cnn_hyperparamter_search.py the n_jobs parameter should not be larger than the available number of GPUs. Per job 1 GPU is used unless the gpus parameter is also changed. Changing the gpus parameter will enable pytorch distributed parallel protocol. It should be noted that DDP will have influence on the effective batch size and learning rate. See: []

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •