This pipeline identifies insulin-like peptides (ILPs) from transcriptomic data, performs machine learning-based annotation, and conducts phylogenetic analysis across prepropeptide, propeptide, and mature peptide forms. It is optimized for efficiency and scalability, leveraging parallel processing and memory-efficient techniques.
- Tools:
curl,pigz,seqkit,TransDecoder,cd-hit,mmseqs2,hhblits,hmmsearch,hhmake,hmmbuild,hhsearch,blastp,interproscan.sh,colabfold_batch,mafft,trimal,FastTree,iqtree,foldtree,R(withapepackage),ete3,autophy,meme,ame,fimo,taxonkit,parallel,mamba,snakemake,pymol,yq - Python Libraries:
BioPython,pandas,scikit-learn,xgboost,shap,matplotlib,seaborn,logomaker,psutil,pyyaml - Hardware: Multi-core CPU recommended; GPU optional for ColabFold
# Create and activate the base environment with Python 3.11
conda create -n ilp_pipeline python=3.11
conda activate ilp_pipeline
# Add necessary channels
conda config --append channels bioconda
conda config --append channels conda-forge
# Install core bioinformatics tools
conda install -c bioconda curl pigz seqkit transdecoder cd-hit mmseqs2 hhsuite blast interproscan mafft trimal fasttree iqtree meme taxonkit parallel mamba snakemake yq
# Install additional dependencies
conda install -c conda-forge biopython r-base r-ape ete3 psutil pymol-open-source pillow numpy matplotlib tqdm pytorch=1.13
conda install -c anaconda pandas scikit-learn xgboost seaborn
# Install Python libraries via pip
pip install shap logomaker pyyaml
pip install 'autophy @ git+https://github.com/aortizsax/autophy@main'
# ColabFold requires separate installation (see https://github.com/sokrypton/ColabFold)
# FoldTree is integrated via Snakemake in the pipeline (see https://github.com/DessimozLab/fold_tree)input/: Input FASTA files (e.g.,9606_T1.fasta)preprocess/: Preprocessed sequences and reference datacandidates/: Candidate ILP sequences and metadataanalysis/: Phylogenetic trees, structural models, and intermediate filesoutput/: Final tables and plots (prepro/,pro/,mature/subdirectories)
- Purpose: Fetches annotated ILP and non-ILP sequences from UniProt and generates HMM profiles for training and candidate identification.
- Process: Queries UniProt for ILPs (e.g., insulin, relaxin) across Metazoa and curated references (e.g., P01308), balances with non-ILPs, annotates with InterProScan and
annotate_references.py, aligns ILP sequences with MAFFT, buildsilp.hmmwithhmmbuild, andilp_db.hhmwithhhmake. - Output:
input/ref_ILPs.fasta(annotated),input/ilp.hmm,input/ilp_db.hhm. - Details: Uses
curlfor API calls,pigzfor compression, multi-threaded alignment; includes dependency checks.
- Purpose: Prepares reference data for machine learning with sequence-based features.
- Process: Clusters sequences with
linclust, searches withHHblitsandHMMER, deduplicates withCD-HIT, performs BLAST, and annotates domains with InterProScan. Processes sequences into prepro, pro, and mature forms withpreprocess_ilp.py. Extracts sequence features (extract_training_features.py) and generates labels (generate_labels.py). - Inputs:
input/ref_ILPs.fasta,input/ilp.hmm,input/ilp_db.hhm. - Output:
preprocess/ref_features.csv,preprocess/ref_labels.csv,preprocess/*_{type}.fasta. - Details: Multi-threaded (
-T $max_cpus), chunked processing, validates inputs and sequence completion; structural features deferred to candidate processing.
- Purpose: Preprocesses input transcriptomes.
- Process: Translates nucleotide sequences with
TransDecoderif needed, filters by length based on reference ILPs (calc_ref_lengths.py), deduplicates withCD-HIT. - Inputs:
input/[0-9]*_*.fasta,input/ref_ILPs.fasta. - Output:
preprocess/*_preprocessed.fasta. - Details: Multi-threaded with
seqkitandTransDecoder, validates input files.
- Purpose: Identifies ILP candidates from preprocessed transcriptomes.
- Process: Clusters with
linclust, searches withHHblitsandHMMER, builds HMM profiles, performs batched BLAST, and annotates domains with InterProScan in parallel. - Inputs:
preprocess/[0-9]*_preprocessed.fasta,input/ref_ILPs.fasta,input/ilp.hmm,input/ilp_db.hhm. - Output:
candidates/*_candidates.fasta, metadata files (*_hhblits.out, etc.). - Details: Optimizes with
parallel, caches BLAST/InterProScan, validates inputs.
- Purpose: Performs initial ML annotation of candidates based on sequence features.
- Process: Combines candidates into
analysis/all_candidates.fasta, extracts sequence-based features withextract_features.py, runs ML models (run_ml.py) for initial probabilities and novelty prediction. - Inputs:
candidates/[0-9]*_candidates.fasta,preprocess/ref_features.csv,preprocess/ref_labels.csv. - Output:
analysis/all_candidates.fasta,analysis/predictions.csv,analysis/novel_candidates.csv. - Details: Initial pass without structural features; see ML section for specifics.
- Purpose: Generates structural models for identified ILP candidates only.
- Process: Runs ColabFold on prepro, pro, and mature forms of candidate sequences from
candidates/*_candidates.fasta, processed viapreprocess_ilp.py, skipping existing PDBs with checkpointing. - Inputs:
candidates/*_candidates.fasta. - Output:
analysis/pdbs/*.pdb. - Details: Uses
parallelwith incremental checks, GPU support optional, validates inputs; predicts structures only for candidates to optimize computational efficiency.
- Purpose: Performs phylogenetic and structural analysis with comprehensive consensus trees.
- Process:
- Filters candidates to ILPs (
filter_ilps.py), determines taxonomy (determine_common_taxonomy.py), filters references (filter_ref_ilps_by_taxonomy.py). - Aligns with
MAFFT, trims withtrimal, builds sequence trees withFastTreeandIQ-TREE, generates structural trees withfoldtree(Foldtree, LDDT, TM metrics) using candidate PDBs. - Creates consensus trees (
consensus_tree_with_support.R) for all combinations:- Per type: Sequence + Foldtree, Sequence + LDDT, Sequence + TM, Foldtree + LDDT, Foldtree + TM, LDDT + TM, Foldtree + LDDT + TM, Sequence + Foldtree + LDDT + TM.
- Across types: Sequence and Foldtree combinations (e.g., prepro + pro).
- Conducts clade analysis with
ETEandAutophy, motif discovery (MEME,AME,FIMO), and logo generation (plot_alignment.py).
- Filters candidates to ILPs (
- Inputs:
analysis/all_candidates.fasta,analysis/predictions.csv,analysis/novel_candidates.csv,input/[0-9]*_*.fasta,input/ref_ILPs.fasta,analysis/pdbs/*.pdb. - Output:
analysis/(trees like*_consensus_seq_foldtree.tre, PDBs, clades, plots). - Details: Uses
parallel -j 3, caches alignments, integrates Foldtree via Snakemake, validatesmamba/snakemake.
- Purpose: Generates final outputs with structural features for manuscript preparation.
- Process:
- Re-runs
extract_features.pywith structural features from candidate PDBs, generates tables (generate_tables.py) and plots (generate_plots.py) for each type. - Produces annotated FASTA files and metadata TSV (
generate_output_fasta_and_metadata.py). - Creates 3D structure figures (
generate_structure_figures.py) for candidates.
- Re-runs
- Inputs:
analysis/all_candidates.fasta,analysis/predictions.csv,analysis/novel_candidates.csv,candidates/*_blast.out,candidates/*_interpro.tsv,clades_ete_{type}/,clades_autophy_{type}/,analysis/ilp_candidates.fasta,analysis/ref_ILPs_filtered.fasta,input/[0-9]*_*.fasta,analysis/pdbs/. - Output:
output/*/*.csv(overview, details, motif enrichment),output/*/*.png(counts, heatmap, violin, logos),output/*/ilps.fasta(annotated ILPs),output/comparative_metadata.tsv(sequence metadata),output/figures/*.png(3D structures). - Details: Uses
taxonkitfor taxonomy, validates PyMOL, chunked processing; includes final feature extraction with structural data.
Identifies ILPs and novel candidates using an ensemble of Random Forest (RF) and XGBoost models trained on reference sequence data.
- Initial Pass (03_annotate_and_novel.sh):
- Inputs: Reference (
preprocess/ref_candidates.fasta) and candidate (analysis/all_candidates.fasta) sequences, search outputs (HHblits, HMMER, HHsearch, BLAST), InterPro annotations. - Process:
- Extracts sequence similarity scores (HHblits probability, HMMER score, HHsearch probability, BLAST identity).
- Adds physicochemical properties (hydrophobicity, charge) and InterPro domains.
- Uses chunked processing based on available memory (
psutil).
- Output:
preprocess/ref_features.csv,analysis/features_initial.csv.
- Inputs: Reference (
- Final Pass (06_generate_outputs.sh):
- Inputs:
analysis/all_candidates.fasta,analysis/pdbs/*.pdb. - Process:
- Repeats sequence feature extraction as above.
- Computes structural similarity (TM-scores, pLDDT) against dynamic and standard references (1TRZ, 6RLX, Bombyxin-II) for prepro, pro, and mature forms using
tmalignon candidate PDBs.
- Output:
analysis/features_final.csv.
- Inputs:
- Inputs:
input/ref_ILPs.fasta,preprocess/ref_features.csv. - Process: Assigns binary labels (1 for ILP, 0 for non-ILP) based on
[ILP]tags. - Output:
preprocess/ref_labels.csv.
- Inputs:
analysis/features_initial.csv,preprocess/ref_features.csv,preprocess/ref_labels.csv, max CPUs. - Process:
- Loads data with dynamic chunk sizing (
psutil). - Trains RF with
GridSearchCVfor hyperparameter tuning, computes SHAP values for feature selection (top 10 features). - Re-trains RF and XGBoost with 5-fold cross-validation, reporting AUC, precision, recall (
output/ml_metrics.txt). - Predicts ILP probabilities (RF + XGBoost average), flags novel ILPs (probability > 0.7, HHsearch < 70, BLAST < 30).
- Generates SHAP summary plot.
- Loads data with dynamic chunk sizing (
- Output:
analysis/predictions.csv,analysis/novel_candidates.csv,output/shap_summary.png,output/rf_model.joblib,output/xgb_model.joblib,output/ml_metrics.txt. - Details: Multi-threaded training, reusable models, robust validation; initial pass in
03uses sequence features only.
- Thresholds: Key thresholds such as
ilp_prob_threshold,hhsearch_novel_threshold, andblast_novel_thresholdcan be adjusted inconfig.yamlto fine-tune ILP identification and novelty detection. - Tools and Paths: Ensure all tool paths (e.g.,
interpro_path,colabfold_path) are correctly set inconfig.yaml.
- The pipeline includes checks for missing tools, input files, and failed commands. Logs are recorded in
pipeline.logfor troubleshooting. - For external tool failures (e.g.,
Foldtreein05_comparative_analysis.sh), check the corresponding log files inanalysis/orfold_tree/directories.
- Place your transcriptome FASTA files in
input/with TaxID prefixes (e.g.,9606_T1.fasta). - Update
config.yamlwith appropriate tool paths and parameters. - Run the pipeline sequentially:
bash 00a_fetch_references.sh && bash 00b_prepare_training.sh && bash 01_preprocess.sh && bash 02_identify_candidates.sh && bash 03_annotate_and_novel.sh && bash 04_run_colabfold.sh && bash 05_comparative_analysis.sh && bash 06_generate_outputs.sh