Skip to content

benjiec/needle

Repository files navigation

Needle

Needle is a suite of tools to curate and compare sequences of proteins by pathway, for understudied organisms. Cryptic gene structures, unconventional splicing models, unfinished genomes, understudied proteomes, are some of the challenges that make applying traditional workflows (e.g. gene prediction and classification) difficult and/or unreliable. Needle focuses on working around these challenges to detect presence of proteins for key pathways, and curates information in a phylogentic aware manner to enable comparative analysis within and across species.

Setup

Install NCBI Docker image

docker pull ncbi/blast

Install MMSeqs2 and HH-suite Docker images

docker pull ghcr.io/soedinglab/mmseqs2
docker pull soedinglab/hh-suite

Setup SwissProt DB for MMSeqs2

scripts/data/mmseqs-swissprot-setup

Install Muscle Docker image

docker pull pegi3s/muscle

Install HMMer package. E.g. on MacOS run brew install hmmer.

Create Python virtualenv

python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt

Initial Data

There are some initial data already in data directory. Below are instructions to re-create them.

Download a list of KO (KEGG Ortholog) numbers and names

curl https://rest.kegg.jp/list/ko -o data/ko.txt
echo "Ortholog ID\tOrtholog Name" | cat - data/ko.txt > data/ko.tsv
rm data/ko.txt

To load this TSV file into Tableau, remove " and replace them with ''.

Download a list of KEGG modules

curl https://rest.kegg.jp/list/module -o data/modules.txt
echo "Module ID\tModule Name" | cat - data/modules.txt > data/modules.tsv
rm data/modules.txt

The following two scripts downloads KEGG module definitions and store them as a list of KO numbers, and, in the second script, as steps and components.

python3 scripts/data/fetch-kegg-module-ko.py
PYTHONPATH=. python3 scripts/data/fetch-kegg-module-def.py

Download KEGG KO profile HMMs

Download the HMM profiles from https://www.genome.jp/ftp/db/kofam/. The profiles.tar.gz file is large, so this may take awhile.

Concatenate all the .hmm files together, e.g.

cat profiles/*.hmm > kegg_downloads/ko.hmm

Also, download the ko_list.gz file from the above location into data/ko_thresholds.gz. This file contains scoring criteria for using the HMMs.

Download HMM profiles from Pfam

Download https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz. After uncompress into pfam-downloads directory, run the following to create a searchable HMM database

hmmpress pfam-downloads/Pfam-A.hmm

Also, download the Pfam-A.clans.tsv file into data directory.

Prepare List of Genomes

There is a list of Coral genomes in data/genomes_coral.txt, and a list of Symbiodinium (algae) genomes in data/genomes_algae.txt. Run the following command to generate data/genomes.tsv, which includes genome name and taxonomy information.

PYTHONPATH=. python3 scripts/data/fetch-genomes.py data/genomes_coral.txt

Workflow and Scripts

The general workflow looks like the following, starting with a KEGG pathway module and the HMM profiles for orthologs for that module

  • Detect Proteins using ortholog profile HMMs: HMM search plus refinement
  • Classify Proteins: hmmscan
  • Cluster Proteins: MMSeqs2
  • Generate MSAs: Muscle
  • Finish Protein Sequences

The following instructions use "m00009", which is the KEGG module M00009 describing the TCA cycle. Replace this number with other module IDs as appropriate.

Generating Query .hmm for a KEGG Module

Use hmmfetch to create a smaller HMM database for the KOs of a module. Use the following naming convention but change the module ID: data/m00009_ko.hmm.

Detect Orthologs from Genomes

The following script puts outputs in data/m00009_results directory

./scripts/detect/search-genome m00009 GCF_002042975.1

Or if you have a list of genome accessions in a file, e.g. genomes.txt, then do

./scripts/detect/search-genomes m00009 genomes.txt

Use the following script to compare, for a given HMM model, how NCBI annotated proteins (i.e. in protein.faa and genomic.gff) compare against protein found by Needle.

PYTHONPATH=. python3 scripts/detect/compare-gff-with-match.py \
  --best-hmm \
  data/m00009_ko.hmm GCF_002042975.1 data/m00009_results/proteins.tsv \
  --output-file <filename>

Classify Proteins

The following two commands will classify detected proteins first by KEGG ortholog, then Pfam domains.

PYTHONPATH=. python3 scripts/classify/classify.py \
  --disable-cutoff-ga \
  data/m00009_ko.hmm m00009
PYTHONPATH=. python3 scripts/classify/classify.py pfam-downloads/Pfam-A.hmm m00009

Classification outputs appear in data/m00009_results/classify.tsv.

Annotated proteins submitted to NBCI can be classified in the same way, and added to the same output TSV, using the following two commands.

PYTHONPATH=. python3 scripts/classify/classify.py \
  --disable-cutoff-ga \
  --genome-accession GCF_932526225.1 \
  data/m00009_ko.hmm m00009
PYTHONPATH=. python3 scripts/classify/classify.py \
  --filter-by-prev-output \
  --genome-accession GCF_932526225.1 \
  pfam-downloads/Pfam-A.hmm m00009

The --filter-by-prev-output argument first filters the curated proteins to remove those that do not appear in the data/m00009_results/classify.tsv file; only those proteins matching one or more KEGG orthologs are further classified using Pfam.

The following helper script, classify-ncbi, calls the above two commands for each accession in an accession file.

scripts/classify/classify-ncbi data/genomes_ref.txt

Use the following script to generate a protein_ncbi.tsv and protein_names.tsv files. Both include just proteins from the NCBI reference genomes. protein_ncbi.tsv is similar to proteins.tsv and enumerates exons. protein_names.tsv lists the curated names of proteins, and is used in the Tableau workbook.

PYTHONPATH=. python3 scripts/classify/generate-ref-protein-tsv.py m00009 \
  data/m00009_results/protein_ncbi.tsv \
  data/m00009_results/protein_names.tsv

Use the following script to create FASTA files for orthologs, and domains for each ortholog, based on classification results. The FASTA files are in data/m00009_results/faa directory.

PYTHONPATH=. python3 scripts/classify/assign.py \
  data/m00009_ko.hmm pfam-downloads/Pfam-A.hmm m00009 \
  --additional-genome-accession GCF_932526225.1

Instead of one additional genome accession, a file containing a list of accessions can also be used.

PYTHONPATH=. python3 scripts/classify/assign.py \
  data/m00009_ko.hmm pfam-downloads/Pfam-A.hmm m00009 \
  --additional-genome-accession data/genomes_ref.txt

Note that different scoring threshold criterias are used for detected proteins (more tolerant) vs those from reference genomes (more stringent).

Clustering (Optional)

For each KO, run the following script to cluster assigned sequences further

./scripts/cluster/cluster m00009

Cluster outputs are summarized in data/m00009_results/cluster.tsv, and clustered FAA files are in data/m00009_results/clusters.

Classification and clustering results -- i.e. how detected proteins match against KO HMM profiles and how Pfam domains map onto those proteins assigned to a KO -- can be visualized using Tableau. A template workbook that uses the classification output TSV and several downloaded data files (e.g. genomes.tsv, ko.tsv, and Pfam-A.clans.tsv), is data/Protein Classification.twb.

Generating Multi-Sequence Alignments

To generate MSAs and PNGs that visualize the MSAs, run the following script. The faa_dir argument can be either the data/m00009_results/faa dir, or the data/m00009_results/clusters dir.

./scripts/align/generate-msas m00009 <faa_dir>

This script creates the data/m00009_results/alignments dir and, for each input FAA file, generates a MSA FAA file, a PNG visualizing the MSA, and a HMM profile from the MSA.

To compare the alignments of two clusters in the same ortholog group, use the following script

./scripts/align/hhalign m00009 K00235 0bb2a08d 353fd803

Other Scripts

Search in SwissProt for related proteins

scripts/mmseqs-swissprot-search proteins.faa results.tsv

Download files from NCBI

PYTHONPATH=. python3 scripts/ncbi-download.py GCF_932526225.1

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published