GitHub - epiliper/devira: A light and modern Nextflow pipeline for reference-assisted denovo assembly of viral genomes from metagenomics data.

📖 About

DeViRA is a pipeline for reference-guided denovo assembly of viral genomes from short-read sequencing data, tested with both shotgun and amplicon sequencing approaches. It is inspired by the Broad Institute's assemble_denovo workflow, and uses established viral reference sequences to guide accurate denovo assembly of viral genomes.

Right now, DeViRA is intended primarily for assembly of respiratory viruses, and supports:

Human parainfluenza virus subtypes 1 - 4
Influenza A, B, C, D
SARS-CoV2
Seasonal coronaviruses (subtypes HKU1, 229E, OC43, NL63)
Human metapneumovirus
Measles
Mumps
RSV

🔃 Workflow

DeViRA takes in raw FASTQ files specified in a user-created samplesheet, performs read trimming and QC reporting, and assigns reads to specific taxon bins with kraken2.
The reads associated with each bin are then assembled into contigs with megahit, and the contigs are compared to FASTA references in a user-supplied database; for all references above average nucleotide identity and coverage thresholds, DeViRA will use the references to guide and refine contig arrangement into scaffolds.
Scaffold gaps undergo gap-filling with read sequence, with any remaining gaps being filled with Ns.
To generate a consensus genome, sequencing reads are realigned to the scaffolds with bwa-mem2, and the alignment is used to call consensus with ivar consensus. This is performed twice to get longer assemblies with less bias towards the chosen reference genome.

💾 Installation

download the latest devira file from the "Releases page" or run this command:
```
wget https://raw.githubusercontent.com/epiliper/devira/refs/heads/main/devira
```
It's recommended that you move this file somewhere to your $PATH to run it from anywhere.
run chmod +x devira to make the script executable.
Install Docker if you haven't already
Install Nextflow if you haven't already

🦠 Instructions

Ensure the Docker desktop client is running.
Arrange all input fastqs (can be plain .fastq or compressed .fastq.gz) in their own directory.

🚨 MANDATORY: all fastq files must have unique sample names before the first underscore ('_') character.

RIGHT: sample1_R1.fastq.gz sample1_R2.fastq.gz
WRONG: sample_1_R1.fastq.gz sample_2_R1.fastq.gz

In the wrong case, sample_1_R1 would get wrongly paired with sample_2_R1.
It's recommended to have the read mate info (e.g. R1, 1) immediately follow the first underscore, after the unique sample name.
if samples do not have underscores, this logic applies to the period ('.') character instead.

once you're sure your fastqs are correctly named, run:

devira <fastq_dir> -profile docker --output <out_dir>

For a description of all options, run devira -h.

Once the pipeline is done, consensus genomes can be found in $out_dir/final_files/final_assemblies. For reports of any genomes/samples that failed assembly QC thresholds, see files in $out_dir/fail.

You can test this with out with the pipeline's example data, located in fastq_example.

🤓 For developers

Running the pipeline with advanced nextflow options

The devira script is just a wrapper over the basic nextflow run command; it creates a fastq samplesheet from given input directory, and passes other arguments to nextflow itself. You can pass any of the typical nextflow arguments to the devira script. This includes the -c command to specify advanced options for, as an example, running the pipeline on AWS Batch or other cloud computing environments.

Kraken2 database

After trimmed reads are classified with Kraken2 early in the pipeline, reads associated with a specific taxon are extracted into a new file, and this is repeated for all taxon IDs listed in a TSV file input to the pipeline. The default taxon ID list can be found in assets/taxids.tsv. Reads not related to any of the taxon IDs in the list, or unclassified reads period, are not used in downstream assembly.

Important

If you want to make your own ID list, it's recommended to keep taxon IDs at species-level or higher, to avoid over-stratification of reads and lots of unnecessary file/generation processing.

The kraken2 database we use is generated solely from the fasta sequences in assets/ref.fa from our REVICA-STRM pipeline. To add your own sequences, you'll have to recreate the database. See our instructions for how to do so. You can specify a path to your Kraken2 database with --kraken2_db $DB.

Reference database

The reference database used for selecting sequences to guide scaffolding is comprised of respiratory virus sequences used in the development of NCBI's VADR project for viral annotation. Inspect assets/refs.fasta if curious. If you intend to use your own database, ensure the headers are structured as follows:

ACCESSION<SPACE>REF_TAG<SPACE>SAMPLE_HEADER

where REF_TAG should be unique to a species-specific segment/genome. Take these entries for Flu A segments PB1 and NS1, and an enterovirus genome, for example:

>NC_007364.1 fluA_NS1 Influenza A virus (A/goose/Guangdong/1/1996(H5N1)) segment 8, complete sequence
>NC_007375.1 fluA_PB1 Influenza A virus (A/Korea/426/1968(H2N2)) segment 2, complete sequence
>AF406813.1 EV Porcine enterovirus 8 strain V13, complete genome

Also note that the pipeline expects each reference fasta entry to occupy two lines: one for the header, one for the sequence. Files that split the sequence across multiple lines will not be compatible, and any extra newline characters will have to be removed.

You can specify your own reference database with --refs $REF_DB.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
.github/workflows		.github/workflows
assets		assets
bin		bin
conf		conf
docker_ubuntu		docker_ubuntu
example		example
img		img
kraken2_db		kraken2_db
modules		modules
subworkflows		subworkflows
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
creating_kraken2_db.md		creating_kraken2_db.md
devira		devira
main.nf		main.nf
nextflow.config		nextflow.config
run_example.sh		run_example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📖 About

🔃 Workflow

💾 Installation

🦠 Instructions

🚨 MANDATORY: all fastq files must have unique sample names before the first underscore ('_') character.

🤓 For developers

Running the pipeline with advanced nextflow options

Kraken2 database

Reference database

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

epiliper/devira

Folders and files

Latest commit

History

Repository files navigation

📖 About

🔃 Workflow

💾 Installation

🦠 Instructions

🚨 MANDATORY: all fastq files must have unique sample names before the first underscore ('_') character.

🤓 For developers

Running the pipeline with advanced nextflow options

Kraken2 database

Reference database

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages