DeViRA is a pipeline for reference-guided denovo assembly of viral genomes from short-read sequencing data, tested with both shotgun and amplicon sequencing approaches. It is inspired by the Broad Institute's assemble_denovo workflow, and uses established viral reference sequences to guide accurate denovo assembly of viral genomes.
Right now, DeViRA is intended primarily for assembly of respiratory viruses, and supports:
- Human parainfluenza virus subtypes 1 - 4
- Influenza A, B, C, D
- SARS-CoV2
- Seasonal coronaviruses (subtypes HKU1, 229E, OC43, NL63)
- Human metapneumovirus
- Measles
- Mumps
- RSV
-
DeViRA takes in raw FASTQ files specified in a user-created samplesheet, performs read trimming and QC reporting, and assigns reads to specific taxon bins with kraken2.
-
The reads associated with each bin are then assembled into contigs with megahit, and the contigs are compared to FASTA references in a user-supplied database; for all references above average nucleotide identity and coverage thresholds, DeViRA will use the references to guide and refine contig arrangement into scaffolds.
-
Scaffold gaps undergo gap-filling with read sequence, with any remaining gaps being filled with Ns.
-
To generate a consensus genome, sequencing reads are realigned to the scaffolds with bwa-mem2, and the alignment is used to call consensus with ivar consensus. This is performed twice to get longer assemblies with less bias towards the chosen reference genome.
-
download the latest
devirafile from the "Releases page" or run this command:wget https://raw.githubusercontent.com/epiliper/devira/refs/heads/main/deviraIt's recommended that you move this file somewhere to your $PATH to run it from anywhere.
-
run
chmod +x devirato make the script executable. -
Install Docker if you haven't already
-
Install Nextflow if you haven't already
-
Ensure the Docker desktop client is running.
-
Arrange all input fastqs (can be plain .fastq or compressed .fastq.gz) in their own directory.
🚨 MANDATORY: all fastq files must have unique sample names before the first underscore ('_') character.
RIGHT: sample1_R1.fastq.gz sample1_R2.fastq.gz
WRONG: sample_1_R1.fastq.gz sample_2_R1.fastq.gz
- In the wrong case, sample_1_R1 would get wrongly paired with sample_2_R1.
- It's recommended to have the read mate info (e.g. R1, 1) immediately follow the first underscore, after the unique sample name.
- if samples do not have underscores, this logic applies to the period ('.') character instead.
- once you're sure your fastqs are correctly named, run:
devira <fastq_dir> -profile docker --output <out_dir>
For a description of all options, run devira -h.
- Once the pipeline is done, consensus genomes can be found in
$out_dir/final_files/final_assemblies. For reports of any genomes/samples that failed assembly QC thresholds, see files in$out_dir/fail.
You can test this with out with the pipeline's example data, located in fastq_example.
The devira script is just a wrapper over the basic nextflow run command; it creates a fastq samplesheet from given input directory, and passes other arguments to nextflow itself. You can pass any of the typical nextflow arguments to the devira script. This includes the -c command to specify advanced options for, as an example, running the pipeline on AWS Batch or other cloud computing environments.
After trimmed reads are classified with Kraken2 early in the pipeline, reads associated with a specific taxon are extracted into a new file, and this is repeated for all taxon IDs listed in a TSV file input to the pipeline. The default taxon ID list can be found in assets/taxids.tsv. Reads not related to any of the taxon IDs in the list, or unclassified reads period, are not used in downstream assembly.
Important
If you want to make your own ID list, it's recommended to keep taxon IDs at species-level or higher, to avoid over-stratification of reads and lots of unnecessary file/generation processing.
The kraken2 database we use is generated solely from the fasta sequences in assets/ref.fa from our REVICA-STRM pipeline. To add your own sequences, you'll have to recreate the database. See our instructions for how to do so. You can specify a path to your Kraken2 database with --kraken2_db $DB.
The reference database used for selecting sequences to guide scaffolding is comprised of respiratory virus sequences used in the development of NCBI's VADR project for viral annotation. Inspect assets/refs.fasta if curious. If you intend to use your own database, ensure the headers are structured as follows:
ACCESSION<SPACE>REF_TAG<SPACE>SAMPLE_HEADER
where REF_TAG should be unique to a species-specific segment/genome. Take these entries for Flu A segments PB1 and NS1, and an enterovirus genome, for example:
>NC_007364.1 fluA_NS1 Influenza A virus (A/goose/Guangdong/1/1996(H5N1)) segment 8, complete sequence
>NC_007375.1 fluA_PB1 Influenza A virus (A/Korea/426/1968(H2N2)) segment 2, complete sequence
>AF406813.1 EV Porcine enterovirus 8 strain V13, complete genome
Also note that the pipeline expects each reference fasta entry to occupy two lines: one for the header, one for the sequence. Files that split the sequence across multiple lines will not be compatible, and any extra newline characters will have to be removed.
You can specify your own reference database with --refs $REF_DB.


