-
Notifications
You must be signed in to change notification settings - Fork 85
Description
Describe the issue
I am trying at the moment to build a generic analysis workflow (in Galaxy) for sequencing data of Morbilliviruses and related viruses from Paramyxoviridae that would go from sequenced reads to consensus genomes, but also annotated lists of variants. Because I want to support a broad range of viruses, my idea was to let users bring their reference sequences and corresponding annotation files in gtf or genbank format and have the WF build the SnpEff db on the fly.
Unfortunately, all members of Paramyxoviridae share a somewhat complicated P/V/C ORF, which can give rise to up to 3 different protein products. In human measles virus as the best known representative the situation is this:
- The regular transcript of the P/V/C locus, which corresponds to the viral genome template, gives rise to two proteins:
- P protein, which gets translated from the first start codon of the transcript
- C protein, when the ribosome uses an internal start
- Programmed polymerase stuttering can lead to an alternative transcript, with an inserted G (not found on the template) around half-way along the transcript. This gives rise to the:
- V protein, which is identical to the first part of P, but then continues in a different reading frame.
I've been doing tests with a measles refseq and its annotation files from NCBI in gtf and genbank format, and built SnpEff dbs (v5.2) from both of them, but none of the two versions gave successful annotations with regard to all three products although both files from NCBI seem to make sense to me.
I am aware that the typical answer is: request the genome and don't try to build the db yourself, but I'm afraid for this use case I really need to give users the choice which virus they want to analyze and while human measles virus would probably be of interest to many, it's really many different viruses I'd like to support.
So what does it take to build a db for SnpEff for such a case, which is kind of reminiscent to the ORF1ab situation in SARS-CoV-2, just a bit more complex. If you could provide some guidance, that would be greatly appreciated.
I could also share a measles virus VCF if you would like to play around with it.
Thanks a lot for any help you can provide!