Skip to content

Refine taxassign algo #22

@meglecz

Description

@meglecz

At present, all sequences in the reference database are used if they are among the best hits, irrespective of the resolution of their taxon. Some are assigned to a species level, others to a higher level.
This can reduce the taxonomic resolution: For example if we have 2 hits at 97% identity, where 1 reference sequence is identified to the species, but the other only to the family, the variant will be assigned to the family.

I suggest that the users should be able to set the minimum resolution of the reference sequences for each %identity.
It can be something like this
100% species
97% genus
95% family
90% order
85% class
80% phylum

I have already made a taxonomy file with an additional column that contains the resolution index:
8: species
7: genus
6 : family
5 : order
4 : class
3 : phylum
2 : kingdom
1 : superkingdom
For other levels the index is a non-integer. e.g. 7.5 for subgenus.
This simplifies greatly the selection of the reference sequences.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions