Skip to content

Python refactoring to new annotation scheme #4

@pcm32

Description

@pcm32

Python part needs to be refactored to use the new annotation scheme. Currently, the python modules read the following from the annotation file:

  • Clade identifier
  • Description of the clade

The current annotation file is expected to be in the same path as the HMMER model, with an .annot file extension. For backward compatibility, we could add a flag for the new annotation file, and read it if provided, otherwise, expect to find the previous annotation file in the expected location. The class currently responsible for reading the annotation is hmmer/core/ModelAnnotator.

  • Write reader and related classes for new annotation scheme.
  • Add backward compatible ability to read new annotation file, extracting from here the clade description that it was obtained from the older annotation file.
    • Override constructor of the class that process it (ModelAnnotator) to obtain these descriptions from a CladeAnnotation object if provided.

The Python code might need the following data from the annotation file:

  • Clade ID
  • Clade description as shown in tool
  • Mol file for monomer (this was previously based on the clade identifer, not anymore)
    • [ ] Make changes in code to use the mol file name given in the new annotation format, if provided. Only needed in the Java part.
  • Postprocessor: This is used by the Java part and should be passed along.
  • VerificationDomains: This probably should be used within the Python part, to correct the annotation given to the Java part. I have my doubts here.
    • Use annotation object with Domain_Verifier classes, instead of local loader previously implemented.
    • Test DomainVerifier classes with Annotation reader.
    • Invoke DomainVerifier classes from main script, to influence the resulting SeqObj's features.
      • Write test for SimpleFeatureWriter making sure that the verification column appears adequately, fix any issues
    • Compare output to expected outputs for some sequences, fixing missing annotations that arise.
  • TerminationRule: This is used by the Java part and should be passed along.
  • NonElongating: This is used by the Java part and should be passed along.
  • VerificationDomainIsMandatory: Used in the python part.
    • Should be used after calling the DomainVerifier classes, possibly to execute some changes (either remove the feature, which is preferred, or change it) if the verification fails. Will be used only on the Java section, to decide whether to make use of the verifications done.

This also means that these fields need to make it into the new file that Python writes for the Java-CDK part (features file), or that Python generates a simplified file for Java and Java reads all these from the annotation file. One way to go would be to combine in the feature file the fields produced by Python from the sequence search and all the elements read from the annotation file, to avoid the risk of the Java part running with an incorrect annotation file. All the fields produced through the sequence search are stored initially in qualifiers inside SeqFeatures objects, which go inside the SeqRecords returned by the FeatureMarker classes in Query.core. These are in turn written to the .feature file passed to Java by SimpleFeatWriter class in SimpleFeatWriter.core. This could the place to add all the annotation elements if a unified output is to be used.

Alternatively, the annotation file can be passed to Java, alongside the file with the results of the sequence searches and domain annotations alterations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions