Skip to content

Conversation

@gongyixiao
Copy link
Collaborator

@gongyixiao gongyixiao commented Mar 30, 2022

This might also be useful since we added #1008 and planning to just store BAMs instead of FASTQs.

@gongyixiao gongyixiao added the enhancement New feature or request label Mar 30, 2022
@gongyixiao gongyixiao self-assigned this Mar 30, 2022
Copy link
Collaborator

@anoronh4 anoronh4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed earlier today, we should consider accommodating cases in which the bam has the OQ tag, so that we can get something as close to the original fastq as possible. This can be done with gatk's RevertSam tool. We might have to first parse the input bam file to see if the OQ tag is present, before running the tool. We should consider adding OQ to our aligned bam outputs as well, which was brought up in #947

@gongyixiao gongyixiao marked this pull request as draft October 10, 2022 18:07
@gongyixiao
Copy link
Collaborator Author

As discussed earlier today, we should consider accommodating cases in which the bam has the OQ tag, so that we can get something as close to the original fastq as possible. This can be done with gatk's RevertSam tool. We might have to first parse the input bam file to see if the OQ tag is present, before running the tool. We should consider adding OQ to our aligned bam outputs as well, which was brought up in #947

Yes. I think this can be done by first check if OQ is present, and and then pipe RevertSam output into SamToFastq as input like this https://www.biostars.org/p/56566/ to save disk space.

@gongyixiao gongyixiao marked this pull request as ready for review April 18, 2024 18:02
@gongyixiao
Copy link
Collaborator Author

We need to bring this back to our radar.

@johnoooh
Copy link
Collaborator

Currently if the input bam mapping file is named bamMapping.tsv (default tempo output) it will be overwritten by the script. If this happens the user will lose their original bamMapping file. We should add a check: if bam2fastq is enabled and bammapping filename is the default bamMapping.tsv then exit.

We should force the user to rename their input bamMapping file when bam2fastq is used.

@gongyixiao gongyixiao requested review from anoronh4 and johnoooh May 28, 2025 17:19
@gongyixiao
Copy link
Collaborator Author

Currently if the input bam mapping file is named bamMapping.tsv (default tempo output) it will be overwritten by the script. If this happens the user will lose their original bamMapping file. We should add a check: if bam2fastq is enabled and bammapping filename is the default bamMapping.tsv then exit.

We should force the user to rename their input bamMapping file when bam2fastq is used.

Solved here:
5bffc82

@johnoooh
Copy link
Collaborator

Currently unpaired reads and secondary alignments may mess with Sam2fastq. Eithery they need to be filtered out with samtools view like so:

samtools view -h -f 0x2 -F 0x904 -o tmp.clean.bam ${bam}

or set these to false: INCLUDE_NON_PF_READS=true INCLUDE_NON_PRIMARY_ALIGNMENTS=true

@johnoooh
Copy link
Collaborator

The issue is this samtools view may be too strict, -f 0x2 allows only properly paired reads, while
-F 0x904 removes: 0x100 (secondary) , 0x800 (supplementary), and 0x004 (unmapped).

@gongyixiao
Copy link
Collaborator Author

It is a valid concern about secondary and supplementary alignments being duplicated when converting back to fastq files. My understanding is by setting INCLUDE_NON_PRIMARY_ALIGNMENTS=false, it should solve the problem. No need to to extra steps in samtools. Will need to confirm about it.

INCLUDE_NON_PF_READS=true is for `passes filtering' so I think it should set to true.

And unmapped reads will be kept by SamToFastq as long as the unmapped reads are in the bam. We did included unmapped reads in our alignment steps.

The only question that is not so clear is even we keep the original quality score in the BAM file, it's not clear that SamToFastq will be using the OQ tag when converting and there is no option to force that. So we have to experiment it.

So the plan it to run test to fully test these two points, and also run fastp and compare the original fastq file and the reverted fastq files before we merge this PR.

@johnoooh
Copy link
Collaborator

johnoooh commented Jun 4, 2025

I found some options

Use RevertOriginalBaseQualitiesAndAddMateCigar in picard before we run SamtoFastq.
https://gatk.broadinstitute.org/hc/en-us/articles/4409917453595-RevertOriginalBaseQualitiesAndAddMateCigar-Picard

Use RevertSam in picard before we run SamtoFastq.
https://gatk.broadinstitute.org/hc/en-us/articles/360036831851-RevertSam-Picard

I'll do some exploring.

@gongyixiao
Copy link
Collaborator Author

gongyixiao commented Jul 30, 2025

RevertSam looks like the most appropriate tool to be used here. And it has a spark version.
https://gatk.broadinstitute.org/hc/en-us/articles/4409897195675-RevertSamSpark-BETA

And so far the parameters I'm planning to use are:

gatk RevertSam \
    -I your_final.bam \
    -O reverted_querysorted.bam \
    --SORT_ORDER queryname \
    --RESTORE_ORIGINAL_QUALITIES true \
    --REMOVE_DUPLICATE_INFORMATION true \
    --REMOVE_ALIGNMENT_INFORMATION true \
    --SANITIZE true

@pintoa1-mskcc
Copy link

pintoa1-mskcc commented Aug 6, 2025

Since TEMPO expects files to have the following format: <samplename>_L00#_R#_00#.fastq.gz and RevertSam + SamToFastq does not output this format, during run of this branch an error occurs here

filePrefix = fastqFile1.getSimpleName().split("_R1")[0..-2].join("_R1")

and
def lane = fastq.getSimpleName().split('_L00')[1].split('_')[0]

However running RevertSame and SamToFastq with parameters in this branch and then manually renaming the files to conform to this fastq naming convention, TEMPO runs no problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants