phredsort is a command-line tool for sorting sequences in FASTQ files by their quality scores.
Basic usage:
# Read from `input.fastq.gz` and write to `output.fastq.gz`
phredsort -i input.fastq.gz -o output.fastq.gz
# Read from stdin and write to stdout (default when -i/-o not specified)
zcat input.fastq.gz | phredsort | less -S
# Explicit stdin/stdout (equivalent to above)
zcat input.fastq.gz | phredsort -i - -o - | less -Sphredsort headersort -i input.fasta -o output.fasta --metric maxeephredsort headersort -i input.fastq -o output.fastq --metric avgphred --minqual 20 --maxqual 40phredsort headersort -i input.fa -o output.fa --metric meep --ascendingExamples of supported header formats:
- Space-separated: ">seq1 maxee=2.5 size=100"
- Semicolon-separated: ">seq1;maxee=2.5;size=100"
wget https://github.com/vmikk/phredsort/releases/download/1.4.0/phredsort
chmod +x phredsort
./phredsort --helpgit clone --depth 1 https://github.com/vmikk/phredsort
cd phredsort
go build -ldflags="-s -w" phredsort.go
./phredsort --helpphredsort supports several metrics (--metric parameter) to assess sequence quality:
- Properly calculated mean quality score that accounts for the logarithmic nature of Phred scores
- Converts Phred scores to error probabilities, calculates their arithmetic mean, then converts back to Phred scale
- Formula:
-10 * log10(mean(10^(-Q/10))) - More accurate than simple arithmetic mean of Phred scores, which would overestimate quality
- Sum of error probabilities for all bases in a sequence
- Formula:
sum(10^(-Q/10)) - Higher values indicate lower quality
- Depends on sequence length (longer sequences tend to have higher MaxEE)
- MaxEE standardized by sequence length
- Represents expected number of errors per 100 bases
- Formula:
(MaxEE * 100) / sequence_length - Higher values indicate lower quality
- Allows fair comparison between sequences of different lengths
- Number of bases below specified quality threshold
- Useful for binned quality scores (e.g., data from Illumina NovaSeq platform)
- Counts bases with Phred score < threshold (default: 15)
- Higher values indicate lower quality
- Percentage of bases below quality threshold
- Formula:
(lqcount * 100) / sequence_length - Higher values indicate lower quality
- Normalizes low-quality base count by sequence length
