TLRFinder is a tools desgined to detect and analyse Telomere-Linked-Repeats (TLR)
TLRs are, as the name describes, regions adjacent to telomeres that are conserved across chromosome ends
The most commonly described version contains Telomere-Linked-Helicases, such as the Y' element in S. cerevisiae
However other coding sequences also exist within TLRs and are also conserved across diverse fungi
With the advent of Long-read sequencing we now have an increasing number of well assembled genomes with the subtelomeres intact
Therefore we are now primed for looking at the TLRs and their evolution.
All that is required to run TLRFinder is an assembly in fasta format (can be bgzip compressed) or a tsv file containing a list of assemblies (first column: sample name; second column: assembly)
NOTE: Do not use hyphens '-' or other special characters in the sample/file names
This tool was developed for O'Donnell et al. 2025 (please cite this publication if you find this tool useful)
conda install samtobam::tlrfinder
TLRFinder.sh -a assembly.fa
or
TLRFinder.sh -al list.tsv
Required inputs:
-a | --assembly Genome assembly in fasta format (*.fa / *.fasta / *.fna) and can be gzipped (*.gz) with bgzip
or
-al | --assemblylist A tsv file containing sample names in the first column and assembly paths in the second column
Recommended inputs:
-ts | --tipsize Length of contig ends to be extracted for TLR detection (Default: 50000)
-tr | --telomererepeat Telomeric repeat pattern (Default: TTAGGG)
-ct | --covthreshold The amount of coverage required for a region to be considered for TLR clustering relative to the number fo telomeres (Default = 0.75)
-sm | --sizemin The minimum size of a region passing the coverage threshold to be considered as a potential TLR region (Default: 2000)
Multiple assembly specific parameters (if using --al)
-b | --bootstraps Number of bootstrap tests to be performed by mashtree (Default: 1000)
Optional parameters:
-w | --window Number of basepairs for window averaging for coverage (Default: 10)
-s | --slide Number of basepairs for the window to slide for coverage (Default: 5)
-p | --prefix Prefix for output (Default: TLRFinder)
-o | --output Name of output folder for all results (default: TLRFinder_output)
-h | --help Print this help message
TLRFinder works in 10 main steps (each step is run on all assemblies provided -al)
- Extract 50kb from the ends of contigs >100kb
- Identify Telomeric sequences genomes wide and determine the number within the contig ends
- Align all contig ends to one another
- Calculate the average coverage of sliding windows within the 50kb ends
- Extract contiguous regions with a coverage > (0.75*'the number of telomeric sequences')
- Remove regions < 2kb
- Cluster the nucleotide sequence of the remaining regions using the 80/80 principal
- Consider the largest cluster as the TLR and take the largest version as the representative
- Search the whole genome using BLASTn using the representative TLR
- Plot the alignment and whole genome positions of TLRs for manual verification/scrutiny
TLRFinder runs an additional 3 steps if provided multiple assemblies using -al
11. Rapidly generate a k-mer NJ phylogeny using mashtree
12. Calculate global-ANI (g-ANI) statistics for TLRs within an assembly and between TLR representatives
13. Plot the tree and g-ANI stats side by side
-The primary output is the summary_stats.tsv file with some summarising data per assembly
Column 1 : 'sample' : name given to samples in -al file or prefix if a single assembly (-a)
Column 2 : 'assembly' : name of assembly file provided
Column 3 : 'contigs' : Number of contigs in assembly
Column 4 : 'telomeric_repeats' : Number of good telomeric repeat regions found (minimum of 50bp)
Column 5 : 'telomeric_repeats_contig_ends' : Number of the above telomeric repeats found within the contig ends (Default: 50kb from end of contigs larger than 100kb)
Column 6 : 'TLR_regions' : Number of good TLRs detected genome wide
Column 7 : 'TLR_regions_contig_ends' : Number of the above TLRs found within the contig ends (Default: 50kb from end of contigs larger than 100kb)
Column 8 : 'TLR_representative_coords' : The coordinates in the assembly to the TLR representative used in the analyses
Column 9 : 'TLR_representative_size' : The size in basepairs of the TLR representative
Column 10 : 'TLR_representative_size' : The average size of the regions detected that clustered with the TLR representative
Column 12 : 'TLR_representative' : The nucleotide sequence of the TLR representative
-Per sample plots are all placed in plotting_Rscripts
plotting_Rscripts/*.end_alignments.svg : Contigs with ends aligned with TLRs highlighted (red) and telomeric repeats showing (black dots)
plotting_Rscripts/*.end_alignments.filtered.svg : Same as above but only with contig ends containing at least one telomeric repeat or TLR
plotting_Rscripts/*.whole_genome.svg : Positions of telomeric_repeats (dots) and TLR_regions (triangles) in the whole genome
-Comparisons between samples are placed in output folder phylogeny_plus_gANI_heatmap.svg : a midrooted mashtree of all assemblies horizontally aligned with lz-ani calulated global ANI (gANI) scores comparing the representatives of each assemblies TLR. Far right is the same gANI scores but calculated between repeats for the same assembly.
This value is essentially is used to find repeats that are at least in this many copies compared to telomeric sequences
This tools was designed primarily on Fusarium and Pyricularia assemblies which contain (if present) TLRs on most chromosome ends
However they may be less frequent in other species
The default of 0.75 is too make sure that other potentially repetitive regions within 50kb of the contig ends are not kept
When looking a the alignments of the contig ends (and maybe manually reordering the ends etc), did you notice another region next to telomeres with several alignments?
Here is a few lines you can run in order to extract and detail another repeat
All you need to know is the coordinates for one of the repeats (will be considered the representative copy)
For Example: I extracted a secondary repeat in the assembly 'GCA030345115.fa' from contig 'GCA030345115_CP128282.1' between position 2 and 9881 (GCA030345115_CP128282.1:2-9881)
In my run of TLRFinder on 'GCA030345115.fa', the prefix was GCA030345115 and therefore in the TLRFinder output folder I ran the below to add the secondary repeat:
##my run of TLRFinder commented out
#TLRFinder -a GCA030345115.fa -p GCA030345115 -o GCA030345115_TLRFinder_output
prefix="GCA030345115"
assembly="GCA030345115.fa"
coords="GCA030345115_CP128282.1:2-9881"
##move into the TLRFinder output for 'GCA030345115.fa'
cd GCA030345115_TLRFinder_output/
##manually extract the second repeat for ${prefix} based on using the synteny alignment script (only the primary repeat was found previously)
samtools faidx ${assembly} ${coords} > subtelomeric_repeats/${prefix}.manual_second.repeat_rep.fa
##get the positions of this second repeat in the contig ends
blastn -subject contig_ends/${prefix}.${tipsize2}kb_ends.fa -query subtelomeric_repeats/${prefix}.manual_second.repeat_rep.fa -outfmt 6 > subtelomeric_repeats/${prefix}.manual_second.repeat_rep.ends_blast.tsv
##create bed from nonredundant positions (used for plotting)
echo "contig;start;end" | tr ';' '\t' > subtelomeric_repeats/${prefix}.manual_second.repeat_rep.ends_blast.bed
cat subtelomeric_repeats/${prefix}.manual_second.repeat_rep.ends_blast.tsv | awk '{if($3 > 80 && $4 > 500) print}' | awk '{print $2"\t"$9"\t"$10}' | awk '{if($2>$3) {print $1"\t"$3"\t"$2} else {print}}' | sort -k1,1 -k2,2n | bedtools merge -d 10 >> subtelomeric_repeats/${prefix}.manual_second.repeat_rep.ends_blast.bed
##and the same now for the whole genome
blastn -subject assemblies/${prefix}.fa -query subtelomeric_repeats/${prefix}.manual_second.repeat_rep.fa -outfmt 6 > subtelomeric_repeats/${prefix}.manual_second.repeat_rep.WG_blast.tsv
echo "contig;start;end" | tr ';' '\t' > subtelomeric_repeats/${prefix}.manual_second.repeat_rep.WG_blast.bed
cat subtelomeric_repeats/${prefix}.manual_second.repeat_rep.WG_blast.tsv | awk '{if($3 > 80 && $4 > 500) print}' | awk '{print $2"\t"$9"\t"$10}' | awk '{if($2>$3) {print $1"\t"$3"\t"$2} else {print}}' | sort -k1,1 -k2,2n | bedtools merge -d 10 >> subtelomeric_repeats/${prefix}.manual_second.repeat_rep.WG_blast.bed
##Now you can plot both side by side
TLRFinder looks for the conserved region however the actual expressed portion of the repeat (generally a helicase) is only contained within the repeat
Therefore you may be interested to find domains within the repeat as an indication of the function
Notably, in O'Donnell et al. 2025 using TLRFinder we found several examples of repeats containing genes that were not helicases.
This step requires you to download a large database of conserved domains in order to search against
##first download the appropriate conserved domain library (https://pmc.ncbi.nlm.nih.gov/articles/PMC7378889/)
wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian/Cdd_LE.tar.gz
mkdir Cdd_LE
mv Cdd_LE.tar.gz Cdd_LE
cd Cdd_LE
tar -zxvf Cdd_LE.tar.gz
##also download a library to read the cdd index numbers
wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cddid.tbl.gz
gunzip cddid.tbl.gz
cd ../
Then use rpstblastn to search against these domains with your representative TLR OR all TLRs found in your assembly dataset
Here we are using the TLR representative for GCA030345115
We also just grab the top 5 for the blast results for looking at the function using the cdd index table (this can definitely be increased if necessary)
sample="GCA030345115"
input="GCA030345115_TLRFinder_output/subtelomeric_repeats/GCA030345115.repeat_rep.fa"
##alternatively if having used multiple assemblies with -al you could use the combined 'TLRFinder_output/repeat_representatives.fa' file which contains all TLR representatives from assemblies where it was found
tophits="5"
##get the outfmt 6 plus the actual sequence aigned in the query (our repeats)
rpstblastn -query ${input} -db Cdd_LE/Cdd -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qseq" > ${sample}.repeat_rep.rpsblast_output.txt
##now take the best 5 hits for each representative
echo "repeat_representative;CDD;domain;title;description" | tr ';' '\t' > ${sample}.repeat_rep.rpsblast_output.besthit_description.tsv
grep '>' ${input} | sed 's/>//g' | while read rep
do
grep "${rep}" ${sample}.repeat_rep.rpsblast_output.txt | head -n${tophits}
done | while read line
do
cdd=$( echo "${line}" | cut -f2 | awk -F "|" '{print $3}' )
cdd2=$( grep ^${cdd} Cdd_LE/cddid.tbl | awk -F "\t" '{print "CDD:"$1"\t"$2"\t"$3"\t"$4}' )
echo "${line}" | awk -v cdd2="$cdd2" '{print $1"\t"cdd2}'
done >> ${sample}.repeat_rep.rpsblast_output.besthit_description.tsv
