Skip to content

drowsygoat/RMLizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

     __             __   _
    /__\   /\/\    / /  (_) ____  ___  _ __
   / \//  /    \  / /   | ||_  / / _ \| '__|
  / _  \ / /\/\ \/ /___ | | / / |  __/| |
  \/ \_/ \/    \/\____/ |_|/___| \___||_|

Intended use: Quantifying T->C conversions in RNA sequencing data * Processing paired-end BAM files for downstream analysis (quantification of T->C conversions)

How to run: * Make script executable (chmod u+x rebound.sh) and run (see examples). Because GAWK is single-threaded, it is best to run the script using only one CPU thread (flag -t 1).

Required Input: Paired-end or single-end SAM/BAM files Annotation file (gtf/gff) and genomic reference.

Output:
* Table of counts of mismatches across all possible nucleotide combinations * Table of counts of mismatches across all possible nucleotide combinations (background-corrected) * Sanity-check plots

Prerequisites: * Samtools (https://anaconda.org/bioconda/samtools) * GAWK (https://anaconda.org/anaconda/gawk)

Options and usage:

rebound.sh [options] [BAMs]

Option Default Description
-a ./ Path to RMLizer.awk
-o ./ Output directory
-l star Aligner used to prepare BAM files (one of bbmap or star). Rebound was tested with output of STAR and BBmap (sam=1.3 flag). If using other aligner, alignment files should be in SAM 1.3 (CIGARs must not contain X and = operations, but only MNDIS). Files in SAM 1.4 format can be converted to SAM 1.3 with reformat.sh in=input.bam out=output.bam sam=1.3.
-r Fasta reference file with full path. Genome reference that is the same as the one used to prepare BAM files or updated based on SNP analysis (variant calling). The latter can be done with bcftools (see examples).
-f Overwrite MD tags. With this flag set MD tags are added if absent, or overwritten if present. In mode "normal" (default) MD tag are replaced by default, so this flag has no effect.
-i disabled Filter reads containing indels smaller than this value (reads will be excluded from the output). Accepted values are [1:30] or "ban" (the latter bans reads with indels completely, and will ban all reads spanning splice junctions if introns are designated as deletions in the BAM files).
-m normal Mode of function (slamdunk or normal). In mode normal a table with mismatches per gene and relevant sanity check plots are produced. In mode slamdunk a slamDunk-compatible BAM file(s) are produced in addition. Currently, mode slamdunk works best with single-end input but if paired-end BAM files are provided (type of file should be determined automatically), rebound will attempt to convert them to single-end while retaining original mapping positions (experimental).
-g Annotation (gtf/gff) file with path. If using unstranded libraries, this file may include strandedness information appended to gene identifiers as double underscore followed by + or - sign, e.g., ENSMUSG00000079037_+.
-q 254,20,0.95,27,27 (star)
20,100000,0.95,27,27 (bbmap)
Quality filtering settings. It is a string of 5 comma-separated values (order of values important) defining: 1. MAPQ (integer): mapping quality filter; the default value for star (255) keeps only reads with MAPQ 255, which is STAR's default value for unique mappers. 2. ED (integer): Read filtering threshold by edit distance (e.g., 20 keeps reads with values below this threshold). Set to large values to disable the filter (this may be necessary if introns (N) are represented as deletions (D) in CIGARs) 3. XF (decimal [0-1): Read filtering threshold by mismatch fraction (e.g., 0.95 excludes reads with mismatch content above 5%). 4. QV (integer [1-41]): Base quality threshold; excludes mismatches when called base quality is below this value (range [1-41]). 5. QVTC (range [1-41]): Same as QV but only affects T->C mismatches. Must be higher then QV to have an effect.
-s forward Strandedness. In case of unstranded libraries Rebound will read strand information from annotation file. In such case strandedness info (+ or -) must be appended to gene identifiers in the GTF file (via double underscore followed by + or - sign, e.g., ENSMUSG00000079037__+). GTF file can be modified as shown in the example.
-d Remove duplicated reads. Requires duplicates marked with Picard MarkDuplicates or a similar tool. If this flag is NOT set, and duplicates are marked, duplicate information will be included in the final output counts, allowing two distinct analyses (with and without duplicate removal).
-u Use only uniquely mapped reads. Default is to use all reads.
-x 1G Per thread memory block size (affects samtools).
-t 1 Number of CPU threads (affects samtools).
-h Prints help, overrides any remaining options.
BAM List BAMs to process (wildcards [*] accepted). Also accepts SAM cd28format.

Examples

Adding strand information to GTF file (works on GENCODE GTF format):

cat annotations.gtf | awk 'BEGIN{OFS="\t"}{gsub(/\.[[:digit:]]+";|";/,"__"$7"\";",$10); print $0}' > annotations_with_strand.gtf

Running rebound on a STAR-aligned BAM file, reverse strandedness, keep unique reads only, ignore reads with deletions <= 30:

RMlizer.sh -a ./RMLizer.awk -r ./genome.fa -o ./output_dir -l star -g annotations.gtf -s reverse -u -f -i 30 my.bam

Output description
Summary table Counts of reads passing filters and a summary of filter settings. * Total reads: all reads initially considered. * Reads passing indel filter: reads passing indel filter. * Reads with matches/mismatches - all reads reads with M operations in CIGARs (reads with no M operations may occur due to clipping of overlapping pairs in paired-end data). * Reads with mismatches: fraction that contain mismatched bases. * Reads with mismatches filtered - fraction of reads with mismatches that passed QC filters [ -q ]. * Reads with mismatches filtered (unique assignment): (mode normal only) fraction of reads with mismatches, passing QC filters, and unambiguously assigned to a feature in GTF file. Specifically, it excludes all entries marked by HTseq as any of the following: __ambiguous|__no_feature|_too_low_aQual|__not_aligned|__alignment_not_unique. The count includes multi-mappers if [ -u ] flag was not set. * Reads without mismatches: fraction of reads that contain mismatched bases. * Reads without mismatches filtered: fraction of reads without mismatches that passed QC filters [ -q ]. * Reads without mismatches filtered (unique assignment): (mode normal only) fraction of reads without mismatches, passing QC filters, and unambiguously assigned to a feature in GTF file. Specifically, it excludes all entries marked by HTseq as any of the following: __ambiguous|__no_feature|_too_low_aQual|__not_aligned|__alignment_not_unique. The count includes multi-mappers if [ -u ] flag was not set.

Rates main Histogram of coverage over tymines (T). Red, blue and black lines denote 100, 1000, and 10000 cutoffs respectively. Genes and ERCC spike-ins are automatically discriminated provided that Ensembl gene identifiers are used, and ERCCs are included in the GTF file.

Rates plot Mismatch rates calculated (for each gene independently) by dividing the number of observed mismatches by the total count of reference nucleotides of the corresponding type (e.g., T for T->C mismatch).
Rate equation

Rates coverage plot Same as the previous plot, but including coverage bins.

Overrepresentation plot Overrepresentation or T->C mismatches across the landscape of all possible (ACTG) mismatches (left facet) or only T mismatches (right facet). Useful to determine optimal read filter parameters to improve signal-to-noise ratio for SLAM-seq analysis.

Overrepresentation plot with coverage Same as the previous plot, but including coverage bins.

Overrepresentation plot with coverage Same data as the previous plot, but shown as histograms.

Overrepresentation plot with coverage Mismatch rates for the optional in-vitro transcribed control transcript (slam). This transcript serves as internal control of alkylation efficacy and can be used to normalize for potential variation between samples.
"Slam" is an optional in-vitro transcribed mRNA containing synthetic 4-thiouridines only (no "normal" uridines). It serves as a control for alkylation efficacy and may be used to normalize for variation between samples. "Slam" spike-in control is typically added to RNA after initial purification and prior to alkylation.

References
Herzog, V., Reichholf, B., Neumann, T. et al. Thiol-linked alkylation of RNA to assess expression dynamics. Nat Methods 14, 1198--1204 (2017). https://doi.org/10.1038/nmeth.4435

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published