An end-to-end RNA-Seq workflow on TCGA-COAD, showcasing differential expression, enrichment analysis, and cancer gene annotation
This project demonstrates how bioinformatics methods can be applied to explore the transcriptomic landscape of Colorectal Adenocarcinoma (COAD), a subtype of colorectal cancer analyzed through The Cancer Genome Atlas (TCGA-COAD) dataset.
The workflow walks through the complete journey of RNA-Seq analysis on acquiring raw count data from the Genomic Data Commons, performing QC and normalization, conducting differential expression analysis (DEA), annotating results with gene symbols and functional categories, and uncovering biological insights through GO, KEGG, and GSEA enrichment analyses.
A key highlight of this project is the integration of OncoKB, a curated knowledge base of cancer genes, which adds an extra layer of clinical relevance by identifying known oncogenes and tumor suppressors among the differentially expressed genes.
By combining statistical rigor (DESeq2) with biological interpretation (clusterProfiler, OncoKB), this project showcases a robust, modular, and reproducible transcriptomics pipeline. While focused on COAD, the workflow is adaptable to any TCGA cancer cohort, making it a versatile framework for cancer bioinformatics. Ultimately, this project not only demonstrates technical skills in RNA-Seq data analysis, enrichment interpretation, and visualization, but also highlights how computational workflows can contribute to a deeper understanding of cancer biology, with potential applications in biomarker discovery and therapeutic target prioritization.
-
R Packages
TCGAbiolinks(data acquisition from GDC)SummarizedExperiment(data container)DESeq2(differential expression analysis, normalization, MA plots, PCA)clusterProfiler(GO, KEGG, GSEA enrichment)org.Hs.eg.db(gene annotation)ggplot2,ggrepel,pheatmap(visualisations)
-
External Databases
-
Utilities
- Microsoft Excel (
VLOOKUPfor gene cross-referencing with OncoKB)
- Microsoft Excel (
- Source: Genomic Data Commons (GDC) via
TCGAbiolinks - Cancer Type: TCGA-COAD (Colorectal Adenocarcinoma)
- Data category: Transcriptome Profiling
- Data type: Gene Expression Quantification
- Workflow type: STAR Counts
- Barcode: short barcodes
- Samples: Tumor vs Solid Tissue Normal paired RNA-seq counts
- Metadata: Sample IDs, Sample Type, Treatments, Primary Diagnosis and related clinical annotations
-
Clone the repository
git clone https://github.com/Naila-Srivastava/Cancer-Transcriptomics-Analysis.git cd Cancer-Transcriptomics-Analysis
-
Open the provided R script.
-
Run the workflow step-by-step.
-
Results (tables) will be saved in the
/Resultsdirectory and/Plotsdirectory.
- Automated DESeq2 differential expression pipeline.
- Integration of OncoKB annotations for cancer relevance.
- Publication-quality visualisations (volcano, PCA, enrichment plots).
- Exportable tables for downstream use.
- MA Plot (log2 fold change vs mean expression)
- PCA Plot (tumor vs normal clustering)
- Volcano Plot (significant DEGs, top 5 genes labeled)
- GO BP Enrichment (dotplot of enriched terms)
- GSEA GO BP Ridgeplot
- KEGG Pathway Analysis (dotplot)
- OncoKB Gene Type Distribution (oncogenes vs tumor suppressors, up/downregulated)
-
Identified 8930 Differentially Expresssed genes in tumor vs normal samples.
-
Top DEGs included KRT23, SIM2, PCAT2, SPTBN2, CBX8 (upregulated DEGs) and TMEM132D-AS1, MAGEA3, ELF5, LINC02418, LY6G6D (downregulated DEGs).
-
Enrichment analysis revealed pathways related to Neuroactive ligand-receptor interaction, Cytokine-cytokine receptor interaction and Calcium signalling pathway.
-
OncoKB annotation confirmed known 124 oncogenes, 76 tumor suppressor genes (TSGs) & 25 oncogene and TSG within the DEGs.
- End-to-End Workflow: Demonstrates the complete RNA-Seq pipeline, from raw data to biological interpretation on a real-world cancer dataset.
- Cancer Relevance: Cross-references DEGs with OncoKB, linking computational results to clinically significant oncogenes and tumor suppressors.
- Publication-Quality Visuals: Generates professional plots (MA, PCA, Volcano, Enrichment) that clearly communicate findings.
- Functional Insights: Highlights GO biological processes and KEGG pathways enriched in colorectal adenocarcinoma, adding biological depth.
- Adaptability and Reproducibility: The modular pipeline can be reused for other TCGA datasets or transcriptomics studies beyond COAD.
- Extend analysis to pan-cancer comparisons.
- Add survival analysis linking expression to patient outcomes.
- Implement Python Scanpy version for cross-validation.
- Explore single-cell RNA-seq (scRNA-seq) workflows.
-
Packages:
- Love MI et al., DESeq2: Differential analysis of count data (Genome Biology, 2014)
- Yu G et al., clusterProfiler: Enrichment analysis (OMICS, 2012)
- Durinck S et al., org.Hs.eg.db annotation package
- Colaprico A et al., TCGAbiolinks: Integrative analysis of TCGA data (NAR, 2016)
-
Databases:
- TCGA: https://www.cancer.gov/tcga
- GDC Data Portal: https://portal.gdc.cancer.gov/
- OncoKB: https://www.oncokb.org/
Cancer-Transcriptomics-Analysis/
│── Results/ # Tables & files
│── Plots/ # Plots
│── RNA_Seq_DE # R script
│── README.md # Documentation