Skip to content

RNA-Seq differential expression analysis of TCGA-COAD (Colorectal Adenocarcinoma) using DESeq2, clusterProfiler, and OncoKB. Includes data preprocessing, normalization, DEG identification, functional enrichment and cancer gene annotation with visualisations.

License

Notifications You must be signed in to change notification settings

Naila-Srivastava/Cancer_Transcriptomics_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cancer Transcriptomics Analysis

An end-to-end RNA-Seq workflow on TCGA-COAD, showcasing differential expression, enrichment analysis, and cancer gene annotation


Project Overview

This project demonstrates how bioinformatics methods can be applied to explore the transcriptomic landscape of Colorectal Adenocarcinoma (COAD), a subtype of colorectal cancer analyzed through The Cancer Genome Atlas (TCGA-COAD) dataset.

The workflow walks through the complete journey of RNA-Seq analysis on acquiring raw count data from the Genomic Data Commons, performing QC and normalization, conducting differential expression analysis (DEA), annotating results with gene symbols and functional categories, and uncovering biological insights through GO, KEGG, and GSEA enrichment analyses.

A key highlight of this project is the integration of OncoKB, a curated knowledge base of cancer genes, which adds an extra layer of clinical relevance by identifying known oncogenes and tumor suppressors among the differentially expressed genes.

By combining statistical rigor (DESeq2) with biological interpretation (clusterProfiler, OncoKB), this project showcases a robust, modular, and reproducible transcriptomics pipeline. While focused on COAD, the workflow is adaptable to any TCGA cancer cohort, making it a versatile framework for cancer bioinformatics. Ultimately, this project not only demonstrates technical skills in RNA-Seq data analysis, enrichment interpretation, and visualization, but also highlights how computational workflows can contribute to a deeper understanding of cancer biology, with potential applications in biomarker discovery and therapeutic target prioritization.


Tools & Dependencies

  • R Packages

    • TCGAbiolinks (data acquisition from GDC)
    • SummarizedExperiment (data container)
    • DESeq2 (differential expression analysis, normalization, MA plots, PCA)
    • clusterProfiler (GO, KEGG, GSEA enrichment)
    • org.Hs.eg.db (gene annotation)
    • ggplot2, ggrepel, pheatmap (visualisations)
  • External Databases

    • TCGA-COAD (colorectal adenocarcinoma RNA-seq)
    • OncoKB (curated cancer gene knowledgebase)
  • Utilities

    • Microsoft Excel (VLOOKUP for gene cross-referencing with OncoKB)

Dataset

  • Source: Genomic Data Commons (GDC) via TCGAbiolinks
  • Cancer Type: TCGA-COAD (Colorectal Adenocarcinoma)
  • Data category: Transcriptome Profiling
  • Data type: Gene Expression Quantification
  • Workflow type: STAR Counts
  • Barcode: short barcodes
  • Samples: Tumor vs Solid Tissue Normal paired RNA-seq counts
  • Metadata: Sample IDs, Sample Type, Treatments, Primary Diagnosis and related clinical annotations

How to Run

  1. Clone the repository

    git clone https://github.com/Naila-Srivastava/Cancer-Transcriptomics-Analysis.git
    cd Cancer-Transcriptomics-Analysis
  2. Open the provided R script.

  3. Run the workflow step-by-step.

  4. Results (tables) will be saved in the /Results directory and /Plots directory.


Methodology

ChatGPT Image Sep 8, 2025, 11_21_03 PM

Features

  • Automated DESeq2 differential expression pipeline.
  • Integration of OncoKB annotations for cancer relevance.
  • Publication-quality visualisations (volcano, PCA, enrichment plots).
  • Exportable tables for downstream use.

Visualisations

  • MA Plot (log2 fold change vs mean expression)
  • PCA Plot (tumor vs normal clustering)
  • Volcano Plot (significant DEGs, top 5 genes labeled)
  • GO BP Enrichment (dotplot of enriched terms)
  • GSEA GO BP Ridgeplot
  • KEGG Pathway Analysis (dotplot)
  • OncoKB Gene Type Distribution (oncogenes vs tumor suppressors, up/downregulated)

Results

  • Identified 8930 Differentially Expresssed genes in tumor vs normal samples.

  • Top DEGs included KRT23, SIM2, PCAT2, SPTBN2, CBX8 (upregulated DEGs) and TMEM132D-AS1, MAGEA3, ELF5, LINC02418, LY6G6D (downregulated DEGs).

  • Enrichment analysis revealed pathways related to Neuroactive ligand-receptor interaction, Cytokine-cytokine receptor interaction and Calcium signalling pathway.

    KEGG Pathway
  • OncoKB annotation confirmed known 124 oncogenes, 76 tumor suppressor genes (TSGs) & 25 oncogene and TSG within the DEGs.


Key Takeaways

  • End-to-End Workflow: Demonstrates the complete RNA-Seq pipeline, from raw data to biological interpretation on a real-world cancer dataset.
  • Cancer Relevance: Cross-references DEGs with OncoKB, linking computational results to clinically significant oncogenes and tumor suppressors.
  • Publication-Quality Visuals: Generates professional plots (MA, PCA, Volcano, Enrichment) that clearly communicate findings.
  • Functional Insights: Highlights GO biological processes and KEGG pathways enriched in colorectal adenocarcinoma, adding biological depth.
  • Adaptability and Reproducibility: The modular pipeline can be reused for other TCGA datasets or transcriptomics studies beyond COAD.

What's Next

  • Extend analysis to pan-cancer comparisons.
  • Add survival analysis linking expression to patient outcomes.
  • Implement Python Scanpy version for cross-validation.
  • Explore single-cell RNA-seq (scRNA-seq) workflows.

References

  • Packages:

    • Love MI et al., DESeq2: Differential analysis of count data (Genome Biology, 2014)
    • Yu G et al., clusterProfiler: Enrichment analysis (OMICS, 2012)
    • Durinck S et al., org.Hs.eg.db annotation package
    • Colaprico A et al., TCGAbiolinks: Integrative analysis of TCGA data (NAR, 2016)
  • Databases:


Project Structure

Cancer-Transcriptomics-Analysis/
│── Results/            # Tables & files
│── Plots/              # Plots
│── RNA_Seq_DE          # R script
│── README.md           # Documentation

About

RNA-Seq differential expression analysis of TCGA-COAD (Colorectal Adenocarcinoma) using DESeq2, clusterProfiler, and OncoKB. Includes data preprocessing, normalization, DEG identification, functional enrichment and cancer gene annotation with visualisations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages