Skip to content

Releases: bbglab/oncodrive3d

Release v1.0.7

04 Nov 12:01
b873adc

Choose a tag to compare

This (minor) release mainly fix a bug appeared with the new release of the AlphaFold Protein Database:

  • Fix build-datasets AlphaFold download after symlink change in #67

Another minor change:

  • Increase protein length threshold for unfragmented samplesheet preprocessing #64

Full Changelog: v1.0.6...v1.0.7

Release v1.0.6

29 Jul 21:50

Choose a tag to compare

MANE-Only Dataset Preprocessing and Custom Structure Integration

This release introduces a major enhancement to the dataset building pipeline in Oncodrive3D, enabling MANE-only processing, custom structural predictions, and complete decoupling from UniProt for structure annotation workflows of MANE Select transcript-associated structures.

Key Features & Improvements

Refactored Dataset Builder (build-datasets)

  • Direct MANE Integration: Switches the default data source for the MANE Select transcript-associated structures from UniProt API to the MANE Select protein set obtained directly from NCBI.

  • Custom Structures Support:

    • This feature allows you to integrate in-house AlphaFold2-predicted structures, which can be generated via the nf-core/proteinfold pipeline, into the Oncodrive3D build using two new arguments:

      • --custom_mane_pdb_dir: directory containing custom PDB files.

      • --custom_mane_metadata: path to a samplesheet.csv including the structures metadata.

    • The objective is to maximize structural coverage of the MANE Select transcriptome, compensating for proteins missing from the AlphaFold Database MANE release, which is still based on version 1.0 and lacks hundreds of proteins.

    • The samplesheet.csv (custom_mane_metadata) must include at minimum:

      • sequence: the Ensembl protein ID used as the structure identifier.

      • refseq: the amino acid sequence (in one-letter code).
        This will be used to inject the sequence into the PDB file if the structure is missing this information, which is common for predicted structures generated via nf-core/proteinfold.

New Utility: prepare_samplesheet.py

A standalone preprocessing script that:

  • Downloads and parses the full MANE.GRCh38.v1.4.ensembl_protein.faa.gz release from NCBI.

  • Cross-references these proteins with the AlphaFold MANE mapping file (mane_refseq_prot_to_alphafold.csv) to identify missing structures.

  • Generates:

    • A samplesheet.csv listing all MANE Select proteins missing from the AlphaFold Database release, including necessary metadata for structure prediction and downstream integration.

    • Individual FASTA files per Ensembl protein ID, ready for direct input to the nf-core/proteinfold pipeline.

These outputs allow users to predict and recover missing structures, enabling full coverage of the MANE proteome, independent of AlphaFold’s release schedule or UniProt mappings.


Full Changelog: v1.0.5...v1.0.6

Publication (v1.0.5)

23 Jul 12:49

Choose a tag to compare

Oncodrive3D is a fast and accurate 3D-clustering algorithm for driver gene discovery.

This release corresponds to the version used for the analyses performed for the publication on Nucleic Acids Research: v1.0.5

Release v1.0.5

27 Jan 11:38

Choose a tag to compare

Oncodrive3D is a fast and accurate 3D-clustering algorithm for driver gene discovery.

Key Updates and Features

This release addresses bug fixes in the build annotations and plotting modules, introduces enhancements to association plots, updates documentation, and includes general code cleanups.

Bug Fixes

  • Fixed bug in the plotting module: 1 2
  • Fixed bug in build annotation 1 2

Features in associations plots

  • Add FDR to logistic regression analysis for association between clusters and annotations 1
  • Added associations plots to nextflow 1
  • Removed comparative plots 1

Others

  • Documentation update
  • Linting

Release v1.0.4

17 Jan 16:00

Choose a tag to compare

Second release of Oncodrive3D, a fast and accurate 3D-clustering algorithm for driver gene discovery. It identifies mutation-enriched volumes by analyzing missense somatic mutations, leveraging AlphaFold's structural predictions to define residue contacts and mutation profiles to simulate neutral mutagenesis. The tool uses rank-based statistics and can process mutations from duplex sequencing studies, enabling the analysis of both cancer and normal tissue datasets across potentially any organism.

Key Updates and Features

This release mainly update the README with important information and fix a bug in the oncodrive3d build-datasets step.

Documentation Updates

  • General improved documentation for clarity and usability.
  • Added steps to fulfill software requirements, addressing installation failures on older machines lacking updated C libraries.
  • Provided detailed information on input and output data formats, including:
    • How to obtain the required input files.
    • In-depth descriptions of the main outputs, including gene-level and residue-level clustering results.

Bug Fixes and Refactoring

  • Fixed bug in scripts/datasets/build_datasets.py and scripts/datasets/seq_for_mut_prob.py:
    • Disabled downloading and integrating MANE structures if --mane flag is not enabled.
    • Removed usage of files related to the MANE downloads when computing the seq_for_mut_prob.py for a non-MANE Human proteome.
  • Updated scripts/datasets/utils.py to increase the timeout for sock_read in PyPdl, preventing errors during the download of AlphaFold structures.
  • Refactored scripts/main.py by moving the import of specific modules into their corresponding functions for better modularity and efficiency.

Release v1.0.3

17 Jan 15:27

Choose a tag to compare

First release of Oncodrive3D, a fast and accurate 3D-clustering algorithm for driver gene discovery. It identifies mutation-enriched volumes by analyzing missense somatic mutations, leveraging AlphaFold's structural predictions to define residue contacts and mutation profiles to simulate neutral mutagenesis. The tool uses rank-based statistics and can process mutations from duplex sequencing studies, enabling the analysis of both cancer and normal tissue datasets across potentially any organism.

Key Updates and Features

Packaging and Linting

  • Added Python package build using uv.
  • Published the package to PyPI, enabling installation via pip install oncodrive3d.
  • Updated the Dockerfile.
  • Applied code linting to improve code quality and maintainability.
  • Added LICENCE

NextFlow Pipeline Updates

  • Restructured the pipeline according to best practices for enhanced performance and maintainability and moved to oncodrive3d_pipeline/.

Documentation Updates

  • Updated the README file:
    • Added instructions for installation.
    • Added instructions for running the provided NextFlow pipeline.

Bug Fixes and Refactoring and Others

  • Removed preprocessing scripts in build/preprocessing.
  • Updated URLs in scripts/datasets/seq_for_mut_prob.py and scripts/plotting/pfam.py to use the January 2024 Ensembl archive.
  • Changed output column from Cluster to Clump in the residue-level output (<cohort>.3d_clustering_pos.csv).
  • Changed oncodrive3d run input argument from input_maf_path to input_path in scripts/main.py.
  • Refactored scripts/datasets/utils.py to improve download functionality and logging.

Pre-release v1.0.2-rc

20 Nov 01:38

Choose a tag to compare

Pre-release v1.0.2-rc Pre-release
Pre-release

This is the second pre-release of Oncodrive3D, a fast and accurate 3D-clustering algorithm for driver gene discovery. It identifies mutation-enriched volumes by analyzing missense somatic mutations, leveraging AlphaFold's structural predictions to define residue contacts and mutation profiles to simulate neutral mutagenesis. The tool uses rank-based statistics and can process mutations from duplex sequencing studies, enabling the analysis of both cancer and normal tissue datasets across potentially any organism.

Key Updates and Features

New Modules for Annotation and Plotting:

  • Introduced a comprehensive plotting module, including summary plots, gene plots, comparative plots, association plots, and ChimeraX plots.

Nextflow Pipeline:

  • Added a minimal Nextflow pipeline to perform 3D clustering analysis across multiple cohorts and generate all relevant plots.

MANE Transcripts Support:

  • Built datasets prioritizing MANE AF-predicted structures.
  • Tracked transcript IDs from input data, including mismatch, match, or missing status compared to Oncodrive3D datasets.

Mutation Filtering:

  • Filtered mutations with wild-type (WT) structure-AA mismatches and genes exceeding a threshold ratio of mapping issues.
  • Added an option to disable WT AA mismatch filtering, particularly useful for mouse data where VEP and Uniprot isoform inconsistencies occur.

Direct VEP Output Support:

  • Enabled direct VEP output processing, allowing filtering of transcripts based on Oncodrive3D-built datasets.

Enhanced Outputs:

  • Included processed input mutations (<cohort>.mutations.processed.tsv), missense mutation probabilities (<cohort>.miss_prob.processed.tsv), and Oncodrive3D sequence dataframes (<cohort>.seq_df.processed.tsv).

Mouse Data Support:

  • Fully enabled and tested processing of mouse data (mm39) across all steps, including dataset building, annotations, and plotting.

Bug Fixes and Improvements:

  • Resolved bug affecting the identification of the most significant volume per gene.
  • Changed sorting of position-level results from rank-based (Gene, Rank) to significance-based (Gene, p-value, Score).
  • Refactored main.py, offloading unnecessary code to module-specific scripts for better organization.

Example usage

To run the examples provided, the <input_path> directory should be organized as follows:

<input_path>/
├── vep/
│   ├── <cohort_1>.vep.tsv.gz
│   └── <cohort_2>.vep.tsv.gz
├── mut_profile/
│   ├── <cohort_1>.sig.json
│   ├── <cohort_2>.sig.json

vep/: Contains the VEP output files for each cohort, compressed as .tsv.gz.
mut_profile/: Contains the Bgsignature output files (mutation profile in trinucleotide context) for each cohort, saved as .sig.json.

Human MANE

  • build_datasets -o <datasets_path> --mane
  • build_annotations -o <annotations_path> -d <datasets_path>
  • nextflow run main.nf --indir <input_path> --outdir <output_path> --data_dir <datasets_path> --annotations_dir <annotations_path> --vep_input true --verbose true --plot true --chimerax_plot true --mane true --seed 64 -profile container

Mouse

  • build_datasets -o <datasets_path> --organism mouse
  • build_annotations -o <annotations_path> -d <datasets_path> --organism mouse
  • nextflow run main.nf --indir <input_path> --outdir <output_path> --data_dir <datasets_path> --annotations_dir <annotations_path> --ignore_mapping_issues true --plot true --chimerax_plot true --vep_input true -profile container

Pre-release v1.0.1-rc

20 Dec 16:10

Choose a tag to compare

Pre-release v1.0.1-rc Pre-release
Pre-release

This is the first pre-release of Oncodrive3D, a fast and accurate novel 3D-clustering algorithm for driver genes discovery. This approach involves analysing patterns of observed missense somatic mutations (in cancer or normal tissue) to identify volumes that exhibit a higher-than-expected frequency of mutations than what is typically observed under neutral mutagenesis. Oncodrive3D leverages AlphaFold's structure predictions and Predicted Aligned Error (PAE) to construct contact probability maps. Moreover, if provided, it uses the mutation profile of the cohort to simulate neutral mutagenesis while employing rank-based statistics to determine empirical p-values for the volumes of each mutated residue. Also, It can process the mutation profile and sequencing depth information. If provided as a mutability file, this allows the tool to process mutations obtained from duplex sequencing studies, which are commonly used in normal tissue sequencing at the time of this release.

Input

  • input.maf (required): Mutation Annotation Format (MAF) file annotated with consequences (e.g., by using Ensembl Variant Effect Predictor (VEP)).

  • mut_profile.json (optional): Dictionary including the normalized frequencies of mutations (values) in every possible trinucleotide context (keys), such as 'ACA>A', 'ACC>A', and so on.

  • mut_config.json (optional): Dictionary including the path and parsing information for the mutability file, which includes information about mutation profile integrated with sequencing depth.

Output

  • cohort_filename.3d_clustering_genes.csv: This is a Comma-Separated Values (CSV) file containing the results of the analysis at the gene level.

  • cohort_filename.3d_clustering_pos.csv: This is a Comma-Separated Values (CSV) file containing the results of the analysis at the level of mutated positions.