Releases: bbglab/oncodrive3d
Release v1.0.7
This (minor) release mainly fix a bug appeared with the new release of the AlphaFold Protein Database:
- Fix build-datasets AlphaFold download after symlink change in #67
Another minor change:
- Increase protein length threshold for unfragmented samplesheet preprocessing #64
Full Changelog: v1.0.6...v1.0.7
Release v1.0.6
MANE-Only Dataset Preprocessing and Custom Structure Integration
This release introduces a major enhancement to the dataset building pipeline in Oncodrive3D, enabling MANE-only processing, custom structural predictions, and complete decoupling from UniProt for structure annotation workflows of MANE Select transcript-associated structures.
Key Features & Improvements
Refactored Dataset Builder (build-datasets)
-
Direct MANE Integration: Switches the default data source for the MANE Select transcript-associated structures from UniProt API to the MANE Select protein set obtained directly from NCBI.
-
Custom Structures Support:
-
This feature allows you to integrate in-house AlphaFold2-predicted structures, which can be generated via the nf-core/proteinfold pipeline, into the Oncodrive3D build using two new arguments:
-
--custom_mane_pdb_dir: directory containing custom PDB files. -
--custom_mane_metadata: path to asamplesheet.csvincluding the structures metadata.
-
-
The objective is to maximize structural coverage of the MANE Select transcriptome, compensating for proteins missing from the AlphaFold Database MANE release, which is still based on version 1.0 and lacks hundreds of proteins.
-
The
samplesheet.csv(custom_mane_metadata) must include at minimum:-
sequence: the Ensembl protein ID used as the structure identifier. -
refseq: the amino acid sequence (in one-letter code).
This will be used to inject the sequence into the PDB file if the structure is missing this information, which is common for predicted structures generated via nf-core/proteinfold.
-
-
New Utility: prepare_samplesheet.py
A standalone preprocessing script that:
-
Downloads and parses the full MANE.GRCh38.v1.4.ensembl_protein.faa.gz release from NCBI.
-
Cross-references these proteins with the AlphaFold MANE mapping file (
mane_refseq_prot_to_alphafold.csv) to identify missing structures. -
Generates:
-
A samplesheet.csv listing all MANE Select proteins missing from the AlphaFold Database release, including necessary metadata for structure prediction and downstream integration.
-
Individual FASTA files per Ensembl protein ID, ready for direct input to the nf-core/proteinfold pipeline.
-
These outputs allow users to predict and recover missing structures, enabling full coverage of the MANE proteome, independent of AlphaFold’s release schedule or UniProt mappings.
Full Changelog: v1.0.5...v1.0.6
Publication (v1.0.5)
Oncodrive3D is a fast and accurate 3D-clustering algorithm for driver gene discovery.
This release corresponds to the version used for the analyses performed for the publication on Nucleic Acids Research: v1.0.5
Release v1.0.5
Oncodrive3D is a fast and accurate 3D-clustering algorithm for driver gene discovery.
Key Updates and Features
This release addresses bug fixes in the build annotations and plotting modules, introduces enhancements to association plots, updates documentation, and includes general code cleanups.
Bug Fixes
Features in associations plots
- Add FDR to logistic regression analysis for association between clusters and annotations 1
- Added associations plots to nextflow 1
- Removed comparative plots 1
Others
- Documentation update
- Linting
Release v1.0.4
Second release of Oncodrive3D, a fast and accurate 3D-clustering algorithm for driver gene discovery. It identifies mutation-enriched volumes by analyzing missense somatic mutations, leveraging AlphaFold's structural predictions to define residue contacts and mutation profiles to simulate neutral mutagenesis. The tool uses rank-based statistics and can process mutations from duplex sequencing studies, enabling the analysis of both cancer and normal tissue datasets across potentially any organism.
Key Updates and Features
This release mainly update the README with important information and fix a bug in the oncodrive3d build-datasets step.
Documentation Updates
- General improved documentation for clarity and usability.
- Added steps to fulfill software requirements, addressing installation failures on older machines lacking updated C libraries.
- Provided detailed information on input and output data formats, including:
- How to obtain the required input files.
- In-depth descriptions of the main outputs, including gene-level and residue-level clustering results.
Bug Fixes and Refactoring
- Fixed bug in
scripts/datasets/build_datasets.pyandscripts/datasets/seq_for_mut_prob.py:- Disabled downloading and integrating MANE structures if
--maneflag is not enabled. - Removed usage of files related to the MANE downloads when computing the
seq_for_mut_prob.pyfor a non-MANE Human proteome.
- Disabled downloading and integrating MANE structures if
- Updated
scripts/datasets/utils.pyto increase the timeout forsock_readin PyPdl, preventing errors during the download of AlphaFold structures. - Refactored
scripts/main.pyby moving the import of specific modules into their corresponding functions for better modularity and efficiency.
Release v1.0.3
First release of Oncodrive3D, a fast and accurate 3D-clustering algorithm for driver gene discovery. It identifies mutation-enriched volumes by analyzing missense somatic mutations, leveraging AlphaFold's structural predictions to define residue contacts and mutation profiles to simulate neutral mutagenesis. The tool uses rank-based statistics and can process mutations from duplex sequencing studies, enabling the analysis of both cancer and normal tissue datasets across potentially any organism.
Key Updates and Features
Packaging and Linting
- Added Python package build using
uv. - Published the package to
PyPI, enabling installation viapip install oncodrive3d. - Updated the
Dockerfile. - Applied code linting to improve code quality and maintainability.
- Added
LICENCE
NextFlow Pipeline Updates
- Restructured the pipeline according to best practices for enhanced performance and maintainability and moved to
oncodrive3d_pipeline/.
Documentation Updates
- Updated the
READMEfile:- Added instructions for installation.
- Added instructions for running the provided NextFlow pipeline.
Bug Fixes and Refactoring and Others
- Removed preprocessing scripts in
build/preprocessing. - Updated URLs in
scripts/datasets/seq_for_mut_prob.pyandscripts/plotting/pfam.pyto use the January 2024 Ensembl archive. - Changed output column from
ClustertoClumpin the residue-level output (<cohort>.3d_clustering_pos.csv). - Changed
oncodrive3d runinput argument frominput_maf_pathto input_path inscripts/main.py. - Refactored
scripts/datasets/utils.pyto improve download functionality and logging.
Pre-release v1.0.2-rc
This is the second pre-release of Oncodrive3D, a fast and accurate 3D-clustering algorithm for driver gene discovery. It identifies mutation-enriched volumes by analyzing missense somatic mutations, leveraging AlphaFold's structural predictions to define residue contacts and mutation profiles to simulate neutral mutagenesis. The tool uses rank-based statistics and can process mutations from duplex sequencing studies, enabling the analysis of both cancer and normal tissue datasets across potentially any organism.
Key Updates and Features
New Modules for Annotation and Plotting:
- Introduced a comprehensive plotting module, including summary plots, gene plots, comparative plots, association plots, and ChimeraX plots.
Nextflow Pipeline:
- Added a minimal Nextflow pipeline to perform 3D clustering analysis across multiple cohorts and generate all relevant plots.
MANE Transcripts Support:
- Built datasets prioritizing MANE AF-predicted structures.
- Tracked transcript IDs from input data, including mismatch, match, or missing status compared to Oncodrive3D datasets.
Mutation Filtering:
- Filtered mutations with wild-type (WT) structure-AA mismatches and genes exceeding a threshold ratio of mapping issues.
- Added an option to disable WT AA mismatch filtering, particularly useful for mouse data where VEP and Uniprot isoform inconsistencies occur.
Direct VEP Output Support:
- Enabled direct VEP output processing, allowing filtering of transcripts based on Oncodrive3D-built datasets.
Enhanced Outputs:
- Included processed input mutations (
<cohort>.mutations.processed.tsv), missense mutation probabilities (<cohort>.miss_prob.processed.tsv), and Oncodrive3D sequence dataframes (<cohort>.seq_df.processed.tsv).
Mouse Data Support:
- Fully enabled and tested processing of mouse data (mm39) across all steps, including dataset building, annotations, and plotting.
Bug Fixes and Improvements:
- Resolved bug affecting the identification of the most significant volume per gene.
- Changed sorting of position-level results from rank-based (Gene, Rank) to significance-based (Gene, p-value, Score).
- Refactored
main.py, offloading unnecessary code to module-specific scripts for better organization.
Example usage
To run the examples provided, the <input_path> directory should be organized as follows:
<input_path>/
├── vep/
│ ├── <cohort_1>.vep.tsv.gz
│ └── <cohort_2>.vep.tsv.gz
├── mut_profile/
│ ├── <cohort_1>.sig.json
│ ├── <cohort_2>.sig.json
vep/: Contains the VEP output files for each cohort, compressed as .tsv.gz.
mut_profile/: Contains the Bgsignature output files (mutation profile in trinucleotide context) for each cohort, saved as .sig.json.
Human MANE
build_datasets -o <datasets_path> --manebuild_annotations -o <annotations_path> -d <datasets_path>nextflow run main.nf --indir <input_path> --outdir <output_path> --data_dir <datasets_path> --annotations_dir <annotations_path> --vep_input true --verbose true --plot true --chimerax_plot true --mane true --seed 64 -profile container
Mouse
build_datasets -o <datasets_path> --organism mousebuild_annotations -o <annotations_path> -d <datasets_path> --organism mousenextflow run main.nf --indir <input_path> --outdir <output_path> --data_dir <datasets_path> --annotations_dir <annotations_path> --ignore_mapping_issues true --plot true --chimerax_plot true --vep_input true -profile container
Pre-release v1.0.1-rc
This is the first pre-release of Oncodrive3D, a fast and accurate novel 3D-clustering algorithm for driver genes discovery. This approach involves analysing patterns of observed missense somatic mutations (in cancer or normal tissue) to identify volumes that exhibit a higher-than-expected frequency of mutations than what is typically observed under neutral mutagenesis. Oncodrive3D leverages AlphaFold's structure predictions and Predicted Aligned Error (PAE) to construct contact probability maps. Moreover, if provided, it uses the mutation profile of the cohort to simulate neutral mutagenesis while employing rank-based statistics to determine empirical p-values for the volumes of each mutated residue. Also, It can process the mutation profile and sequencing depth information. If provided as a mutability file, this allows the tool to process mutations obtained from duplex sequencing studies, which are commonly used in normal tissue sequencing at the time of this release.
Input
-
input.maf (
required): Mutation Annotation Format (MAF) file annotated with consequences (e.g., by using Ensembl Variant Effect Predictor (VEP)). -
mut_profile.json (
optional): Dictionary including the normalized frequencies of mutations (values) in every possible trinucleotide context (keys), such as 'ACA>A', 'ACC>A', and so on. -
mut_config.json (
optional): Dictionary including the path and parsing information for the mutability file, which includes information about mutation profile integrated with sequencing depth.
Output
-
cohort_filename.3d_clustering_genes.csv: This is a Comma-Separated Values (CSV) file containing the results of the analysis at the gene level.
-
cohort_filename.3d_clustering_pos.csv: This is a Comma-Separated Values (CSV) file containing the results of the analysis at the level of mutated positions.