Skip to content

Conversation

@ramosv
Copy link
Member

@ramosv ramosv commented Nov 23, 2025

[1.2.0] - 2025-11-23

API and Architecture Refactoring

  • Namespace Hierarchy Overhaul: Transitioned from a flat namespace to a hybrid hierarchical structure to enhance modularity and prevent namespace pollution.
    • Core Classes: DPMON, GNNEmbedding, SubjectRepresentation, SmCCNet, and DatasetLoader remain accessible at the top level (e.g., bnn.DPMON).
    • Utilities and Metrics: Functional tools are now scoped to their respective submodules (e.g., bnn.metrics.plot_network, bnn.utils.preprocess_clinical).
  • Utils Module Restructuring: Decomposed the monolithic utils module into specialized submodules for improved maintainability:
    • utils.data: Contains summary statistics functions (e.g., variance_summary).
    • utils.preprocess: Contains data transformation functions (e.g., impute_omics, normalize_omics).
    • utils.reproducibility: Dedicated module for seeding functions (set_seed).

New Features

  • Graph Engineering Module (graph_tools): Introduced a new module for the diagnosis and repair of network topology issues.
    • repair_graph_connectivity: Implemented an algorithm to reconnect fragmented network components (islands) to the global network using eigenvector centrality hubs or omics-driven correlation.
    • find_optimal_graph: Added an AutoML-style search function that benchmarks various graph construction strategies (Gaussian, Correlation, Threshold) using a structural proxy task to optimize downstream stability.
    • graph_analysis: Added diagnostic utilities to log topological metrics (clustering coefficient, average degree) and identify isolated subgraphs broken down by omics modality.
  • DPMON Enhancements: Expanded the NeuralNetwork backbone to support multiple dimensionality reduction strategies beyond the standard AutoEncoder.
    • Linear Projection: Added ScalarProjection, utilizing a linear layer to map embeddings to feature weights.
    • MLP Projection: Added MLPProjection, utilizing a non-linear Multilayer Perceptron for complex feature weighting.
  • Dataset Loaders:
    • Implemented functional loaders (load_brca, load_kipan, load_lgg, load_paad, load_monet, load_example) to provide immediate access to data dictionaries, aligning with scikit-learn conventions.
    • Added __getitem__ support to the DatasetLoader class for direct key access (e.g., loader['rna']).

Data Standardization

  • BRCA Clinical Update: Removed 15 duplicated columns from the BRCA clinical dataset, reducing the feature dimensionality from 118 to 103 to ensure data uniqueness.
  • Dataset Renaming:
    • Renamed the synthetic dataset example1 to example.
    • Renamed gbmlgg to lgg (Brain Lower Grade Glioma).
  • Target Variable Update: Updated the target variable for the lgg dataset from 'histological type' to 'vital_status' to better align with prognostic prediction tasks.
  • Key Standardization: Removed redundant _data suffixes from dataset dictionary keys (e.g., monet['mirna_data'] is now monet['mirna']).
  • Dataset Specifications: Updated documentation to explicitly define the dimensions (samples × features) for all included datasets:
    • BRCA: miRNA (769, 503), Target (769, 1), Clinical (769, 103), RNA (769, 2500), Meth (769, 2203).
    • LGG: miRNA (511, 548), Target (511, 1), Clinical (511, 13), RNA (511, 2127), Meth (511, 1823).
    • PAAD: CNV (177, 1035), Target (177, 1), Clinical (177, 19), RNA (177, 1910), Meth (177, 1152).
    • KIPAN: miRNA (658, 472), Target (658, 1), Clinical (658, 19), RNA (658, 2284), Meth (658, 2102).
    • Monet: Gene (107, 5039), miRNA (107, 789), Phenotype (106, 1), RPPA (107, 175), Clinical (107, 5).
    • Example: X1 (358, 500), X2 (358, 100), Y (358, 1), Clinical (358, 6).

Improvements and Fixes

  • Documentation: Refactored all docstrings across the library to adhere to strict Google Style formatting (Args/Returns) to ensure consistent API documentation generation.
  • Clustering:
    • Hybrid Louvain: Corrected the parameter tuning logic for k3 and k4 weights and refined the iterative refinement loop for identifying phenotype-associated subgraphs.
    • Correlated PageRank: Enhanced input validation to ensure proper alignment between graph nodes and omics features.

Removed

  • Metrics Evaluation: Removed the metrics.evaluation module. Its functionality has been consolidated into the metrics module or deprecated in favor of external validation workflows.

Left to Do

  • Test Suite Completion: Refactor remaining tests (gnn_embedding, subject_representation, hybrid_louvain) to align with new utils imports and other major changes.
  • Documentation: Update ReadTheDocs API reference and README.md to reflect the split utils submodules and new graph_tools.
  • Release Prep: Bump version to 1.2.0 in setup.py.
  • Erros: Errors with tests and doc-build are expected. They will be addresed in following smaller versions 1.2.1 and so on.

**API and Architecture Refactoring**
- **Namespace Hierarchy Overhaul**: Transitioned from a flat namespace to a hybrid hierarchical structure to enhance modularity and prevent namespace pollution.
    - **Core Classes**: `DPMON`, `GNNEmbedding`, `SubjectRepresentation`, `SmCCNet`, and `DatasetLoader` remain accessible at the top level (e.g., `bnn.DPMON`).
    - **Utilities and Metrics**: Functional tools are now scoped to their respective submodules (e.g., `bnn.metrics.plot_network`, `bnn.utils.preprocess_clinical`).
- **Utils Module Restructuring**: Decomposed the monolithic `utils` module into specialized submodules for improved maintainability:
    - `utils.data`: Contains summary statistics functions (e.g., `variance_summary`).
    - `utils.preprocess`: Contains data transformation functions (e.g., `impute_omics`, `normalize_omics`).
    - `utils.reproducibility`: Dedicated module for seeding functions (`set_seed`).

**New Features**
- **Graph Engineering Module (`graph_tools`)**: Introduced a new module for the diagnosis and repair of network topology issues.
    - `repair_graph_connectivity`: Implemented an algorithm to reconnect fragmented network components (islands) to the global network using eigenvector centrality hubs or omics-driven correlation.
    - `find_optimal_graph`: Added an AutoML-style search function that benchmarks various graph construction strategies (Gaussian, Correlation, Threshold) using a structural proxy task to optimize downstream stability.
    - `graph_analysis`: Added diagnostic utilities to log topological metrics (clustering coefficient, average degree) and identify isolated subgraphs broken down by omics modality.
- **DPMON Enhancements**: Expanded the `NeuralNetwork` backbone to support multiple dimensionality reduction strategies beyond the standard AutoEncoder.
    - **Linear Projection**: Added `ScalarProjection`, utilizing a linear layer to map embeddings to feature weights.
    - **MLP Projection**: Added `MLPProjection`, utilizing a non-linear Multilayer Perceptron for complex feature weighting.
- **Dataset Loaders**:
    - Implemented functional loaders (`load_brca`, `load_kipan`, `load_lgg`, `load_paad`, `load_monet`, `load_example`) to provide immediate access to data dictionaries, aligning with `scikit-learn` conventions.
    - Added `__getitem__` support to the `DatasetLoader` class for direct key access (e.g., `loader['rna']`).

**Data Standardization**
- **BRCA Clinical Update**: Removed 15 duplicated columns from the BRCA clinical dataset, reducing the feature dimensionality from 118 to 103 to ensure data uniqueness.
- **Dataset Renaming**:
    - Renamed the synthetic dataset `example1` to `example`.
    - Renamed `gbmlgg` to `lgg` (Brain Lower Grade Glioma).
- **Target Variable Update**: Updated the target variable for the `lgg` dataset from 'histological type' to 'vital_status' to better align with prognostic prediction tasks.
- **Key Standardization**: Removed redundant `_data` suffixes from dataset dictionary keys (e.g., `monet['mirna_data']` is now `monet['mirna']`).
- **Dataset Specifications**: Updated documentation to explicitly define the dimensions (samples × features) for all included datasets:
    - **BRCA**: miRNA (769, 503), Target (769, 1), Clinical (769, 103), RNA (769, 2500), Meth (769, 2203).
    - **LGG**: miRNA (511, 548), Target (511, 1), Clinical (511, 13), RNA (511, 2127), Meth (511, 1823).
    - **PAAD**: CNV (177, 1035), Target (177, 1), Clinical (177, 19), RNA (177, 1910), Meth (177, 1152).
    - **KIPAN**: miRNA (658, 472), Target (658, 1), Clinical (658, 19), RNA (658, 2284), Meth (658, 2102).
    - **Monet**: Gene (107, 5039), miRNA (107, 789), Phenotype (106, 1), RPPA (107, 175), Clinical (107, 5).
    - **Example**: X1 (358, 500), X2 (358, 100), Y (358, 1), Clinical (358, 6).

**Improvements and Fixes**
- **Documentation**: Refactored all docstrings across the library to adhere to strict Google Style formatting (Args/Returns) to ensure consistent API documentation generation.
- **Clustering**:
    - **Hybrid Louvain**: Corrected the parameter tuning logic for `k3` and `k4` weights and refined the iterative refinement loop for identifying phenotype-associated subgraphs.
    - **Correlated PageRank**: Enhanced input validation to ensure proper alignment between graph nodes and omics features.

**Removed**
- **Metrics Evaluation**: Removed the `metrics.evaluation` module. Its functionality has been consolidated into the `metrics` module or deprecated in favor of external validation workflows.

**Left to Do**
- **Test Suite Completion**: Refactor remaining tests (`gnn_embedding`, `subject_representation`, `hybrid_louvain`) to align with new `utils` imports and other major changes.
- **Documentation**: Update ReadTheDocs API reference and `README.md` to reflect the split `utils` submodules and new `graph_tools`.
- **Release Prep**: Bump version to `1.2.0` in `setup.py`.
- **Erros**: Errors with tests and doc-build are expected. They will be addresed in following smaller versions `1.2.1` and so on.
@ramosv ramosv self-assigned this Nov 23, 2025
Copilot AI review requested due to automatic review settings November 23, 2025 23:58
@ramosv
Copy link
Member Author

ramosv commented Nov 24, 2025

@abdelhafizm @SundousHussein @ElyasYassin

Its likely that some of the tests will not pass, as long as it builds ok we should be fine.
I needed to update the package so that I can run some experiments in a cloud environment. This is for a dataset that I am unable to download directly due to privacy reasons. Therefore the package had to be ready to be.
Other than failing tests please conduct your review as you would normally do.

Thank you,
Vicente

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a major architectural refactoring (v1.2.0) that modernizes the BioNeuralNet framework by transitioning from a flat namespace to a hierarchical structure. The changes improve modularity, enhance maintainability, and standardize dataset handling across the library.

Key Changes:

  • Restructured the namespace to scope utilities and metrics to submodules while keeping core classes at the top level
  • Renamed and standardized datasets (example1example, gbmlgglgg) with consistent key naming conventions
  • Added convenience loader functions (load_brca, load_lgg, etc.) following scikit-learn patterns
  • Enhanced clustering algorithms with improved input validation and unsupervised mode support

Reviewed changes

Copilot reviewed 56 out of 74 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
bioneuralnet/__init__.py Restructured imports to expose submodules while maintaining top-level access to core classes; updated version to 1.2.0
bioneuralnet/datasets/__init__.py Added six convenience loader functions (load_example, load_monet, load_brca, load_lgg, load_kipan, load_paad) with consistent docstrings
bioneuralnet/clustering/hybrid_louvain.py Refactored seeding logic to use centralized set_seed utility; added early stopping for small graphs; improved tuning condition logic
bioneuralnet/clustering/correlated_pagerank.py Enhanced input validation for node-to-column mapping; rewrote personalization vector generation with explicit fallback handling; replaced manual while loops with for loops in some areas
bioneuralnet/clustering/correlated_louvain.py Added _compute_community_cohesion method for unsupervised mode; refactored _quality_correlated to support both supervised and unsupervised clustering
bioneuralnet/clustering/__init__.py Added module-level docstring describing clustering functionality
bioneuralnet/datasets/brca/pam50.csv Removed file (likely renamed to target.csv for consistency across datasets)
TCGA-Notebooks/TCGA-BRCA.ipynb Removed notebook (relocated to separate repository or documentation)
README.md Updated version reference from 1.1.4 to 1.2.0; corrected example dataset name from example1 to example
MANIFEST.in Updated dataset paths to reflect renamed directories (example1example, gbmlgglgg); added paad dataset
.pre-commit-config.yaml Updated CSV validation regex to recognize example instead of example1
.gitignore Updated paths for renamed dataset directories and added new exclusion patterns
CHANGELOG.md Added comprehensive v1.2.0 release notes documenting all changes, including API restructuring, new features, and data standardization
Comments suppressed due to low confidence (2)

tests/test_utils_preprocess.py:77

  • Keyword argument 'y' is not a supported parameter name of function preprocess_clinical.
    tests/test_utils_data.py:2
  • Import of 'call' is not used.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ramosv
Copy link
Member Author

ramosv commented Nov 24, 2025

TestPassing

All test pass locally but since the datasets are not available until we push the package to pipy then they will not pass...

Copy link
Collaborator

@abdelhafizm abdelhafizm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

num_edges = G.number_of_edges()
num_components = nx.number_connected_components(G)
components = list(nx.connected_components(G))
largest_cc = max(components, key=len)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might run a value error if num_nodes = 0 since it's checking the max of an empty sequence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, will revise on next PR. Need to double check how networkx handles it.

Copy link
Collaborator

@ElyasYassin ElyasYassin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ramosv ramosv merged commit bbc6649 into main Nov 25, 2025
2 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants