Architecture refactoring v1.2.0 #98

ramosv · 2025-11-23T23:58:44Z

[1.2.0] - 2025-11-23

API and Architecture Refactoring

Namespace Hierarchy Overhaul: Transitioned from a flat namespace to a hybrid hierarchical structure to enhance modularity and prevent namespace pollution.
- Core Classes: DPMON, GNNEmbedding, SubjectRepresentation, SmCCNet, and DatasetLoader remain accessible at the top level (e.g., bnn.DPMON).
- Utilities and Metrics: Functional tools are now scoped to their respective submodules (e.g., bnn.metrics.plot_network, bnn.utils.preprocess_clinical).
Utils Module Restructuring: Decomposed the monolithic utils module into specialized submodules for improved maintainability:
- utils.data: Contains summary statistics functions (e.g., variance_summary).
- utils.preprocess: Contains data transformation functions (e.g., impute_omics, normalize_omics).
- utils.reproducibility: Dedicated module for seeding functions (set_seed).

New Features

Graph Engineering Module (graph_tools): Introduced a new module for the diagnosis and repair of network topology issues.
- repair_graph_connectivity: Implemented an algorithm to reconnect fragmented network components (islands) to the global network using eigenvector centrality hubs or omics-driven correlation.
- find_optimal_graph: Added an AutoML-style search function that benchmarks various graph construction strategies (Gaussian, Correlation, Threshold) using a structural proxy task to optimize downstream stability.
- graph_analysis: Added diagnostic utilities to log topological metrics (clustering coefficient, average degree) and identify isolated subgraphs broken down by omics modality.
DPMON Enhancements: Expanded the NeuralNetwork backbone to support multiple dimensionality reduction strategies beyond the standard AutoEncoder.
- Linear Projection: Added ScalarProjection, utilizing a linear layer to map embeddings to feature weights.
- MLP Projection: Added MLPProjection, utilizing a non-linear Multilayer Perceptron for complex feature weighting.
Dataset Loaders:
- Implemented functional loaders (load_brca, load_kipan, load_lgg, load_paad, load_monet, load_example) to provide immediate access to data dictionaries, aligning with scikit-learn conventions.
- Added __getitem__ support to the DatasetLoader class for direct key access (e.g., loader['rna']).

Data Standardization

BRCA Clinical Update: Removed 15 duplicated columns from the BRCA clinical dataset, reducing the feature dimensionality from 118 to 103 to ensure data uniqueness.
Dataset Renaming:
- Renamed the synthetic dataset example1 to example.
- Renamed gbmlgg to lgg (Brain Lower Grade Glioma).
Target Variable Update: Updated the target variable for the lgg dataset from 'histological type' to 'vital_status' to better align with prognostic prediction tasks.
Key Standardization: Removed redundant _data suffixes from dataset dictionary keys (e.g., monet['mirna_data'] is now monet['mirna']).
Dataset Specifications: Updated documentation to explicitly define the dimensions (samples × features) for all included datasets:
- BRCA: miRNA (769, 503), Target (769, 1), Clinical (769, 103), RNA (769, 2500), Meth (769, 2203).
- LGG: miRNA (511, 548), Target (511, 1), Clinical (511, 13), RNA (511, 2127), Meth (511, 1823).
- PAAD: CNV (177, 1035), Target (177, 1), Clinical (177, 19), RNA (177, 1910), Meth (177, 1152).
- KIPAN: miRNA (658, 472), Target (658, 1), Clinical (658, 19), RNA (658, 2284), Meth (658, 2102).
- Monet: Gene (107, 5039), miRNA (107, 789), Phenotype (106, 1), RPPA (107, 175), Clinical (107, 5).
- Example: X1 (358, 500), X2 (358, 100), Y (358, 1), Clinical (358, 6).

Improvements and Fixes

Documentation: Refactored all docstrings across the library to adhere to strict Google Style formatting (Args/Returns) to ensure consistent API documentation generation.
Clustering:
- Hybrid Louvain: Corrected the parameter tuning logic for k3 and k4 weights and refined the iterative refinement loop for identifying phenotype-associated subgraphs.
- Correlated PageRank: Enhanced input validation to ensure proper alignment between graph nodes and omics features.

Removed

Metrics Evaluation: Removed the metrics.evaluation module. Its functionality has been consolidated into the metrics module or deprecated in favor of external validation workflows.

Left to Do

Test Suite Completion: Refactor remaining tests (gnn_embedding, subject_representation, hybrid_louvain) to align with new utils imports and other major changes.
Documentation: Update ReadTheDocs API reference and README.md to reflect the split utils submodules and new graph_tools.
Release Prep: Bump version to 1.2.0 in setup.py.
Erros: Errors with tests and doc-build are expected. They will be addresed in following smaller versions 1.2.1 and so on.

**API and Architecture Refactoring** - **Namespace Hierarchy Overhaul**: Transitioned from a flat namespace to a hybrid hierarchical structure to enhance modularity and prevent namespace pollution. - **Core Classes**: `DPMON`, `GNNEmbedding`, `SubjectRepresentation`, `SmCCNet`, and `DatasetLoader` remain accessible at the top level (e.g., `bnn.DPMON`). - **Utilities and Metrics**: Functional tools are now scoped to their respective submodules (e.g., `bnn.metrics.plot_network`, `bnn.utils.preprocess_clinical`). - **Utils Module Restructuring**: Decomposed the monolithic `utils` module into specialized submodules for improved maintainability: - `utils.data`: Contains summary statistics functions (e.g., `variance_summary`). - `utils.preprocess`: Contains data transformation functions (e.g., `impute_omics`, `normalize_omics`). - `utils.reproducibility`: Dedicated module for seeding functions (`set_seed`). **New Features** - **Graph Engineering Module (`graph_tools`)**: Introduced a new module for the diagnosis and repair of network topology issues. - `repair_graph_connectivity`: Implemented an algorithm to reconnect fragmented network components (islands) to the global network using eigenvector centrality hubs or omics-driven correlation. - `find_optimal_graph`: Added an AutoML-style search function that benchmarks various graph construction strategies (Gaussian, Correlation, Threshold) using a structural proxy task to optimize downstream stability. - `graph_analysis`: Added diagnostic utilities to log topological metrics (clustering coefficient, average degree) and identify isolated subgraphs broken down by omics modality. - **DPMON Enhancements**: Expanded the `NeuralNetwork` backbone to support multiple dimensionality reduction strategies beyond the standard AutoEncoder. - **Linear Projection**: Added `ScalarProjection`, utilizing a linear layer to map embeddings to feature weights. - **MLP Projection**: Added `MLPProjection`, utilizing a non-linear Multilayer Perceptron for complex feature weighting. - **Dataset Loaders**: - Implemented functional loaders (`load_brca`, `load_kipan`, `load_lgg`, `load_paad`, `load_monet`, `load_example`) to provide immediate access to data dictionaries, aligning with `scikit-learn` conventions. - Added `__getitem__` support to the `DatasetLoader` class for direct key access (e.g., `loader['rna']`). **Data Standardization** - **BRCA Clinical Update**: Removed 15 duplicated columns from the BRCA clinical dataset, reducing the feature dimensionality from 118 to 103 to ensure data uniqueness. - **Dataset Renaming**: - Renamed the synthetic dataset `example1` to `example`. - Renamed `gbmlgg` to `lgg` (Brain Lower Grade Glioma). - **Target Variable Update**: Updated the target variable for the `lgg` dataset from 'histological type' to 'vital_status' to better align with prognostic prediction tasks. - **Key Standardization**: Removed redundant `_data` suffixes from dataset dictionary keys (e.g., `monet['mirna_data']` is now `monet['mirna']`). - **Dataset Specifications**: Updated documentation to explicitly define the dimensions (samples × features) for all included datasets: - **BRCA**: miRNA (769, 503), Target (769, 1), Clinical (769, 103), RNA (769, 2500), Meth (769, 2203). - **LGG**: miRNA (511, 548), Target (511, 1), Clinical (511, 13), RNA (511, 2127), Meth (511, 1823). - **PAAD**: CNV (177, 1035), Target (177, 1), Clinical (177, 19), RNA (177, 1910), Meth (177, 1152). - **KIPAN**: miRNA (658, 472), Target (658, 1), Clinical (658, 19), RNA (658, 2284), Meth (658, 2102). - **Monet**: Gene (107, 5039), miRNA (107, 789), Phenotype (106, 1), RPPA (107, 175), Clinical (107, 5). - **Example**: X1 (358, 500), X2 (358, 100), Y (358, 1), Clinical (358, 6). **Improvements and Fixes** - **Documentation**: Refactored all docstrings across the library to adhere to strict Google Style formatting (Args/Returns) to ensure consistent API documentation generation. - **Clustering**: - **Hybrid Louvain**: Corrected the parameter tuning logic for `k3` and `k4` weights and refined the iterative refinement loop for identifying phenotype-associated subgraphs. - **Correlated PageRank**: Enhanced input validation to ensure proper alignment between graph nodes and omics features. **Removed** - **Metrics Evaluation**: Removed the `metrics.evaluation` module. Its functionality has been consolidated into the `metrics` module or deprecated in favor of external validation workflows. **Left to Do** - **Test Suite Completion**: Refactor remaining tests (`gnn_embedding`, `subject_representation`, `hybrid_louvain`) to align with new `utils` imports and other major changes. - **Documentation**: Update ReadTheDocs API reference and `README.md` to reflect the split `utils` submodules and new `graph_tools`. - **Release Prep**: Bump version to `1.2.0` in `setup.py`. - **Erros**: Errors with tests and doc-build are expected. They will be addresed in following smaller versions `1.2.1` and so on.

ramosv · 2025-11-24T00:01:40Z

@abdelhafizm @SundousHussein @ElyasYassin

Its likely that some of the tests will not pass, as long as it builds ok we should be fine.
I needed to update the package so that I can run some experiments in a cloud environment. This is for a dataset that I am unable to download directly due to privacy reasons. Therefore the package had to be ready to be.
Other than failing tests please conduct your review as you would normally do.

Thank you,
Vicente

Copilot

Pull request overview

This pull request introduces a major architectural refactoring (v1.2.0) that modernizes the BioNeuralNet framework by transitioning from a flat namespace to a hierarchical structure. The changes improve modularity, enhance maintainability, and standardize dataset handling across the library.

Key Changes:

Restructured the namespace to scope utilities and metrics to submodules while keeping core classes at the top level
Renamed and standardized datasets (example1 → example, gbmlgg → lgg) with consistent key naming conventions
Added convenience loader functions (load_brca, load_lgg, etc.) following scikit-learn patterns
Enhanced clustering algorithms with improved input validation and unsupervised mode support

Reviewed changes

Copilot reviewed 56 out of 74 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
`bioneuralnet/__init__.py`	Restructured imports to expose submodules while maintaining top-level access to core classes; updated version to 1.2.0
`bioneuralnet/datasets/__init__.py`	Added six convenience loader functions (`load_example`, `load_monet`, `load_brca`, `load_lgg`, `load_kipan`, `load_paad`) with consistent docstrings
`bioneuralnet/clustering/hybrid_louvain.py`	Refactored seeding logic to use centralized `set_seed` utility; added early stopping for small graphs; improved tuning condition logic
`bioneuralnet/clustering/correlated_pagerank.py`	Enhanced input validation for node-to-column mapping; rewrote personalization vector generation with explicit fallback handling; replaced manual while loops with for loops in some areas
`bioneuralnet/clustering/correlated_louvain.py`	Added `_compute_community_cohesion` method for unsupervised mode; refactored `_quality_correlated` to support both supervised and unsupervised clustering
`bioneuralnet/clustering/__init__.py`	Added module-level docstring describing clustering functionality
`bioneuralnet/datasets/brca/pam50.csv`	Removed file (likely renamed to `target.csv` for consistency across datasets)
`TCGA-Notebooks/TCGA-BRCA.ipynb`	Removed notebook (relocated to separate repository or documentation)
`README.md`	Updated version reference from 1.1.4 to 1.2.0; corrected example dataset name from `example1` to `example`
`MANIFEST.in`	Updated dataset paths to reflect renamed directories (`example1` → `example`, `gbmlgg` → `lgg`); added `paad` dataset
`.pre-commit-config.yaml`	Updated CSV validation regex to recognize `example` instead of `example1`
`.gitignore`	Updated paths for renamed dataset directories and added new exclusion patterns
`CHANGELOG.md`	Added comprehensive v1.2.0 release notes documenting all changes, including API restructuring, new features, and data standardization

Comments suppressed due to low confidence (2)

tests/test_utils_preprocess.py:77

Keyword argument 'y' is not a supported parameter name of function preprocess_clinical.
tests/test_utils_data.py:2
Import of 'call' is not used.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bioneuralnet/clustering/correlated_pagerank.py

CHANGELOG.md

bioneuralnet/downstream_task/dpmon.py

tests/__init__.py

bioneuralnet/utils/graph.py

ramosv · 2025-11-24T08:22:20Z

All test pass locally but since the datasets are not available until we push the package to pipy then they will not pass...

abdelhafizm

lgtm

ElyasYassin · 2025-11-25T01:01:06Z

bioneuralnet/utils/graph_tools.py

+    num_edges = G.number_of_edges()
+    num_components = nx.number_connected_components(G)
+    components = list(nx.connected_components(G))
+    largest_cc = max(components, key=len)


This might run a value error if num_nodes = 0 since it's checking the max of an empty sequence.

Noted, will revise on next PR. Need to double check how networkx handles it.

ElyasYassin

LGTM!

ramosv added 2 commits November 13, 2025 02:04

Ready for review: TCGA KIPAN, BRCA, and BioMarkers

c07f1c2

ramosv requested review from ElyasYassin, SundousHussein and abdelhafizm November 23, 2025 23:58

ramosv self-assigned this Nov 23, 2025

Copilot AI review requested due to automatic review settings November 23, 2025 23:58

Copilot started reviewing on behalf of ramosv November 23, 2025 23:59 View session

Copilot finished reviewing on behalf of ramosv November 24, 2025 00:02

Copilot AI reviewed Nov 24, 2025

View reviewed changes

Finish tests and addressed copilot suggestions

9cc0622

abdelhafizm approved these changes Nov 24, 2025

View reviewed changes

ElyasYassin reviewed Nov 25, 2025

View reviewed changes

ElyasYassin approved these changes Nov 25, 2025

View reviewed changes

ramosv merged commit bbc6649 into main Nov 25, 2025
2 of 11 checks passed

Architecture refactoring v1.2.0 #98

Architecture refactoring v1.2.0 #98

Uh oh!

Conversation

ramosv commented Nov 23, 2025

[1.2.0] - 2025-11-23

API and Architecture Refactoring

New Features

Data Standardization

Improvements and Fixes

Removed

Left to Do

Uh oh!

ramosv commented Nov 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ramosv commented Nov 24, 2025

Uh oh!

abdelhafizm left a comment

Choose a reason for hiding this comment

Uh oh!

ElyasYassin Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

ramosv Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

ElyasYassin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants