Skip to content

Conversation

@preshanth
Copy link
Owner

This is a complete overhaul of SAM-RFI to be able to accommodate SAM2 models. We also explored SAM3 but have decided against it for now as the model native training values are not 1024 but rather 1008.

  Major Changes:
  - Migrated from manual SAM2 library to HuggingFace transformers
  - Moved legacy code to legacy/ directory
  - Implemented clean class-based architecture with CLI and YAML configs

  New Modules:
  - data/: MSLoader, Preprocessor, SAMDataset (clean data pipeline)
  - training/: SAM2Trainer with validation loss tracking
  - inference/: RFIPredictor with iterative flagging support
  - config/: YAML configuration loader
  - data_generation/: MS and synthetic data generators

  Features:
  - Training & validation loss plots (dual curves)
  - Iterative flagging: N-pass RFI detection with cumulative masking
  - GPU profiling: validate_gpu.py with memory/utilization monitoring
  - Batch size optimization for V100/A100
  - Real unit tests (removed mock-heavy tests)

  CLI Commands:
  - generate-data: Create datasets from MS or synthetic
  - train: Train on pre-generated HuggingFace datasets
  - predict: Single-pass or iterative flagging
  - create-config/validate-config: Config management

  Package:
  - pyproject.toml with proper dependencies (numpy>=1.26, pandas>=2.2)
  - pytest configuration
  - Example configs for training and validation

  Fixes:
  - Resolved pandas/numpy version conflicts
  - Separated data generation from training
  - Clean imports, no legacy dependencies
… The dataset generation now directly save torch tensors which allows for direct GPU loading. So dataset generation and preprocessing are done together and avoid loading time compute
…on tools

   Per-file changes:

   preprocessor.py:
   - Add automatic padding in _patchify_single_waterfall for arrays smaller than
   patch_size
   - Pad to multiples of patch_size for patchify compatibility
   - Store original_shapes in metadata for reconstruction cropping

   predictor.py:
   - Add save_probabilities parameter to save raw probability maps
   - Implement adaptive thresholding (threshold=None uses mean of probabilities)
   - Add upscaling of SAM2 256x256 outputs to patch_size using scipy.ndimage.zoom
   - Calculate padded shape for reconstruction, crop result to original dimensions

   evaluation/statistics.py (new):
   - Add compute_statistics for before/after flagging analysis
   - Add compute_ffi for Flagging Fidelity Index metric
   - Add print_statistics_comparison for formatted output

   evaluation/__init__.py:
   - Export compute_statistics, compute_ffi, print_statistics_comparison

   scripts/validate_single_array.py (new):
   - Standalone validation for synthetic or real single arrays
   - Probability heatmaps and histograms
   - Adaptive threshold testing
   - 2x4 grid (synthetic with GT) or 2x3 grid (real with FFI)

   ms_loader.py:
   - Add load_single_baseline method for extracting single baseline/pol

   sam_dataset.py:
   - Fix empty mask bbox: use full image [0,0,W,H] instead of center box

   sam2_trainer.py:
   - Fix logging check: use hasHandlers() instead of checking root logger

   configs/validation.yaml:
   - Fix stretch: sqrt → null for synthetic data

   pyproject.toml:
   - Add viz extras for holoviews/datashader visualization tools
   - Add samrfi.visualization package

   docs/batched_dataset_training.md:
   - Fix file extension examples: .npz → .pt

This commit message and the doc updates are all made using Claude Code.
where I have incorporated the calcquality metric. Introducing a test
for the metrics module.
Lazy loading casa and making a ci setup for pip install without heavy deps
@preshanth preshanth linked an issue Dec 30, 2025 that may be closed by this pull request
@preshanth preshanth self-assigned this Dec 30, 2025
@preshanth preshanth requested a review from Kitchi December 30, 2025 18:51
@preshanth preshanth merged commit 9b7cccc into main Dec 30, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SAM2 Refactoring + Speedup Attempt

3 participants