Skip to content

Conversation

@Cloakless
Copy link
Collaborator

@Cloakless Cloakless commented Nov 11, 2025

This PR splits out the utils for general experiments into two sets, in motools/ (for general use) and in mozoo/experiments (useful for the scripts in there but a lower level of recommendation - indeed end users should be happy doing everything they want with motools with only the motools/ directory and installation).

Summary by Sourcery

Extract and centralize common workflow and experiment helper functions into dedicated utility modules to streamline training, evaluation, caching, configuration, result management, and display formatting.

New Features:

  • Add workflow utilities: find_model_from_cache, train_variant, and check_training_status for train_and_evaluate
  • Add evaluate_model_on_task utility for the evaluate_only workflow
  • Introduce mozoo.experiments.utils package with modules for config loading, environment setup, path management, result saving/loading, DataFrame conversion, and display formatting

Enhancements:

  • Restructure workflows package to export new utility functions alongside existing workflow definitions

Chores:

  • Create new mozoo/experiments/utils modules for config, display, paths, and results

Summary by CodeRabbit

  • New Features
    • Workflow utilities for training model variants, checking training status, and retrieving cached models
    • Model evaluation on specific tasks
    • Experiment configuration management with environment setup capabilities
    • Results persistence and conversion to tabular format
    • Progress display utilities for monitoring experiments

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Nov 11, 2025

Reviewer's Guide

This PR restructures the codebase by extracting and centralizing helper utilities: it enriches the train_and_evaluate and evaluate_only workflows with non-blocking cache queries and orchestration functions, and introduces a new mozoo.experiments.utils package that provides configuration management, display formatting, results handling, and path utilities for experiments.

Sequence diagram for non-blocking model cache lookup in train_and_evaluate workflow

sequenceDiagram
    participant "Caller"
    participant "TrainAndEvaluateConfig"
    participant "StageCache"
    participant "TrainingJobAtom"
    participant "ModelAtom"
    "Caller"->>"TrainAndEvaluateConfig": from_dict(config)
    "Caller"->>"StageCache": get("prepare_dataset", config, {})
    "StageCache"-->>"Caller": cached_dataset_state
    "Caller"->>"StageCache": get("submit_training", config, {prepared_dataset})
    "StageCache"-->>"Caller": cached_submit_state
    "Caller"->>"TrainingJobAtom": load(job_atom_id)
    "TrainingJobAtom"-->>"Caller": job_atom
    "Caller"->>"TrainingJobAtom": get_status()
    "TrainingJobAtom"-->>"Caller": job_status
    "Caller"->>"StageCache": get("wait_for_training", config, {training_job})
    "StageCache"-->>"Caller": cached_wait_state
    "Caller"->>"ModelAtom": load(model_atom_id)
    "ModelAtom"-->>"Caller": model_atom
    "Caller"-->>"Caller": return model_atom_id/status_message
Loading

Sequence diagram for evaluating a model on a task using evaluate_only workflow

sequenceDiagram
    participant "Caller"
    participant "EvaluateOnlyConfig"
    participant "evaluate_only_workflow"
    "Caller"->>"EvaluateOnlyConfig": from_dict(config)
    "Caller"->>"evaluate_only_workflow": run_workflow(config, user)
    "evaluate_only_workflow"-->>"Caller": result
    "Caller"->>"Caller": get_step_state("evaluate_model")
    "Caller"-->>"Caller": return eval_results_atom_id
Loading

Class diagram for ExperimentPaths utility class

classDiagram
    class ExperimentPaths {
        - experiment_dir: Path
        + __init__(experiment_dir: Path)
        + config_file: Path
        + results_file: Path
        + training_results_file: Path
        + plots_dir: Path
    }
Loading

Class diagram for results management utilities

classDiagram
    class results {
        + save_results(results: list[dict], path: Path)
        + load_results(path: Path) list[dict]
        + results_to_dataframe(results: list[dict]) pd.DataFrame
    }
Loading

Class diagram for configuration management utilities

classDiagram
    class config {
        + get_experiment_dir(script_path: Path) Path
        + setup_experiment_env(experiment_dir: Path)
        + load_experiment_config(experiment_dir: Path, config_file: str) dict
    }
Loading

Class diagram for display formatting utilities

classDiagram
    class display {
        + print_section(title: str, width: int)
        + print_subsection(title: str, width: int)
        + print_config_summary(config: dict)
        + print_progress(current: int, total: int, item_name: str)
    }
Loading

Class diagram for new workflow utility functions

classDiagram
    class train_and_evaluate {
        + find_model_from_cache(variant_config: dict, training_config: dict, ...)
        + train_variant(variant: dict, training_config: dict, ...)
        + check_training_status(variant_config: dict, training_config: dict, ...)
    }
    class evaluate_only {
        + evaluate_model_on_task(model_atom_id: str, model_id: str, eval_task: str, eval_config: dict, ...)
    }
Loading

File-Level Changes

Change Details Files
Added helper functions for train_and_evaluate workflow
  • Implement find_model_from_cache for non-blocking cache lookup
  • Add train_variant to orchestrate model training per variant
  • Add check_training_status for asynchronous status checks
motools/workflows/train_and_evaluate.py
motools/workflows/__init__.py
Added helper function for evaluate_only workflow
  • Implement evaluate_model_on_task to run model evaluation via workflow
motools/workflows/evaluate_only.py
motools/workflows/__init__.py
Introduced mozoo.experiments.utils package
  • Add config.py with experiment directory and YAML loading utilities
  • Add display.py with section, subsection, and progress printing
  • Add results.py to save/load JSON and convert to pandas DataFrame
  • Add paths.py with ExperimentPaths class for common file paths
  • Aggregate exports in experiments/utils/init.py
mozoo/experiments/utils/config.py
mozoo/experiments/utils/display.py
mozoo/experiments/utils/results.py
mozoo/experiments/utils/paths.py
mozoo/experiments/utils/__init__.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link

coderabbitai bot commented Nov 11, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The changes expand the motools.workflows public API with four new helper functions (evaluate_model_on_task, train_variant, find_model_from_cache, check_training_status) that orchestrate evaluation and training workflows, and introduce a new mozoo.experiments.utils package with utilities for configuration management, console display, path handling, and results processing.

Changes

Cohort / File(s) Change Summary
Workflow API Expansion
motools/workflows/__init__.py
Expanded module docstring and added four new public imports and exports: evaluate_model_on_task, train_variant, find_model_from_cache, check_training_status.
Evaluate-Only Workflow Helper
motools/workflows/evaluate_only.py
Added evaluate_model_on_task() utility function that orchestrates an evaluate-only workflow by constructing config, executing via run_workflow(), and returning evaluation results; includes validation and error handling for missing step results.
Train-and-Evaluate Workflow Helpers
motools/workflows/train_and_evaluate.py
Added three async helper functions: train_variant() (runs training-only partial workflow and extracts model info), find_model_from_cache() (probes cache to locate trained models with status checks), and check_training_status() (reports training status from cache without triggering execution). Includes cache-driven logic and status extraction.
Experiment Utilities Package
mozoo/experiments/utils/__init__.py
New module that consolidates and re-exports utilities from config, display, paths, and results submodules with comprehensive __all__ definition.
Configuration Utilities
mozoo/experiments/utils/config.py
New module with three utilities: get_experiment_dir() (extracts parent directory), setup_experiment_env() (loads .env file from project root), and load_experiment_config() (loads YAML config file).
Display Utilities
mozoo/experiments/utils/display.py
New module with four console printing functions: print_section(), print_subsection(), print_config_summary(), and print_progress() for formatting experiment output.
Path Management
mozoo/experiments/utils/paths.py
New ExperimentPaths class that manages common experiment file paths (config, results, training results, plots directory) with automatic plots directory creation.
Results Processing
mozoo/experiments/utils/results.py
New module with three utilities: save_results() (writes results to JSON), load_results() (reads JSON results), and results_to_dataframe() (converts results to pandas DataFrame with support for nested metrics structures).

Sequence Diagrams

sequenceDiagram
    participant Caller
    participant evaluate_model_on_task
    participant run_workflow
    participant Workflow

    Caller->>evaluate_model_on_task: model_atom_id, model_id, eval_task, eval_config
    activate evaluate_model_on_task
    
    Note over evaluate_model_on_task: Build EvaluateOnlyConfig<br/>with prepared_model, prepared_task,<br/>evaluate_model steps
    
    evaluate_model_on_task->>run_workflow: Execute workflow
    activate run_workflow
    run_workflow->>Workflow: Execute steps
    Workflow-->>run_workflow: workflow_results
    run_workflow-->>evaluate_model_on_task: results
    deactivate run_workflow
    
    Note over evaluate_model_on_task: Extract eval_results<br/>from evaluate_model step
    
    evaluate_model_on_task->>Caller: return eval_results
    deactivate evaluate_model_on_task
Loading
sequenceDiagram
    participant Caller
    participant train_variant
    participant run_workflow
    participant Workflow

    Caller->>train_variant: variant, training_config
    activate train_variant
    
    Note over train_variant: Build TrainAndEvaluateConfig<br/>with training stages only<br/>(prepare_dataset, submit_training, wait)
    
    train_variant->>run_workflow: Execute partial workflow
    activate run_workflow
    run_workflow->>Workflow: Execute training steps
    Workflow-->>run_workflow: workflow_results
    run_workflow-->>train_variant: results
    deactivate run_workflow
    
    Note over train_variant: Extract model_atom<br/>and model_id from results
    
    train_variant->>Caller: return {model_id, model_atom, ...}
    deactivate train_variant
Loading
sequenceDiagram
    participant Caller
    participant find_model_from_cache
    participant StageCache

    Caller->>find_model_from_cache: variant_config, training_config
    activate find_model_from_cache
    
    Note over find_model_from_cache: Check prepare_dataset<br/>in cache
    find_model_from_cache->>StageCache: Query step result
    StageCache-->>find_model_from_cache: result or None
    
    alt Dataset Missing
        find_model_from_cache->>Caller: return (None, "dataset_missing")
    end
    
    Note over find_model_from_cache: Check submit_training<br/>in cache
    find_model_from_cache->>StageCache: Query step result
    StageCache-->>find_model_from_cache: result or None
    
    alt Training Not Submitted
        find_model_from_cache->>Caller: return (None, "training_not_submitted")
    end
    
    Note over find_model_from_cache: Check wait_for_training<br/>status in cache
    find_model_from_cache->>StageCache: Query training status
    StageCache-->>find_model_from_cache: status
    
    alt Training In Progress
        find_model_from_cache->>Caller: return (None, "training_in_progress")
    else Training Complete
        Note over find_model_from_cache: Extract model_id and<br/>training_job_atom
        find_model_from_cache->>Caller: return (model_id, "success")
    end
    
    deactivate find_model_from_cache
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

  • motools/workflows/train_and_evaluate.py: Complex multi-step cache operations with status checking and early returns; requires careful verification of cache traversal logic and error handling across find_model_from_cache, train_variant, and check_training_status.
  • motools/workflows/evaluate_only.py: Workflow orchestration with step extraction and error handling; verify correct config construction and result extraction.
  • mozoo/experiments/utils/results.py: Non-trivial DataFrame construction with nested metric structure handling; verify metric parsing logic handles both flat and nested metric formats correctly.
  • Path integration in mozoo/experiments/utils/paths.py: Side effect of auto-creating plots directory on property access; verify this is the intended behavior and won't cause issues with concurrent access.

Possibly related PRs

Poem

🐰 Workflows unite with helpers so keen,
Experiments bloom with utilities clean—
Cache-checking rabbits hop through the store,
Results and configs dance on the floor!
A toolkit to train, evaluate, display!

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Split out utils' is vague and generic, lacking specificity about which utils are being split, where they're being split from/to, or why this change matters. Consider a more descriptive title like 'Extract experiment utilities into separate modules' or 'Expose workflow helpers and experiment utils in public API' to better convey the actual changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tops/split_out_utils

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • There’s a lot of repeated TrainAndEvaluateConfig.from_dict logic across the new workflow utilities—consider extracting a shared helper to DRY up the config construction and reduce drift.
  • The hardcoded dummy_task_loader default appears in multiple function signatures; extracting it into a single module‐level constant or configuration entry would improve consistency and make updates easier.
  • Many utility functions catch broad Exception; narrowing these to specific error types or at least logging the full traceback could make debugging unexpected failures much easier.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- There’s a lot of repeated TrainAndEvaluateConfig.from_dict logic across the new workflow utilities—consider extracting a shared helper to DRY up the config construction and reduce drift.
- The hardcoded dummy_task_loader default appears in multiple function signatures; extracting it into a single module‐level constant or configuration entry would improve consistency and make updates easier.
- Many utility functions catch broad Exception; narrowing these to specific error types or at least logging the full traceback could make debugging unexpected failures much easier.

## Individual Comments

### Comment 1
<location> `mozoo/experiments/utils/paths.py:39` </location>
<code_context>
+    def plots_dir(self) -> Path:
+        """Path to plots directory (creates if doesn't exist)."""
+        plots_dir = self.experiment_dir / "plots"
+        plots_dir.mkdir(exist_ok=True)
+        return plots_dir
</code_context>

<issue_to_address>
**suggestion:** Consider using exist_ok=True and parents=True for mkdir to handle nested experiment directories.

Using parents=True ensures all parent directories are created, improving robustness if experiment_dir or its parents are missing.

```suggestion
        plots_dir.mkdir(exist_ok=True, parents=True)
```
</issue_to_address>

### Comment 2
<location> `mozoo/experiments/utils/results.py:75` </location>
<code_context>
def results_to_dataframe(results: list[dict[str, Any]]) -> pd.DataFrame:
    """Convert results to a pandas DataFrame for analysis.

    Args:
        results: List of evaluation results

    Returns:
        DataFrame with columns: variant_name, trait, strength, task, metric_name, metric_value

    Example:
        >>> df = results_to_dataframe(results)
        >>> df.groupby("trait")["value"].mean()
    """
    rows = []
    for result in results:
        variant_name = result["variant_name"]
        trait = result.get("trait")
        strength = result.get("strength")
        model_id = result.get("model_id")

        for task_name, task_result in result.get("evaluations", {}).items():
            if "error" in task_result:
                continue

            metrics = task_result.get("metrics", {})

            # Handle case where metrics dict directly contains "mean" and "stderr"
            if set(metrics.keys()) == {"mean", "stderr"} or set(metrics.keys()) == {"mean"}:
                metric_name = task_name
                value = metrics.get("mean")
                stderr = metrics.get("stderr", 0)

                rows.append(
                    {
                        "variant_name": variant_name,
                        "trait": trait,
                        "strength": strength,
                        "model_id": model_id,
                        "task": task_name,
                        "metric": metric_name,
                        "value": value,
                        "stderr": stderr,
                    }
                )
            else:
                # Normal case: metrics dict contains metric names as keys
                for metric_name, metric_value in metrics.items():
                    if isinstance(metric_value, dict):
                        if "mean" in metric_value:
                            value = metric_value["mean"]
                            stderr = metric_value.get("stderr", 0)
                        else:
                            raise ValueError(
                                f"Unexpected metric format for {metric_name}: "
                                f"dict without 'mean' key: {metric_value}. "
                                f"Expected dict with 'mean' (and optionally 'stderr') or a numeric value."
                            )
                    else:
                        value = metric_value
                        stderr = 0

                    rows.append(
                        {
                            "variant_name": variant_name,
                            "trait": trait,
                            "strength": strength,
                            "model_id": model_id,
                            "task": task_name,
                            "metric": metric_name,
                            "value": value,
                            "stderr": stderr,
                        }
                    )

    return pd.DataFrame(rows)

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Replace multiple comparisons of same variable with `in` operator ([`merge-comparisons`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/merge-comparisons/))
- Swap if/else branches ([`swap-if-else-branches`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/swap-if-else-branches/))
- Remove unnecessary else after guard condition ([`remove-unnecessary-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unnecessary-else/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (9)
mozoo/experiments/utils/results.py (3)

10-21: Consider more explicit serialization handling.

The default=str fallback may silently convert non-serializable objects (e.g., datetime, Path, custom classes) in potentially unexpected ways. While convenient, this can make debugging serialization issues harder.

Consider one of these approaches:

  1. Use a custom serializer that explicitly handles expected types
  2. Add a docstring note documenting the str-conversion behavior
  3. Log a warning when the fallback is used (though this requires hooking into json.dump)

Example custom serializer:

def _serialize_default(obj: Any) -> str:
    """Convert non-serializable objects to strings."""
    if isinstance(obj, (Path, datetime)):
        return str(obj)
    # Add other expected types here
    return str(obj)

Then use json.dump(results, f, indent=2, default=_serialize_default)


75-91: Consider relaxing the exact set equality check.

The condition set(metrics.keys()) == {"mean", "stderr"} uses exact equality, meaning a metrics dict with additional keys (e.g., {"mean": 0.5, "stderr": 0.1, "n_samples": 100}) would fall through to the else branch and be processed differently. This could be surprising.

Consider either:

  1. Documenting this exact-match behavior in a comment
  2. Changing to subset check if the intent is to detect the "compact" format: {"mean", "stderr"}.issuperset(metrics.keys())

Example:

-            # Handle case where metrics dict directly contains "mean" and "stderr"
-            if set(metrics.keys()) == {"mean", "stderr"} or set(metrics.keys()) == {"mean"}:
+            # Handle compact format: metrics dict contains only "mean" (and optionally "stderr")
+            # Note: This check requires exact key match - additional keys will use normal format
+            if set(metrics.keys()) == {"mean", "stderr"} or set(metrics.keys()) == {"mean"}:

78-78: Consider using None instead of 0 for missing stderr.

When stderr is not present in the data, defaulting to 0 might be misleading as it suggests a measured stderr of zero rather than missing data. Using None or pd.NA would be more semantically accurate.

Apply this diff:

-                stderr = metrics.get("stderr", 0)
+                stderr = metrics.get("stderr", None)

And:

-                            stderr = metric_value.get("stderr", 0)
+                            stderr = metric_value.get("stderr", None)

And:

-                        stderr = 0
+                        stderr = None

Also applies to: 98-98

mozoo/experiments/utils/display.py (2)

37-61: Hardcoded filename in output.

Line 50 prints Config file: config.yaml but this string is hardcoded. If a caller uses a different config file name (via the load_experiment_config parameter), this output would be misleading.

Consider either:

  1. Removing the hardcoded filename line
  2. Adding a parameter to accept the actual config file path/name
  3. Adding a comment noting this assumes the default config.yaml

Example with parameter:

-def print_config_summary(config: dict[str, Any]) -> None:
+def print_config_summary(config: dict[str, Any], config_file: str = "config.yaml") -> None:
     """Print a summary of experiment configuration.

     Args:
         config: Configuration dictionary
+        config_file: Name of the config file (default: "config.yaml")

Then:

-    print("  Config file: config.yaml")
+    print(f"  Config file: {config_file}")

64-76: Naive pluralization in progress output.

Line 76 always appends 's' to item_name, which works for most English nouns but the capitalized prefix on the same line doesn't match the example in the docstring. The docstring example shows Training: 1/5 models but the actual output would be Model: 1/5 models.

The implementation doesn't match the docstring example. The current code produces:

Model: 1/5 models

But the docstring example shows:

Training: 1/5 models

Consider clarifying the docstring or adjusting the implementation to match intended usage.

mozoo/experiments/utils/paths.py (1)

35-40: Property with side effect may be surprising.

The plots_dir property creates the directory on access (line 39), which is a side effect that may be surprising for a property getter. Users might not expect a simple property access to perform filesystem operations.

Consider one of these approaches:

  1. Document the side effect clearly (minimal change):
     @property
     def plots_dir(self) -> Path:
-        """Path to plots directory (creates if doesn't exist)."""
+        """Path to plots directory.
+        
+        Note: Automatically creates the directory on first access if it doesn't exist.
+        """
  1. Move creation to a method (more explicit):
@property
def plots_dir(self) -> Path:
    """Path to plots directory."""
    return self.experiment_dir / "plots"

def ensure_plots_dir(self) -> Path:
    """Ensure plots directory exists and return path."""
    plots_dir = self.plots_dir
    plots_dir.mkdir(exist_ok=True)
    return plots_dir
  1. Use cached_property to avoid repeated mkdir calls:
from functools import cached_property

@cached_property
def plots_dir(self) -> Path:
    """Path to plots directory (creates if doesn't exist)."""
    plots_dir = self.experiment_dir / "plots"
    plots_dir.mkdir(exist_ok=True)
    return plots_dir
motools/workflows/train_and_evaluate.py (2)

317-322: Skip the dummy prepare_task stage in training-only runs.

The training path never consumes prepared_task, so running that stage just invokes the dummy loader, creates throwaway atoms, and risks import failures if the placeholder task isn’t present. Dropping it keeps the helper lean and avoids an unnecessary dependency.

-    training_stages = [
-        "prepare_dataset",
-        "prepare_task",
-        "submit_training",
-        "wait_for_training",
-    ]
+    training_stages = [
+        "prepare_dataset",
+        "submit_training",
+        "wait_for_training",
+    ]

482-487: Broaden the refresh fallback to cover NotImplementedError.

Some backends define refresh() but raise NotImplementedError (rather than omitting the method). At the moment that bubbles into the outer except block and misreports the status as an error. Treat it like the missing-method case so we still surface a valid status.

-        try:
-            await job_atom.refresh()
-        except AttributeError:
+        try:
+            await job_atom.refresh()
+        except (AttributeError, NotImplementedError):
motools/workflows/evaluate_only.py (1)

75-124: Validate model_atom_id before running the workflow.

We accept both the atom ID and the raw model_id, but never check that they match. A typo slips through silently and we end up evaluating an unexpected model. Load the atom up front and assert the IDs line up so callers get a fast, actionable failure.

-from typing import Any
+from typing import Any
+
+from motools.atom import ModelAtom
@@
 async def evaluate_model_on_task(
     model_atom_id: str,
     model_id: str,
@@
 ) -> str:
     """Evaluate a model on a specific task using the evaluate_only workflow."""
+    model_atom = ModelAtom.load(model_atom_id)
+    expected_model_id = model_atom.get_model_id()
+    if expected_model_id != model_id:
+        raise ValueError(
+            f"model_id '{model_id}' does not match stored model_id '{expected_model_id}' for atom {model_atom_id}"
+        )
 
     config = EvaluateOnlyConfig.from_dict(
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c8ce76 and a666d73.

📒 Files selected for processing (8)
  • motools/workflows/__init__.py (1 hunks)
  • motools/workflows/evaluate_only.py (3 hunks)
  • motools/workflows/train_and_evaluate.py (3 hunks)
  • mozoo/experiments/utils/__init__.py (1 hunks)
  • mozoo/experiments/utils/config.py (1 hunks)
  • mozoo/experiments/utils/display.py (1 hunks)
  • mozoo/experiments/utils/paths.py (1 hunks)
  • mozoo/experiments/utils/results.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
motools/workflows/__init__.py (2)
motools/workflows/evaluate_only.py (2)
  • EvaluateOnlyConfig (25-36)
  • evaluate_model_on_task (75-137)
motools/workflows/train_and_evaluate.py (4)
  • TrainAndEvaluateConfig (32-47)
  • check_training_status (350-555)
  • find_model_from_cache (99-238)
  • train_variant (241-347)
mozoo/experiments/utils/config.py (2)
motools/system.py (1)
  • load_dotenv (12-38)
mozoo/experiments/utils/paths.py (1)
  • config_file (21-23)
motools/workflows/train_and_evaluate.py (4)
motools/atom/base.py (4)
  • ModelAtom (563-632)
  • TrainingJobAtom (636-773)
  • get_model_id (620-632)
  • _load_training_run (700-739)
motools/cache/stage_cache.py (1)
  • StageCache (39-319)
motools/workflow/execution.py (1)
  • run_workflow (18-56)
motools/workflow/state.py (1)
  • get_step_state (77-89)
motools/workflows/evaluate_only.py (3)
motools/workflow/execution.py (1)
  • run_workflow (18-56)
motools/workflow/config.py (1)
  • from_dict (62-111)
motools/workflow/state.py (1)
  • get_step_state (77-89)
mozoo/experiments/utils/__init__.py (4)
mozoo/experiments/utils/config.py (3)
  • get_experiment_dir (10-24)
  • load_experiment_config (46-70)
  • setup_experiment_env (27-43)
mozoo/experiments/utils/display.py (4)
  • print_config_summary (37-61)
  • print_progress (64-76)
  • print_section (6-19)
  • print_subsection (22-34)
mozoo/experiments/utils/paths.py (1)
  • ExperimentPaths (6-40)
mozoo/experiments/utils/results.py (3)
  • load_results (24-45)
  • results_to_dataframe (48-122)
  • save_results (10-21)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Sourcery review
🔇 Additional comments (9)
mozoo/experiments/utils/results.py (1)

24-45: LGTM!

The function correctly checks for file existence and provides a helpful error message to guide users.

mozoo/experiments/utils/display.py (2)

6-19: LGTM!

Simple and effective section header formatting.


22-34: LGTM!

Simple and effective subsection header formatting.

mozoo/experiments/utils/config.py (2)

10-24: LGTM!

Simple utility to extract the parent directory from a script path.


46-70: LGTM!

The function correctly loads YAML configuration with appropriate error handling.

mozoo/experiments/utils/paths.py (2)

6-18: LGTM!

Clean initialization of the path manager.


20-33: LGTM!

Simple and clear path properties for configuration and results files.

mozoo/experiments/utils/__init__.py (1)

1-41: LGTM!

The public API module correctly re-exports all utilities from submodules and provides a clear, well-documented interface for the utils package.

motools/workflows/__init__.py (1)

19-30: Public surface looks tidy.

Re-exporting the new helpers alongside the configs keeps the package API predictable.

Comment on lines +27 to +43
def setup_experiment_env(experiment_dir: Path) -> None:
"""Set up environment variables for an experiment.

Loads .env file from the project root (3 levels up from experiment_dir).

Args:
experiment_dir: Path to the experiment directory

Example:
>>> setup_experiment_env(EXPERIMENT_DIR)
"""
# .env is typically in project root (3 levels up from experiment)
# For mozoo/experiments/trait_expression -> motools (repo root)
project_root = experiment_dir.parent.parent.parent
env_file = project_root / ".env"
if env_file.exists():
load_dotenv(env_file)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Duplicates functionality and has fragile path traversal.

This function duplicates motools.system.load_dotenv() (which is already called on motools import) but with several issues:

  1. Hardcoded 3-level traversal (line 40) assumes a fixed directory structure. If the experiment directory depth changes, this breaks.
  2. Missing error handling: Unlike motools.system.load_dotenv() which has try/except for ImportError, this assumes python-dotenv is installed.
  3. Potential double-loading: Since motools.system.load_dotenv() is called automatically on import, this may load the .env file twice.

Consider one of these approaches:

  1. Reuse existing functionality (recommended):
 def setup_experiment_env(experiment_dir: Path) -> None:
-    """Set up environment variables for an experiment.
-
-    Loads .env file from the project root (3 levels up from experiment_dir).
-
-    Args:
-        experiment_dir: Path to the experiment directory
-
-    Example:
-        >>> setup_experiment_env(EXPERIMENT_DIR)
-    """
-    # .env is typically in project root (3 levels up from experiment)
-    # For mozoo/experiments/trait_expression -> motools (repo root)
-    project_root = experiment_dir.parent.parent.parent
-    env_file = project_root / ".env"
-    if env_file.exists():
-        load_dotenv(env_file)
+    """Set up environment variables for an experiment.
+
+    Note: motools.system.load_dotenv() is called automatically on import,
+    so this is typically unnecessary. Kept for explicit calls if needed.
+
+    Args:
+        experiment_dir: Path to the experiment directory
+
+    Example:
+        >>> setup_experiment_env(EXPERIMENT_DIR)
+    """
+    from motools.system import load_dotenv as motools_load_dotenv
+    motools_load_dotenv()
  1. Or use robust traversal like the motools version:
def setup_experiment_env(experiment_dir: Path) -> None:
    """Set up environment variables for an experiment."""
    try:
        # Search upward from experiment_dir for .env file
        for directory in [experiment_dir, *experiment_dir.parents]:
            env_file = directory / ".env"
            if env_file.exists():
                load_dotenv(env_file)
                return
    except ImportError:
        pass  # python-dotenv not installed

Comment on lines +48 to +59
def results_to_dataframe(results: list[dict[str, Any]]) -> pd.DataFrame:
"""Convert results to a pandas DataFrame for analysis.

Args:
results: List of evaluation results

Returns:
DataFrame with columns: variant_name, trait, strength, task, metric_name, metric_value

Example:
>>> df = results_to_dataframe(results)
>>> df.groupby("trait")["value"].mean()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix docstring to match actual column names.

The docstring states the DataFrame will have columns metric_name and metric_value, but the actual implementation uses metric and value.

Apply this diff:

-    Returns:
-        DataFrame with columns: variant_name, trait, strength, task, metric_name, metric_value
+    Returns:
+        DataFrame with columns: variant_name, trait, strength, model_id, task, metric, value, stderr
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def results_to_dataframe(results: list[dict[str, Any]]) -> pd.DataFrame:
"""Convert results to a pandas DataFrame for analysis.
Args:
results: List of evaluation results
Returns:
DataFrame with columns: variant_name, trait, strength, task, metric_name, metric_value
Example:
>>> df = results_to_dataframe(results)
>>> df.groupby("trait")["value"].mean()
def results_to_dataframe(results: list[dict[str, Any]]) -> pd.DataFrame:
"""Convert results to a pandas DataFrame for analysis.
Args:
results: List of evaluation results
Returns:
DataFrame with columns: variant_name, trait, strength, model_id, task, metric, value, stderr
Example:
>>> df = results_to_dataframe(results)
>>> df.groupby("trait")["value"].mean()
🤖 Prompt for AI Agents
In mozoo/experiments/utils/results.py around lines 48 to 59, the docstring
claims the DataFrame columns are `metric_name` and `metric_value` but the
function actually produces `metric` and `value`; update the docstring (Returns
and any Examples) to list the correct column names: variant_name, trait,
strength, task, metric, value, and adjust the example usage text if it
references the old names.

@Cloakless Cloakless assigned Cloakless and dtch1997 and unassigned Cloakless Nov 11, 2025
@dtch1997 dtch1997 merged commit 354547e into main Nov 12, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants