Split out utils #272

Cloakless · 2025-11-11T23:40:18Z

This PR splits out the utils for general experiments into two sets, in motools/ (for general use) and in mozoo/experiments (useful for the scripts in there but a lower level of recommendation - indeed end users should be happy doing everything they want with motools with only the motools/ directory and installation).

Summary by Sourcery

Extract and centralize common workflow and experiment helper functions into dedicated utility modules to streamline training, evaluation, caching, configuration, result management, and display formatting.

New Features:

Add workflow utilities: find_model_from_cache, train_variant, and check_training_status for train_and_evaluate
Add evaluate_model_on_task utility for the evaluate_only workflow
Introduce mozoo.experiments.utils package with modules for config loading, environment setup, path management, result saving/loading, DataFrame conversion, and display formatting

Enhancements:

Restructure workflows package to export new utility functions alongside existing workflow definitions

Chores:

Create new mozoo/experiments/utils modules for config, display, paths, and results

Summary by CodeRabbit

New Features
- Workflow utilities for training model variants, checking training status, and retrieving cached models
- Model evaluation on specific tasks
- Experiment configuration management with environment setup capabilities
- Results persistence and conversion to tabular format
- Progress display utilities for monitoring experiments

sourcery-ai · 2025-11-11T23:40:23Z

Reviewer's Guide

This PR restructures the codebase by extracting and centralizing helper utilities: it enriches the train_and_evaluate and evaluate_only workflows with non-blocking cache queries and orchestration functions, and introduces a new mozoo.experiments.utils package that provides configuration management, display formatting, results handling, and path utilities for experiments.

Sequence diagram for non-blocking model cache lookup in train_and_evaluate workflow

sequenceDiagram
    participant "Caller"
    participant "TrainAndEvaluateConfig"
    participant "StageCache"
    participant "TrainingJobAtom"
    participant "ModelAtom"
    "Caller"->>"TrainAndEvaluateConfig": from_dict(config)
    "Caller"->>"StageCache": get("prepare_dataset", config, {})
    "StageCache"-->>"Caller": cached_dataset_state
    "Caller"->>"StageCache": get("submit_training", config, {prepared_dataset})
    "StageCache"-->>"Caller": cached_submit_state
    "Caller"->>"TrainingJobAtom": load(job_atom_id)
    "TrainingJobAtom"-->>"Caller": job_atom
    "Caller"->>"TrainingJobAtom": get_status()
    "TrainingJobAtom"-->>"Caller": job_status
    "Caller"->>"StageCache": get("wait_for_training", config, {training_job})
    "StageCache"-->>"Caller": cached_wait_state
    "Caller"->>"ModelAtom": load(model_atom_id)
    "ModelAtom"-->>"Caller": model_atom
    "Caller"-->>"Caller": return model_atom_id/status_message

Sequence diagram for evaluating a model on a task using evaluate_only workflow

sequenceDiagram
    participant "Caller"
    participant "EvaluateOnlyConfig"
    participant "evaluate_only_workflow"
    "Caller"->>"EvaluateOnlyConfig": from_dict(config)
    "Caller"->>"evaluate_only_workflow": run_workflow(config, user)
    "evaluate_only_workflow"-->>"Caller": result
    "Caller"->>"Caller": get_step_state("evaluate_model")
    "Caller"-->>"Caller": return eval_results_atom_id

Class diagram for ExperimentPaths utility class

classDiagram
    class ExperimentPaths {
        - experiment_dir: Path
        + __init__(experiment_dir: Path)
        + config_file: Path
        + results_file: Path
        + training_results_file: Path
        + plots_dir: Path
    }

Class diagram for results management utilities

classDiagram
    class results {
        + save_results(results: list[dict], path: Path)
        + load_results(path: Path) list[dict]
        + results_to_dataframe(results: list[dict]) pd.DataFrame
    }

Class diagram for configuration management utilities

classDiagram
    class config {
        + get_experiment_dir(script_path: Path) Path
        + setup_experiment_env(experiment_dir: Path)
        + load_experiment_config(experiment_dir: Path, config_file: str) dict
    }

Class diagram for display formatting utilities

classDiagram
    class display {
        + print_section(title: str, width: int)
        + print_subsection(title: str, width: int)
        + print_config_summary(config: dict)
        + print_progress(current: int, total: int, item_name: str)
    }

Class diagram for new workflow utility functions

classDiagram
    class train_and_evaluate {
        + find_model_from_cache(variant_config: dict, training_config: dict, ...)
        + train_variant(variant: dict, training_config: dict, ...)
        + check_training_status(variant_config: dict, training_config: dict, ...)
    }
    class evaluate_only {
        + evaluate_model_on_task(model_atom_id: str, model_id: str, eval_task: str, eval_config: dict, ...)
    }

File-Level Changes

Change	Details	Files
Added helper functions for train_and_evaluate workflow	Implement find_model_from_cache for non-blocking cache lookup Add train_variant to orchestrate model training per variant Add check_training_status for asynchronous status checks	`motools/workflows/train_and_evaluate.py` `motools/workflows/__init__.py`
Added helper function for evaluate_only workflow	Implement evaluate_model_on_task to run model evaluation via workflow	`motools/workflows/evaluate_only.py` `motools/workflows/__init__.py`
Introduced mozoo.experiments.utils package	Add config.py with experiment directory and YAML loading utilities Add display.py with section, subsection, and progress printing Add results.py to save/load JSON and convert to pandas DataFrame Add paths.py with ExperimentPaths class for common file paths Aggregate exports in experiments/utils/init.py	`mozoo/experiments/utils/config.py` `mozoo/experiments/utils/display.py` `mozoo/experiments/utils/results.py` `mozoo/experiments/utils/paths.py` `mozoo/experiments/utils/__init__.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2025-11-11T23:40:26Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The changes expand the motools.workflows public API with four new helper functions (evaluate_model_on_task, train_variant, find_model_from_cache, check_training_status) that orchestrate evaluation and training workflows, and introduce a new mozoo.experiments.utils package with utilities for configuration management, console display, path handling, and results processing.

Changes

Cohort / File(s)	Change Summary
Workflow API Expansion `motools/workflows/__init__.py`	Expanded module docstring and added four new public imports and exports: `evaluate_model_on_task`, `train_variant`, `find_model_from_cache`, `check_training_status`.
Evaluate-Only Workflow Helper `motools/workflows/evaluate_only.py`	Added `evaluate_model_on_task()` utility function that orchestrates an evaluate-only workflow by constructing config, executing via `run_workflow()`, and returning evaluation results; includes validation and error handling for missing step results.
Train-and-Evaluate Workflow Helpers `motools/workflows/train_and_evaluate.py`	Added three async helper functions: `train_variant()` (runs training-only partial workflow and extracts model info), `find_model_from_cache()` (probes cache to locate trained models with status checks), and `check_training_status()` (reports training status from cache without triggering execution). Includes cache-driven logic and status extraction.
Experiment Utilities Package `mozoo/experiments/utils/__init__.py`	New module that consolidates and re-exports utilities from config, display, paths, and results submodules with comprehensive `__all__` definition.
Configuration Utilities `mozoo/experiments/utils/config.py`	New module with three utilities: `get_experiment_dir()` (extracts parent directory), `setup_experiment_env()` (loads .env file from project root), and `load_experiment_config()` (loads YAML config file).
Display Utilities `mozoo/experiments/utils/display.py`	New module with four console printing functions: `print_section()`, `print_subsection()`, `print_config_summary()`, and `print_progress()` for formatting experiment output.
Path Management `mozoo/experiments/utils/paths.py`	New `ExperimentPaths` class that manages common experiment file paths (config, results, training results, plots directory) with automatic plots directory creation.
Results Processing `mozoo/experiments/utils/results.py`	New module with three utilities: `save_results()` (writes results to JSON), `load_results()` (reads JSON results), and `results_to_dataframe()` (converts results to pandas DataFrame with support for nested metrics structures).

Sequence Diagrams

sequenceDiagram
    participant Caller
    participant evaluate_model_on_task
    participant run_workflow
    participant Workflow

    Caller->>evaluate_model_on_task: model_atom_id, model_id, eval_task, eval_config
    activate evaluate_model_on_task
    
    Note over evaluate_model_on_task: Build EvaluateOnlyConfig<br/>with prepared_model, prepared_task,<br/>evaluate_model steps
    
    evaluate_model_on_task->>run_workflow: Execute workflow
    activate run_workflow
    run_workflow->>Workflow: Execute steps
    Workflow-->>run_workflow: workflow_results
    run_workflow-->>evaluate_model_on_task: results
    deactivate run_workflow
    
    Note over evaluate_model_on_task: Extract eval_results<br/>from evaluate_model step
    
    evaluate_model_on_task->>Caller: return eval_results
    deactivate evaluate_model_on_task

sequenceDiagram
    participant Caller
    participant train_variant
    participant run_workflow
    participant Workflow

    Caller->>train_variant: variant, training_config
    activate train_variant
    
    Note over train_variant: Build TrainAndEvaluateConfig<br/>with training stages only<br/>(prepare_dataset, submit_training, wait)
    
    train_variant->>run_workflow: Execute partial workflow
    activate run_workflow
    run_workflow->>Workflow: Execute training steps
    Workflow-->>run_workflow: workflow_results
    run_workflow-->>train_variant: results
    deactivate run_workflow
    
    Note over train_variant: Extract model_atom<br/>and model_id from results
    
    train_variant->>Caller: return {model_id, model_atom, ...}
    deactivate train_variant

sequenceDiagram
    participant Caller
    participant find_model_from_cache
    participant StageCache

    Caller->>find_model_from_cache: variant_config, training_config
    activate find_model_from_cache
    
    Note over find_model_from_cache: Check prepare_dataset<br/>in cache
    find_model_from_cache->>StageCache: Query step result
    StageCache-->>find_model_from_cache: result or None
    
    alt Dataset Missing
        find_model_from_cache->>Caller: return (None, "dataset_missing")
    end
    
    Note over find_model_from_cache: Check submit_training<br/>in cache
    find_model_from_cache->>StageCache: Query step result
    StageCache-->>find_model_from_cache: result or None
    
    alt Training Not Submitted
        find_model_from_cache->>Caller: return (None, "training_not_submitted")
    end
    
    Note over find_model_from_cache: Check wait_for_training<br/>status in cache
    find_model_from_cache->>StageCache: Query training status
    StageCache-->>find_model_from_cache: status
    
    alt Training In Progress
        find_model_from_cache->>Caller: return (None, "training_in_progress")
    else Training Complete
        Note over find_model_from_cache: Extract model_id and<br/>training_job_atom
        find_model_from_cache->>Caller: return (model_id, "success")
    end
    
    deactivate find_model_from_cache

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

motools/workflows/train_and_evaluate.py: Complex multi-step cache operations with status checking and early returns; requires careful verification of cache traversal logic and error handling across find_model_from_cache, train_variant, and check_training_status.
motools/workflows/evaluate_only.py: Workflow orchestration with step extraction and error handling; verify correct config construction and result extraction.
mozoo/experiments/utils/results.py: Non-trivial DataFrame construction with nested metric structure handling; verify metric parsing logic handles both flat and nested metric formats correctly.
Path integration in mozoo/experiments/utils/paths.py: Side effect of auto-creating plots directory on property access; verify this is the intended behavior and won't cause issues with concurrent access.

Possibly related PRs

Refactor workflow execution to async #143: New workflow helpers (evaluate_model_on_task, train_variant, find_model_from_cache, check_training_status) directly depend on and invoke the async run_workflow infrastructure changes introduced in that PR.
Add evals for sycophantic, evil and hallucinating behaviour #131: Both PRs modify the train-and-evaluate and evaluate workflow surfaces, including step behaviors and helper utilities for workflow orchestration.
Create generic train_and_evaluate workflow #99: Both PRs extend train-and-evaluate workflow code and public API surface with new utilities and step configuration patterns.

Poem

🐰 Workflows unite with helpers so keen,
Experiments bloom with utilities clean—
Cache-checking rabbits hop through the store,
Results and configs dance on the floor!
A toolkit to train, evaluate, display! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Split out utils' is vague and generic, lacking specificity about which utils are being split, where they're being split from/to, or why this change matters.	Consider a more descriptive title like 'Extract experiment utilities into separate modules' or 'Expose workflow helpers and experiment utils in public API' to better convey the actual changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tops/split_out_utils

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

There’s a lot of repeated TrainAndEvaluateConfig.from_dict logic across the new workflow utilities—consider extracting a shared helper to DRY up the config construction and reduce drift.
The hardcoded dummy_task_loader default appears in multiple function signatures; extracting it into a single module‐level constant or configuration entry would improve consistency and make updates easier.
Many utility functions catch broad Exception; narrowing these to specific error types or at least logging the full traceback could make debugging unexpected failures much easier.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- There’s a lot of repeated TrainAndEvaluateConfig.from_dict logic across the new workflow utilities—consider extracting a shared helper to DRY up the config construction and reduce drift.
- The hardcoded dummy_task_loader default appears in multiple function signatures; extracting it into a single module‐level constant or configuration entry would improve consistency and make updates easier.
- Many utility functions catch broad Exception; narrowing these to specific error types or at least logging the full traceback could make debugging unexpected failures much easier.

## Individual Comments

### Comment 1
<location> `mozoo/experiments/utils/paths.py:39` </location>
<code_context>
+    def plots_dir(self) -> Path:
+        """Path to plots directory (creates if doesn't exist)."""
+        plots_dir = self.experiment_dir / "plots"
+        plots_dir.mkdir(exist_ok=True)
+        return plots_dir
</code_context>

<issue_to_address>
**suggestion:** Consider using exist_ok=True and parents=True for mkdir to handle nested experiment directories.

Using parents=True ensures all parent directories are created, improving robustness if experiment_dir or its parents are missing.

```suggestion
        plots_dir.mkdir(exist_ok=True, parents=True)
```
</issue_to_address>

### Comment 2
<location> `mozoo/experiments/utils/results.py:75` </location>
<code_context>
def results_to_dataframe(results: list[dict[str, Any]]) -> pd.DataFrame:
    """Convert results to a pandas DataFrame for analysis.

    Args:
        results: List of evaluation results

    Returns:
        DataFrame with columns: variant_name, trait, strength, task, metric_name, metric_value

    Example:
        >>> df = results_to_dataframe(results)
        >>> df.groupby("trait")["value"].mean()
    """
    rows = []
    for result in results:
        variant_name = result["variant_name"]
        trait = result.get("trait")
        strength = result.get("strength")
        model_id = result.get("model_id")

        for task_name, task_result in result.get("evaluations", {}).items():
            if "error" in task_result:
                continue

            metrics = task_result.get("metrics", {})

            # Handle case where metrics dict directly contains "mean" and "stderr"
            if set(metrics.keys()) == {"mean", "stderr"} or set(metrics.keys()) == {"mean"}:
                metric_name = task_name
                value = metrics.get("mean")
                stderr = metrics.get("stderr", 0)

                rows.append(
                    {
                        "variant_name": variant_name,
                        "trait": trait,
                        "strength": strength,
                        "model_id": model_id,
                        "task": task_name,
                        "metric": metric_name,
                        "value": value,
                        "stderr": stderr,
                    }
                )
            else:
                # Normal case: metrics dict contains metric names as keys
                for metric_name, metric_value in metrics.items():
                    if isinstance(metric_value, dict):
                        if "mean" in metric_value:
                            value = metric_value["mean"]
                            stderr = metric_value.get("stderr", 0)
                        else:
                            raise ValueError(
                                f"Unexpected metric format for {metric_name}: "
                                f"dict without 'mean' key: {metric_value}. "
                                f"Expected dict with 'mean' (and optionally 'stderr') or a numeric value."
                            )
                    else:
                        value = metric_value
                        stderr = 0

                    rows.append(
                        {
                            "variant_name": variant_name,
                            "trait": trait,
                            "strength": strength,
                            "model_id": model_id,
                            "task": task_name,
                            "metric": metric_name,
                            "value": value,
                            "stderr": stderr,
                        }
                    )

    return pd.DataFrame(rows)

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Replace multiple comparisons of same variable with `in` operator ([`merge-comparisons`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/merge-comparisons/))
- Swap if/else branches ([`swap-if-else-branches`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/swap-if-else-branches/))
- Remove unnecessary else after guard condition ([`remove-unnecessary-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unnecessary-else/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

mozoo/experiments/utils/paths.py

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (9)

mozoo/experiments/utils/results.py (3)
10-21: Consider more explicit serialization handling.

The default=str fallback may silently convert non-serializable objects (e.g., datetime, Path, custom classes) in potentially unexpected ways. While convenient, this can make debugging serialization issues harder.

Consider one of these approaches:

Use a custom serializer that explicitly handles expected types

Add a docstring note documenting the str-conversion behavior

Log a warning when the fallback is used (though this requires hooking into json.dump)

Example custom serializer:
def _serialize_default(obj: Any) -> str:
    """Convert non-serializable objects to strings."""
    if isinstance(obj, (Path, datetime)):
        return str(obj)
    # Add other expected types here
    return str(obj)
Then use json.dump(results, f, indent=2, default=_serialize_default)

75-91: Consider relaxing the exact set equality check.

The condition set(metrics.keys()) == {"mean", "stderr"} uses exact equality, meaning a metrics dict with additional keys (e.g., {"mean": 0.5, "stderr": 0.1, "n_samples": 100}) would fall through to the else branch and be processed differently. This could be surprising.

Consider either:

Documenting this exact-match behavior in a comment

Changing to subset check if the intent is to detect the "compact" format: {"mean", "stderr"}.issuperset(metrics.keys())

Example:
-            # Handle case where metrics dict directly contains "mean" and "stderr"
-            if set(metrics.keys()) == {"mean", "stderr"} or set(metrics.keys()) == {"mean"}:
+            # Handle compact format: metrics dict contains only "mean" (and optionally "stderr")
+            # Note: This check requires exact key match - additional keys will use normal format
+            if set(metrics.keys()) == {"mean", "stderr"} or set(metrics.keys()) == {"mean"}:
78-78: Consider using None instead of 0 for missing stderr.

When stderr is not present in the data, defaulting to 0 might be misleading as it suggests a measured stderr of zero rather than missing data. Using None or pd.NA would be more semantically accurate.

Apply this diff:
-                stderr = metrics.get("stderr", 0)
+                stderr = metrics.get("stderr", None)
And:
-                            stderr = metric_value.get("stderr", 0)
+                            stderr = metric_value.get("stderr", None)
And:
-                        stderr = 0
+                        stderr = None
Also applies to: 98-98
mozoo/experiments/utils/display.py (2)
37-61: Hardcoded filename in output.

Line 50 prints Config file: config.yaml but this string is hardcoded. If a caller uses a different config file name (via the load_experiment_config parameter), this output would be misleading.

Consider either:

Removing the hardcoded filename line

Adding a parameter to accept the actual config file path/name

Adding a comment noting this assumes the default config.yaml

Example with parameter:
-def print_config_summary(config: dict[str, Any]) -> None:
+def print_config_summary(config: dict[str, Any], config_file: str = "config.yaml") -> None:
     """Print a summary of experiment configuration.

     Args:
         config: Configuration dictionary
+        config_file: Name of the config file (default: "config.yaml")
Then:
-    print("  Config file: config.yaml")
+    print(f"  Config file: {config_file}")
64-76: Naive pluralization in progress output.

Line 76 always appends 's' to item_name, which works for most English nouns but the capitalized prefix on the same line doesn't match the example in the docstring. The docstring example shows Training: 1/5 models but the actual output would be Model: 1/5 models.

The implementation doesn't match the docstring example. The current code produces:
Model: 1/5 models
But the docstring example shows:
Training: 1/5 models
Consider clarifying the docstring or adjusting the implementation to match intended usage.
mozoo/experiments/utils/paths.py (1)
35-40: Property with side effect may be surprising.

The plots_dir property creates the directory on access (line 39), which is a side effect that may be surprising for a property getter. Users might not expect a simple property access to perform filesystem operations.

Consider one of these approaches:

Document the side effect clearly (minimal change):
     @property
     def plots_dir(self) -> Path:
-        """Path to plots directory (creates if doesn't exist)."""
+        """Path to plots directory.
+        
+        Note: Automatically creates the directory on first access if it doesn't exist.
+        """
Move creation to a method (more explicit):
@property
def plots_dir(self) -> Path:
    """Path to plots directory."""
    return self.experiment_dir / "plots"

def ensure_plots_dir(self) -> Path:
    """Ensure plots directory exists and return path."""
    plots_dir = self.plots_dir
    plots_dir.mkdir(exist_ok=True)
    return plots_dir
Use cached_property to avoid repeated mkdir calls:
from functools import cached_property

@cached_property
def plots_dir(self) -> Path:
    """Path to plots directory (creates if doesn't exist)."""
    plots_dir = self.experiment_dir / "plots"
    plots_dir.mkdir(exist_ok=True)
    return plots_dir
motools/workflows/train_and_evaluate.py (2)
317-322: Skip the dummy prepare_task stage in training-only runs.

The training path never consumes prepared_task, so running that stage just invokes the dummy loader, creates throwaway atoms, and risks import failures if the placeholder task isn’t present. Dropping it keeps the helper lean and avoids an unnecessary dependency.
-    training_stages = [
-        "prepare_dataset",
-        "prepare_task",
-        "submit_training",
-        "wait_for_training",
-    ]
+    training_stages = [
+        "prepare_dataset",
+        "submit_training",
+        "wait_for_training",
+    ]
482-487: Broaden the refresh fallback to cover NotImplementedError.

Some backends define refresh() but raise NotImplementedError (rather than omitting the method). At the moment that bubbles into the outer except block and misreports the status as an error. Treat it like the missing-method case so we still surface a valid status.
-        try:
-            await job_atom.refresh()
-        except AttributeError:
+        try:
+            await job_atom.refresh()
+        except (AttributeError, NotImplementedError):
motools/workflows/evaluate_only.py (1)
75-124: Validate model_atom_id before running the workflow.

We accept both the atom ID and the raw model_id, but never check that they match. A typo slips through silently and we end up evaluating an unexpected model. Load the atom up front and assert the IDs line up so callers get a fast, actionable failure.
-from typing import Any
+from typing import Any
+
+from motools.atom import ModelAtom
@@
 async def evaluate_model_on_task(
     model_atom_id: str,
     model_id: str,
@@
 ) -> str:
     """Evaluate a model on a specific task using the evaluate_only workflow."""
+    model_atom = ModelAtom.load(model_atom_id)
+    expected_model_id = model_atom.get_model_id()
+    if expected_model_id != model_id:
+        raise ValueError(
+            f"model_id '{model_id}' does not match stored model_id '{expected_model_id}' for atom {model_atom_id}"
+        )
 
     config = EvaluateOnlyConfig.from_dict(

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c8ce76 and a666d73.

📒 Files selected for processing (8)

motools/workflows/__init__.py (1 hunks)
motools/workflows/evaluate_only.py (3 hunks)
motools/workflows/train_and_evaluate.py (3 hunks)
mozoo/experiments/utils/__init__.py (1 hunks)
mozoo/experiments/utils/config.py (1 hunks)
mozoo/experiments/utils/display.py (1 hunks)
mozoo/experiments/utils/paths.py (1 hunks)
mozoo/experiments/utils/results.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (5)

motools/workflows/__init__.py (2)

motools/workflows/evaluate_only.py (2)

EvaluateOnlyConfig (25-36)

evaluate_model_on_task (75-137)

motools/workflows/train_and_evaluate.py (4)

TrainAndEvaluateConfig (32-47)

check_training_status (350-555)

find_model_from_cache (99-238)

train_variant (241-347)

mozoo/experiments/utils/config.py (2)

motools/system.py (1)

load_dotenv (12-38)

mozoo/experiments/utils/paths.py (1)

config_file (21-23)

motools/workflows/train_and_evaluate.py (4)

motools/atom/base.py (4)

ModelAtom (563-632)

TrainingJobAtom (636-773)

get_model_id (620-632)

_load_training_run (700-739)

motools/cache/stage_cache.py (1)

StageCache (39-319)

motools/workflow/execution.py (1)

run_workflow (18-56)

motools/workflow/state.py (1)

get_step_state (77-89)

motools/workflows/evaluate_only.py (3)

motools/workflow/execution.py (1)

run_workflow (18-56)

motools/workflow/config.py (1)

from_dict (62-111)

motools/workflow/state.py (1)

get_step_state (77-89)

mozoo/experiments/utils/__init__.py (4)

mozoo/experiments/utils/config.py (3)

get_experiment_dir (10-24)

load_experiment_config (46-70)

setup_experiment_env (27-43)

mozoo/experiments/utils/display.py (4)

print_config_summary (37-61)

print_progress (64-76)

print_section (6-19)

print_subsection (22-34)

mozoo/experiments/utils/paths.py (1)

ExperimentPaths (6-40)

mozoo/experiments/utils/results.py (3)

load_results (24-45)

results_to_dataframe (48-122)

save_results (10-21)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Sourcery review

🔇 Additional comments (9)

mozoo/experiments/utils/results.py (1)

24-45: LGTM!

The function correctly checks for file existence and provides a helpful error message to guide users.

mozoo/experiments/utils/display.py (2)

6-19: LGTM!

Simple and effective section header formatting.

22-34: LGTM!

Simple and effective subsection header formatting.

mozoo/experiments/utils/config.py (2)

10-24: LGTM!

Simple utility to extract the parent directory from a script path.

46-70: LGTM!

The function correctly loads YAML configuration with appropriate error handling.

mozoo/experiments/utils/paths.py (2)

6-18: LGTM!

Clean initialization of the path manager.

20-33: LGTM!

Simple and clear path properties for configuration and results files.

mozoo/experiments/utils/__init__.py (1)

1-41: LGTM!

The public API module correctly re-exports all utilities from submodules and provides a clear, well-documented interface for the utils package.

motools/workflows/__init__.py (1)

19-30: Public surface looks tidy.

Re-exporting the new helpers alongside the configs keeps the package API predictable.

coderabbitai · 2025-11-11T23:48:09Z

mozoo/experiments/utils/config.py

+def setup_experiment_env(experiment_dir: Path) -> None:
+    """Set up environment variables for an experiment.
+
+    Loads .env file from the project root (3 levels up from experiment_dir).
+
+    Args:
+        experiment_dir: Path to the experiment directory
+
+    Example:
+        >>> setup_experiment_env(EXPERIMENT_DIR)
+    """
+    # .env is typically in project root (3 levels up from experiment)
+    # For mozoo/experiments/trait_expression -> motools (repo root)
+    project_root = experiment_dir.parent.parent.parent
+    env_file = project_root / ".env"
+    if env_file.exists():
+        load_dotenv(env_file)


⚠️ Potential issue | 🟠 Major

Duplicates functionality and has fragile path traversal.

This function duplicates motools.system.load_dotenv() (which is already called on motools import) but with several issues:

Hardcoded 3-level traversal (line 40) assumes a fixed directory structure. If the experiment directory depth changes, this breaks.

Missing error handling: Unlike motools.system.load_dotenv() which has try/except for ImportError, this assumes python-dotenv is installed.

Potential double-loading: Since motools.system.load_dotenv() is called automatically on import, this may load the .env file twice.

Consider one of these approaches:

Reuse existing functionality (recommended):

def setup_experiment_env(experiment_dir: Path) -> None: - """Set up environment variables for an experiment. - - Loads .env file from the project root (3 levels up from experiment_dir). - - Args: - experiment_dir: Path to the experiment directory - - Example: - >>> setup_experiment_env(EXPERIMENT_DIR) - """ - # .env is typically in project root (3 levels up from experiment) - # For mozoo/experiments/trait_expression -> motools (repo root) - project_root = experiment_dir.parent.parent.parent - env_file = project_root / ".env" - if env_file.exists(): - load_dotenv(env_file) + """Set up environment variables for an experiment. + + Note: motools.system.load_dotenv() is called automatically on import, + so this is typically unnecessary. Kept for explicit calls if needed. + + Args: + experiment_dir: Path to the experiment directory + + Example: + >>> setup_experiment_env(EXPERIMENT_DIR) + """ + from motools.system import load_dotenv as motools_load_dotenv + motools_load_dotenv()

Or use robust traversal like the motools version:

def setup_experiment_env(experiment_dir: Path) -> None: """Set up environment variables for an experiment.""" try: # Search upward from experiment_dir for .env file for directory in [experiment_dir, *experiment_dir.parents]: env_file = directory / ".env" if env_file.exists(): load_dotenv(env_file) return except ImportError: pass # python-dotenv not installed

coderabbitai · 2025-11-11T23:48:09Z

mozoo/experiments/utils/results.py

+def results_to_dataframe(results: list[dict[str, Any]]) -> pd.DataFrame:
+    """Convert results to a pandas DataFrame for analysis.
+
+    Args:
+        results: List of evaluation results
+
+    Returns:
+        DataFrame with columns: variant_name, trait, strength, task, metric_name, metric_value
+
+    Example:
+        >>> df = results_to_dataframe(results)
+        >>> df.groupby("trait")["value"].mean()


⚠️ Potential issue | 🟡 Minor

Fix docstring to match actual column names.

The docstring states the DataFrame will have columns metric_name and metric_value, but the actual implementation uses metric and value.

Apply this diff:

- Returns: - DataFrame with columns: variant_name, trait, strength, task, metric_name, metric_value + Returns: + DataFrame with columns: variant_name, trait, strength, model_id, task, metric, value, stderr

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def results_to_dataframe(results: list[dict[str, Any]]) -> pd.DataFrame:

"""Convert results to a pandas DataFrame for analysis.

Args:

results: List of evaluation results

Returns:

DataFrame with columns: variant_name, trait, strength, task, metric_name, metric_value

Example:

>>> df = results_to_dataframe(results)

>>> df.groupby("trait")["value"].mean()

def results_to_dataframe(results: list[dict[str, Any]]) -> pd.DataFrame:

"""Convert results to a pandas DataFrame for analysis.

Args:

results: List of evaluation results

Returns:

DataFrame with columns: variant_name, trait, strength, model_id, task, metric, value, stderr

Example:

>>> df = results_to_dataframe(results)

>>> df.groupby("trait")["value"].mean()

🤖 Prompt for AI Agents

In mozoo/experiments/utils/results.py around lines 48 to 59, the docstring claims the DataFrame columns are `metric_name` and `metric_value` but the function actually produces `metric` and `value`; update the docstring (Returns and any Examples) to list the correct column names: variant_name, trait, strength, task, metric, value, and adjust the example usage text if it references the old names.

Split out utils

a666d73

sourcery-ai bot reviewed Nov 11, 2025

View reviewed changes

mozoo/experiments/utils/paths.py Show resolved Hide resolved

coderabbitai bot reviewed Nov 11, 2025

View reviewed changes

Cloakless assigned Cloakless and dtch1997 and unassigned Cloakless Nov 11, 2025

dtch1997 merged commit 354547e into main Nov 12, 2025
6 checks passed

This was referenced Nov 13, 2025

Split out refactor of trait expression experiments #273

Merged

Add realistic reward hacking #268

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split out utils #272

Split out utils #272

Uh oh!

Cloakless commented Nov 11, 2025 •

edited

Loading

Uh oh!

sourcery-ai bot commented Nov 11, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai bot commented Nov 11, 2025 •

edited

Loading

Other AI code review bot(s) detected

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 11, 2025

Uh oh!

coderabbitai bot Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Split out utils #272

Split out utils #272

Uh oh!

Conversation

Cloakless commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by CodeRabbit

Uh oh!

sourcery-ai bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for non-blocking model cache lookup in train_and_evaluate workflow

Sequence diagram for evaluating a model on a task using evaluate_only workflow

Class diagram for ExperimentPaths utility class

Class diagram for results management utilities

Class diagram for configuration management utilities

Class diagram for display formatting utilities

Class diagram for new workflow utility functions

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cloakless commented Nov 11, 2025 •

edited

Loading

sourcery-ai bot commented Nov 11, 2025 •

edited

Loading

coderabbitai bot commented Nov 11, 2025 •

edited

Loading