-
Notifications
You must be signed in to change notification settings - Fork 1
Split out refactor of trait expression experiments #273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR refactors the Trait Expression experiment scripts by extracting shared setup and I/O logic into a utilities module, standardizing console output, extending result visualizations with PNG exports, and updating documentation and configuration for the new experiment framework. Class diagram for refactored experiment utilities and usageclassDiagram
class ExperimentPaths {
+plots_dir
+results_file
+training_results_file
}
class TraitExpressionTrain {
+main()
}
class TraitExpressionEvaluate {
+main()
}
class TraitExpressionResults {
+main()
}
class TraitExpressionStatus {
+main()
}
class Utils {
+get_experiment_dir()
+setup_experiment_env()
+load_experiment_config()
+print_section()
+print_subsection()
+print_config_summary()
+save_results()
+load_results()
+results_to_dataframe()
}
TraitExpressionTrain --> Utils
TraitExpressionEvaluate --> Utils
TraitExpressionResults --> Utils
TraitExpressionStatus --> Utils
TraitExpressionTrain --> ExperimentPaths
TraitExpressionEvaluate --> ExperimentPaths
TraitExpressionResults --> ExperimentPaths
TraitExpressionStatus --> ExperimentPaths
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThe Persona Vectors Experiment is renamed to Trait Expression Experiment and refactored to delegate training, evaluation, status checks, and results handling to shared utilities in motools.workflows and mozoo.experiments.utils; README and config updated (model and backend), CLI paths adjusted, and plotting/export improved. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as CLI script
participant Utils as mozoo.experiments.utils
participant Workflows as motools.workflows
participant Cache as ModelCache
participant Results as ResultsStore
Note over CLI,Utils: Train flow (train.py)
User->>CLI: run train.py
CLI->>Utils: load_experiment_config(), setup_experiment_env(), ExperimentPaths
CLI->>Workflows: train_variant(variant, training_config, user)
Workflows->>Cache: store trained atom/model
Workflows->>Results: save training results
Results->>CLI: training_results saved
Note over CLI,Workflows: Eval flow (evaluate.py)
User->>CLI: run evaluate.py
CLI->>Utils: load_experiment_config(), setup_experiment_env(), ExperimentPaths
CLI->>Workflows: find_model_from_cache(...)
Workflows->>Cache: retrieve model atoms
CLI->>Workflows: evaluate_model_on_task(model_atom, model_id, task, config)
Workflows->>Results: append evaluation results
Results->>CLI: results saved
Note over CLI,Utils: Results & Plots (results.py)
User->>CLI: run results.py
CLI->>Utils: get_experiment_dir(), load_results(paths.results_file), results_to_dataframe
CLI->>Utils: create_tabbed_html + export PNGs (kaleido)
Utils->>User: open plots/results path
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `mozoo/experiments/trait_expression/results.py:246-247` </location>
<code_context>
-
- cache = StageCache()
-
- try:
- # Step 1: Check cache for prepare_dataset (no input atoms needed)
- cached_dataset_state = cache.get(
</code_context>
<issue_to_address>
**suggestion (bug_risk):** PNG export error handling is broad; may mask unrelated issues.
Consider handling unexpected exceptions separately or logging them to improve debugging and avoid masking unrelated errors.
</issue_to_address>
### Comment 2
<location> `mozoo/experiments/trait_expression/results.py:260` </location>
<code_context>
+ print(f" ⚠ Skipped {filename} ({e})")
+
def create_tabbed_html(plot_htmls: list[str], tab_titles: list[str]) -> str:
"""Create HTML with tabs containing multiple plots.
</code_context>
<issue_to_address>
**issue (complexity):** Consider moving the static HTML/CSS for tabbed plots into a separate template file and using a template engine to simplify the Python code.
```markdown
You can dramatically shrink `create_tabbed_html` by moving the big static HTML/CSS into its own file and using a tiny template engine (e.g. Python’s built-in `string.Template` or Jinja2). Here’s one approach:
1. **Create a file** `templates/results_template.html`:
```html
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>${page_title}</title>
<style>
body { … }
.tabs { … }
/* rest of your CSS */
</style>
${plotly_js}
</head>
<body>
<div class="container">
<h1>${page_title}</h1>
<div class="tabs">${tab_buttons}</div>
<div class="tab-contents">${tab_contents}</div>
</div>
<script>
/* your JS */
</script>
</body>
</html>
```
2. **Simplify** `create_tabbed_html` in your `.py`:
```python
from pathlib import Path
from string import Template
TEMPLATE_PATH = Path(__file__).parent / "templates" / "results_template.html"
def create_tabbed_html(plot_htmls, tab_titles):
tpl = Template(TEMPLATE_PATH.read_text(encoding="utf-8"))
plotly_js = get_plotlyjs() # however you load it today
buttons = "\n".join(
f'<button class="tab-button" onclick="showTab({i})">{t}</button>'
for i, t in enumerate(tab_titles)
)
contents = "\n".join(
f'<div id="tab-content-{i}" class="tab-content{" show active" if i==0 else ""}">'
f' <div class="plot-wrapper">{html}</div></div>'
for i, html in enumerate(plot_htmls)
)
return tpl.substitute(
page_title="Trait Expression Experiment Results",
plotly_js=plotly_js,
tab_buttons=buttons,
tab_contents=contents,
)
```
3. **Keep the rest of your code** exactly as-is.
By extracting static HTML/CSS into `results_template.html`, you:
- Remove the 300+ lines of inline template from Python.
- Keep all formatting intact via `${…}` placeholders.
- Make it trivial to tweak styles or scripts without touching Python logic.
</issue_to_address>
### Comment 3
<location> `mozoo/experiments/trait_expression/evaluate.py:221-224` </location>
<code_context>
async def main() -> None:
"""Evaluate all trained models."""
print_section("Trait Expression Experiment - Evaluation")
# Load configuration
try:
config_data = load_experiment_config(EXPERIMENT_DIR)
except FileNotFoundError as e:
print(f"Error: {e}")
return
models = config_data.get("models", [])
training_config = config_data.get("training", {})
eval_config = config_data.get("evaluation", {})
eval_tasks = eval_config.get("tasks", [])
if not models:
print(
"""
Error: No models defined in config.yaml
Please add at least one model to the 'models' section.
"""
)
return
if not eval_tasks:
print(
"""
Error: No evaluation tasks defined in config.yaml
Please add at least one task to the 'evaluation.tasks' section.
"""
)
return
print(
f"""
Configuration:
Models to evaluate: {len(models)}
Evaluation tasks: {len(eval_tasks)}
"""
)
# Find all models from cache
print_subsection("Looking for trained models in cache...")
models_to_evaluate = []
models_not_ready = []
for variant in models:
model_atom_id, status_message = await find_model_from_cache(
variant_config=variant,
training_config=training_config,
)
if model_atom_id is None:
print(f"⚠️ {variant['name']}: {status_message}")
models_not_ready.append((variant["name"], status_message))
continue
model_atom = cast(ModelAtom, ModelAtom.load(model_atom_id))
model_id = model_atom.get_model_id()
models_to_evaluate.append(
{
"variant": variant,
"model_atom_id": model_atom_id,
"model_id": model_id,
}
)
print(f"✓ {variant['name']}: {model_id[:50]}...")
print()
if not models_to_evaluate:
train_script = EXPERIMENT_DIR / "train.py"
print(f"No trained models found. Please run train.py first:\n python {train_script}")
return
# Summary of what will be evaluated
print(f"Found {len(models_to_evaluate)}/{len(models)} trained models")
if models_not_ready:
print()
print("⚠️ Models not ready (will be skipped):")
for name, reason in models_not_ready:
print(f" - {name}: {reason}")
print(
"""
Note: You can run evaluate.py again later to evaluate these models
once their training completes.
"""
)
print("Proceeding with evaluation of available models...")
# Evaluate all models on all tasks
print_subsection("Evaluating models...")
all_results = []
# Keep track of not-ready models for summary (models_not_ready already defined above)
for model_info in models_to_evaluate:
variant = model_info["variant"]
model_atom_id = model_info["model_atom_id"]
model_id = model_info["model_id"]
print(
f"""
Evaluating: {variant["name"]}
Model: {model_id[:50]}...
"""
)
variant_results = {
"variant_name": variant["name"],
"trait": variant.get("trait"),
"strength": variant.get("strength"),
"model_atom_id": model_atom_id,
"model_id": model_id,
"evaluations": {},
}
for task_config in eval_tasks:
task_name = task_config["name"]
eval_task = task_config["eval_task"]
print(f" Task: {task_name}")
try:
eval_atom_id = await evaluate_model_on_task(
model_atom_id=model_atom_id,
model_id=model_id,
eval_task=eval_task,
eval_config=eval_config,
user="trait-expression-experiment",
)
# Load and extract metrics
eval_atom = cast(EvalAtom, EvalAtom.load(eval_atom_id))
eval_results_obj = await eval_atom.to_eval_results()
metrics = {}
for task_name_inner, task_metrics in eval_results_obj.metrics.items():
for metric_name, value in task_metrics.items():
if metric_name != "stats":
metrics[metric_name] = value
variant_results["evaluations"][task_name] = {
"eval_atom_id": eval_atom_id,
"metrics": metrics,
}
# Display metrics
for metric_name, value in metrics.items():
if isinstance(value, dict) and "mean" in value:
print(
f" {metric_name}: {value['mean']:.3f} ± {value.get('stderr', 0):.3f}"
)
else:
print(f" {metric_name}: {value}")
except Exception as e:
print(f" ✗ Failed: {e}")
variant_results["evaluations"][task_name] = {"error": str(e)}
all_results.append(variant_results)
print_subsection("✓ Evaluation complete")
print()
# Save results
save_results(all_results, paths.results_file)
# Display summary
print_section("Evaluation Summary")
print()
if all_results:
print(f"Evaluated {len(all_results)} model(s):")
print()
for result in all_results:
trait = result.get("trait")
strength = result.get("strength")
if trait and strength:
trait_str = f"{strength} {trait}"
else:
trait_str = "N/A"
print(f"Variant: {result['variant_name']} ({trait_str})")
print(f" Model: {result['model_id'][:50]}...")
for task_name, task_result in result["evaluations"].items():
if "error" in task_result:
print(f" {task_name}: Error - {task_result['error']}")
else:
metrics = task_result.get("metrics", {})
for metric_name, value in metrics.items():
if isinstance(value, dict) and "mean" in value:
print(
f" {task_name}/{metric_name}: {value['mean']:.3f} ± {value.get('stderr', 0):.3f}"
)
else:
print(f" {task_name}/{metric_name}: {value}")
print()
if models_not_ready:
print("⚠️ Skipped models (training not complete):")
for name, reason in models_not_ready:
print(f" - {name}: {reason}")
results_script = EXPERIMENT_DIR / "results.py"
print(
f"""
Results saved to: {paths.results_file}
Next step:
Run: python {results_script}
This will display results and generate visualization plots.
"""
)
</code_context>
<issue_to_address>
**suggestion (code-quality):** We've found these issues:
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))
- Low code quality found in main - 8% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
```suggestion
trait_str = f"{strength} {trait}" if trait and strength else "N/A"
```
<br/><details><summary>Explanation</summary>
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>
### Comment 4
<location> `mozoo/experiments/trait_expression/status.py:26` </location>
<code_context>
async def main() -> None:
"""Check status of all training jobs."""
print_section("Trait Expression Experiment - Training Status")
# Load configuration
try:
config_data = load_experiment_config(EXPERIMENT_DIR)
except FileNotFoundError as e:
print(f"Error: {e}")
return
models = config_data.get("models", [])
training_config = config_data.get("training", {})
if not models:
print("No models configured in config.yaml")
return
print(f"Checking status of {len(models)} model(s)...")
print()
# Check status of each model
statuses = []
for model_config in models:
status = await check_training_status(
variant_config=model_config,
training_config=training_config,
)
statuses.append(status)
# Display results
print_section("Training Status")
# Group by status
by_status: dict[str, list[dict[str, Any]]] = {}
for status_info in statuses:
status = status_info["status"]
if status not in by_status:
by_status[status] = []
by_status[status].append(status_info)
# Show in-progress first
in_progress_statuses = ["queued", "running", "validating_files"]
completed_statuses = ["succeeded", "completed"]
failed_statuses = ["failed", "cancelled"]
for status_group in [
in_progress_statuses,
completed_statuses,
failed_statuses,
["not_submitted", "error"],
]:
for status_key in status_group:
if status_key in by_status:
models_with_status = by_status[status_key]
print(f"\n{status_key.upper()}:")
print("-" * 80)
for info in models_with_status:
print(f" {info['name']}")
if info.get("trait") and info.get("strength"):
print(f" Trait: {info['strength']} {info['trait']}")
if info.get("model_id"):
print(f" Model: {info['model_id'][:60]}...")
elif info.get("job_atom_id"):
print(f" Job ID: {info['job_atom_id'][:60]}...")
if info.get("message"):
print(f" Note: {info['message']}")
# Summary
print_section("Summary")
total = len(statuses)
completed = sum(s["status"] in completed_statuses for s in statuses)
in_progress = sum(s["status"] in in_progress_statuses for s in statuses)
failed = sum(s["status"] in failed_statuses for s in statuses)
other = total - completed - in_progress - failed
print(f" Total models: {total}")
print(f" ✓ Completed: {completed}")
print(f" ⏳ In progress: {in_progress}")
print(f" ✗ Failed/Cancelled: {failed}")
if other > 0:
print(f" ⚠️ Other: {other}")
print()
if completed < total:
print("Note: Run evaluate.py once training completes to evaluate models.")
print(" You can run this script again to check updated status.")
else:
print("All models are complete! Run evaluate.py to evaluate them.")
</code_context>
<issue_to_address>
**issue (code-quality):** Low code quality found in main - 21% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
mozoo/experiments/trait_expression/results.py (2)
154-154: Consider consistent use of thebyparameter.Line 154 uses
by=["strength_sort", "task"](list), while line 177 usesby="strength_sort"(scalar). While both are valid, using a consistent style (always using lists for thebyparameter) improves readability.Apply this diff to make line 177 consistent:
task_data = metric_trait_df[metric_trait_df["task"] == task].sort_values( - by="strength_sort" + by=["strength_sort"] )Also applies to: 177-177
247-277: Consider consolidating similar exception handlers.The three exception blocks (ValueError/RuntimeError, OSError, Exception) have similar logging and print patterns. Consider consolidating them for better maintainability.
Here's a consolidated approach:
try: fig.write_image(str(png_path), width=900, height=600, scale=2) print(f" ✓ Saved {filename}") - except (ValueError, RuntimeError) as e: - # Handle Kaleido-related errors (expected when Kaleido isn't installed) - error_str = str(e).lower() - if "kaleido" in error_str or "browserdeps" in error_str: - print( - f" ⚠ Skipped {filename} (install kaleido for PNG export: pip install kaleido)" - ) - else: - # Unexpected ValueError/RuntimeError - log for debugging - logging.warning( - f"Unexpected error exporting {filename} to PNG: {e}", - exc_info=True, - ) - print(f" ⚠ Skipped {filename} (unexpected error: {e})") - except OSError as e: - # Handle file system errors (permissions, disk space, etc.) - logging.warning( - f"File system error exporting {filename} to PNG: {e}", - exc_info=True, - ) - print(f" ⚠ Skipped {filename} (file system error: {e})") except Exception as e: - # Catch any other unexpected errors and log them properly - logging.error( - f"Unexpected error exporting {filename} to PNG: {type(e).__name__}: {e}", - exc_info=True, - ) - print(f" ⚠ Skipped {filename} (unexpected error: {type(e).__name__}: {e})") + error_str = str(e).lower() + # Check if it's an expected Kaleido-related error + if isinstance(e, (ValueError, RuntimeError)) and ( + "kaleido" in error_str or "browserdeps" in error_str + ): + print( + f" ⚠ Skipped {filename} (install kaleido for PNG export: pip install kaleido)" + ) + else: + # Log unexpected errors for debugging + log_level = logging.ERROR if not isinstance(e, (ValueError, RuntimeError, OSError)) else logging.WARNING + logging.log( + log_level, + f"Error exporting {filename} to PNG: {type(e).__name__}: {e}", + exc_info=True, + ) + error_type = "file system error" if isinstance(e, OSError) else "unexpected error" + print(f" ⚠ Skipped {filename} ({error_type}: {e})")mozoo/experiments/trait_expression/evaluate.py (1)
219-222: Consider simplifying the trait_str formatting.The current approach with a conditional expression is fine, but the logic could be slightly cleaner by handling None values explicitly.
Alternative (optional):
trait = result.get("trait") strength = result.get("strength") - trait_str = f"{strength} {trait}" if trait and strength else "N/A" + trait_str = f"{strength} {trait}" if (trait and strength) else "N/A"Or more explicitly handle None:
trait = result.get("trait") strength = result.get("strength") if trait and strength: trait_str = f"{strength} {trait}" else: trait_str = "N/A"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
mozoo/experiments/trait_expression/evaluate.py(9 hunks)mozoo/experiments/trait_expression/results.py(11 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
mozoo/experiments/trait_expression/results.py (4)
mozoo/experiments/utils/paths.py (3)
ExperimentPaths(6-40)plots_dir(36-40)results_file(26-28)mozoo/experiments/utils/config.py (2)
get_experiment_dir(10-24)setup_experiment_env(27-43)mozoo/experiments/utils/results.py (2)
load_results(24-45)results_to_dataframe(48-122)mozoo/experiments/utils/display.py (1)
print_section(6-19)
mozoo/experiments/trait_expression/evaluate.py (7)
motools/atom/base.py (2)
EvalAtom(777-854)ModelAtom(563-632)motools/workflows/evaluate_only.py (1)
evaluate_model_on_task(75-137)motools/workflows/train_and_evaluate.py (1)
find_model_from_cache(99-238)mozoo/experiments/utils/paths.py (2)
ExperimentPaths(6-40)results_file(26-28)mozoo/experiments/utils/config.py (3)
get_experiment_dir(10-24)load_experiment_config(46-70)setup_experiment_env(27-43)mozoo/experiments/utils/display.py (2)
print_section(6-19)print_subsection(22-34)mozoo/experiments/utils/results.py (1)
save_results(10-21)
🪛 GitHub Actions: CI
mozoo/experiments/trait_expression/results.py
[error] 154-154: no-matching-overload: No overload of bound method sort_values matches arguments
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Sourcery review
🔇 Additional comments (7)
mozoo/experiments/trait_expression/results.py (3)
16-33: Well-structured refactoring to use centralized utilities.The migration to use
ExperimentPaths,get_experiment_dir, andsetup_experiment_envimproves consistency across experiments and reduces code duplication.Based on learnings
238-246: Good addition of PNG export functionality.The PNG export feature with graceful fallback when Kaleido is unavailable is a valuable enhancement for documentation and sharing results.
154-154: ****The code at line 154 is correct. At this point,
metric_trait_dfis definitively a pandas DataFrame (result ofgroupby(["strength", "task"], as_index=False)["mean_value"].mean()), and both columns"strength_sort"and"task"exist in the DataFrame. Thesort_values(by=["strength_sort", "task"])call is valid pandas syntax and will execute without error. This appears to be a false positive from a type checker with incomplete pandas type stubs; no code changes are needed.Likely an incorrect or invalid review comment.
mozoo/experiments/trait_expression/evaluate.py (4)
16-36: Excellent refactoring to use external workflow utilities.Delegating model discovery and evaluation to
motools.workflowsfunctions eliminates code duplication and centralizes workflow logic. The integration withmozoo.experiments.utilsfor path management and configuration loading is clean and consistent.Based on learnings
44-48: Good error handling for missing configuration.The try-except block properly handles
FileNotFoundErrorwhen the config file is missing, providing a clear error message to the user.
172-172: Appropriate user identifier for workflow execution.Using
"trait-expression-experiment"as the user parameter is a good practice for tracking and organizing workflow executions by experiment type.
153-154: No action needed—code already handles the concern.The review comment requests verification that trait and strength fields are present in all model variants or that missing cases are handled. Both conditions are already met:
All 9 models in config.yaml include both fields: Each model (baseline/mild/severe for hallucinating, evil, and sycophantic) defines
traitandstrength.Missing fields are gracefully handled: Lines 219–222 extract the fields safely using
.get()and apply a conditional fallback (if trait and strength else "N/A"), so the code won't crash if these fields are absent in future configurations.
Summary by Sourcery
Refactor and rename the Persona Vectors experiment suite to Trait Expression, centralize file and environment handling via shared utilities, update scripts accordingly, and introduce static PNG export and updated documentation.
New Features:
Enhancements:
Documentation:
Summary by CodeRabbit
Documentation
New Features
Refactor
Chores