Added baseline prompt and `README` example #6

jamesbraza · 2025-06-05T19:45:16Z

Used lmi/aviary to run a baselines on ether0-benchmark with gpt-4o.

I also realized we didn't open source our baselines prompt yet.

Here is the printed output:

In category 'functional-group' of 10 questions, average reward was 0.000.
In category 'molecule-completion' of 25 questions, average reward was 0.200.
In category 'molecule-formula' of 25 questions, average reward was 0.000.
In category 'molecule-name' of 25 questions, average reward was 0.080.
In category 'oracle-solubility' of 25 questions, average reward was 0.040.
In category 'property-cat-eve' of 25 questions, average reward was 0.400.
In category 'property-cat-safety' of 25 questions, average reward was 0.360.
In category 'property-cat-smell' of 25 questions, average reward was 0.320.
In category 'property-regression-adme' of 25 questions, average reward was 0.480.
In category 'property-regression-ld50' of 25 questions, average reward was 0.360.
In category 'property-regression-pka' of 25 questions, average reward was 0.280.
In category 'reaction-prediction' of 25 questions, average reward was 0.160.
In category 'retro-synthesis' of 25 questions, average reward was 0.000.
In category 'simple-formula' of 15 questions, average reward was 0.000.
Cumulative average reward across 325 questions was 0.206.

Copilot

Pull Request Overview

This PR adds a baseline prompt and example usage documentation for the ether0 benchmark while also updating dependency specifications to support the new baselines feature.

Added a new constant (LOOSE_XML_ANSWER_USER_PROMPT) in model_prompts.py for XML-formatted SMILES answers.
Updated pyproject.toml with new baselines dependencies.
Extended README.md with benchmark instructions and a complete code snippet to run the evaluation.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
src/ether0/model_prompts.py	Added a constant for XML-based SMILES answer prompts.
pyproject.toml	Added new dependencies required for running baselines.
README.md	Added detailed benchmark instructions and example evaluation snippet.

Comments suppressed due to low confidence (2)

src/ether0/model_prompts.py:121

[nitpick] Consider adding a comment above this constant to explain its purpose and usage context for future maintainers.

LOOSE_XML_ANSWER_USER_PROMPT = (

README.md:177

Ensure that the use of 'await' at the top level is supported in your ipython environment or include guidance to wrap asynchronous calls inside an async function.

results = await asyncio.gather(

jamesbraza added 2 commits June 5, 2025 12:39

Made baselines extra for tutorial

351945a

Put LOOSE_XML_ANSWER_USER_PROMPT in the model prompts file

b421178

jamesbraza requested review from albertbou92, geemi725, maykcaldas, sidnarayanan and whitead June 5, 2025 19:45

jamesbraza self-assigned this Jun 5, 2025

Copilot AI review requested due to automatic review settings June 5, 2025 19:45

jamesbraza added the enhancement New feature or request label Jun 5, 2025

jamesbraza force-pushed the baselines-docs branch from 7118c15 to 82d5e8c Compare June 5, 2025 19:45

Copilot AI reviewed Jun 5, 2025

View reviewed changes

Added gpt-4o baseline tutorial

81260cd

jamesbraza force-pushed the baselines-docs branch from 82d5e8c to 81260cd Compare June 5, 2025 19:46

jamesbraza requested review from MicPie and ludomitch June 5, 2025 20:04

whitead approved these changes Jun 5, 2025

View reviewed changes

jamesbraza merged commit f6a0ca6 into main Jun 5, 2025
3 checks passed

jamesbraza deleted the baselines-docs branch June 5, 2025 20:09

jamesbraza mentioned this pull request Jun 6, 2025

Reusing extract_answer_loose in accuracy_reward #7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added baseline prompt and `README` example #6

Added baseline prompt and `README` example #6

jamesbraza commented Jun 5, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Added baseline prompt and README example #6

Added baseline prompt and README example #6

Conversation

jamesbraza commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Added baseline prompt and `README` example #6

Added baseline prompt and `README` example #6

jamesbraza commented Jun 5, 2025 •

edited

Loading