Skip to content

Conversation

@jamesbraza
Copy link
Collaborator

@jamesbraza jamesbraza commented Jun 5, 2025

Used lmi/aviary to run a baselines on ether0-benchmark with gpt-4o.

I also realized we didn't open source our baselines prompt yet.

Here is the printed output:

In category 'functional-group' of 10 questions, average reward was 0.000.
In category 'molecule-completion' of 25 questions, average reward was 0.200.
In category 'molecule-formula' of 25 questions, average reward was 0.000.
In category 'molecule-name' of 25 questions, average reward was 0.080.
In category 'oracle-solubility' of 25 questions, average reward was 0.040.
In category 'property-cat-eve' of 25 questions, average reward was 0.400.
In category 'property-cat-safety' of 25 questions, average reward was 0.360.
In category 'property-cat-smell' of 25 questions, average reward was 0.320.
In category 'property-regression-adme' of 25 questions, average reward was 0.480.
In category 'property-regression-ld50' of 25 questions, average reward was 0.360.
In category 'property-regression-pka' of 25 questions, average reward was 0.280.
In category 'reaction-prediction' of 25 questions, average reward was 0.160.
In category 'retro-synthesis' of 25 questions, average reward was 0.000.
In category 'simple-formula' of 15 questions, average reward was 0.000.
Cumulative average reward across 325 questions was 0.206.

@jamesbraza jamesbraza self-assigned this Jun 5, 2025
Copilot AI review requested due to automatic review settings June 5, 2025 19:45
@jamesbraza jamesbraza added the enhancement New feature or request label Jun 5, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a baseline prompt and example usage documentation for the ether0 benchmark while also updating dependency specifications to support the new baselines feature.

  • Added a new constant (LOOSE_XML_ANSWER_USER_PROMPT) in model_prompts.py for XML-formatted SMILES answers.
  • Updated pyproject.toml with new baselines dependencies.
  • Extended README.md with benchmark instructions and a complete code snippet to run the evaluation.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/ether0/model_prompts.py Added a constant for XML-based SMILES answer prompts.
pyproject.toml Added new dependencies required for running baselines.
README.md Added detailed benchmark instructions and example evaluation snippet.
Comments suppressed due to low confidence (2)

src/ether0/model_prompts.py:121

  • [nitpick] Consider adding a comment above this constant to explain its purpose and usage context for future maintainers.
LOOSE_XML_ANSWER_USER_PROMPT = (

README.md:177

  • Ensure that the use of 'await' at the top level is supported in your ipython environment or include guidance to wrap asynchronous calls inside an async function.
results = await asyncio.gather(

@jamesbraza jamesbraza merged commit f6a0ca6 into main Jun 5, 2025
3 checks passed
@jamesbraza jamesbraza deleted the baselines-docs branch June 5, 2025 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants