Skip to content

Conversation

@allenanie
Copy link
Member

Adding multi-modal support. Also introducing a context section.

For context, the design intention is that if the user provides context, it will appear in the user message; if no context is provided, the section will not be there.

@allenanie allenanie requested a review from Copilot October 13, 2025 17:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements multi-modal support for optimizers and introduces a context section to provide additional information during optimization. The changes enable image input handling, context passing, and improved structure for optimization prompts.

Key changes include:

  • Multi-modal payload support for handling images alongside text queries
  • Context section implementation for passing additional optimization context
  • Optimizer API enhancements to support image and context inputs

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit_tests/test_priority_search.py Added multi-modal message handling for test compatibility
opto/optimizers/utils.py Added image encoding utility for base64 conversion
opto/optimizers/optoprime_v2.py Main multi-modal and context implementation with API changes
opto/optimizers/opro_v2.py Extended OPRO optimizer with context support
opto/features/flows/types.py Added multi-modal payload types and query normalization
opto/features/flows/compose.py Updated TracedLLM to handle multi-modal payloads
docs/tutorials/minibatch.ipynb Updated escape sequences in notebook output
.github/workflows/ci.yml Commented out optimizer test suite

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@chinganc chinganc self-assigned this Oct 13, 2025
@allenanie
Copy link
Member Author

allenanie commented Oct 22, 2025

TODO:

  1. Support Image URL directly loading (Adith)
  2. Add support for in-memory image (like RGB/numpy) (Ching-An)
  3. Node supports image (multi-modal) -- when we traverse it, we add the image as payload into the optimizer. Add a function to node to determine if it's image.

@allenanie
Copy link
Member Author

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link

Copilot AI commented Nov 10, 2025

@allenanie I've opened a new pull request, #54, to work on those changes. Once the pull request is ready, I'll request review from you.

allenanie and others added 4 commits November 10, 2025 17:17
[WIP] Add multi-modal optimizer and context support
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 14 comments.

Comments suppressed due to low confidence (1)

opto/optimizers/optoprime_v2.py:236

        return OptoPrime.extract_llm_suggestion(response)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@AgentOpt AgentOpt deleted a comment from Copilot AI Nov 26, 2025
@allenanie
Copy link
Member Author

allenanie commented Dec 12, 2025

TODO:

  1. Add API usage / common usage pattern (documentation)
  2. Image Context API broken, need to migrate to the current backbone
  3. feedback as image (basically ready)
  4. Optimizer extraction of image output (replace parameter node if it's an image)

@allenanie
Copy link
Member Author

Documenting some decisions here:

OpenAI released ResponseAPI and announced migration and retirement/sunset of CompletionAPI. This triggered changes in the broad industry. LiteLLM introduced a beta version of ResponseAPI. It's unclear if Google/Anthropic will follow.

Although LiteLLM's responseAPI is usable, it's support for multi-modality (image generation) is quite poor (at least for Gemini). We are making Gemini first-party support for OpenTrace going forward, therefore, for this PR, we are staying with LiteLLM's completionAPI, with the option to upgrade to ResponseAPI in the future.

…ause these meanings are shifting (the "premium" model of 2025 will be the "cheap" model of 2027, which causes confusion and unreliability for the users).
…(automatically generated to increase coverage)
@allenanie
Copy link
Member Author

allenanie commented Dec 27, 2025

To increase backward compatibility, the llm.py is designed in the following way:
(mm_beta means multi-modal beta version):

When mm_beta (multimodal) is enabled, we either use:

  1. LiteLLM's response API (most compatible with OpenAI models, but also can work with others)

When mm_beta is disabled, for backward compatibility, we use:

  1. LiteLLM's completion API (default)

For any Google models (starts with gemini in the model name), we use:
Google's generate_content API (LiteLLM's Gemini support is insufficient for our use case)

Even with this small change, a lot of details were handled:

  1. OpenAI returns base64 string. Google GenAI library returns bytes.
  2. Google GenAI library expects system_instruction explicitly passed in. OpenAI uses message role role="system".
  3. (and other small quality of life updates)

In addition to llm.py changes, we update the AssistantTurn construction. Now it can take a raw response from the LLM API call and directly map the returned result into our class construct.

This is not strictly necessary, but this helps us simplify Optimizer's design since it no longer needs to interact with raw LLM API response object.

@allenanie
Copy link
Member Author

Multiturn conversation is tested.

See test test_real_google_genai_multi_turn_with_images_updated in test_optimizer_backbone.py

We store conversation history as structured data in AssistantTurn and UserTurn. They are added to ConversationHistory object. When we need to pass them back into the LLM API call, we call history.to_messages() to automatically get the input, or explicitly call history.to_gemini_format() or history.to_litellm_format().

to_messages() will do a smart check to see the model used by the last AssistantTurn and automatically determine which format function is use. However, this is not 100% reliable (for example, if CustomLLM backend is used but a Gemini model is called, then this automatic conversion will fail because CustomLLM expects an OpenAI-compatible server).

@allenanie
Copy link
Member Author

So far, all supporting functions for multi-modal capabilities are finished:

  1. backbone.py
  2. llm.py

Tests are finished:

  1. test_llm.py
  2. test_optimizer_backbone.py

Remaining todos:

  1. Integrate this into the optimizer class (i.e., support image parameter extraction)
  2. Write a notebook to demonstrate usage of the backbone as well as new optimizer.

refactor and moved gemini input message history conversion to `ConversationHistory`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants