-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
https://arxiv.org/pdf/2511.04570 is doing : Input to the model: The program feeds the model the “whiteboard image” containing the question along with a text instruction such as: “Solve this problem step by step on the whiteboard and finally read the answer aloud in audio.”
Model output: The model generates a video showing a hand writing the solution steps on the whiteboard and ultimately producing the answer.
Evaluation: The evaluation program extracts the last video frame and audio, then uses a strong language model (LLM-as-a-Judge) to check whether the answer shown in the video and spoken in the audio matches the standard answer “330,000” stored in the dataset.
Whether we need to support it?
- evaluator prompt design, like " read the number in image"
Metadata
Metadata
Assignees
Labels
No labels