[Feature] Support  text-centric task

https://arxiv.org/pdf/2511.04570  is doing :   Input to the model: The program feeds the model the “whiteboard image” containing the question along with a text instruction such as: “Solve this problem step by step on the whiteboard and finally read the answer aloud in audio.”
Model output: The model generates a video showing a hand writing the solution steps on the whiteboard and ultimately producing the answer.
Evaluation: The evaluation program extracts the last video frame and audio, then uses a strong language model (LLM-as-a-Judge) to check whether the answer shown in the video and spoken in the audio matches the standard answer “330,000” stored in the dataset.

Whether we need to support it?

- [ ]   evaluator prompt design,  like " read the number in image"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support text-centric task #145

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Support text-centric task #145

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions