diff --git a/README.md b/README.md index 5787648..5710012 100644 --- a/README.md +++ b/README.md @@ -46,10 +46,11 @@ DeepMath implements both. The model learns to generate short Python snippets, wh - Inference: based on [SmolAgents](https://github.com/huggingface/smolagents/), a math agent was created. vLLM is used as the inference engine. - Training: based on the GRPO trainer in [TRL](https://github.com/huggingface/trl), we modified TRL's vLLM client and server to generate GRPO completions using our DeepMath agent. -
-Changes to vLLM client and server in TRL library. -

Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.

-
+
+Changes to vLLM client and server in TRL library. +

+Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend. + - **Agent Interface:** During inference, the model can output normal tokens or special agent calls containing Python snippets. @@ -63,10 +64,9 @@ DeepMath implements both. The model learns to generate short Python snippets, wh - **Interpretability:** Snippets are readable and auditable. -
-Output example: it contains a short python snippet as well as its output which is used in the reasoning process. -

Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.

-
+
+Output example: it contains a short python snippet as well as its output which is used in the reasoning process.
+Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context. ## Training with GRPO @@ -92,7 +92,13 @@ We benchmarked DeepMath against baselines on four datasets. Metrics include: - **Mean output length** (brevity). +
Main results table. +
+ +- We compare a baseline configuration ([Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507), no agenting) with our DeepMath model. As ablation, we evaluate the agentic framework we developed running with the untrained Qwen3 model, denoted by **+Agent**. Additionally, we examine whether the GRPO training (for agentic use) improves non-agentic inference, denoted by **+GRPO**. Thus the two ablations are independent, not additive. + +- We observe the agentic inference reduces output lengths, with mixed accuracy results. The DeepMath model is both GRPO-trained and run in agentic mode, and shows the highest accuracy with shortened traces. We conclude **both GRPO training and agentic inference are needed** for best results. **Key Insight:** DeepMath reduces output length by up to **66%** while improving accuracy on challenging datasets. diff --git a/assets/trl-grpo-vllm-deepmath.png b/assets/trl-grpo-vllm-deepmath.png index 79d69ee..2281c47 100644 Binary files a/assets/trl-grpo-vllm-deepmath.png and b/assets/trl-grpo-vllm-deepmath.png differ