ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

TODOs:

This is an official repo for <ToolGrad: Efficient Tool-use Dataset Generation with Textual “Gradients”>.

Get Started: A Quick Demo

Step 0: Install packages

git clone https://github.com/zhongyi-zhou/toolgrad.git
cd toolgrad
conda env create -f environment.yml
conda activate toolgrad

Step 1: launch your first ToolGrad framework on a MCP service

export PYTHONPATH=./
python examples/mcp_filesystem.py

Reproduction of Dataset Generation

Step 0: ToolBench API Key

You need to first obtain a ToolBench API key by following their instruction:

https://github.com/OpenBMB/ToolBench

Note: The API key is necessary for the following procedures.

Step 1: ToolBench Setups

export TOOLBENCH_KEY=YOURTOOLBENCHKEY

You also need to setup the ToolBench API database:

Unzip tools.zip (Google Drive) and it will show a tools/ folder.
add this path to the environ as follow

export TOOLBENCH_LIBRARY_ROOT=YOUR_PATH/TO/TOOLS

Step 2: Generate your first ToolGrad sample on the ToolBench API database

export PYTHONPATH=./
python examples/toolbench.py

You will then find a new json file under examples/outputs/. examples/example_outputs/seed=123__iter=5__num_apis=50.json is an example that we generated.

ToolGrad-5K is composed of 5k data generation sessions with different seed. It takes ~250 USD to generate the full 5K dataset, using gpt-4.1-mini.

Evaluation

First download the dataset from Google Drive and unzip it. You should be able see a folder structure as follows:

ToolGrad-5k  
├── data  
├── metadata  
├── prediction  
└── sft_data

The prediction folder stores the prediction of three ToolGrad models on the test set. You can run the following command to perform evaluation with LLM judges:

python src/eval.py --pred_model toolgrad-1b --dataset ~/YOUR_DATASET_STORAGE_DIR/ToolGrad-5k/

You should see the following messages in CMD.

judge model: gpt-4.1
100%|████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 1384.60it/s]
               Recall  Success Rate     QoR
Model                                      
toolgrad-1b  0.987917      0.955482  93.702

This is an exact reproduction of our results.

If you wish to run the LLM judge again, run the following command (note this introduces costs on your OpenAI API):

python src/eval.py --pred_model toolgrad-1b \
  --dataset ~/YOUR_DATASET_STORAGE_DIR/ToolGrad-5k/ \
  --overwrite \
  --num_process 16

You should be able to see a new result with similar values of ours. Note that you can adjust the num_process dependent on your OpenAI API RPM.

BibTex

@misc{zhou2025toolgradefficienttoolusedataset,
      title={ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"}, 
      author={Zhongyi Zhou and Kohei Uehara and Haoyu Zhang and Jingtao Zhou and Lin Gu and Ruofei Du and Zheng Xu and Tatsuya Harada},
      year={2025},
      eprint={2508.04086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.04086}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
images		images
src		src
toolgrad		toolgrad
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
core.py		core.py
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

Get Started: A Quick Demo

Step 0: Install packages

Step 1: launch your first ToolGrad framework on a MCP service

Reproduction of Dataset Generation

Step 0: ToolBench API Key

Step 1: ToolBench Setups

Step 2: Generate your first ToolGrad sample on the ToolBench API database

Evaluation

BibTex

About

Uh oh!

Releases

Packages

Languages

License

zhongyi-zhou/toolgrad

Folders and files

Latest commit

History

Repository files navigation

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

Get Started: A Quick Demo

Step 0: Install packages

Step 1: launch your first ToolGrad framework on a MCP service

Reproduction of Dataset Generation

Step 0: ToolBench API Key

Step 1: ToolBench Setups

Step 2: Generate your first ToolGrad sample on the ToolBench API database

Evaluation

BibTex

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages