This is an official repo for <ToolGrad: Efficient Tool-use Dataset Generation with Textual “Gradients”>.
git clone https://github.com/zhongyi-zhou/toolgrad.git
cd toolgrad
conda env create -f environment.yml
conda activate toolgradexport PYTHONPATH=./
python examples/mcp_filesystem.pyYou need to first obtain a ToolBench API key by following their instruction:
Note: The API key is necessary for the following procedures.
export TOOLBENCH_KEY=YOURTOOLBENCHKEYYou also need to setup the ToolBench API database:
- Unzip
tools.zip(Google Drive) and it will show atools/folder. - add this path to the environ as follow
export TOOLBENCH_LIBRARY_ROOT=YOUR_PATH/TO/TOOLSexport PYTHONPATH=./
python examples/toolbench.pyYou will then find a new json file under examples/outputs/. examples/example_outputs/seed=123__iter=5__num_apis=50.json is an example that we generated.
ToolGrad-5K is composed of 5k data generation sessions with different seed. It takes ~250 USD to generate the full 5K dataset, using gpt-4.1-mini.
First download the dataset from Google Drive and unzip it. You should be able see a folder structure as follows:
ToolGrad-5k
├── data
├── metadata
├── prediction
└── sft_data
The prediction folder stores the prediction of three ToolGrad models on the test set. You can run the following command to perform evaluation with LLM judges:
python src/eval.py --pred_model toolgrad-1b --dataset ~/YOUR_DATASET_STORAGE_DIR/ToolGrad-5k/
You should see the following messages in CMD.
judge model: gpt-4.1
100%|████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 1384.60it/s]
Recall Success Rate QoR
Model
toolgrad-1b 0.987917 0.955482 93.702
This is an exact reproduction of our results.
If you wish to run the LLM judge again, run the following command (note this introduces costs on your OpenAI API):
python src/eval.py --pred_model toolgrad-1b \
--dataset ~/YOUR_DATASET_STORAGE_DIR/ToolGrad-5k/ \
--overwrite \
--num_process 16
You should be able to see a new result with similar values of ours. Note that you can adjust the num_process dependent on your OpenAI API RPM.
@misc{zhou2025toolgradefficienttoolusedataset,
title={ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"},
author={Zhongyi Zhou and Kohei Uehara and Haoyu Zhang and Jingtao Zhou and Lin Gu and Ruofei Du and Zheng Xu and Tatsuya Harada},
year={2025},
eprint={2508.04086},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.04086},
}
