Skip to content

zhongyi-zhou/toolgrad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

GitHub license Arxiv PyPI

TODOs: Open In Colab Dataset on HF Model on HF

This is an official repo for <ToolGrad: Efficient Tool-use Dataset Generation with Textual “Gradients”>.

demo

demo

Get Started: A Quick Demo

Step 0: Install packages

git clone https://github.com/zhongyi-zhou/toolgrad.git
cd toolgrad
conda env create -f environment.yml
conda activate toolgrad

Step 1: launch your first ToolGrad framework on a MCP service

export PYTHONPATH=./
python examples/mcp_filesystem.py

Reproduction of Dataset Generation

demo

Step 0: ToolBench API Key

You need to first obtain a ToolBench API key by following their instruction:

Note: The API key is necessary for the following procedures.

Step 1: ToolBench Setups

export TOOLBENCH_KEY=YOURTOOLBENCHKEY

You also need to setup the ToolBench API database:

  • Unzip tools.zip (Google Drive) and it will show a tools/ folder.
  • add this path to the environ as follow
export TOOLBENCH_LIBRARY_ROOT=YOUR_PATH/TO/TOOLS

Step 2: Generate your first ToolGrad sample on the ToolBench API database

export PYTHONPATH=./
python examples/toolbench.py

You will then find a new json file under examples/outputs/. examples/example_outputs/seed=123__iter=5__num_apis=50.json is an example that we generated.

ToolGrad-5K is composed of 5k data generation sessions with different seed. It takes ~250 USD to generate the full 5K dataset, using gpt-4.1-mini.

Evaluation

First download the dataset from Google Drive and unzip it. You should be able see a folder structure as follows:

ToolGrad-5k  
├── data  
├── metadata  
├── prediction  
└── sft_data  

The prediction folder stores the prediction of three ToolGrad models on the test set. You can run the following command to perform evaluation with LLM judges:

python src/eval.py --pred_model toolgrad-1b --dataset ~/YOUR_DATASET_STORAGE_DIR/ToolGrad-5k/

You should see the following messages in CMD.

judge model: gpt-4.1
100%|████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 1384.60it/s]
               Recall  Success Rate     QoR
Model                                      
toolgrad-1b  0.987917      0.955482  93.702

This is an exact reproduction of our results.

If you wish to run the LLM judge again, run the following command (note this introduces costs on your OpenAI API):

python src/eval.py --pred_model toolgrad-1b \
  --dataset ~/YOUR_DATASET_STORAGE_DIR/ToolGrad-5k/ \
  --overwrite \
  --num_process 16 

You should be able to see a new result with similar values of ours. Note that you can adjust the num_process dependent on your OpenAI API RPM.

BibTex

@misc{zhou2025toolgradefficienttoolusedataset,
      title={ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"}, 
      author={Zhongyi Zhou and Kohei Uehara and Haoyu Zhang and Jingtao Zhou and Lin Gu and Ruofei Du and Zheng Xu and Tatsuya Harada},
      year={2025},
      eprint={2508.04086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.04086}, 
}

About

ToolGrad: Efficient Tool-use Dataset Generation with Textual “Gradients”

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages