FuncBenchGen

Implementation of paper:

Seiji Maekawa, Jackson Hassell, Pouya Pezeshkpour, Tom Mitchell, Estevam Hruschka. Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

⭐️ Key Contributions

Introduces FuncBenchGen, a contamination-free, controllable evaluation framework that casts tool use as traversal over a hidden function-dependency DAG, letting users precisely tune task difficulty (graph size, dependency depth, and type-compatible distractor functions).
Provides an extensive empirical study across seven open/closed LLMs, showing reasoning-optimized models outperform general ones but degrade sharply with deeper dependencies; “connected” distractors (irrelevant yet type-compatible functions) strongly harm performance; and common failures stem from brittle state/variable tracking despite syntactically valid calls.
Proposes a lightweight mitigation—explicitly restating known variable values at each step—that requires no model changes and substantially boosts success rates (e.g., GPT-5 from 62.5%→81.3%).

ID	OSS Component Name	Modified	Copyright Holder	Upstream Link	License
1	FuncBenchGen	No	Megagon Labs, Inc.	link	BSD 3-Clause License

🛠️ Setup

The required packages are listed in requirements.txt. The following setup is based on conda to manage the environment; if you do not have conda installed, please follow the official installation instructions if you want to run the code under the same conditions as ours. To simulate the environment, on your terminal simply run:

conda create -n funcbenchgen python=3.11
conda activate funcbenchgen
pip install -r requirements.txt

🚀 Quick Start

1. API Key Setup

This project uses LiteLLM to interface with various LLM providers. Ensure you have the necessary API keys set in your environment:

export OPENAI_API_KEY="your-api-key"
# If using other providers:
# export GEMINI_API_KEY="your-api-key"

2. Run Baseline Experiments

To run the main evaluation sweep described in the paper:

bash scripts/run_funcbenchgen.sh

This script will generate function-dependency graphs, execute tool-calling loops with the specified models, and save the results in data/results/.

3. Run Mitigation Experiments

To test the "explicitly restating known variable values" mitigation:

bash scripts/run_mitigation.sh

4. Running Individual Experiments

You can also run src/funcbenchgen.py directly for more granular control:

python src/funcbenchgen.py run \
  --model "gpt-5-2025-08-07" \
  --root_save_dir "./data" \
  --experiment_name "test_run" \
  --num_total_nodes 10 \
  --mcp_start 1 \
  --mcp_stop 2 \
  --num_graphs_per_config 1

📂 Codebase Overview

The core logic and data structure are organized as follows:

Source Code (`src/`)

src/funcbenchgen.py: The main entry point. It orchestrates the experiment sweep, handles model interactions via LiteLLM, and manages data persistence.
src/function_tree.py: Contains the FunctionDependencyTree class, which generates the synthetic function DAGs with controllable complexity (depth, width, distractors).
src/evaluator.py: Implements the ToolCallingEvaluator, which simulates the execution of tool calls and validates the LLM's multi-step reasoning process.
src/utils.py: Shared utility functions for graph handling, RNG management, and result formatting.

Execution Scripts (`scripts/`)

scripts/run_funcbenchgen.sh: A bash script to automate sweeps over different graph configurations (e.g., varying the number of nodes or critical path length).
scripts/run_mitigation.sh: Similar to the baseline script but enables the repeat_known_variable_values flag to test the proposed mitigation strategy.

Generated Graphs and Results (`data/`)

1. Generated Graph Files (`data/graphs/.../*.json`)

Generated graphs are stored as JSON files containing the DAG structure.

{
  "graph_data": {
    "nodes": [
      { "id": "func_soe", "function": { "inputs": ["nsg"], "output": "pngszx", "description": "..." }, "node_type": "core" },
      ...
    ],
    "links": [ { "source": "func_soe", "target": "func_mvo" }, ... ]
  },
  "target_node": "func_puf",
  "target_variable": "dtxyt",
  "config": { "num_total_nodes": 10, "max_critical_path_length": 1, ... }
}

2. Result Files (`data/results/.../*.json`)

Evaluation results include detailed per-trial logs and execution traces.

{
  "model": "gpt-5-2025-08-07",
  "summary": { "success_rate": 1.0, "avg_calls": 10.0, "num_trials": 1 },
  "trials": [
    {
      "input_prompt": "Using the tools at your disposal, use function(s) to compute and give me the correct value of variable dtxyt...",
      "call_sequence": [
        {
          "order": 1,
          "function": "func_cdl",
          "tool_call": "ChatCompletionMessageToolCall(...)",
          "content": "{'result': 'Variable uflxz = 488.'}",
          "thought_process": "..."
        },
        ...
      ],
      "was_correct": true,
      "target_value": "508"
    }
  ]
}

📚 Citations

@misc{maekawa2025distinctive,
  title={Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling},
  author={Seiji Maekawa and Jackson Hassell and Pouya Pezeshkpour and Tom Mitchell and Estevam Hruschka},
  url={https://arxiv.org/abs/2509.26553},
  year={2025}
}

📜 Disclosure

Embedded in, or bundled with, this product are open source software (OSS) components, datasets and other third party components identified below. The license terms respectively governing the datasets and third-party components continue to govern those portions, and you agree to those license terms, which, when applicable, specifically limit any distribution. You may receive a copy of, distribute and/or modify any open source code for the OSS component under the terms of their respective licenses, which may be BSD 3 clause license and Apache 2.0 license. In the event of conflicts between Megagon Labs, Inc., license conditions and the Open Source Software license conditions, the Open Source Software conditions shall prevail with respect to the Open Source Software portions of the software. You agree not to, and are not permitted to, distribute actual datasets used with the OSS components listed below. You agree and are limited to distribute only links to datasets from known sources by listing them in the datasets overview table below. You are permitted to distribute derived datasets of data sets from known sources by including links to original dataset source in the datasets overview table below. You agree that any right to modify datasets originating from parties other than Megagon Labs, Inc. are governed by the respective third party’s license conditions. All OSS components and datasets are distributed WITHOUT ANY WARRANTY, without even implied warranty such as for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, and without any liability to or claim against any Megagon Labs, Inc. entity other than as explicitly documented in this README document. You agree to cease using any part of the provided materials if you do not agree with the terms or the lack of any warranty herein. While Megagon Labs, Inc., makes commercially reasonable efforts to ensure that citations in this document are complete and accurate, errors may occur. If you see any error or omission, please help us improve this document by sending information to contact_oss@megagon.ai.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FuncBenchGen

⭐️ Key Contributions

🛠️ Setup

🚀 Quick Start

1. API Key Setup

2. Run Baseline Experiments

3. Run Mitigation Experiments

4. Running Individual Experiments

📂 Codebase Overview

Source Code (`src/`)

Execution Scripts (`scripts/`)

Generated Graphs and Results (`data/`)

1. Generated Graph Files (`data/graphs/.../*.json`)

2. Result Files (`data/results/.../*.json`)

📚 Citations

📜 Disclosure

About

Uh oh!

Releases

Packages

Languages

License

megagonlabs/FuncBenchGen

Folders and files

Latest commit

History

Repository files navigation

FuncBenchGen

⭐️ Key Contributions

🛠️ Setup

🚀 Quick Start

1. API Key Setup

2. Run Baseline Experiments

3. Run Mitigation Experiments

4. Running Individual Experiments

📂 Codebase Overview

Source Code (src/)

Execution Scripts (scripts/)

Generated Graphs and Results (data/)

1. Generated Graph Files (data/graphs/.../*.json)

2. Result Files (data/results/.../*.json)

📚 Citations

📜 Disclosure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Source Code (`src/`)

Execution Scripts (`scripts/`)

Generated Graphs and Results (`data/`)

1. Generated Graph Files (`data/graphs/.../*.json`)

2. Result Files (`data/results/.../*.json`)

Packages