Added Benchmark Evaluation Framework with CORE Benchmark Suite #402

Jasmine-Yuting-Zhang · 2025-10-28T08:28:37Z

This PR introduces a benchmark evaluation framework to Plato, enabling systematic evaluation of trained federated learning models on user-specified benchmarks. This PR also adds the CORE benchmark (borrowed from NanoChat) for language model evaluations.

Description

Benchmark registry and CORE benchmark implementation:

Benchmark base class (plato/benchmarks/base.py): Abstract interface with thread-safe download guards and evaluation contract.
Registry system (plato/benchmarks/registry.py): Runtime benchmark selection via configuration (registered_benchmarks dict).
CORE benchmark (plato/benchmarks/core.py): Language model evaluation suite with multiprocessing support, task loading, and metric aggregation.
Helper modules (plato/benchmarks/core_helpers): Task evaluation logic and necessary tokenizer wrapper for compatibility.

Configuration support (plato/config.py):

Added benchmark_path parameter for benchmark data directory
Added [benchmark] section with type field to define the type/name of the benchmark; for type="core", meaning using CORE benchmark, added random_seed, and max_per_task fields in particular

Trainer integration (plato/trainers/composable.py):

Implemented eval(), eval_model(), and eval_process() to run the model evaluation in the testing strategy
Added save_benchmark_result() and load_benchmark_result() utilities for trainer max concurrency mode

Server orchestration (plato/servers/fedavg.py):

Integrated benchmark evaluation after aggregation rounds
Lazy benchmark instantiation from benchmarks_registry.get()
Formatted result logging via benchmark.get_formatted_result()

Testing strategy interface (plato/trainers/testing_strategies/default.py):

Added abstract eval_model() to testing strategy to run model evaluation

Split learning example validation (examples/split_learning/llm_split_learning):

Impemented eval_model() in split_learning_trainer.py
Device/tokenizer forwarding to benchmark instances

How has this been tested?

Tested CORE benchmark evaluation with split learning LLM example with proper configurations
Test execution and results:
Command:

uv run split_learning_main.py -c split_learning_wikitext2_gpt2.toml

Output showing successful CORE benchmark evaluation on 22 tasks after 10 rounds of FL training session:

[INFO][03:13:29]: [Server #22804] Finished aggregating updated weights.
[INFO][03:13:29]: [Server #22804] Average client accuracy: 0.00%.
[INFO][03:13:29]: [Server #22804] Started model evaluation on benchmark core.
[INFO][03:13:29]: Evaluating task: hellaswag_zeroshot (0-shot, type: multiple_choice)
[INFO][03:13:32]: accuracy: 0.1250 | centered: -0.1667 | time: 2.35s
[INFO][03:13:32]: Evaluating task: jeopardy (10-shot, type: language_modeling)
[INFO][03:13:34]: accuracy: 0.0000 | centered: 0.0000 | time: 2.33s
[INFO][03:13:34]: Evaluating task: bigbench_qa_wikidata (10-shot, type: language_modeling)
[INFO][03:13:35]: accuracy: 0.1875 | centered: 0.1875 | time: 1.16s
[INFO][03:13:35]: Evaluating task: arc_easy (10-shot, type: multiple_choice)
[INFO][03:13:43]: accuracy: 0.4375 | centered: 0.2500 | time: 8.25s
[INFO][03:13:43]: Evaluating task: arc_challenge (10-shot, type: multiple_choice)
[INFO][03:13:52]: accuracy: 0.3125 | centered: 0.0833 | time: 8.73s
[INFO][03:13:52]: Evaluating task: copa (0-shot, type: multiple_choice)
[INFO][03:13:53]: accuracy: 0.5625 | centered: 0.1250 | time: 0.55s
[INFO][03:13:53]: Evaluating task: commonsense_qa (10-shot, type: multiple_choice)
[INFO][03:14:05]: accuracy: 0.0625 | centered: -0.1719 | time: 12.37s
[INFO][03:14:05]: Evaluating task: piqa (10-shot, type: multiple_choice)
[INFO][03:14:13]: accuracy: 0.5625 | centered: 0.1250 | time: 7.65s
[INFO][03:14:13]: Evaluating task: openbook_qa (0-shot, type: multiple_choice)
[INFO][03:14:13]: accuracy: 0.3750 | centered: 0.1667 | time: 0.73s
[INFO][03:14:13]: Evaluating task: lambada_openai (0-shot, type: language_modeling)
[INFO][03:14:14]: accuracy: 0.1250 | centered: 0.1250 | time: 0.92s
[INFO][03:14:14]: Evaluating task: hellaswag (10-shot, type: multiple_choice)
[INFO][03:14:36]: accuracy: 0.1250 | centered: -0.1667 | time: 22.23s
[INFO][03:14:36]: Evaluating task: winograd (0-shot, type: schema)
[INFO][03:14:37]: accuracy: 0.4375 | centered: -0.1250 | time: 0.53s
[INFO][03:14:37]: Evaluating task: winogrande (0-shot, type: schema)
[INFO][03:14:38]: accuracy: 0.5000 | centered: 0.0000 | time: 0.60s
[INFO][03:14:38]: Evaluating task: bigbench_dyck_languages (10-shot, type: language_modeling)
[INFO][03:14:41]: accuracy: 0.0625 | centered: 0.0625 | time: 3.90s
[INFO][03:14:41]: Evaluating task: agi_eval_lsat_ar (3-shot, type: multiple_choice)
[INFO][03:15:02]: accuracy: 0.3125 | centered: 0.1406 | time: 20.20s
[INFO][03:15:02]: Evaluating task: bigbench_cs_algorithms (10-shot, type: language_modeling)
[INFO][03:15:05]: accuracy: 0.2500 | centered: 0.2500 | time: 3.15s
[INFO][03:15:05]: Evaluating task: bigbench_operators (10-shot, type: language_modeling)
[INFO][03:15:08]: accuracy: 0.3125 | centered: 0.3125 | time: 2.86s
[INFO][03:15:08]: Evaluating task: bigbench_repeat_copy_logic (10-shot, type: language_modeling)
[INFO][03:15:11]: accuracy: 0.0000 | centered: 0.0000 | time: 2.97s
[INFO][03:15:11]: Evaluating task: squad (10-shot, type: language_modeling)
[INFO][03:15:17]: accuracy: 0.0000 | centered: 0.0000 | time: 6.25s
[INFO][03:15:17]: Evaluating task: coqa (0-shot, type: language_modeling)
[INFO][03:15:20]: accuracy: 0.0625 | centered: 0.0625 | time: 3.33s
[INFO][03:15:20]: Evaluating task: boolq (10-shot, type: multiple_choice)
[INFO][03:15:33]: accuracy: 0.6250 | centered: 0.0132 | time: 12.44s
[INFO][03:15:33]: Evaluating task: bigbench_language_identification (10-shot, type: multiple_choice)
[INFO][03:15:57]: accuracy: 0.5000 | centered: 0.4499 | time: 24.73s
[INFO][03:15:57]: [Server #22804] Model evaluation result on benchmark core: 
Task                               , Accuracy  , Centered  
hellaswag_zeroshot                 , 0.125000  , -0.166667 
jeopardy                           , 0.000000  , 0.000000  
bigbench_qa_wikidata               , 0.187500  , 0.187500  
arc_easy                           , 0.437500  , 0.250000  
arc_challenge                      , 0.312500  , 0.083333  
copa                               , 0.562500  , 0.125000  
commonsense_qa                     , 0.062500  , -0.171875 
piqa                               , 0.562500  , 0.125000  
openbook_qa                        , 0.375000  , 0.166667  
lambada_openai                     , 0.125000  , 0.125000  
hellaswag                          , 0.125000  , -0.166667 
winograd                           , 0.437500  , -0.125000 
winogrande                         , 0.500000  , 0.000000  
bigbench_dyck_languages            , 0.062500  , 0.062500  
agi_eval_lsat_ar                   , 0.312500  , 0.140625  
bigbench_cs_algorithms             , 0.250000  , 0.250000  
bigbench_operators                 , 0.312500  , 0.312500  
bigbench_repeat_copy_logic         , 0.000000  , 0.000000  
squad                              , 0.000000  , 0.000000  
coqa                               , 0.062500  , 0.062500  
boolq                              , 0.625000  , 0.013158  
bigbench_language_identification   , 0.500000  , 0.449945  
Overall CORE Metric                ,           , 0.078342

Types of changes

Bug fix (non-breaking change which fixes an issue) Fixes #
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code has been formatted using the Ruff formatter (ruff format) and checked using the Ruff linter (ruff check --fix).
My change requires a change to the documentation.
I have updated the documentation accordingly.

Integrated NanoChat’s benchmark components into Plato to enable direct evaluation of Plato models using the NanoChat benchmark. Added files: - common.py: shared utilities and configurations for the benchmark - core_eval.py: implements the CORE benchmark evaluation logic - evaluate_model.py: main entry point to run model evaluation from Plato - report.py: handles result aggregation and reporting - tokenizer.py: provides tokenization utilities for language model evaluation

… models. Features: - Automatically sets up the NanoChat datasets under `.cache/nanochat`. - Downloads and unpacks the CORE evaluation bundle if not already available. - Invokes `evaluate_model.py` with the specified HuggingFace model path. - Adds argument parsing for `<model_path>` and optional `[max_per_task]`. - Defaults `max_per_task` to 16 when not provided. Usage: bash evaluate_model.sh <model_path> [optional: max_per_task]

…chmarks/language_model.

- Introduced eval_model() in testing.py to define a placeholder interface for benchmark-based evaluation. - The default strategy now raises NotImplementedError to prompt use of specialized testing strategies.

- Added static methods save_benchmark_result() and load_benchmark_result() in base.py for saving and loading benchmark evaluation results.

- Implemented benchmark evaluation pipeline in plato/trainers/composable.py. - Added eval_model(), eval(), and eval_process() methods.

- Enabled benchmark evaluation in split learning to test and validate benchmark implementations.

netlify · 2025-10-28T08:28:42Z

✅ Deploy Preview for platodocs canceled.

Name	Link
🔨 Latest commit	`ec1c1ba`
🔍 Latest deploy log	https://app.netlify.com/projects/platodocs/deploys/69007eb8d9c1e6000787f025

Jasmine-Yuting-Zhang added 17 commits October 26, 2025 20:33

Added missing explanation of --max_per_task default value.

473d350

Moved NanoChat benchmark from benchmarks/language_models to plato/ben…

f703348

…chmarks/language_model.

Cleaned up unused code from nanochat.

08dc54e

Added abstract eval_model() to TestingStrategy.

a978beb

Added eval_model() to DefaultTestingStrategy.

c9db3e0

- Introduced eval_model() in testing.py to define a placeholder interface for benchmark-based evaluation. - The default strategy now raises NotImplementedError to prompt use of specialized testing strategies.

Added benchmark result save/load utilities.

34ec582

- Added static methods save_benchmark_result() and load_benchmark_result() in base.py for saving and loading benchmark evaluation results.

Implemented benchmark evaluation pipeline with multiprocessing.

956922f

- Implemented benchmark evaluation pipeline in plato/trainers/composable.py. - Added eval_model(), eval(), and eval_process() methods.

Added registry for benchmark.

103c665

Added base class for evaluating trained models.

c3c5020

Added CORE benchmark implementation for language models.

ed4b025

Added helper functions for CORE benchmark implementation.

d8cf214

Added benchmark evaluation support in fedavg.py.

aaac39b

Added benchmark configuration support in config.py.

0a36f74

Added support for split learning benchmark evaluation.

e60d99b

- Enabled benchmark evaluation in split learning to test and validate benchmark implementations.

Reformatted code using Ruff.

ec1c1ba

Jasmine-Yuting-Zhang requested a review from baochunli October 28, 2025 08:28

Jasmine-Yuting-Zhang closed this Nov 11, 2025

Jasmine-Yuting-Zhang deleted the benchmark branch November 11, 2025 05:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added Benchmark Evaluation Framework with CORE Benchmark Suite #402

Added Benchmark Evaluation Framework with CORE Benchmark Suite #402

Uh oh!

Jasmine-Yuting-Zhang commented Oct 28, 2025

Uh oh!

netlify bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added Benchmark Evaluation Framework with CORE Benchmark Suite #402

Added Benchmark Evaluation Framework with CORE Benchmark Suite #402

Uh oh!

Conversation

Jasmine-Yuting-Zhang commented Oct 28, 2025

Description

How has this been tested?

Types of changes

Checklist:

Uh oh!

netlify bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for platodocs canceled.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify bot commented Oct 28, 2025 •

edited

Loading