Skip to content

Conversation

@Jasmine-Yuting-Zhang
Copy link
Collaborator

This PR introduces a benchmark evaluation framework to Plato, enabling systematic evaluation of trained federated learning models on user-specified benchmarks. This PR also adds the CORE benchmark (borrowed from NanoChat) for language model evaluations.

Description

Benchmark registry and CORE benchmark implementation:

  • Benchmark base class (plato/benchmarks/base.py): Abstract interface with thread-safe download guards and evaluation contract.
  • Registry system (plato/benchmarks/registry.py): Runtime benchmark selection via configuration (registered_benchmarks dict).
  • CORE benchmark (plato/benchmarks/core.py): Language model evaluation suite with multiprocessing support, task loading, and metric aggregation.
  • Helper modules (plato/benchmarks/core_helpers): Task evaluation logic and necessary tokenizer wrapper for compatibility.

Configuration support (plato/config.py):

  • Added benchmark_path parameter for benchmark data directory
  • Added [benchmark] section with type field to define the type/name of the benchmark; for type="core", meaning using CORE benchmark, added random_seed, and max_per_task fields in particular

Trainer integration (plato/trainers/composable.py):

  • Implemented eval(), eval_model(), and eval_process() to run the model evaluation in the testing strategy
  • Added save_benchmark_result() and load_benchmark_result() utilities for trainer max concurrency mode

Server orchestration (plato/servers/fedavg.py):

  • Integrated benchmark evaluation after aggregation rounds
  • Lazy benchmark instantiation from benchmarks_registry.get()
  • Formatted result logging via benchmark.get_formatted_result()

Testing strategy interface (plato/trainers/testing_strategies/default.py):

  • Added abstract eval_model() to testing strategy to run model evaluation

Split learning example validation (examples/split_learning/llm_split_learning):

  • Impemented eval_model() in split_learning_trainer.py
  • Device/tokenizer forwarding to benchmark instances

How has this been tested?

  • Tested CORE benchmark evaluation with split learning LLM example with proper configurations
  • Test execution and results:
    Command:
uv run split_learning_main.py -c split_learning_wikitext2_gpt2.toml

Output showing successful CORE benchmark evaluation on 22 tasks after 10 rounds of FL training session:

[INFO][03:13:29]: [Server #22804] Finished aggregating updated weights.
[INFO][03:13:29]: [Server #22804] Average client accuracy: 0.00%.
[INFO][03:13:29]: [Server #22804] Started model evaluation on benchmark core.
[INFO][03:13:29]: Evaluating task: hellaswag_zeroshot (0-shot, type: multiple_choice)
[INFO][03:13:32]: accuracy: 0.1250 | centered: -0.1667 | time: 2.35s
[INFO][03:13:32]: Evaluating task: jeopardy (10-shot, type: language_modeling)
[INFO][03:13:34]: accuracy: 0.0000 | centered: 0.0000 | time: 2.33s
[INFO][03:13:34]: Evaluating task: bigbench_qa_wikidata (10-shot, type: language_modeling)
[INFO][03:13:35]: accuracy: 0.1875 | centered: 0.1875 | time: 1.16s
[INFO][03:13:35]: Evaluating task: arc_easy (10-shot, type: multiple_choice)
[INFO][03:13:43]: accuracy: 0.4375 | centered: 0.2500 | time: 8.25s
[INFO][03:13:43]: Evaluating task: arc_challenge (10-shot, type: multiple_choice)
[INFO][03:13:52]: accuracy: 0.3125 | centered: 0.0833 | time: 8.73s
[INFO][03:13:52]: Evaluating task: copa (0-shot, type: multiple_choice)
[INFO][03:13:53]: accuracy: 0.5625 | centered: 0.1250 | time: 0.55s
[INFO][03:13:53]: Evaluating task: commonsense_qa (10-shot, type: multiple_choice)
[INFO][03:14:05]: accuracy: 0.0625 | centered: -0.1719 | time: 12.37s
[INFO][03:14:05]: Evaluating task: piqa (10-shot, type: multiple_choice)
[INFO][03:14:13]: accuracy: 0.5625 | centered: 0.1250 | time: 7.65s
[INFO][03:14:13]: Evaluating task: openbook_qa (0-shot, type: multiple_choice)
[INFO][03:14:13]: accuracy: 0.3750 | centered: 0.1667 | time: 0.73s
[INFO][03:14:13]: Evaluating task: lambada_openai (0-shot, type: language_modeling)
[INFO][03:14:14]: accuracy: 0.1250 | centered: 0.1250 | time: 0.92s
[INFO][03:14:14]: Evaluating task: hellaswag (10-shot, type: multiple_choice)
[INFO][03:14:36]: accuracy: 0.1250 | centered: -0.1667 | time: 22.23s
[INFO][03:14:36]: Evaluating task: winograd (0-shot, type: schema)
[INFO][03:14:37]: accuracy: 0.4375 | centered: -0.1250 | time: 0.53s
[INFO][03:14:37]: Evaluating task: winogrande (0-shot, type: schema)
[INFO][03:14:38]: accuracy: 0.5000 | centered: 0.0000 | time: 0.60s
[INFO][03:14:38]: Evaluating task: bigbench_dyck_languages (10-shot, type: language_modeling)
[INFO][03:14:41]: accuracy: 0.0625 | centered: 0.0625 | time: 3.90s
[INFO][03:14:41]: Evaluating task: agi_eval_lsat_ar (3-shot, type: multiple_choice)
[INFO][03:15:02]: accuracy: 0.3125 | centered: 0.1406 | time: 20.20s
[INFO][03:15:02]: Evaluating task: bigbench_cs_algorithms (10-shot, type: language_modeling)
[INFO][03:15:05]: accuracy: 0.2500 | centered: 0.2500 | time: 3.15s
[INFO][03:15:05]: Evaluating task: bigbench_operators (10-shot, type: language_modeling)
[INFO][03:15:08]: accuracy: 0.3125 | centered: 0.3125 | time: 2.86s
[INFO][03:15:08]: Evaluating task: bigbench_repeat_copy_logic (10-shot, type: language_modeling)
[INFO][03:15:11]: accuracy: 0.0000 | centered: 0.0000 | time: 2.97s
[INFO][03:15:11]: Evaluating task: squad (10-shot, type: language_modeling)
[INFO][03:15:17]: accuracy: 0.0000 | centered: 0.0000 | time: 6.25s
[INFO][03:15:17]: Evaluating task: coqa (0-shot, type: language_modeling)
[INFO][03:15:20]: accuracy: 0.0625 | centered: 0.0625 | time: 3.33s
[INFO][03:15:20]: Evaluating task: boolq (10-shot, type: multiple_choice)
[INFO][03:15:33]: accuracy: 0.6250 | centered: 0.0132 | time: 12.44s
[INFO][03:15:33]: Evaluating task: bigbench_language_identification (10-shot, type: multiple_choice)
[INFO][03:15:57]: accuracy: 0.5000 | centered: 0.4499 | time: 24.73s
[INFO][03:15:57]: [Server #22804] Model evaluation result on benchmark core: 
Task                               , Accuracy  , Centered  
hellaswag_zeroshot                 , 0.125000  , -0.166667 
jeopardy                           , 0.000000  , 0.000000  
bigbench_qa_wikidata               , 0.187500  , 0.187500  
arc_easy                           , 0.437500  , 0.250000  
arc_challenge                      , 0.312500  , 0.083333  
copa                               , 0.562500  , 0.125000  
commonsense_qa                     , 0.062500  , -0.171875 
piqa                               , 0.562500  , 0.125000  
openbook_qa                        , 0.375000  , 0.166667  
lambada_openai                     , 0.125000  , 0.125000  
hellaswag                          , 0.125000  , -0.166667 
winograd                           , 0.437500  , -0.125000 
winogrande                         , 0.500000  , 0.000000  
bigbench_dyck_languages            , 0.062500  , 0.062500  
agi_eval_lsat_ar                   , 0.312500  , 0.140625  
bigbench_cs_algorithms             , 0.250000  , 0.250000  
bigbench_operators                 , 0.312500  , 0.312500  
bigbench_repeat_copy_logic         , 0.000000  , 0.000000  
squad                              , 0.000000  , 0.000000  
coqa                               , 0.062500  , 0.062500  
boolq                              , 0.625000  , 0.013158  
bigbench_language_identification   , 0.500000  , 0.449945  
Overall CORE Metric                ,           , 0.078342  

Types of changes

  • Bug fix (non-breaking change which fixes an issue) Fixes #
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code has been formatted using the Ruff formatter (ruff format) and checked using the Ruff linter (ruff check --fix).
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

Integrated NanoChat’s benchmark components into Plato to enable direct evaluation of Plato models using the NanoChat benchmark.

Added files:
- common.py: shared utilities and configurations for the benchmark
- core_eval.py: implements the CORE benchmark evaluation logic
- evaluate_model.py: main entry point to run model evaluation from Plato
- report.py: handles result aggregation and reporting
- tokenizer.py: provides tokenization utilities for language model evaluation
… models.

Features:
- Automatically sets up the NanoChat datasets  under `.cache/nanochat`.
- Downloads and unpacks the CORE evaluation bundle if not already available.
- Invokes `evaluate_model.py` with the specified HuggingFace model path.
- Adds argument parsing for `<model_path>` and optional `[max_per_task]`.
- Defaults `max_per_task` to 16 when not provided.

Usage:
    bash evaluate_model.sh <model_path> [optional: max_per_task]
- Introduced eval_model() in testing.py to define a placeholder interface for benchmark-based evaluation.
- The default strategy now raises NotImplementedError to prompt use of specialized testing strategies.
- Added static methods save_benchmark_result() and load_benchmark_result() in base.py for saving and loading benchmark evaluation results.
- Implemented benchmark evaluation pipeline in plato/trainers/composable.py.
- Added eval_model(), eval(), and eval_process() methods.
- Enabled benchmark evaluation in split learning to test and validate benchmark implementations.
@netlify
Copy link

netlify bot commented Oct 28, 2025

Deploy Preview for platodocs canceled.

Name Link
🔨 Latest commit ec1c1ba
🔍 Latest deploy log https://app.netlify.com/projects/platodocs/deploys/69007eb8d9c1e6000787f025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants