Skip to content

InternScience/SGI-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A Definition of Scientific General Intelligence

Paper · GitHub · Page · HuggingFace

SGI Overview

Scientist-aligned benchmark for evaluating Scientific General Intelligence (SGI) across the full inquiry cycle: Deliberation, Conception, Action, and Perception. The benchmark spans 10 disciplines and more than 1,000 expert‑curated samples inspired by Science’s 125 Big Questions, with an agentic evaluation framework and multi‑metric protocol.


What is Scientific General Intelligence (SGI)?

SGI denotes an AI system that can autonomously navigate the full, iterative cycle of scientific inquiry—Deliberation, Conception, Action, and Perception—with the versatility and proficiency of a human scientist. SGI‑Bench operationalizes this definition via four scientist‑aligned task families: deep research, idea generation, AI‑assisted experiments (dry/wet), and multimodal experimental reasoning.


Framework & Tasks

SGI-Bench Pipeline

  • Deliberation (Deep Research): Multi‑hop retrieval, synthesis, and meta‑analysis style reasoning.
  • Conception (Idea Generation): Structured ideation and multi‑dimensional comparative evaluation.
  • Action (Dry/Wet Experiment): Code/simulation and lab protocol generation and verification.
  • Perception (Multimodal Reasoning): Process/observation/simulation/experiment/visualization image reasoning.

Grounded in the Practical Inquiry Model (PIM), SGI‑Bench treats science as an iterative cycle linking deliberation, conception, action and perception. Under this lens, SGI captures the capacity to integrate knowledge retrieval, idea formation, action execution, and interpretation into a unified loop of inquiry.


Scientist‑Aligned Data Construction

Scientist-Aligned Data Construction

  • Raw Corpus: Expert‑curated texts/images across 10 domains, inspired by Science’s 125 Big Questions.
  • Question Construction: 100+ graduate/PhD annotators with continuous expert‑in‑the‑loop review.
  • Data Cleaning: Rules + model checks + expert QA to ensure executability and unique answers.
  • Difficulty Filtering: Removes samples solved by >50% strong LLMs to maintain high challenge.

Result: High‑fidelity, scientist‑aligned tasks that are authentic, challenging, and broadly representative.


Agentic Evaluation Framework

Agentic Evaluation Framework

  • Four Stages: Question Selection → Metric Customization → Predict & Eval → Report Generation
  • Tool Pool: Web search, PDF parser, Python interpreter, file reader, metric functions
  • Task Metrics: EM/SLA; Implementation Similarity; PassAll@k/SER; MCA/RV
  • Customizable: Add scientist‑aligned metrics (e.g., rigor, feasibility) on demand

This agent‑based stack formalizes scoring into traceable stages, improves reproducibility, mitigates evaluator–model coupling bias, and yields actionable, scientist‑aligned insights.


Test‑Time Reinforcement Learning (TTRL)

TTRL Training Dynamics

  • Objective: Address no‑ground‑truth idea generation by optimizing novelty at test time with online retrieval as a moving baseline.
  • Reward Design:
    R = R_format + R_novelty
    Enforce XML format and strict structure (e.g., <think>, <answer>); reward embedding dissimilarity from retrieved works, gated by thresholds.
  • Setup: GRPO on Qwen3‑8B (ms‑swift), G=8, high temperature, bfloat16, online retrieval n=4.
  • Dynamics: Format reward saturates quickly; novelty steadily increases. Average novelty improved from 49.36 → 62.06 without labels.

TTRL converts open‑ended ideation into measurable test‑time optimization and extends to multi‑objective rewards (rigor, feasibility, safety, cost).


Leaderboard Highlights

Model Deep Research Idea Generation Dry Experiment Wet Experiment Experimental Reasoning SGI-Score
Gemini-3-Pro 🥇 18.48 39.68 36.64 32.45 41.92 33.83
Claude-Sonnet-4.5 🥈 13.84 43.20 35.79 30.15 37.80 32.16
Qwen3-Max 🥉 15.38 39.83 33.21 33.62 37.80 31.97
GPT-4.1 11.32 36.49 34.32 36.63 38.49 31.45
GPT-5 14.47 55.40 29.89 16.31 38.14 30.84
o3 12.89 46.07 31.73 30.04 32.65 30.68
Claude-Opus-4.1 12.93 40.29 34.69 25.38 38.83 30.42
o4-mini 11.95 40.78 35.79 28.86 33.33 30.14
GPT-5.1 11.64 47.12 31.00 22.77 34.02 29.31
Grok-4 13.31 37.12 33.71 29.01 30.24 28.68
Qwen3-VL-235B-A22B 11.97 39.28 28.41 30.30 31.62 28.32
Gemini-2.5-Pro 15.09 39.95 22.51 22.05 41.24 28.17
Intern-S1 15.74 38.09 28.79 29.02 28.87 28.10
GPT-4o 7.86 35.95 26.94 31.31 32.30 26.87
Gemini-2.5-Flash 10.69 39.13 21.03 18.55 34.36 24.75
Llama-4-Scout 7.86 29.72 20.37 21.66 25.77 21.08
Qwen3-8B 8.18 35.78 18.45 9.96 23.37 19.15
Intern-S1-mini 11.06 36.04 16.97 12.42 16.84 18.67

Quick Start

export OPENAI_API_KEY="xxxxx"
export OPENAI_BASE_URL="xxxxx"
cd evaluation
conda create -n sgi python=3.13.7
conda activate sgi
pip install -r requirements.txt

Task 1 Deep Research

conda activate sgi
python task_1_deep_research/step_1_get_answer.py
python task_1_deep_research/step_2_score.py

Task 2 Idea Generation

Comming soon...

Task 3.1 Dry Experiment (Code Generation)

cd task_1_deep_research
conda create -n dryexp python=3.10.18
conda activate dryexp
pip install -r dry_experiment_requirements.txt
python task_3_dry_experiment/step_1_build.py
conda activate sgi
python task_3_dry_experiment/step_2_get_answer.py
conda activate dryexp
python task_3_dry_experiment/step_3_run_code.py
conda activate sgi
python task_3_dry_experiment/step_4_score.py

Task 3.2 Wet Experiment (Lab Protocol)

conda activate sgi
python task_3_wet_experiment/step_1_get_answer.py
python task_3_wet_experiment/step_2_score.py

Task 4 Experimental Reasoning

conda activate sgi
python task_4_experimental_reasoning/step_1_get_answer.py
python task_4_experimental_reasoning/step_2_score.py

Citation

@article{sgi2025,
  title={SGI-Bench: Scientific Intelligence Benchmark via Scientist-Aligned Workflows},
  author={Research Team},
  journal={arXiv preprint arXiv:2401.xxxxx},
  year={2025}
}

About

A Definition of Scientific General Intelligence

Topics

Resources

Stars

Watchers

Forks

Languages