Executive Order 14168: This repository is under review for potential modification in compliance with Administration directives.
π View Documentation & Examples
A LinkML schema for Datasheets for Datasets model as published in Datasheets for Datasets. Inspired by datasheets as used in the electronics and other industries, Gebru et al. proposed that every dataset "be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on". To this end the authors create a series of topics and over 50 questions addressing different aspects of datasets, also useful in an AI/ML context. An example of completed datasheet for datasets can be found here: Structured dataset documentation: a datasheet for CheXpert
Google is working with a different model called Data Cards, which in practice is close to the original Datasheets for Datasets template.
This repository stores a LinkML schema representation for the original Datasheets for Datasets model, representing the topics, sets of questions, and expected entities and fields in the answers (work in progress). Beyond a less structured markdown template for this model (e.g. template for datasheet for dataset) we are not aware of any other structured form representing Datasheets for Datasets.
We are also tracking related developments, such as augmented Datasheets for Datasets models as in Augmented Datasheets for Speech Datasets and Ethical Decision-Making.
Curated comprehensive datasheets for each Bridge2AI data generating project:
- AI-READI - Retinal imaging and diabetes dataset
- CM4AI - Cell maps for AI dataset
- VOICE - Voice biomarker dataset
- CHORUS - Health data for underrepresented populations
Browse the source code repository:
- examples/ - example data
- project/ - project files (do not edit these)
- src/ - source files (edit these)
- src/data_sheets_schema/schema/ - LinkML schema (edit this)
- src/data_sheets_schema/datamodel/ - generated Python datamodel
- tests/ - Python tests
This repository supports two distinct approaches for generating D4D (Datasheets for Datasets) metadata from dataset documentation:
Use when: You need to batch-process many files automatically with minimal human intervention.
Automated scripts that use LLM APIs (OpenAI/Anthropic) to extract D4D metadata from dataset documentation. These agents run autonomously and can process hundreds of files in batch mode.
python src/download/validated_d4d_wrapper.py -i downloads_by_column -o data/extracted_by_columnFeatures:
- Validates downloads succeeded
- Checks content relevance to projects
- Generates D4D YAML metadata via GPT-5
- Creates detailed validation reports
- Processes HTML, JSON, PDF, and text files
- Adds generation metadata to YAML headers
Generated Metadata Includes:
# D4D Metadata extracted from: dataset_page.html
# Column: AI_READI
# Validation: Download β
success
# Relevance: β
relevant
# Generated: 2025-10-31 14:23:15
# Generator: validated_d4d_wrapper (GPT-5)
# Schema: https://raw.githubusercontent.com/monarch-initiative/ontogpt/main/...For integration into Python applications:
from aurelian.agents.d4d.d4d_agent import d4d_agent
from aurelian.agents.d4d.d4d_config import D4DConfig
# Process multiple sources (URLs and local files)
sources = [
"https://example.com/dataset",
"/path/to/metadata.json",
"/path/to/documentation.html"
]
config = D4DConfig()
result = await d4d_agent.run(
f"Extract metadata from: {', '.join(sources)}",
deps=config
)
print(result.data) # D4D YAML outputSupported File Types: PDF, HTML, JSON, text/markdown (URLs and local files)
python src/download/d4d_agent_wrapper.py -i downloads_by_column -o data/extracted_by_columnSimpler version without validation steps, suitable for clean input data.
Requirements for API Agents:
- Set
ANTHROPIC_API_KEYorOPENAI_API_KEYenvironment variable - Wrappers use GPT-5 by default (configurable)
- Files organized in column directories
Use when: You need human oversight, domain expertise, or customized metadata extraction.
Use coding assistants like Claude Code, GitHub Copilot, or Cursor to generate D4D metadata interactively. This approach provides human-in-the-loop quality control and domain-specific reasoning.
Step 1: Provide the schema and dataset documentation to Claude Code
Please generate D4D (Datasheets for Datasets) metadata for the dataset at:
https://example.com/dataset
Use the D4D schema at:
https://raw.githubusercontent.com/monarch-initiative/ontogpt/main/src/ontogpt/templates/data_sheets_schema.yaml
Generate a complete YAML file following the schema structure.
Step 2: Claude Code will:
- Fetch the dataset documentation
- Analyze the content
- Generate structured D4D YAML
- Include reasoning about field mappings
- Iterate based on your feedback
Generated Metadata Includes:
# D4D Metadata for: Example Dataset
# Generated: 2025-10-31
# Generator: Claude Code (claude-sonnet-4-5)
# Method: Interactive extraction with human oversight
# Schema: https://raw.githubusercontent.com/monarch-initiative/ontogpt/main/...
# Reviewed by: [Your Name]# 1. Start interactive session with Claude Code
claude-code
# 2. Provide instructions
"Generate D4D metadata for datasets in downloads_by_column/AI_READI/
following the schema at [schema URL]"
# 3. Review and refine
# Claude Code will generate metadata and you can provide feedback:
# - "Add more detail to the preprocessing section"
# - "Include information from the supplementary materials"
# - "Ensure all required fields are populated"
# 4. Save validated output
# Output is saved with generation metadata in YAML headerBenefits of Interactive Approach:
- β Human oversight and quality control
- β Domain expertise applied to field mapping
- β Iterative refinement based on feedback
- β Reasoning captured in generation process
- β Can handle complex, ambiguous documentation
- β Better handling of edge cases
| Aspect | API Agents π€ | Interactive Coding Agents π¨βπ» |
|---|---|---|
| Speed | Fast (batch processing) | Slower (interactive) |
| Scale | Hundreds of files | Few files at a time |
| Quality | Consistent, good | Variable, can be excellent |
| Human oversight | Minimal | Full |
| Cost | API costs Γ files | Time + API costs |
| Best for | Standardized docs | Complex/ambiguous docs |
| Customization | Limited | High |
| Domain expertise | Model knowledge only | Human + model knowledge |
For large-scale extraction:
- Use API agents for initial batch processing
- Use coding agents to review and refine difficult cases
- Document any manual corrections
For high-value datasets:
- Use coding agents with human oversight
- Validate against domain expertise
- Iterate until metadata is complete
Both approaches should include standardized generation metadata in YAML headers:
# D4D Metadata for: [Dataset Name]
# Source: [URL or file path]
# Generated: [ISO 8601 timestamp]
# Generator: [Tool name and version/model]
# Method: [automated | interactive | hybrid]
# Schema: [D4D schema URL]
# Validator: [Name/email if human reviewed]
# Notes: [Any relevant generation notes]- This repo: https://github.com/bridge2ai/data-sheets-schema
- API Agent Scripts: src/download/
- Validated wrapper:
src/download/validated_d4d_wrapper.py - Basic wrapper:
src/download/d4d_agent_wrapper.py
- Validated wrapper:
- Aurelian D4D Agent: aurelian/src/aurelian/agents/d4d/
- Agent:
d4d_agent.py - Tools:
d4d_tools.py - Config:
d4d_config.py
- Agent:
Details
Use the `make` command to generate project artefacts:make all: make everythingmake deploy: deploys site
This project was made with linkml-project-cookiecutter.