Skip to content

Conversation

@d4dassistant
Copy link

Summary

Created new D4D datasheet for Cell Maps for AI (CM4AI) based on comprehensive information from the CM4AI website and publications page.

Source Documentation

Files Added

  • data/sheets_d4dassistant/cm4ai_d4d.yaml - Comprehensive D4D YAML datasheet

Key Metadata Extracted

Dataset Overview

  • Dataset ID: cm4ai-cell-maps
  • Dataset Name: CM4AI Cell Maps for AI
  • Description: Multimodal cellular maps integrating proteomics, imaging, and CRISPR perturbation data

Three Complementary Mapping Approaches

  1. Proteomic Mass Spectrometry: Protein abundance, modifications, and interactions
  2. Cellular Imaging: High-resolution microscopy of cellular morphology and localization
  3. CRISPR/Cas9 Perturbation: Genetic knockouts and functional characterization

D4D Sections Populated

  • Motivation: Purpose, tasks, funding (NIH Bridge2AI)
  • Composition: Multimodal data types, sampling strategies, missing information
  • Collection: Acquisition methods, collection mechanisms, data collectors, ethical reviews
  • Preprocessing: Processing strategies for each modality, software tools used
  • Uses: Recommended use cases, discouraged uses, impact considerations
  • Distribution: Formats (mzML, TIFF, CSV, etc.), dates, licensing (open access principles)
  • Maintenance: Maintainers, update frequency, versioning, retention policies

Ethical & Governance

  • IRB approvals documented
  • De-identification procedures described
  • Consent processes included
  • Open access licensing following NIH policies

Known Issues

Schema Validation

⚠️ Important Note: The current LinkML schema validation fails for this file AND for existing D4D files in the repository (including VOICE_d4d_alldocs.yaml, CM4AI_d4d_alldocs.yaml, etc.).

Validation issues observed:

  • Existing D4D files use complex nested structures that don't match current schema expectations
  • Schema appears to expect simple strings for most fields, but examples use structured objects
  • This suggests schema and examples are out of sync

Recommendation: Schema review needed to align with the D4D file structure used across the repository.

How to Review

  1. Read the YAML: Review data/sheets_d4dassistant/cm4ai_d4d.yaml for completeness and accuracy
  2. Verify against sources: Compare metadata with CM4AI website and publications
  3. Check required fields: Confirm id, name, title, and description are present and accurate
  4. Assess coverage: Evaluate whether all major D4D sections are adequately populated

Metadata Quality

Strengths:

  • Comprehensive coverage of all major D4D sections
  • Based on official CM4AI project documentation
  • Includes specific technical details (software tools, data formats, protocols)
  • Documents ethical review and governance procedures
  • References 30+ publications from CM4AI

Limitations:

  • MCP tools (ARTL, WebFetch) encountered API issues, preventing direct paper analysis
  • Some details inferred from project website rather than individual paper analysis
  • Specific numeric details (participant counts, dataset sizes) not available from sources accessed
  • Would benefit from review by CM4AI consortium members for technical accuracy

Next Steps

  1. Review datasheet content for accuracy
  2. Consider schema updates to support structured D4D metadata
  3. Optionally: Enhance with specific metrics from publications once MCP tool access is resolved
  4. Merge when satisfied with content

Related to: #91


🤖 Generated with D4D Assistant

- Comprehensive metadata for Cell Maps for AI (CM4AI) dataset
- Extracted from CM4AI website and publications page
- Covers all major D4D sections: motivation, composition, collection, preprocessing, uses, distribution, and maintenance
- Describes three complementary mapping approaches: proteomics, imaging, and CRISPR perturbation
- Includes ethical review information and data governance details

Note: Current schema validation shows errors due to schema-example mismatch.
Existing D4D files in the repository also fail validation with current schema.
Schema updates may be needed to align with existing D4D file structures.

Source URLs processed:
- https://cm4ai.org/
- https://cm4ai.org/publications/
- https://cm4ai.org/data-releases/

Co-Authored-By: Claude <noreply@anthropic.com>
@d4dassistant d4dassistant mentioned this pull request Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants