Skip to content

GroDecoder extracts and identifies the molecular components of structure files (PDB or GRO) issued from molecular dynamics simulations.

License

Notifications You must be signed in to change notification settings

MDverse/grodecoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GROdecoder πŸ”¬

GROdecoder is a powerful Python toolkit for extracting and identifying molecular components from structure files (PDB, GRO, CRD, COOR) generated by molecular dynamics simulations. It automatically detects and classifies proteins, nucleic acids, lipids, ions, solvents, and other small molecules, providing detailed molecular inventories in JSON format.

Python Version License

🌟 Features

  • Automatic Molecule Detection: Intelligently identifies proteins, nucleic acids, lipids, ions, solvents, and unknown molecules
  • Chain Segmentation: Detects individual protein/nucleic acid chains using distance-based connectivity analysis
  • Resolution Detection: Automatically determines if structures are all-atom or coarse-grained
  • Comprehensive Database: Built-in databases from CHARMM-GUI CSML and MAD for accurate molecular identification
  • Multiple Output Formats: Full or compact JSON serialization with optional atom indices
  • Flexible Input: Supports PDB, GRO, CRD, and COOR file formats
  • Web Interface: User-friendly Streamlit web application
  • Command Line Tool: Batch processing capabilities for high-throughput analysis

πŸš€ Quick Start

Installation

GROdecoder uses uv for dependency management:

git clone https://github.com/pierrepo/grodecoder.git
cd grodecoder
uv sync

Basic Usage

Command Line:

# Analyze a structure file
uv run grodecoder path/to/structure.gro

# Analyze a pair topology file + coordinates
uv run grodecoder path/to/topology.psf /path/to/coordinates.coor

# Output to stdout with compact format
uv run grodecoder structure.pdb --compact --stdout

# Custom bond threshold for chain detection
uv run grodecoder structure.gro --bond-threshold 3.5

Python API:

import grodecoder as gd

# Decode a structure file
decoded = gd.decode_structure("structure.gro")

# Access the molecular inventory
inventory = decoded.inventory
print(f"Found {len(inventory.segments)} protein/nucleic chains")
print(f"Found {len(inventory.small_molecules)} small molecules")

# Get specific molecule types
proteins = [seg for seg in inventory.segments
           if seg.molecular_type == gd.MolecularType.PROTEIN]
lipids = [mol for mol in inventory.small_molecules
          if mol.molecular_type == gd.MolecularType.LIPID]

Web Interface:

uv run streamlit run scripts/streamlit_app.py

Then open your browser at http://localhost:8501

πŸ“Š Output Format

GROdecoder produces detailed JSON inventories with the following structure:

{
  "inventory": {
    "segments": [
      {
        "atoms": [0, 1, 2, ...],
        "sequence": "MKALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIV...",
        "molecular_type": "protein",
        "number_of_atoms": 1434,
        "number_of_residues": 89
      }
    ],
    "small_molecules": [
      {
        "atoms": [8080, 8081, 8082, ...],
        "name": "SOL",
        "description": "Water (TIP3P model)",
        "molecular_type": "solvent",
        "number_of_atoms": 17259,
        "number_of_residues": 5753
      }
    ]
  },
  "resolution": "all-atom",
  "database_version": "1.0.0"
}

πŸ”§ Advanced Features

Read back a Grodecoder inventory file

Reading a Grodecoder inventory file is essential to be able to access the different parts of a system without having to identify them again:

from grodecoder import read_grodecoder_output

gro_results = read_grodecoder_output("1BRS_grodecoder.json")

# Print the sequence of protein segment only.
for segment in gro_results.decoded.inventory.segments:
    if segment.is_protein():
        print(segment.sequence)

In conjunction with the structure file, we can use the grodecoder output file to access the different parts of the system, as identified by grodecoder:

import MDAnalysis
from grodecoder import read_grodecoder_output


universe = MDAnalysis.Universe("tests/data/1BRS.pdb")
gro_results = read_grodecoder_output("1BRS_grodecoder.json")

# Prints the center of mass of each protein segment.
for segment in gro_results.decoded.inventory.segments:
    if segment.is_protein():
        seg: MDAnalysis.AtomGroup = universe.atoms[segment.atoms]
        print(seg.center_of_mass())

Chain Detection

GROdecoder uses sophisticated distance-based algorithms to detect protein and nucleic acid chains:

# Detect chains with custom cutoff
decoded = gd.decode_structure("multi_chain.pdb", bond_threshold=4.0)

# Access individual chains
for i, chain in enumerate(decoded.inventory.segments):
    print(f"Chain {i+1}: {len(chain.sequence)} residues")

Molecular Type Classification

The toolkit categorizes molecules into six types:

  • Proteins: Amino acid sequences
  • Nucleic Acids: DNA/RNA sequences
  • Lipids: Membrane components from CHARMM-GUI and MAD databases
  • Ions: Inorganic ions (Na+, Cl-, Ca2+, etc.)
  • Solvents: Water models and organic solvents
  • Unknown: Unidentified small molecules

Resolution Detection

Automatically distinguishes between:

  • All-atom: High-resolution structures with complete atomic detail
  • Coarse-grained: Simplified representations with grouped atoms

πŸ—ƒοΈ Database System

GROdecoder includes comprehensive molecular databases:

Built-in Databases

  • Amino Acids: Standard and modified residues
  • Nucleotides: DNA/RNA bases and modifications
  • Ions: Common inorganic ions with various protonation states
  • Solvents: Water models (TIP3P, SPC/E, etc.) and organic solvents

External Database Integration

  • CHARMM-GUI CSML: All-atom lipid and small molecule definitions
  • MAD Database: Coarse-grained molecular definitions

Database Updates

# Update CHARMM-GUI CSML database
uv run scripts/build_lipid_database.py

# Update MAD database
uv run scripts/scrap_MAD.py

πŸ§ͺ Testing

Run the comprehensive test suite:

# Run all tests
uv run pytest

# Run specific test categories
uv run pytest tests/test_identifier.py  # Core identification tests
uv run pytest tests/test_regression.py  # Regression tests
uv run pytest tests/test_toputils/      # Topology utility tests

🀝 Contributing

Contributions are welcome! Please feel free to:

  1. Report Issues: Bug reports and feature requests via GitHub Issues
  2. Submit Pull Requests: Code improvements and new features
  3. Add Molecules: Extend the molecular databases
  4. Improve Documentation: Help make GROdecoder more accessible

πŸ“„ License

GROdecoder is released under the BSD 3-Clause License. See LICENSE for details.

πŸ‘¨β€πŸ’» Authors & Acknowledgments

Created by Pierre Poulain

Special thanks to:

  • Contributors to the CHARMM-GUI CSML database
  • MAD (Martini Database) maintainers
  • MDAnalysis community
  • Beta testers and early adopters

πŸ”— Links


GROdecoder: Making molecular simulation analysis simple, accurate, and reproducible.

About

GroDecoder extracts and identifies the molecular components of structure files (PDB or GRO) issued from molecular dynamics simulations.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •