GROdecoder is a powerful Python toolkit for extracting and identifying molecular components from structure files (PDB, GRO, CRD, COOR) generated by molecular dynamics simulations. It automatically detects and classifies proteins, nucleic acids, lipids, ions, solvents, and other small molecules, providing detailed molecular inventories in JSON format.
- Automatic Molecule Detection: Intelligently identifies proteins, nucleic acids, lipids, ions, solvents, and unknown molecules
- Chain Segmentation: Detects individual protein/nucleic acid chains using distance-based connectivity analysis
- Resolution Detection: Automatically determines if structures are all-atom or coarse-grained
- Comprehensive Database: Built-in databases from CHARMM-GUI CSML and MAD for accurate molecular identification
- Multiple Output Formats: Full or compact JSON serialization with optional atom indices
- Flexible Input: Supports PDB, GRO, CRD, and COOR file formats
- Web Interface: User-friendly Streamlit web application
- Command Line Tool: Batch processing capabilities for high-throughput analysis
GROdecoder uses uv for dependency management:
git clone https://github.com/pierrepo/grodecoder.git
cd grodecoder
uv syncCommand Line:
# Analyze a structure file
uv run grodecoder path/to/structure.gro
# Analyze a pair topology file + coordinates
uv run grodecoder path/to/topology.psf /path/to/coordinates.coor
# Output to stdout with compact format
uv run grodecoder structure.pdb --compact --stdout
# Custom bond threshold for chain detection
uv run grodecoder structure.gro --bond-threshold 3.5Python API:
import grodecoder as gd
# Decode a structure file
decoded = gd.decode_structure("structure.gro")
# Access the molecular inventory
inventory = decoded.inventory
print(f"Found {len(inventory.segments)} protein/nucleic chains")
print(f"Found {len(inventory.small_molecules)} small molecules")
# Get specific molecule types
proteins = [seg for seg in inventory.segments
if seg.molecular_type == gd.MolecularType.PROTEIN]
lipids = [mol for mol in inventory.small_molecules
if mol.molecular_type == gd.MolecularType.LIPID]Web Interface:
uv run streamlit run scripts/streamlit_app.pyThen open your browser at http://localhost:8501
GROdecoder produces detailed JSON inventories with the following structure:
{
"inventory": {
"segments": [
{
"atoms": [0, 1, 2, ...],
"sequence": "MKALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIV...",
"molecular_type": "protein",
"number_of_atoms": 1434,
"number_of_residues": 89
}
],
"small_molecules": [
{
"atoms": [8080, 8081, 8082, ...],
"name": "SOL",
"description": "Water (TIP3P model)",
"molecular_type": "solvent",
"number_of_atoms": 17259,
"number_of_residues": 5753
}
]
},
"resolution": "all-atom",
"database_version": "1.0.0"
}Reading a Grodecoder inventory file is essential to be able to access the different parts of a system without having to identify them again:
from grodecoder import read_grodecoder_output
gro_results = read_grodecoder_output("1BRS_grodecoder.json")
# Print the sequence of protein segment only.
for segment in gro_results.decoded.inventory.segments:
if segment.is_protein():
print(segment.sequence)In conjunction with the structure file, we can use the grodecoder output file to access the different parts of the system, as identified by grodecoder:
import MDAnalysis
from grodecoder import read_grodecoder_output
universe = MDAnalysis.Universe("tests/data/1BRS.pdb")
gro_results = read_grodecoder_output("1BRS_grodecoder.json")
# Prints the center of mass of each protein segment.
for segment in gro_results.decoded.inventory.segments:
if segment.is_protein():
seg: MDAnalysis.AtomGroup = universe.atoms[segment.atoms]
print(seg.center_of_mass())GROdecoder uses sophisticated distance-based algorithms to detect protein and nucleic acid chains:
# Detect chains with custom cutoff
decoded = gd.decode_structure("multi_chain.pdb", bond_threshold=4.0)
# Access individual chains
for i, chain in enumerate(decoded.inventory.segments):
print(f"Chain {i+1}: {len(chain.sequence)} residues")The toolkit categorizes molecules into six types:
- Proteins: Amino acid sequences
- Nucleic Acids: DNA/RNA sequences
- Lipids: Membrane components from CHARMM-GUI and MAD databases
- Ions: Inorganic ions (Na+, Cl-, Ca2+, etc.)
- Solvents: Water models and organic solvents
- Unknown: Unidentified small molecules
Automatically distinguishes between:
- All-atom: High-resolution structures with complete atomic detail
- Coarse-grained: Simplified representations with grouped atoms
GROdecoder includes comprehensive molecular databases:
- Amino Acids: Standard and modified residues
- Nucleotides: DNA/RNA bases and modifications
- Ions: Common inorganic ions with various protonation states
- Solvents: Water models (TIP3P, SPC/E, etc.) and organic solvents
- CHARMM-GUI CSML: All-atom lipid and small molecule definitions
- MAD Database: Coarse-grained molecular definitions
# Update CHARMM-GUI CSML database
uv run scripts/build_lipid_database.py
# Update MAD database
uv run scripts/scrap_MAD.pyRun the comprehensive test suite:
# Run all tests
uv run pytest
# Run specific test categories
uv run pytest tests/test_identifier.py # Core identification tests
uv run pytest tests/test_regression.py # Regression tests
uv run pytest tests/test_toputils/ # Topology utility testsContributions are welcome! Please feel free to:
- Report Issues: Bug reports and feature requests via GitHub Issues
- Submit Pull Requests: Code improvements and new features
- Add Molecules: Extend the molecular databases
- Improve Documentation: Help make GROdecoder more accessible
GROdecoder is released under the BSD 3-Clause License. See LICENSE for details.
Created by Pierre Poulain
Special thanks to:
- Contributors to the CHARMM-GUI CSML database
- MAD (Martini Database) maintainers
- MDAnalysis community
- Beta testers and early adopters
- Documentation: GitHub Repository
- Web Demo: Streamlit App
- Issues & Support: GitHub Issues
GROdecoder: Making molecular simulation analysis simple, accurate, and reproducible.