A comprehensive Python package for extracting features from both molecules and proteins for machine learning applications, with special support for graph neural networks and protein-ligand modeling.
pip install git+https://github.com/eightmm/plfeature.gitMolecule Features
from plfeature import MoleculeFeaturizer
from rdkit import Chem
# From SDF file
suppl = Chem.SDMolSupplier('molecules.sdf')
for mol in suppl:
featurizer = MoleculeFeaturizer(mol) # Initialize with molecule
features = featurizer.get_feature()
node, edge, adj = featurizer.get_graph()
# From SMILES
featurizer = MoleculeFeaturizer("CC(=O)Oc1ccccc1C(=O)O")
features = featurizer.get_feature() # All descriptors and fingerprints
node, edge, adj = featurizer.get_graph() # Graph representationProtein Features
from plfeature import ProteinFeaturizer
featurizer = ProteinFeaturizer("protein.pdb")
# Sequence extraction by chain
sequences_by_chain = featurizer.get_sequence_by_chain() # {'A': "ACDEF...", 'B': "GHIKL..."}
# Atom-level graph features (node/edge format with SASA included)
atom_node, atom_edge = featurizer.get_atom_graph(distance_cutoff=4.0)
# Residue-level features (node/edge format)
res_node, res_edge = featurizer.get_features(distance_cutoff=8.0)Hierarchical Features with ESM Embeddings
from plfeature.protein_featurizer import HierarchicalFeaturizer
# Initialize (loads ESMC + ESM3 models)
featurizer = HierarchicalFeaturizer()
# Extract all features
data = featurizer.featurize("protein.pdb")
# Atom-level (integer indices for nn.Embedding)
atom_tokens = data.atom_tokens # [N_atom] - indices 0-186
atom_elements = data.atom_elements # [N_atom] - indices 0-7
atom_coords = data.atom_coords # [N_atom, 3]
# Residue-level
residue_features = data.residue_features # [N_res, 76]
residue_vectors = data.residue_vector_features # [N_res, 31, 3]
# ESM embeddings (6 tensors)
esmc_emb = data.esmc_embeddings # [N_res, 1152]
esmc_bos = data.esmc_bos # [1152]
esmc_eos = data.esmc_eos # [1152]
esm3_emb = data.esm3_embeddings # [N_res, 1536]
esm3_bos = data.esm3_bos # [1536]
esm3_eos = data.esm3_eos # [1536]
# Atom-residue mapping for pooling
atom_to_residue = data.atom_to_residue # [N_atom]
# Select subset of residues (e.g., binding pocket)
pocket = data.select_residues([10, 11, 12, 45, 46])Interaction Features (PLI)
from plfeature import PLInteractionFeaturizer
from rdkit import Chem
# Load protein and ligand molecules (must have 3D coordinates)
protein_mol = Chem.MolFromPDBFile("protein.pdb")
ligand_mol = Chem.MolFromMolFile("ligand.sdf")
# Initialize featurizer with both molecules
featurizer = PLInteractionFeaturizer(protein_mol, ligand_mol, distance_cutoff=4.5)
# Get interaction edges and features for GNN
edges, edge_features = featurizer.get_interaction_edges()
# edges: [2, num_interactions] - (protein_atom_idx, ligand_atom_idx) pairs
# edge_features: [num_interactions, 73] - comprehensive interaction features
# Get complete interaction graph data
graph = featurizer.get_interaction_graph()
# Includes: edges, edge_features, interactions, interaction_counts, metadata
# Detect specific interaction types
hbonds = featurizer.detect_hydrogen_bonds()
salt_bridges = featurizer.detect_salt_bridges()
pi_stacking = featurizer.detect_pi_stacking()
hydrophobic = featurizer.detect_hydrophobic()
# Get pharmacophore features for atoms
protein_pharm, ligand_pharm = featurizer.get_atom_pharmacophore_features()
# protein_pharm: [n_protein_atoms, 7] - pharmacophore type one-hot
# ligand_pharm: [n_ligand_atoms, 7]
# Get summary of detected interactions
print(featurizer.get_interaction_summary())PDB Standardization
from plfeature import PDBStandardizer
# Default: Convert PTMs to base amino acids (for ESM models, MD simulation)
standardizer = PDBStandardizer() # ptm_handling='base_aa' (default)
standardizer.standardize("messy.pdb", "clean.pdb")
# For protein-ligand modeling: Preserve PTM atoms
standardizer = PDBStandardizer(ptm_handling='preserve')
standardizer.standardize("protein.pdb", "protein_with_ptms.pdb")
# Remove PTM residues entirely (standard AA only)
standardizer = PDBStandardizer(ptm_handling='remove')
standardizer.standardize("protein.pdb", "protein_standard_only.pdb")
# Keep hydrogens
standardizer = PDBStandardizer(remove_hydrogens=False)
standardizer.standardize("messy.pdb", "clean.pdb")
# Or use convenience function
from plfeature import standardize_pdb
standardize_pdb("messy.pdb", "clean.pdb", ptm_handling='base_aa')
# Standardization automatically:
# - Removes water molecules
# - Removes DNA/RNA residues
# - Handles PTMs based on ptm_handling mode (see below)
# - Normalizes protonation states (HID/HIE/HIPβHIS)
# - Reorders atoms by standard definitions
# - Renumbers residues sequentially (insertion codes removed)
# - Removes hydrogens (optional)- Descriptors: 40 normalized molecular properties β Details
- Fingerprints: 9 types including Morgan, MACCS, RDKit β Details
- Graph Features: 157D atom features, 66D bond features β Details
- Interaction Types: 7 categories (hydrogen bond, salt bridge, pi-stacking, cation-pi, hydrophobic, halogen bond, metal coordination)
- Edge Features: 73-dimensional features per interaction (interaction type, distance, angles, element types, hybridization, residue info)
- Pharmacophore Detection: SMARTS-based detection for H-bond donors/acceptors, charged groups, aromatics, hydrophobics, halogens
- Heavy Atom Only: Graph nodes use heavy atoms only, hydrogen info encoded in edge features (angles)
- Atom Features: 187 token types with atomic SASA β Details
- Residue Features: Geometry, SASA, contacts, secondary structure β Details
- Hierarchical Features: Atom-residue attention with ESM embeddings β Details
- Atom-level: 187 tokens (integer indices), 8 elements, 22 residue types
- Residue-level: 76-dim scalar + 31x3 vector features
- ESM embeddings: ESMC (1152-dim) + ESM3 (1536-dim) with BOS/EOS tokens
- Graph Representations: Both atom and residue-level networks
- Standardization: Flexible PTM handling with 3 modes (base_aa, preserve, remove)
Custom SMARTS Patterns for Molecules
from plfeature import MoleculeFeaturizer
# Define custom SMARTS patterns
custom_patterns = {
'aromatic_nitrogen': 'n',
'carboxyl': 'C(=O)O',
'hydroxyl': '[OH]',
'amine': '[NX3;H2,H1;!$(NC=O)]'
}
# Initialize with custom patterns (and optionally control hydrogen addition)
featurizer = MoleculeFeaturizer("c1ccncc1CCO", hydrogen=False, custom_smarts=custom_patterns)
# Get graph with custom features automatically included
node, edge, adj = featurizer.get_graph()
# node['node_feats'] is now 157 + n_patterns dimensions
# If you need to check patterns separately
custom_feats = featurizer.get_custom_smarts_features()
# Returns: {'features': tensor, 'names': [...], 'patterns': {...}}PTM (Post-Translational Modification) Handling
from plfeature import PDBStandardizer
# Mode 1: 'base_aa' - Convert PTMs to base amino acids (DEFAULT)
# Use for: ESM models, AlphaFold, MD simulations with standard force fields
standardizer = PDBStandardizer(ptm_handling='base_aa')
standardizer.standardize("protein.pdb", "protein_esm.pdb")
# SEP β SER (phosphate removed)
# PTR β TYR (phosphate removed)
# MSE β MET (selenium β sulfur)
# MLY β LYS (methyl groups removed)
# Mode 2: 'preserve' - Keep PTM atoms intact
# Use for: Protein-ligand binding analysis, structural studies with PTMs
standardizer = PDBStandardizer(ptm_handling='preserve')
standardizer.standardize("protein.pdb", "protein_structure.pdb")
# SEP stays as SEP with all phosphate atoms
# PTR stays as PTR with all phosphate atoms
# MSE stays as MSE with selenium atom
# Mode 3: 'remove' - Remove PTM residues entirely
# Use for: Standard-AA-only analysis
standardizer = PDBStandardizer(ptm_handling='remove')
standardizer.standardize("protein.pdb", "protein_standard.pdb")
# SEP residues removed
# MSE residues removed
# Only 20 standard amino acids remain
# Supported PTMs (15+ types):
# - Phosphorylation: SEP, TPO, PTR
# - Selenomethionine: MSE
# - Methylation/Acetylation: MLY, M3L, ALY
# - Hydroxylation: HYP
# - Cysteine modifications: CSO, CSS, CME, OCS
# - Others: MEN, FMEProtein Sequence Extraction
from plfeature import ProteinFeaturizer
featurizer = ProteinFeaturizer("protein.pdb")
# Get sequences separated by chain
sequences_by_chain = featurizer.get_sequence_by_chain()
for chain_id, sequence in sequences_by_chain.items():
print(f"Chain {chain_id}: {sequence} (length: {len(sequence)})")
# If you need the full sequence, concatenate:
full_sequence = ''.join(sequences_by_chain.values())
print(f"Full sequence: {full_sequence}")
# Example output:
# Chain A: ACDEFGHIKLMNPQRSTVWY (length: 20)
# Chain B: GHIKLMNPQR (length: 10)Contact Maps with Different Thresholds
from plfeature import ProteinFeaturizer
featurizer = ProteinFeaturizer("protein.pdb")
# Different thresholds for different analyses
close_contacts = featurizer.get_contact_map(cutoff=4.5) # Close contacts only
standard_contacts = featurizer.get_contact_map(cutoff=8.0) # Standard threshold
extended_contacts = featurizer.get_contact_map(cutoff=12.0) # Extended interactions
# Access contact information
edges = standard_contacts['edges']
distances = standard_contacts['distances']
adjacency = standard_contacts['adjacency_matrix']Batch Processing (Python)
from plfeature import MoleculeFeaturizer
import torch
smiles_list = ["CCO", "CC(=O)O", "c1ccccc1"]
all_features = []
for smiles in smiles_list:
featurizer = MoleculeFeaturizer(smiles)
features = featurizer.get_feature()
all_features.append(features['descriptor'])
descriptors = torch.stack(all_features)Batch Processing (CLI Scripts)
For large-scale feature extraction, use the provided CLI scripts in the scripts/ directory.
# Basic usage - extract hierarchical features with ESM embeddings
python scripts/batch_protein_featurize.py \
--input_dir /data/proteins \
--output_dir /data/protein-features
# Parallel processing with 4 workers
python scripts/batch_protein_featurize.py \
--input_dir /data/proteins \
--output_dir /data/protein-features \
--num_workers 4
# Resume interrupted processing
python scripts/batch_protein_featurize.py \
--input_dir /data/proteins \
--output_dir /data/protein-features \
--resumeOutput features (.pt files):
| Feature | Shape | Description |
|---|---|---|
atom_tokens |
[N_atom] |
Token indices 0-186 (187 classes) |
atom_elements |
[N_atom] |
Element indices 0-7 (8 classes) |
atom_residue_types |
[N_atom] |
Residue indices 0-21 (22 classes) |
atom_coords |
[N_atom, 3] |
3D coordinates |
atom_sasa |
[N_atom] |
Solvent accessible surface area |
residue_features |
[N_res, 76] |
Residue-level scalar features |
residue_vector_features |
[N_res, 31, 3] |
Residue-level vector features |
esmc_embeddings |
[N_res, 1152] |
ESMC embeddings |
esm3_embeddings |
[N_res, 1536] |
ESM3 embeddings |
atom_to_residue |
[N_atom] |
Atom-to-residue mapping |
# Basic usage - extract all features (descriptors + fingerprints + graph)
python scripts/batch_ligand_featurize.py \
--input_dir /data/ligands \
--output_dir /data/ligand-features
# Graph features only (faster, smaller files)
python scripts/batch_ligand_featurize.py \
--input_dir /data/ligands \
--output_dir /data/ligand-features \
--graph_only
# Parallel processing
python scripts/batch_ligand_featurize.py \
--input_dir /data/ligands \
--output_dir /data/ligand-features \
--num_workers 4
# Resume interrupted processing
python scripts/batch_ligand_featurize.py \
--input_dir /data/ligands \
--output_dir /data/ligand-features \
--resumeSupported file formats: SDF, MOL2, MOL, PDB (priority order)
Output features (.pt files):
| Feature | Shape | Description |
|---|---|---|
node_feats |
[N_atoms, 157] |
Atom-level features |
edge_feats |
[N_edges, 66] |
Bond-level features |
edge_index |
[2, N_edges] |
Edge connectivity |
adjacency |
[N_atoms, N_atoms] |
Adjacency matrix |
descriptor |
[40] |
Molecular descriptors (not with --graph_only) |
morgan |
[2048] |
Morgan fingerprint (not with --graph_only) |
maccs |
[167] |
MACCS fingerprint (not with --graph_only) |
Check out the example/ directory for:
- 10gs_protein.pdb and 10gs_ligand.sdf: Example protein-ligand complex files
- usage_example.ipynb: Jupyter notebook with comprehensive usage examples
- README.md: Quick reference for example files
- Feature Types Overview - Quick overview of all features
- Molecule Features - Descriptors & fingerprints guide
- Molecule Graph Features - Graph representations for molecules
- Protein Atom Features - Atom-level features guide
- Protein Residue Features - Residue-level features guide
- Protein Hierarchical Features - ESM embeddings for proteins
MIT License - see LICENSE file
@software{plfeature2025,
title = {plfeature: Unified molecular and protein feature extraction for protein-ligand modeling},
author = {Jaemin Sim},
year = {2025},
url = {https://github.com/eightmm/plfeature}
}