A PyTorch implementation for sign language detection and recognition using Spatio-Temporal Graph Convolutional Networks (ST-GCN). This model incorporates signer-aware attention, adversarial training to reduce signer bias, and supports both multi-class sign classification and binary signing detection.
- ST-GCN Backbone: Spatio-temporal graph convolutional layers for modeling sign language gestures
- Bias-Aware Attention: Attention mechanism conditioned on signer embeddings to adapt to different signing styles
- Adversarial Signer Head: Optional domain-adversarial training via gradient reversal layer to reduce signer-specific bias
- Supervised Contrastive Learning: Optional InfoNCE-based contrastive loss for better feature learning
- Flexible Landmark Support: Works with pose, hands, and face landmarks (e.g., from MediaPipe or OpenPose)
- Pseudo-Signer Clustering: K-means clustering of pose statistics to create pseudo signer IDs for unsupervised signer diversity
- Binary & Multi-Class: Supports both binary classification (signing vs. not-signing) and multi-class sign recognition
The model consists of several key components:
- ST-GCN Backbone: Processes spatio-temporal keypoint sequences using graph convolutions
- Signer Encoder: Encodes per-window pose statistics into signer embeddings
- Bias-Aware Attention: Temporally attends to features while being conditioned on signer style
- Temporal Aggregator: BiGRU-based aggregation over time
- Classifier Head: Final classification layer
- Signer Head (optional): Adversarial head for signer identification to encourage signer-invariant features
pip install -r requirements.txt- PyTorch >= 2.1.0
- NumPy >= 1.23
- scikit-learn >= 1.3
- matplotlib >= 3.7
- tqdm >= 4.65
Optional (for landmark extraction):
- MediaPipe >= 0.10.0
- OpenCV >= 4.7.0
The model expects:
-
Landmark files:
.npzfiles containing keypoint sequences. Each file should contain arrays for:pose: (T, 33, C) - Body pose landmarksleft_hand: (T, 21, C) - Left hand landmarks (optional)right_hand: (T, 21, C) - Right hand landmarks (optional)face: (T, 478, C) - Face landmarks (optional)
-
Groundtruth file: Text file with format:
<video_id> <frame_idx> <label>
Example: s means signing and n means no signing
video001 0 s
video001 25 s
video001 50 n
python -m train_landmarks \
--data /path/to/landmarks/folder \
--batch 32 \
--epochs 50 \
--lr 1e-3 \
--num-classes 100 \
--window 25 \
--stride 1 \
--include-pose \
--include-hands \
--save-best \
--log-csv \
--out runs/signgcnpython -m train_landmarks \
--data /path/to/landmarks/folder \
--binary \
--signing-labels "sign,signing,gesture" \
--batch 32 \
--epochs 50 \
--best-metric pr_auc \
--out runs/signgcn_binaryUses K-means clustering on pose statistics to create pseudo signer IDs for adversarial training:
python -m train_landmarks \
--data /path/to/landmarks/folder \
--use-pseudo-signers \
--num-pseudo-signers 8 \
--signer-loss-weight 0.5 \
--batch 32 \
--epochs 50 \
--out runs/signgcn_pseudoAdds InfoNCE-based contrastive loss:
python -m train_landmarks \
--data /path/to/landmarks/folder \
--use-supcon \
--supcon-weight 0.1 \
--supcon-temp 0.07 \
--batch 32 \
--epochs 50 \
--out runs/signgcn_supcon# First, create splits
python -m build_video_splits \
--data /path/to/landmarks/folder \
--window 25 \
--stride 1 \
--train 0.8 \
--val 0.1 \
--test 0.1 \
--out /path/to/landmarks/folder/splits.json
# Then train with splits
python -m train_landmarks \
--data /path/to/landmarks/folder \
--splits-json /path/to/landmarks/folder/splits.json \
--batch 32 \
--epochs 50 \
--out runs/signgcn_splits--data: Path to folder containing.npzfiles and groundtruth--window: Temporal window size (default: 25)--stride: Sliding window stride (default: 1)--coords: Number of coordinates per keypoint (default: 2, i.e., x, y)--include-pose: Include body pose landmarks--include-hands: Include hand landmarks--include-face: Include face landmarks--batch: Batch size--epochs: Number of training epochs--lr: Learning rate--num-classes: Number of sign classes (for multi-class)--binary: Train binary classifier (signing vs. not-signing)--use-pseudo-signers: Enable pseudo-signer clustering--use-supcon: Enable supervised contrastive learning--save-best: Save best model checkpoint--best-metric: Metric for best model selection (acc,loss,pr_auc,f1)--log-csv: Log metrics to CSV file--out: Output directory for checkpoints and logs
Visualize signer embeddings with t-SNE or PCA:
python -m visualize_signer_embeddings \
--data /path/to/landmarks/folder \
--ckpt runs/signgcn/best.pt \
--window 25 \
--include-pose \
--include-hands \
--reduce tsne \
--out embeddings_plot.png \
--save-csv embeddings.csvCreate train/val/test splits based on videos:
python -m build_video_splits \
--data /path/to/landmarks/folder \
--window 25 \
--stride 1 \
--s-label S \
--rule any \
--train 0.8 \
--val 0.1 \
--test 0.1 \
--out splits.jsonTrain all model variations on a small subset of data for quick testing:
python train_all_variations.py \
--data /path/to/landmarks/folder \
--max-files 3 \
--epochs 2 \
--batch 8 \
--base-out runs/test_variationsThis script will train:
- Basic training
- Binary classification
- Pseudo-signer clustering
- Supervised contrastive learning
- FiLM attention
- MI minimization
- Combined variations
Options:
--max-files: Limit number of .npz files to use (default: 3)--epochs: Number of epochs per variation (default: 2)--batch: Batch size (default: 8)--base-out: Base output directory (default: runs/test_variations)--skip-*: Skip specific variations (e.g.,--skip-binary,--skip-pseudo)
The model can be customized through various hyperparameters:
stgcn_channels: Hidden dimensions for ST-GCN layers (default:(64, 128, 128))stgcn_kernel: Temporal kernel size (default:3)stgcn_dilations: Dilation rates for temporal convolutions (default:(1, 2, 3))temporal_hidden: Hidden dimension for temporal aggregator (default:256)signer_emb_dim: Dimension of signer embedding (default:64)attn_heads: Number of attention heads (default:4)lambda_grl: Gradient reversal weight for adversarial training (default:0.5)dropout: Dropout rate (default:0.1)
Training produces:
best.pt: Best model checkpoint (if--save-bestis used)last.pt: Last epoch checkpointmetrics.csv: Training metrics per epoch (if--log-csvis used)
Checkpoint format:
{
'model': model.state_dict(),
'epoch': epoch_number,
'val_acc': validation_accuracy,
# ... other metadata
}SignSTGCN/
├── train_landmarks.py # Main training script
├── train_all_variations.py # Train all variations on small data subset
├── signGCN.py # Model entry point / example usage
├── model.py # SignSTGCNModel implementation
├── losses.py # Loss functions
├── build_video_splits.py # Video split generation
├── visualize_signer_embeddings.py # Embedding visualization
├── test_film_attention.py # Test script for FiLM attention
├── datasets/
│ └── landmarks_npz.py # NPZ landmark dataset
├── layers/
│ └── graph.py # ST-GCN backbone
├── modules/
│ ├── signer.py # Signer encoder
│ ├── attention.py # Bias-aware attention
│ ├── temporal.py # Temporal aggregator
│ └── heads.py # Classification and signer heads
└── utils/
└── graph.py # Graph utilities
If you use this code in your research, please cite:
@software{signstgcn,
title={SignSTGCN: Sign Language Detection with Spatio-Temporal Graph Convolutional Networks},
author={Cyrine Chaabani},
year={2024}
}This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.
This implementation builds upon ST-GCN architectures and incorporates techniques for reducing signer bias in sign language recognition models.