Distance and Feature library for phonetic analysis
A focused Python library for phonetic feature extraction and distance calculation, with comprehensive IPA support and glyph normalization.
- 🎯 Phonetic Features: Binary feature vectors for 1000+ IPA phonemes from CLTS BIPA
- 📏 Distance Metrics: Multiple methods including Hamming, Jaccard, Euclidean, Cosine, Manhattan, and K-means clustering
- 🔤 IPA Normalization: Comprehensive glyph normalization with NFD, diacritic ordering, and IPA canonicalization
- 🧪 Cognate Testing: Built-in alignment and optimization tools for historical linguistics
- ⚡ Fast & Lightweight: Minimal dependencies, optimized caching, pure Python with NumPy acceleration
- 🔧 Extensible: Register custom distance methods and feature systems
pip install distfeatFor development:
pip install distfeat[dev]from distfeat import phoneme_to_features, calculate_distance
# Get phonetic features
features_p = phoneme_to_features('p')
features_b = phoneme_to_features('b')
# Calculate distance between phonemes
distance = calculate_distance('p', 'b', method='hamming')
print(f"Distance between 'p' and 'b': {distance:.3f}")
# Output: Distance between 'p' and 'b': 0.053from distfeat import build_distance_matrix
# Build matrix for specific phonemes
phonemes = ['p', 'b', 't', 'd', 'k', 'g']
matrix, labels = build_distance_matrix(phonemes, method='hamming')
# Or build for all phonemes in the system
matrix, labels = build_distance_matrix(method='jaccard')from distfeat import normalize_ipa, normalize_glyph
# Normalize IPA text
text = "pʰæ̃n"
normalized = normalize_ipa(text)
print(normalized) # "pʰæ̃n" with consistent encoding
# Fine-grained control
normalized = normalize_glyph(
text,
nfd=True,
order_diacritics=True,
normalize_length=True,
preserve_tones=False
)from distfeat import save_distance_matrix, load_distance_matrix
# Save in different formats
save_distance_matrix(matrix, phonemes, 'distances.tsv', format='tsv')
save_distance_matrix(matrix, phonemes, 'distances.json', format='json')
# Load matrices
loaded_matrix, loaded_phonemes = load_distance_matrix('distances.tsv')- Hamming: Number of differing features (normalized)
- Jaccard: 1 - (intersection/union) of active features
- Euclidean: L2 distance in feature space
- Cosine: 1 - cosine similarity
- Manhattan: L1 distance (sum of absolute differences)
- K-means: Clustering-based distance using centroids
from distfeat import register_distance_method, calculate_distance
import numpy as np
# Define custom distance
def weighted_hamming(vec1, vec2):
weights = np.linspace(1, 2, len(vec1)) # Weight later features more
return np.sum(weights * (vec1 != vec2)) / np.sum(weights)
# Register and use
register_distance_method('weighted_hamming', weighted_hamming)
distance = calculate_distance('p', 'b', method='weighted_hamming')from distfeat import get_config, set_config
# Get current configuration
config = get_config()
# Modify settings
set_config('default_distance_method', 'jaccard')
set_config('on_error', 'raise') # 'raise', 'warn', or 'ignore'
set_config('kmeans_clusters', 15)# config.yaml
default_distance_method: hamming
default_normalize: true
default_precision: 4
cache_size: 2048
kmeans_clusters: 12
on_error: warnfrom distfeat import load_config
load_config('config.yaml')from distfeat import load_custom_features, phoneme_to_features
# Load custom feature system
load_custom_features(
'my_features.csv',
name='custom',
delimiter=',',
phoneme_col='IPA'
)
# Use custom system
features = phoneme_to_features('p', system='custom')The library includes comprehensive tests for:
- Unit Tests: Core functionality for features and distances
- Property Tests: Mathematical properties (symmetry, triangle inequality)
- Linguistic Tests: Voice distinctions, place of articulation, manner classes
- Integration Tests: Validation against cognate data and IPA charts
Run tests:
pytest tests/- CLTS BIPA: Phonetic features from the Cross-Linguistic Transcription Systems project
- Bundled Data: Complete feature system for 1000+ IPA phonemes included
- Test Data: Sample cognate sets for validation (not included in distribution)
phoneme_to_features(phoneme, system=None, on_error='warn'): Convert phoneme to featuresfeatures_to_phoneme(features, system=None, threshold=1.0): Find best matching phonemecalculate_distance(phoneme1, phoneme2, method='hamming', normalize=True): Calculate distancebuild_distance_matrix(phonemes=None, method='hamming'): Build distance matrix
normalize_ipa(text, canonicalize=True, decompose_affricates=False): Normalize IPA textnormalize_glyph(text, nfd=True, order_diacritics=True, ...): Fine-grained normalization
save_distance_matrix(matrix, phonemes, path, format='tsv'): Save matrixload_distance_matrix(path, format=None): Load matrixexport_matrix_tsv/csv/json(...): Format-specific exports
- Caching: LRU cache for distance calculations (configurable size)
- Vectorization: NumPy arrays for efficient computation
- Lazy Loading: Features loaded on first use
Contributions welcome! The library is designed to be extended with:
- New distance methods
- Additional feature systems
- Enhanced normalization rules
- Performance optimizations
MIT License - see LICENSE file for details.
If you use distfeat in your research, please cite:
@software{distfeat,
title = {distfeat: Distance and Feature library for phonetic analysis},
author = {UNIPA Development Team},
year = {2024},
url = {https://github.com/your-org/distfeat}
}