Skip to content

tresoldi/asymcat

Repository files navigation

ASymCat: Asymmetric Categorical Association Analysis

PyPI version Python versions Code Quality codecov Ruff License: MIT

ASymCat is a comprehensive Python library for analyzing asymmetric associations between categorical variables. Unlike traditional symmetric measures that treat relationships as bidirectional, ASymCat provides directional measures that reveal which variable predicts which, making it invaluable for understanding causal relationships, dependencies, and information flow in categorical data.

🚀 Key Features

  • 17+ Association Measures: From basic MLE to advanced information-theoretic measures
  • Directional Analysis: X→Y vs Y→X asymmetric relationship quantification
  • Robust Smoothing: FreqProb integration for numerical stability
  • Multiple Data Formats: Sequences, presence-absence matrices, n-grams
  • Scalable Architecture: Optimized for large datasets with efficient algorithms
  • Comprehensive Testing: 75+ tests with 78%+ coverage ensuring reliability and accuracy

🎯 Why Asymmetric Measures Matter

Traditional measures like Pearson's χ² or Cramér's V treat associations as symmetric: the relationship between X and Y is the same as between Y and X. However, many real-world relationships are inherently directional:

  • Linguistics: Phoneme transitions may be predictable in one direction but not the other
  • Ecology: Species presence may predict other species asymmetrically
  • Market Research: Product purchases may show directional dependencies
  • Medical Analysis: Symptoms may predict conditions more reliably than vice versa

ASymCat quantifies these directional relationships, revealing hidden patterns that symmetric measures miss.

📊 Quick Example

import asymcat

# Load your categorical data
data = asymcat.read_sequences("data.tsv")  # or read_pa_matrix() for binary data

# Collect co-occurrences  
cooccs = asymcat.collect_cooccs(data)

# Create scorer and analyze
scorer = asymcat.CatScorer(cooccs)

# Get asymmetric measures
mle_scores = scorer.mle()           # Maximum likelihood estimation
pmi_scores = scorer.pmi()           # Pointwise mutual information  
chi2_scores = scorer.chi2()         # Chi-square with smoothing
fisher_scores = scorer.fisher()     # Fisher exact test

# Each returns {(x, y): (x→y_score, y→x_score)}
print(f"A→B: {mle_scores[('A', 'B')][0]:.3f}")
print(f"B→A: {mle_scores[('A', 'B')][1]:.3f}")

🛠️ Installation

From PyPI (Recommended)

pip install asymcat

From Source

git clone https://github.com/tresoldi/asymcat.git
cd asymcat
pip install -e ".[dev]"  # Install with all optional dependencies

Dependencies

  • Core: numpy, pandas, scipy, matplotlib, seaborn, tabulate, freqprob
  • Development: pytest, ruff, mypy, jupyter
  • Optional: plotly, bokeh, altair (for enhanced visualization)

📚 Documentation & Resources

ASymCat provides comprehensive documentation organized for different needs:

Core Documentation

Document Purpose Audience
User Guide Conceptual foundations, theory, best practices Everyone - start here
API Reference Complete technical API documentation Developers
LLM Documentation Quick integration and code patterns AI agents, rapid development

Progressive Interactive Tutorials

Learn ASymCat through hands-on Nhandu tutorials with executable code and visualizations:

📘 Tutorial 1: Basics

Foundation - Get started with asymmetric analysis 📄 Python source | 🌐 View HTML

  • What are asymmetric associations and why they matter
  • Basic workflow: load → collect → score
  • Simple measures (MLE, PMI, Jaccard)
  • Working with sequences and presence-absence data

📗 Tutorial 2: Advanced Measures

Depth - Master all 17+ association measures 📄 Python source | 🌐 View HTML

  • Information-theoretic measures (PMI, NPMI, Theil's U)
  • Statistical measures (Chi-square, Cramér's V, Fisher)
  • Smoothing methods and their effects
  • Measure selection decision tree

📙 Tutorial 3: Visualization

Communication - Create publication-quality figures 📄 Python source | 🌐 View HTML

  • Heatmap visualizations of association matrices
  • Score distribution and asymmetry plots
  • Matrix transformations (scaling, inversion)
  • Multi-measure comparison panels

📕 Tutorial 4: Real-World Applications

Application - Complete analysis workflows 📄 Python source | 🌐 View HTML

  • Linguistics: Grapheme-phoneme correspondence analysis
  • Ecology: Galápagos finch species co-occurrence patterns
  • Machine Learning: Feature selection with asymmetric measures
  • Interpretation best practices and reporting strategies

💡 All tutorials are fully executed with committed outputs - view the HTML files online via the links above, or run the Python source files locally to explore and modify. Generate fresh documentation with make docs.

Additional Resources

🎮 Usage

Python API

Basic Analysis

import asymcat

# Load data (TSV format: tab-separated sequences)
data = asymcat.read_sequences("linguistic_data.tsv")
cooccs = asymcat.collect_cooccs(data)

# Create scorer with smoothing
scorer = asymcat.CatScorer(cooccs, smoothing_method="laplace", smoothing_alpha=1.0)

# Compute multiple measures
results = {
    'mle': scorer.mle(),
    'pmi': scorer.pmi(),
    'chi2': scorer.chi2(),
    'fisher': scorer.fisher(),
    'theil_u': scorer.theil_u(),
}

# Analyze directional relationships
for measure, scores in results.items():
    for (x, y), (xy_score, yx_score) in scores.items():
        if xy_score > yx_score:
            print(f"{measure}: {x}{y} stronger than {y}{x}")

Advanced Features

# N-gram analysis
ngram_cooccs = asymcat.collect_cooccs(data, order=2, pad="#")
ngram_scorer = asymcat.CatScorer(ngram_cooccs)

# Matrix generation for visualization
xy_matrix, yx_matrix, x_labels, y_labels = asymcat.scorer.scorer2matrices(
    ngram_scorer.pmi()
)

# Score transformations
scaled_scores = asymcat.scorer.scale_scorer(scores, method="minmax")
inverted_scores = asymcat.scorer.invert_scorer(scaled_scores)

📈 Association Measures

ASymCat implements 17+ association measures organized by type:

Probabilistic Measures

  • MLE: Maximum Likelihood Estimation - P(X|Y) and P(Y|X)
  • Jaccard Index: Set overlap with asymmetric interpretation

Information-Theoretic Measures

  • PMI: Pointwise Mutual Information (log P(X,Y)/P(X)P(Y))
  • PMI Smoothed: Numerically stable PMI with FreqProb smoothing
  • NPMI: Normalized PMI [-1, 1] range
  • Mutual Information: Average information shared
  • Conditional Entropy: Information remaining after observing condition

Statistical Measures

  • Chi-Square: Pearson's χ² with optional smoothing
  • Cramér's V: Normalized chi-square association
  • Fisher Exact: Exact odds ratios for small samples
  • Log-Likelihood Ratio: G² statistic

Specialized Measures

  • Theil's U: Uncertainty coefficient (entropy-based)
  • Tresoldi: Custom measure designed for sequence alignment
  • Goodman-Kruskal λ: Proportional reduction in error

🔬 Scientific Applications

Linguistics & Language Evolution

# Analyze phoneme transitions
phoneme_data = asymcat.read_sequences("phoneme_alignments.tsv")
cooccs = asymcat.collect_cooccs(phoneme_data)
scorer = asymcat.CatScorer(cooccs)

# Asymmetric sound change analysis
tresoldi_scores = scorer.tresoldi()  # Optimized for linguistic alignment

Ecology & Species Analysis

# Species co-occurrence from presence-absence data
species_data = asymcat.read_pa_matrix("galapagos_species.tsv")
scorer = asymcat.CatScorer(species_data)

# Ecological associations
fisher_scores = scorer.fisher()  # Exact tests for species relationships

Market Research & Business Analytics

# Product purchase associations
purchase_data = asymcat.read_sequences("customer_transactions.tsv")
cooccs = asymcat.collect_cooccs(purchase_data)
scorer = asymcat.CatScorer(cooccs, smoothing_method="lidstone", smoothing_alpha=0.5)

# Market basket analysis
chi2_scores = scorer.chi2()  # Statistical significance testing

🎯 Data Formats

Sequence Data (TSV)

# linguistic_data.tsv
sound_from	sound_to
p a t a	B A T A
k a t a	G A T A

Presence-Absence Matrix (TSV)

# species_data.tsv
site	species_A	species_B	species_C
island_1	1	0	1
island_2	1	1	0

N-gram Support

# Automatic n-gram extraction
bigrams = asymcat.collect_cooccs(data, order=2, pad="#")
trigrams = asymcat.collect_cooccs(data, order=3, pad="#")

🔧 Development

Setup Development Environment

git clone https://github.com/tresoldi/asymcat.git
cd asymcat

# Install development dependencies
make install-dev  # Creates venv and installs with [dev] extras

Common Development Commands

# Code quality checks (runs all: format-check + lint + typecheck)
make quality

# Auto-format code
make format

# Auto-fix linting issues and format
make ruff-fix

# Type checking
make mypy

# Run tests with coverage report
make test-cov

# Run tests in parallel (faster)
make test-fast

# Generate HTML documentation from tutorials
make docs

# Clean generated documentation
make docs-clean

Testing

# Full test suite (75+ tests)
pytest

# Specific categories
pytest tests/unit/           # Unit tests only
pytest tests/integration/    # Integration tests only
pytest -m slow              # Performance tests
pytest -m "not slow"        # Skip slow tests

# Coverage with threshold enforcement (78%)
make test-cov

Release Process

Version Bumping:

# Bump patch version (0.4.0 → 0.4.1)
make bump-version TYPE=patch

# Bump minor version (0.4.0 → 0.5.0)
make bump-version TYPE=minor

# Bump major version (0.4.0 → 1.0.0)
make bump-version TYPE=major

The bump-version target will:

  1. Update version in asymcat/__init__.py and pyproject.toml
  2. Prompt you to update CHANGELOG.md
  3. Create a git commit with the version bump
  4. Create a git tag (e.g., v0.4.1)
  5. Display next steps for pushing changes

Full Release Build:

# Clean → Quality checks → Tests → Build distribution
make build-release

Code Quality Standards

All code must pass:

  • Ruff formatting: ruff format --check asymcat/ tests/
  • Ruff linting: ruff check asymcat/ tests/
  • MyPy type checking: mypy asymcat/ tests/
  • Test coverage: Minimum 78% coverage (goal: 80%)

Run all checks before committing:

make quality && make test-cov

📚 Documentation

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

  • Setting up the development environment
  • Code style guidelines and testing requirements
  • Submitting bug reports and feature requests
  • Contributing new association measures or improvements

Quick Start for Contributors

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make changes and add tests
  4. Run quality checks: make quality && make test-cov
  5. Submit a pull request

📖 Citation

If you use ASymCat in your research, please cite:

@software{tresoldi_asymcat_2024,
  title = {ASymCat: Asymmetric Categorical Association Analysis},
  author = {Tresoldi, Tiago},
  year = {2024},
  url = {https://github.com/tresoldi/asymcat},
  version = {0.3.0}
}

🏆 Acknowledgments

  • FreqProb Library: Robust probability estimation and smoothing
  • SciPy Community: Statistical foundations
  • Linguistic Community: Inspiration from historical linguistics applications

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🚀 What's New in v0.4.0

  • Simplified Dependencies: Consolidated to [viz] and [dev] groups - easier installation
  • Modern Tooling: Unified linting/formatting with Ruff, replacing black/isort/flake8
  • Enhanced CI/CD: Simplified quality workflow with faster feedback
  • Coverage Enforcement: 78% minimum threshold (goal: 80%)
  • Keep a Changelog: Semantic versioning with full version history
  • Developer-Friendly Makefile: Self-documenting help, automated version bumping
  • Library-Only Focus: Removed CLI tool for better coverage and maintainability

Migration from v0.3.1:

  • Use pip install asymcat[dev] instead of multiple dependency groups
  • Use library API directly instead of CLI tool (see examples above)
  • See CHANGELOG.md for detailed migration guide

🔮 Roadmap

  • Statistical Significance: P-value calculations for all measures
  • Confidence Intervals: Uncertainty quantification
  • GPU Acceleration: CUDA support for massive datasets
  • Interactive Dashboards: Web-based exploration tools
  • Extended Measures: Additional domain-specific association metrics
  • Nhandu Documentation: Migration to modern documentation system

⭐ Star us on GitHub if you find ASymCat useful!

About

Library for analysis of co-occurrences in sequence for alignment

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •