Lynx Vector Database

A fast and light weight vector database implemented in modern C++20 with support for Approximate Nearest Neighbor (ANN) search algorithms. Suitable for medium size projects on embedded systems.

Features

Unified Database Architecture:
- Single VectorDatabase class works with all index types
- Clean, simple API via IVectorDatabase::create(config)
- Consistent threading model using std::shared_mutex
Multiple Index Types (all through unified VectorDatabase):
- Flat: Brute-force exact search - best for <1K vectors, 100% recall
- HNSW: Hierarchical Navigable Small World graphs - O(log N), 95-99% recall
- IVF: Inverted File Index - memory-efficient, fast construction, 85-98% recall
Thread-Safe Operations:
- Concurrent reads: Multiple threads can search simultaneously
- Exclusive writes: Inserts/removes are properly serialized
Modern C++20: Concepts, spans, smart pointers, RAII
Flexible Distance Metrics: L2 (Euclidean), Cosine, Dot Product
Persistence: Save and load database to/from disk

Requirements

C++20 compatible compiler (GCC 11+, Clang 14+)
GNU Make or CMake 3.20+
Git (for automatic dependency cloning)

Dependencies (Auto-managed)

The following dependencies are automatically handled by the build scripts:

Google Test v1.15.2 (github.com/google/googletest)
- Automatically cloned to external/googletest or fetched by CMake

Quick Start

Clone and Build

git clone https://github.com/Alexk-195/lynx_vector_db.git
cd lynx_vector_db

# Build with setup.sh (recommended - handles all dependencies automatically)
./setup.sh

# OR build with CMake (dependencies auto-detected or fetched)
mkdir build && cd build
cmake ..
cmake --build .

Build Options - setup.sh (Makefile)

./setup.sh          # Build release version
./setup.sh debug    # Build debug version
./setup.sh clean    # Clean build artifacts
./setup.sh rebuild  # Clean and rebuild
./setup.sh test     # Run unit tests
./setup.sh install  # Install to system (requires sudo)

# Quality Assurance and Validation (Ticket #2072)
./setup.sh coverage    # Generate code coverage report
./setup.sh tsan        # Run ThreadSanitizer (data race detection)
./setup.sh asan        # Run AddressSanitizer (memory error detection)
./setup.sh ubsan       # Run UndefinedBehaviorSanitizer + AddressSanitizer
./setup.sh valgrind    # Run Valgrind memory leak check (requires valgrind)
./setup.sh clang-tidy  # Run clang-tidy static analysis (requires clang-tidy)
./setup.sh benchmark   # Run performance benchmarks only

Build Options - Make

make              # Build release
make debug        # Build debug
make clean        # Clean build
make run          # Build and run
make info         # Show build configuration

# Note: To run tests, use ./setup.sh test or CMake (see Testing section below)

Build Options - CMake

# Configure (dependencies auto-detected)
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release

# Build
cmake --build . --parallel

# Run tests (see Testing section below for more options)
ctest                    # Run all tests
make test                # Alternative way to run all tests
make check               # Run tests with verbose output

# Install
sudo cmake --install .

Usage

For examples see src/main_minimal.cpp and src/main.cpp and src/compare_indices.cpp

Testing

Running Tests

Quick Start

# Using setup.sh (recommended - automatically configures CMake)
./setup.sh test

# Or using CMake directly
cd build
ctest                 # Run all tests

All Tests

cd build

# Run all tests
ctest

# Run all tests with output
ctest --output-on-failure

# Run all tests with verbose output
ctest -V

# Run all tests quietly
ctest -Q

Running Specific Tests

cd build

# Run tests matching a pattern (regex)
ctest -R "ThreadingTest"                    # All threading tests
ctest -R "IVFDatabaseTest"                  # All IVF database tests
ctest -R "HNSW"                             # All HNSW-related tests

# Run specific test by exact name
ctest -R "^IVFDatabaseTest.BatchInsertAndSearch$"

# Run multiple patterns
ctest -R "Threading|IVF"                    # Threading OR IVF tests

# Exclude tests matching a pattern
ctest -E "Benchmark"                        # Exclude benchmark tests

# Combine include and exclude
ctest -R "Threading" -E "Stress"            # Threading tests except stress tests

Running Tests with Verbose Output

cd build

# Verbose output for specific tests
ctest -R "ThreadingTest" -V

# Show only failures
ctest -R "IVFDatabaseTest" --output-on-failure

# Rerun only failed tests
ctest --rerun-failed

# Stop on first failure
ctest --stop-on-failure

Running Tests Multiple Times

cd build

# Run tests multiple times to check for flakiness
ctest -R "StatisticsConsistency" --repeat until-pass:5

# Run a specific test 10 times
for i in {1..10}; do
    echo "=== Run $i ==="
    ctest -R "ConcurrentWrites" --output-on-failure || break
done

Test Categories

Category	Pattern	Description
Threading Tests	`ctest -R "ThreadingTest"`	Multi-threaded safety tests
IVF Tests	`ctest -R "IVFDatabaseTest\|IVFIndexTest"`	IVF index tests
HNSW Tests	`ctest -R "HNSWTest\|HNSWDatabaseTest"`	HNSW index tests
Flat Tests	`ctest -R "FlatIndexTest"`	Flat/brute-force index tests
Distance Tests	`ctest -R "DistanceMetricsTest"`	Distance metric tests
Persistence Tests	`ctest -R "Persistence"`	Save/load functionality tests
Benchmark Tests	`ctest -R "Benchmark"`	Performance benchmarks

Using Google Test Filters (Advanced)

cd build

# Run the test binary directly with Google Test filters
./bin/lynx_tests --gtest_filter="ThreadingTest.*"
./bin/lynx_tests --gtest_filter="*IVF*"
./bin/lynx_tests --gtest_list_tests                    # List all tests
./bin/lynx_tests --gtest_filter="ThreadingTest.*" --gtest_repeat=10

Choosing an Index Type

Index Type	Best For	Query Speed	Memory	Construction	Recall
Flat	Small datasets (<1K), exact search required	Slow (O(N))	Low	Instant	100%
HNSW	High-performance ANN, low latency critical	Fast (O(log N))	High	Slow	95-99%
IVF	Memory-constrained, fast construction needed	Medium	Low	Fast	85-98%

IVF Parameter Tuning

n_clusters (number of clusters):

Rule of thumb: sqrt(N) where N is the number of vectors
Example: For 10,000 vectors, use ~100 clusters
Larger values: Faster search but may require higher n_probe for good recall
Smaller values: Slower search but better recall with low n_probe

n_probe (clusters to search):

Controls the speed/recall tradeoff
Start with 10 and adjust based on your needs:
- n_probe=1: Fastest queries, ~60-70% recall
- n_probe=10: Balanced, ~90-95% recall
- n_probe=sqrt(n_clusters): High recall, ~98%+
Higher values = better recall but slower queries

Example tuning process:

// Start with default configuration
config.ivf_params.n_clusters = 100;  // For ~10K vectors
config.ivf_params.n_probe = 10;

// If recall is too low, increase n_probe
params.n_probe = 20;  // Better recall, slower

// If queries are too slow, decrease n_probe
params.n_probe = 5;   // Faster, lower recall

// For datasets with natural clusters, fewer clusters may work better
config.ivf_params.n_clusters = 50;  // Try fewer clusters

Code Quality and Validation Tools

Lynx provides comprehensive validation tools through setup.sh for ensuring code quality, thread safety, memory safety, and performance. These tools are essential for production-ready code and are part of the quality assurance process documented in Ticket #2072.

Available Validation Tools

Tool	Command	Purpose	When to Use
Code Coverage	`./setup.sh coverage`	Measure test coverage, identify untested code paths	Before releases, to ensure comprehensive testing
ThreadSanitizer	`./setup.sh tsan`	Detect data races and threading issues	After threading changes, before production
AddressSanitizer	`./setup.sh asan`	Detect memory errors (buffer overflows, use-after-free)	During development, before commits
UBSanitizer	`./setup.sh ubsan`	Detect undefined behavior + memory errors	For comprehensive safety checks
Valgrind	`./setup.sh valgrind`	Thorough memory leak detection	Final validation, investigating memory issues
clang-tidy	`./setup.sh clang-tidy`	Static code analysis, best practices	Code review, quality improvements
Benchmarks	`./setup.sh benchmark`	Performance testing	After optimizations, regression testing

Validation Tool Details

Code Coverage (./setup.sh coverage):

Generates HTML coverage report in build-coverage/coverage_report/
Target: >85% overall, >95% on critical paths
Results saved to tickets/2072_coverage_report.txt

ThreadSanitizer (./setup.sh tsan):

Detects data races, deadlocks, and thread safety issues
Essential for multi-threaded code validation
Results saved to tickets/2072_tsan_results.txt
⚠️ Note: Slower execution, ~2-5x overhead

AddressSanitizer (./setup.sh asan):

Finds buffer overflows, use-after-free, memory corruption
Fast execution, low overhead (~2x)
Results saved to tickets/2072_asan_results.txt
Recommended for regular development use

UndefinedBehaviorSanitizer (./setup.sh ubsan):

Combines AddressSanitizer + undefined behavior detection
Catches integer overflows, null pointer dereferences, etc.
Results saved to tickets/2072_ubsan_results.txt
Most comprehensive sanitizer check

Valgrind (./setup.sh valgrind):

Definitive memory leak detection
Slower execution (~10-50x overhead)
Results saved to tickets/2072_valgrind_results.txt
Requires: sudo apt-get install valgrind (Ubuntu/Debian)

clang-tidy (./setup.sh clang-tidy):

Static analysis for code quality
Enforces modern C++ best practices
Results saved to tickets/2072_clang_tidy_results.txt
Requires: sudo apt-get install clang-tidy (Ubuntu/Debian)

Benchmarks (./setup.sh benchmark):

Runs performance benchmark tests only
Measures search latency, memory usage, throughput
Results saved to tickets/2072_benchmark_results.txt
Uses Release build for accurate performance metrics

Recommended Validation Workflow

# 1. Quick validation (during development)
./setup.sh asan        # Check for memory errors

# 2. Comprehensive validation (before commits)
./setup.sh coverage    # Ensure good test coverage
./setup.sh tsan        # Check threading safety
./setup.sh asan        # Check memory safety

# 3. Release validation (before production)
./setup.sh coverage    # Final coverage check
./setup.sh tsan        # Thread safety validation
./setup.sh ubsan       # Comprehensive sanitizer check
./setup.sh benchmark   # Performance validation
./setup.sh clang-tidy  # Code quality review

# 4. Deep investigation (if issues found)
./setup.sh valgrind    # Thorough memory leak check

Quality Targets

Based on Ticket #2070 quality assurance standards:

✅ Code Coverage: >85% overall, >95% on critical paths
✅ ThreadSanitizer: Clean (no race conditions)
✅ AddressSanitizer: Clean (no memory errors)
✅ UBSanitizer: Clean (no undefined behavior)
✅ Valgrind: Zero memory leaks in application code
✅ clang-tidy: No critical warnings
✅ Benchmarks: No performance regression

Output Files

All validation tools save their results to the tickets/ directory:

tickets/2072_coverage_report.txt - Coverage analysis
tickets/2072_tsan_results.txt - ThreadSanitizer output
tickets/2072_asan_results.txt - AddressSanitizer output
tickets/2072_ubsan_results.txt - UBSanitizer output
tickets/2072_valgrind_results.txt - Valgrind output
tickets/2072_clang_tidy_results.txt - Static analysis
tickets/2072_benchmark_results.txt - Performance benchmarks

Project Structure

lynx_vector_db/
├── Makefile             # Build configuration
├── setup.sh             # Build script
├── CLAUDE.md            # AI assistant instructions
├── README.md            # This file
├── LICENSE              # MIT License
├── doc/
│   └── research.md      # ANN algorithm research and other files
├── tickets/
│   ├── README.md        # Ticketing system documentation
│   └── done/            # Completed tickets
├── src/
│   ├── include/
│   │   └── lynx/
│   │       └── lynx.h   # Public interface
│   ├── lib/
│   │   └── ...          # Implementation files
│   ├── main_minimal.cpp         # Example of minimal database executable
│   └── main.cpp         # Larger example of database usage
└── tests/
    └── ...              # Unit tests

Architecture

Lynx uses a clean, layered architecture with a unified database implementation:

Layers

API Layer: Pure virtual interface (IVectorDatabase) in src/include/lynx/lynx.h
- Provides factory method: IVectorDatabase::create(config)
- All operations (insert, search, remove, persistence)
- Thread-safe by design
Database Layer: Unified VectorDatabase implementation
- Single class that works with all index types
- Thread-safe using std::shared_mutex:
  - Concurrent reads (multiple searches simultaneously)
  - Exclusive writes (inserts/removes serialized)
- Manages vector storage and metadata
- Delegates search to the appropriate index
Index Layer: Pluggable index implementations
- FlatIndex - Brute-force exact search (O(N))
- HNSWIndex - Hierarchical graphs for fast ANN (O(log N))
- IVFIndex - Inverted file with k-means clustering

Architecture Diagram

┌─────────────────────────────────────────────────────────┐
│                  User Application                        │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│              IVectorDatabase (Interface)                 │
│  - create()     - insert()    - search()                 │
│  - remove()     - batch_insert()                         │
│  - get()        - save()/load()                          │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│           VectorDatabase (Unified Implementation)        │
│                                                           │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Thread Safety: std::shared_mutex               │    │
│  │  - Concurrent reads (search, get, contains)     │    │
│  │  - Exclusive writes (insert, remove)            │    │
│  └─────────────────────────────────────────────────┘    │
│                                                           │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Vector Storage                                  │    │
│  │  - std::unordered_map<id, VectorRecord>         │    │
│  │  - Metadata management                           │    │
│  └─────────────────────────────────────────────────┘    │
│                                                           │
│                        │                                  │
│                        ▼                                  │
│  ┌──────────────────────────────────────────────────┐   │
│  │         Index Selection (via Config)             │   │
│  │  - IndexType::Flat   → FlatIndex                 │   │
│  │  - IndexType::HNSW   → HNSWIndex                 │   │
│  │  - IndexType::IVF    → IVFIndex                  │   │
│  └──────────────────────────────────────────────────┘   │
└─────────────────────┬───────────────────────────────────┘
                      │
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
┌─────────────┐ ┌───────────┐ ┌──────────┐
│  FlatIndex  │ │ HNSWIndex │ │ IVFIndex │
├─────────────┤ ├───────────┤ ├──────────┤
│ Exact       │ │ Graph-    │ │ Cluster- │
│ search      │ │ based ANN │ │ based ANN│
│ O(N)        │ │ O(log N)  │ │ O(k)     │
└─────────────┘ └───────────┘ └──────────┘

Threading Model

All operations are thread-safe using std::shared_mutex:

// Multiple threads can search concurrently
std::thread t1([&]() { db->search(query1, k); });
std::thread t2([&]() { db->search(query2, k); });  // Concurrent with t1

// Writes are exclusive
std::thread t3([&]() { db->insert(record); });     // Blocks during write

t1.join(); t2.join(); t3.join();

Performance characteristics:

Low to medium concurrency: Excellent performance with std::shared_mutex
Embedded systems: Low overhead, simple threading model
High concurrency: Scales well for most use cases

Documentation

doc/MIGRATION_GUIDE.md - Migration guide for unified VectorDatabase
doc/research.md - ANN algorithm research
tests/README.md - Infos about unit testing
tickets/README.md - File-based ticketing system

Contributing

Read CLAUDE.md for coding guidelines
Review open tickets in tickets/ for tasks to work on
Follow the code style conventions
Add tests for new functionality

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 318 Commits
.claude		.claude
.github/workflows		.github/workflows
benchmarks		benchmarks
cmake		cmake
doc		doc
src		src
tests		tests
tickets		tickets
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
context.py		context.py
setup.sh		setup.sh

License

Alexk-195/lynx_vector_db

Folders and files

Latest commit

History

Repository files navigation

Lynx Vector Database

Features

Requirements

Dependencies (Auto-managed)

Quick Start

Clone and Build

Build Options - setup.sh (Makefile)

Build Options - Make

Build Options - CMake

Usage

Testing

Running Tests

Quick Start

All Tests

Running Specific Tests

Running Tests with Verbose Output

Running Tests Multiple Times

Test Categories

Using Google Test Filters (Advanced)

Choosing an Index Type

IVF Parameter Tuning

Code Quality and Validation Tools

Available Validation Tools

Validation Tool Details

Recommended Validation Workflow

Quality Targets

Output Files

Project Structure

Architecture

Layers

Architecture Diagram

Threading Model

Documentation

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages