A high-performance, distributed vector database built from scratch in Rust. This project serves as an educational implementation of a modern vector search engine, demonstrating core concepts like durability, approximate nearest neighbor search, and distributed systems.
- 🛡️ Durability (WAL): Write-Ahead Log ensures data safety. Every operation is checksummed (CRC32) and persisted to disk before being applied.
- ⚡ Fast Search (HNSW): Hierarchical Navigable Small World graph index for sub-linear time Approximate Nearest Neighbor (ANN) search.
- 🌐 Distributed Architecture:
- Sharding: Consistent Hashing distributes vectors across multiple nodes.
- Replication: Configurable replication factor for high availability.
- Scatter-Gather: Queries are broadcast to shards and results are aggregated.
- 🔌 gRPC API: High-performance Protocol Buffers interface for all interactions.
The system consists of three main layers:
- Storage Layer (WAL):
- Handles sequential writes to disk.
- Provides crash recovery by replaying the log on startup.
- Index Layer (HNSW):
- In-memory graph structure.
- Supports
Euclidean,Cosine, andDotProductdistance metrics. - Implements neighbor pruning to maintain graph quality (
M,ef_construction).
- Network Layer (gRPC):
- Router: The entry point. Routes
Putrequests to the correct shard(s) and scattersSearchrequests. - Server: The storage node. Manages the WAL and HNSW index.
- Router: The entry point. Routes
src/
├── bin/
│ ├── server.rs # The Storage Node (gRPC Server)
│ ├── router.rs # The Gateway/Router (Sharding & Replication logic)
│ └── client.rs # CLI Tool for testing
├── index/
│ ├── hnsw.rs # Core HNSW Graph implementation
│ └── distance.rs # Distance metrics (Euclidean, etc.)
├── wal.rs # Write-Ahead Log implementation
└── lib.rs # Shared library code
- Rust (latest stable)
- Cargo
- Protobuf Compiler (
protoc) - Handled automatically by build script
cargo build --releasecargo testFollow these steps to spin up a local cluster with 3 Shards and 1 Router.
Open 3 separate terminals. Each server listens on a different port and maintains its own WAL file.
# Node 1
cargo run --bin server -- --port 50051
# Node 2
cargo run --bin server -- --port 50052
# Node 3
cargo run --bin server -- --port 50053Open a 4th terminal. The router is configured to discover the 3 nodes above.
# Router (Listens on 50050)
cargo run --bin routerOpen a 5th terminal to interact with the cluster.
The router will hash the ID and forward the data to the appropriate node(s).
# Insert ID 100
cargo run --bin client -- put --id 100 --vector 0.1,0.2,0.3
# Insert ID 101
cargo run --bin client -- put --id 101 --vector 0.9,0.8,0.7The router will query all shards and merge the top-k results.
# Find nearest neighbor to [0.1, 0.2, 0.3]
cargo run --bin client -- search --vector 0.1,0.2,0.3 --k 1The service is defined in proto/vector_db.proto.
Inserts a vector into the database.
- id: Unique identifier (uint32).
- vector: List of floats.
Finds the k nearest neighbors.
- vector: Query vector.
- k: Number of results to return.
- Write-Ahead Log (WAL): Append-only log with CRC32.
- HNSW Index: Graph-based ANN search.
- gRPC Network: Server and Client implementation.
- Sharding & Replication: Consistent Hash Ring and Router.
- Persistence: Snapshotting the HNSW graph to disk.
- Dynamic Membership: Gossip protocol for node discovery.
This project is licensed under the MIT License.