-
Notifications
You must be signed in to change notification settings - Fork 76
Open
Description
I had a discussion with Claude about what this could look like. Let me know if this looks like a thing worth doing or not, and whether I should have a go at getting it built and make a pull request....
RuVector Python Bindings Implementation Specification
Overview
This document provides a complete specification for implementing Python bindings for RuVector, a distributed vector database with graph query support. The bindings should be contributed to the main ruvector repository at https://github.com/ruvnet/ruvector.
Goals
- Create native Python bindings using PyO3 and Maturin
- Match API parity with existing Node.js bindings (
@ruvector/core,@ruvector/graph-node,@ruvector/gnn) - Publish to PyPI as
ruvector - Follow ruvector's existing code style and contribution patterns
Repository Context
Before starting, clone and examine the ruvector repository:
git clone https://github.com/ruvnet/ruvector.git
cd ruvectorKey directories to study:
crates/ruvector-core/- Core vector database enginecrates/ruvector-graph/- Graph database with Cypher supportcrates/ruvector-gnn/- Graph Neural Network layerscrates/ruvector-node/- Node.js bindings (reference implementation)npm/packages/- npm package structure (reference for Python package structure)
Project Structure
Create the following structure within the ruvector repository:
crates/
└── ruvector-python/
├── Cargo.toml
├── pyproject.toml
├── README.md
├── LICENSE # MIT (same as main repo)
├── src/
│ ├── lib.rs # Main module entry point
│ ├── vector_db.rs # VectorDB bindings
│ ├── graph_db.rs # GraphDB bindings
│ ├── gnn.rs # GNN layer bindings
│ ├── compression.rs # Tensor compression bindings
│ ├── types.rs # Shared type definitions
│ └── errors.rs # Error handling
├── python/
│ └── ruvector/
│ ├── __init__.py # Re-export from native module
│ ├── py.typed # PEP 561 marker
│ └── _types.pyi # Type stubs
├── tests/
│ ├── test_vector_db.py
│ ├── test_graph_db.py
│ ├── test_gnn.py
│ └── conftest.py
└── examples/
├── basic_vector_search.py
├── graph_queries.py
├── rag_pipeline.py
└── knowledge_graph.py
Dependencies
Cargo.toml
[package]
name = "ruvector-python"
version = "0.1.0"
edition = "2021"
authors = ["RuVector Contributors"]
license = "MIT"
description = "Python bindings for RuVector - a distributed vector database that learns"
repository = "https://github.com/ruvnet/ruvector"
keywords = ["vector-database", "graph-database", "machine-learning", "python", "pyo3"]
categories = ["database", "science", "api-bindings"]
[lib]
name = "ruvector"
crate-type = ["cdylib"]
[dependencies]
pyo3 = { version = "0.22", features = ["extension-module", "abi3-py38"] }
ruvector-core = { path = "../ruvector-core" }
ruvector-graph = { path = "../ruvector-graph" }
ruvector-gnn = { path = "../ruvector-gnn" }
ruvector-collections = { path = "../ruvector-collections" }
ruvector-filter = { path = "../ruvector-filter" }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
thiserror = "1.0"
numpy = "0.22" # For efficient array handling
[build-dependencies]
pyo3-build-config = "0.22"
[features]
default = []
# Enable GNN features (optional, adds dependencies)
gnn = []
# Enable distributed features
distributed = ["ruvector-core/distributed"]pyproject.toml
[build-system]
requires = ["maturin>=1.4,<2.0"]
build-backend = "maturin"
[project]
name = "ruvector"
version = "0.1.0"
description = "A distributed vector database that learns - Python bindings"
readme = "README.md"
license = { file = "LICENSE" }
requires-python = ">=3.8"
authors = [
{ name = "RuVector Contributors" }
]
classifiers = [
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Rust",
"Topic :: Database",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
]
keywords = ["vector-database", "embeddings", "graph-database", "machine-learning", "rag"]
[project.optional-dependencies]
dev = [
"pytest>=7.0",
"pytest-asyncio>=0.21",
"numpy>=1.20",
"sentence-transformers>=2.0", # For testing with real embeddings
]
[project.urls]
Homepage = "https://github.com/ruvnet/ruvector"
Documentation = "https://github.com/ruvnet/ruvector/tree/main/crates/ruvector-python"
Repository = "https://github.com/ruvnet/ruvector"
Issues = "https://github.com/ruvnet/ruvector/issues"
[tool.maturin]
features = ["pyo3/extension-module"]
python-source = "python"
module-name = "ruvector._ruvector"
strip = true
[tool.pytest.ini_options]
testpaths = ["tests"]
asyncio_mode = "auto"Implementation Details
src/lib.rs
use pyo3::prelude::*;
mod vector_db;
mod graph_db;
mod gnn;
mod compression;
mod types;
mod errors;
use vector_db::PyVectorDB;
use graph_db::PyGraphDB;
use gnn::{PyGNNLayer, PyRuvectorLayer};
use compression::{compress, decompress};
use types::{PySearchResult, PyNode, PyEdge};
/// RuVector: A distributed vector database that learns
#[pymodule]
fn _ruvector(m: &Bound<'_, PyModule>) -> PyResult<()> {
// Core classes
m.add_class::<PyVectorDB>()?;
m.add_class::<PyGraphDB>()?;
// GNN classes
m.add_class::<PyGNNLayer>()?;
m.add_class::<PyRuvectorLayer>()?;
// Result types
m.add_class::<PySearchResult>()?;
m.add_class::<PyNode>()?;
m.add_class::<PyEdge>()?;
// Utility functions
m.add_function(wrap_pyfunction!(compress, m)?)?;
m.add_function(wrap_pyfunction!(decompress, m)?)?;
// Version info
m.add("__version__", env!("CARGO_PKG_VERSION"))?;
Ok(())
}src/errors.rs
use pyo3::exceptions::{PyIOError, PyValueError, PyRuntimeError};
use pyo3::prelude::*;
use thiserror::Error;
#[derive(Error, Debug)]
pub enum RuVectorError {
#[error("Database error: {0}")]
Database(String),
#[error("Invalid dimension: expected {expected}, got {got}")]
DimensionMismatch { expected: usize, got: usize },
#[error("Cypher query error: {0}")]
CypherError(String),
#[error("Serialization error: {0}")]
SerializationError(String),
#[error("Not found: {0}")]
NotFound(String),
#[error("IO error: {0}")]
IoError(#[from] std::io::Error),
}
impl From<RuVectorError> for PyErr {
fn from(err: RuVectorError) -> PyErr {
match err {
RuVectorError::Database(msg) => PyRuntimeError::new_err(msg),
RuVectorError::DimensionMismatch { expected, got } => {
PyValueError::new_err(format!(
"Dimension mismatch: expected {}, got {}", expected, got
))
}
RuVectorError::CypherError(msg) => PyValueError::new_err(msg),
RuVectorError::SerializationError(msg) => PyValueError::new_err(msg),
RuVectorError::NotFound(msg) => PyValueError::new_err(msg),
RuVectorError::IoError(e) => PyIOError::new_err(e.to_string()),
}
}
}
pub type Result<T> = std::result::Result<T, RuVectorError>;src/types.rs
use pyo3::prelude::*;
use serde::{Deserialize, Serialize};
use std::collections::HashMap;
#[pyclass]
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct PySearchResult {
#[pyo3(get)]
pub id: String,
#[pyo3(get)]
pub distance: f32,
#[pyo3(get)]
pub metadata: Option<HashMap<String, String>>,
#[pyo3(get)]
pub vector: Option<Vec<f32>>,
}
#[pymethods]
impl PySearchResult {
fn __repr__(&self) -> String {
format!("SearchResult(id='{}', distance={:.4})", self.id, self.distance)
}
fn to_dict(&self) -> HashMap<String, PyObject> {
Python::with_gil(|py| {
let mut dict = HashMap::new();
dict.insert("id".to_string(), self.id.clone().into_py(py));
dict.insert("distance".to_string(), self.distance.into_py(py));
if let Some(ref meta) = self.metadata {
dict.insert("metadata".to_string(), meta.clone().into_py(py));
}
dict
})
}
}
#[pyclass]
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct PyNode {
#[pyo3(get)]
pub id: String,
#[pyo3(get)]
pub labels: Vec<String>,
#[pyo3(get)]
pub properties: HashMap<String, String>,
}
#[pymethods]
impl PyNode {
fn __repr__(&self) -> String {
format!("Node(id='{}', labels={:?})", self.id, self.labels)
}
}
#[pyclass]
#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct PyEdge {
#[pyo3(get)]
pub id: String,
#[pyo3(get)]
pub edge_type: String,
#[pyo3(get)]
pub source: String,
#[pyo3(get)]
pub target: String,
#[pyo3(get)]
pub properties: HashMap<String, String>,
}
#[pymethods]
impl PyEdge {
fn __repr__(&self) -> String {
format!("Edge({} -[{}]-> {})", self.source, self.edge_type, self.target)
}
}src/vector_db.rs
use pyo3::prelude::*;
use pyo3::types::{PyDict, PyList};
use numpy::{PyArray1, PyReadonlyArray1};
use std::collections::HashMap;
use std::sync::{Arc, RwLock};
use ruvector_core::{VectorDB, DbOptions, VectorEntry, SearchQuery};
use crate::errors::{RuVectorError, Result};
use crate::types::PySearchResult;
#[pyclass(name = "VectorDB")]
pub struct PyVectorDB {
inner: Arc<RwLock<VectorDB>>,
dimensions: usize,
}
#[pymethods]
impl PyVectorDB {
/// Create a new VectorDB instance
///
/// Args:
/// dimensions: The dimensionality of vectors to store
/// path: Optional path for persistent storage. If None, uses in-memory storage.
/// distance_metric: Distance metric to use ('cosine', 'euclidean', 'dot')
///
/// Example:
/// db = VectorDB(dimensions=384, path="./vectors.db")
#[new]
#[pyo3(signature = (dimensions, path=None, distance_metric="cosine"))]
fn new(dimensions: usize, path: Option<&str>, distance_metric: &str) -> PyResult<Self> {
let mut opts = DbOptions::default();
opts.dimensions = dimensions as u32;
opts.distance_metric = distance_metric.to_string();
if let Some(p) = path {
opts.storage_path = p.to_string();
}
let db = VectorDB::new(opts)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(Self {
inner: Arc::new(RwLock::new(db)),
dimensions,
})
}
/// Insert a vector into the database
///
/// Args:
/// id: Unique identifier for the vector
/// vector: The vector as a list of floats or numpy array
/// metadata: Optional metadata dictionary
///
/// Example:
/// db.insert("doc1", [0.1, 0.2, 0.3], metadata={"source": "wiki"})
#[pyo3(signature = (id, vector, metadata=None))]
fn insert(
&self,
id: &str,
vector: PyReadonlyArray1<f32>,
metadata: Option<&Bound<'_, PyDict>>,
) -> PyResult<()> {
let vec = vector.as_slice()?;
if vec.len() != self.dimensions {
return Err(RuVectorError::DimensionMismatch {
expected: self.dimensions,
got: vec.len(),
}.into());
}
let meta = metadata.map(|d| {
d.iter()
.filter_map(|(k, v)| {
let key = k.extract::<String>().ok()?;
let val = v.extract::<String>().ok()?;
Some((key, val))
})
.collect::<HashMap<_, _>>()
});
let entry = VectorEntry {
id: Some(id.to_string()),
vector: vec.to_vec(),
metadata: meta.map(|m| serde_json::to_string(&m).unwrap()),
};
let mut db = self.inner.write().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
db.insert(entry)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(())
}
/// Insert multiple vectors in batch
///
/// Args:
/// entries: List of (id, vector, metadata) tuples
///
/// Example:
/// db.insert_batch([
/// ("doc1", [0.1, 0.2], {"type": "article"}),
/// ("doc2", [0.3, 0.4], {"type": "blog"}),
/// ])
fn insert_batch(&self, entries: &Bound<'_, PyList>) -> PyResult<usize> {
let mut db = self.inner.write().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
let mut count = 0;
for item in entries.iter() {
let tuple = item.extract::<(String, Vec<f32>, Option<HashMap<String, String>>)>()?;
let (id, vector, metadata) = tuple;
if vector.len() != self.dimensions {
return Err(RuVectorError::DimensionMismatch {
expected: self.dimensions,
got: vector.len(),
}.into());
}
let entry = VectorEntry {
id: Some(id),
vector,
metadata: metadata.map(|m| serde_json::to_string(&m).unwrap()),
};
db.insert(entry)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
count += 1;
}
Ok(count)
}
/// Search for similar vectors
///
/// Args:
/// query: Query vector as list or numpy array
/// k: Number of results to return
/// filter: Optional metadata filter (not yet implemented)
/// include_vectors: Whether to include vectors in results
///
/// Returns:
/// List of SearchResult objects
///
/// Example:
/// results = db.search([0.1, 0.2, 0.3], k=10)
/// for r in results:
/// print(f"{r.id}: {r.distance}")
#[pyo3(signature = (query, k=10, filter=None, include_vectors=false))]
fn search(
&self,
query: PyReadonlyArray1<f32>,
k: usize,
filter: Option<&Bound<'_, PyDict>>,
include_vectors: bool,
) -> PyResult<Vec<PySearchResult>> {
let vec = query.as_slice()?;
if vec.len() != self.dimensions {
return Err(RuVectorError::DimensionMismatch {
expected: self.dimensions,
got: vec.len(),
}.into());
}
let search_query = SearchQuery {
vector: vec.to_vec(),
k,
filter: None, // TODO: Implement filter parsing
include_vectors,
};
let db = self.inner.read().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
let results = db.search(&search_query)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(results.into_iter().map(|r| PySearchResult {
id: r.id,
distance: r.distance,
metadata: r.metadata.and_then(|m| serde_json::from_str(&m).ok()),
vector: if include_vectors { Some(r.vector) } else { None },
}).collect())
}
/// Get a vector by ID
///
/// Args:
/// id: The vector ID to retrieve
///
/// Returns:
/// SearchResult or None if not found
fn get(&self, id: &str) -> PyResult<Option<PySearchResult>> {
let db = self.inner.read().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
match db.get(id) {
Ok(Some(entry)) => Ok(Some(PySearchResult {
id: entry.id.unwrap_or_default(),
distance: 0.0,
metadata: entry.metadata.and_then(|m| serde_json::from_str(&m).ok()),
vector: Some(entry.vector),
})),
Ok(None) => Ok(None),
Err(e) => Err(RuVectorError::Database(e.to_string()).into()),
}
}
/// Delete a vector by ID
///
/// Args:
/// id: The vector ID to delete
///
/// Returns:
/// True if deleted, False if not found
fn delete(&self, id: &str) -> PyResult<bool> {
let mut db = self.inner.write().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
db.delete(id)
.map_err(|e| RuVectorError::Database(e.to_string()))
}
/// Get the number of vectors in the database
fn __len__(&self) -> PyResult<usize> {
let db = self.inner.read().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
db.len().map_err(|e| RuVectorError::Database(e.to_string()).into())
}
/// Get database statistics
fn stats(&self) -> PyResult<HashMap<String, PyObject>> {
let db = self.inner.read().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
Python::with_gil(|py| {
let mut stats = HashMap::new();
stats.insert("dimensions".to_string(), self.dimensions.into_py(py));
stats.insert("count".to_string(), db.len().unwrap_or(0).into_py(py));
Ok(stats)
})
}
/// Sync data to disk (for persistent storage)
fn sync(&self) -> PyResult<()> {
let db = self.inner.write().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
db.sync().map_err(|e| RuVectorError::Database(e.to_string()).into())
}
fn __repr__(&self) -> String {
let count = self.inner.read()
.map(|db| db.len().unwrap_or(0))
.unwrap_or(0);
format!("VectorDB(dimensions={}, count={})", self.dimensions, count)
}
}src/graph_db.rs
use pyo3::prelude::*;
use pyo3::types::PyDict;
use std::collections::HashMap;
use std::sync::{Arc, RwLock};
use ruvector_graph::{GraphDB, NodeBuilder, EdgeBuilder};
use crate::errors::{RuVectorError, Result};
use crate::types::{PyNode, PyEdge};
#[pyclass(name = "GraphDB")]
pub struct PyGraphDB {
inner: Arc<RwLock<GraphDB>>,
}
#[pymethods]
impl PyGraphDB {
/// Create a new GraphDB instance
///
/// Args:
/// path: Optional path for persistent storage
///
/// Example:
/// graph = GraphDB(path="./graph.db")
#[new]
#[pyo3(signature = (path=None))]
fn new(path: Option<&str>) -> PyResult<Self> {
let db = if let Some(p) = path {
GraphDB::with_path(p)
} else {
GraphDB::new()
}.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(Self {
inner: Arc::new(RwLock::new(db)),
})
}
/// Execute a Cypher query
///
/// Args:
/// cypher: Cypher query string
///
/// Returns:
/// Query results as a list of dictionaries
///
/// Example:
/// results = graph.execute("MATCH (n:Person) RETURN n.name")
fn execute(&self, cypher: &str) -> PyResult<PyObject> {
let db = self.inner.write().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
let result = db.execute(cypher)
.map_err(|e| RuVectorError::CypherError(e.to_string()))?;
Python::with_gil(|py| {
// Convert result to Python object
let json_str = serde_json::to_string(&result)
.map_err(|e| RuVectorError::SerializationError(e.to_string()))?;
let json_module = py.import_bound("json")?;
let parsed = json_module.call_method1("loads", (json_str,))?;
Ok(parsed.into())
})
}
/// Create a node
///
/// Args:
/// id: Unique node identifier
/// labels: List of node labels
/// properties: Node properties dictionary
///
/// Example:
/// graph.create_node("person1", ["Person"], {"name": "Alice", "age": "30"})
#[pyo3(signature = (id, labels=None, properties=None))]
fn create_node(
&self,
id: &str,
labels: Option<Vec<String>>,
properties: Option<&Bound<'_, PyDict>>,
) -> PyResult<PyNode> {
let mut builder = NodeBuilder::new(id);
if let Some(lbls) = labels {
for label in lbls {
builder = builder.label(&label);
}
}
if let Some(props) = properties {
for (key, value) in props.iter() {
let k: String = key.extract()?;
let v: String = value.extract()?;
builder = builder.property(&k, v);
}
}
let node = builder.build();
let mut db = self.inner.write().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
db.create_node(node.clone())
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(PyNode {
id: id.to_string(),
labels: labels.unwrap_or_default(),
properties: properties
.map(|p| {
p.iter()
.filter_map(|(k, v)| {
Some((k.extract::<String>().ok()?, v.extract::<String>().ok()?))
})
.collect()
})
.unwrap_or_default(),
})
}
/// Create an edge between nodes
///
/// Args:
/// source: Source node ID
/// target: Target node ID
/// edge_type: Relationship type
/// properties: Edge properties dictionary
///
/// Example:
/// graph.create_edge("person1", "person2", "KNOWS", {"since": "2020"})
#[pyo3(signature = (source, target, edge_type, properties=None))]
fn create_edge(
&self,
source: &str,
target: &str,
edge_type: &str,
properties: Option<&Bound<'_, PyDict>>,
) -> PyResult<PyEdge> {
let mut builder = EdgeBuilder::new(source, target, edge_type);
if let Some(props) = properties {
for (key, value) in props.iter() {
let k: String = key.extract()?;
let v: String = value.extract()?;
builder = builder.property(&k, v);
}
}
let edge = builder.build();
let edge_id = edge.id.clone();
let mut db = self.inner.write().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
db.create_edge(edge)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(PyEdge {
id: edge_id,
edge_type: edge_type.to_string(),
source: source.to_string(),
target: target.to_string(),
properties: properties
.map(|p| {
p.iter()
.filter_map(|(k, v)| {
Some((k.extract::<String>().ok()?, v.extract::<String>().ok()?))
})
.collect()
})
.unwrap_or_default(),
})
}
/// Get a node by ID
fn get_node(&self, id: &str) -> PyResult<Option<PyNode>> {
let db = self.inner.read().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
match db.get_node(id) {
Ok(Some(node)) => Ok(Some(PyNode {
id: node.id,
labels: node.labels,
properties: node.properties,
})),
Ok(None) => Ok(None),
Err(e) => Err(RuVectorError::Database(e.to_string()).into()),
}
}
/// Delete a node by ID
fn delete_node(&self, id: &str) -> PyResult<bool> {
let mut db = self.inner.write().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
db.delete_node(id)
.map_err(|e| RuVectorError::Database(e.to_string()))
}
/// Sync data to disk
fn sync(&self) -> PyResult<()> {
let db = self.inner.write().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
db.sync().map_err(|e| RuVectorError::Database(e.to_string()).into())
}
fn __repr__(&self) -> String {
"GraphDB()".to_string()
}
}src/gnn.rs
use pyo3::prelude::*;
use numpy::{PyArray1, PyArray2, PyReadonlyArray1, PyReadonlyArray2};
use std::sync::{Arc, RwLock};
use ruvector_gnn::{GNNLayer, RuvectorLayer, LayerConfig};
use crate::errors::RuVectorError;
#[pyclass(name = "GNNLayer")]
pub struct PyGNNLayer {
inner: Arc<RwLock<GNNLayer>>,
input_dim: usize,
output_dim: usize,
}
#[pymethods]
impl PyGNNLayer {
/// Create a new GNN layer
///
/// Args:
/// input_dim: Input feature dimension
/// output_dim: Output feature dimension
/// heads: Number of attention heads
/// dropout: Dropout rate
///
/// Example:
/// layer = GNNLayer(input_dim=128, output_dim=256, heads=4)
#[new]
#[pyo3(signature = (input_dim, output_dim, heads=4, dropout=0.1))]
fn new(input_dim: usize, output_dim: usize, heads: usize, dropout: f32) -> PyResult<Self> {
let config = LayerConfig {
input_dim,
output_dim,
heads,
dropout,
};
let layer = GNNLayer::new(config)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(Self {
inner: Arc::new(RwLock::new(layer)),
input_dim,
output_dim,
})
}
/// Forward pass through the GNN layer
///
/// Args:
/// query: Query features (1D array)
/// neighbors: Neighbor features (2D array: n_neighbors x feature_dim)
/// weights: Edge weights (1D array: n_neighbors)
///
/// Returns:
/// Enhanced query features (1D array)
fn forward<'py>(
&self,
py: Python<'py>,
query: PyReadonlyArray1<f32>,
neighbors: PyReadonlyArray2<f32>,
weights: PyReadonlyArray1<f32>,
) -> PyResult<Bound<'py, PyArray1<f32>>> {
let query_vec = query.as_slice()?;
let neighbors_data = neighbors.as_array();
let weights_vec = weights.as_slice()?;
// Convert neighbors to Vec<Vec<f32>>
let neighbors_vec: Vec<Vec<f32>> = neighbors_data
.rows()
.into_iter()
.map(|row| row.to_vec())
.collect();
let layer = self.inner.read().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
let result = layer.forward(query_vec, &neighbors_vec, weights_vec)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(PyArray1::from_vec_bound(py, result))
}
fn __repr__(&self) -> String {
format!("GNNLayer(input_dim={}, output_dim={})", self.input_dim, self.output_dim)
}
}
#[pyclass(name = "RuvectorLayer")]
pub struct PyRuvectorLayer {
inner: Arc<RwLock<RuvectorLayer>>,
}
#[pymethods]
impl PyRuvectorLayer {
/// Create a RuvectorLayer (GNN + attention for vector search enhancement)
///
/// Args:
/// input_dim: Input dimension
/// output_dim: Output dimension
/// heads: Number of attention heads
/// dropout: Dropout rate
#[new]
#[pyo3(signature = (input_dim, output_dim, heads=4, dropout=0.1))]
fn new(input_dim: usize, output_dim: usize, heads: usize, dropout: f32) -> PyResult<Self> {
let layer = RuvectorLayer::new(input_dim, output_dim, heads, dropout)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(Self {
inner: Arc::new(RwLock::new(layer)),
})
}
/// Apply differentiable search enhancement
fn enhance_search<'py>(
&self,
py: Python<'py>,
query: PyReadonlyArray1<f32>,
candidates: PyReadonlyArray2<f32>,
) -> PyResult<Bound<'py, PyArray1<f32>>> {
let query_vec = query.as_slice()?;
let candidates_data = candidates.as_array();
let candidates_vec: Vec<Vec<f32>> = candidates_data
.rows()
.into_iter()
.map(|row| row.to_vec())
.collect();
let layer = self.inner.read().map_err(|e| {
RuVectorError::Database(format!("Lock error: {}", e))
})?;
let scores = layer.enhance_search(query_vec, &candidates_vec)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(PyArray1::from_vec_bound(py, scores))
}
}src/compression.rs
use pyo3::prelude::*;
use numpy::{PyArray1, PyReadonlyArray1};
use ruvector_gnn::compression::{TensorCompressor, CompressionLevel};
use crate::errors::RuVectorError;
/// Compress a vector using adaptive quantization
///
/// Args:
/// vector: Input vector as numpy array
/// ratio: Compression ratio (0.0-1.0). Lower = more compression.
///
/// Returns:
/// Compressed vector as numpy array
///
/// Example:
/// compressed = ruvector.compress(embedding, ratio=0.3) # ~8x compression
#[pyfunction]
#[pyo3(signature = (vector, ratio=0.5))]
pub fn compress<'py>(
py: Python<'py>,
vector: PyReadonlyArray1<f32>,
ratio: f32,
) -> PyResult<Bound<'py, PyArray1<f32>>> {
let vec = vector.as_slice()?;
let level = if ratio < 0.1 {
CompressionLevel::Binary // 32x
} else if ratio < 0.25 {
CompressionLevel::PQ4 // 16x
} else if ratio < 0.5 {
CompressionLevel::PQ8 // 8x
} else if ratio < 0.75 {
CompressionLevel::Float16 // 2x
} else {
CompressionLevel::None
};
let compressor = TensorCompressor::new(level);
let compressed = compressor.compress(vec)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(PyArray1::from_vec_bound(py, compressed))
}
/// Decompress a vector
///
/// Args:
/// vector: Compressed vector as numpy array
/// original_dim: Original vector dimension
///
/// Returns:
/// Decompressed vector as numpy array
#[pyfunction]
#[pyo3(signature = (vector, original_dim=None))]
pub fn decompress<'py>(
py: Python<'py>,
vector: PyReadonlyArray1<f32>,
original_dim: Option<usize>,
) -> PyResult<Bound<'py, PyArray1<f32>>> {
let vec = vector.as_slice()?;
let compressor = TensorCompressor::default();
let decompressed = compressor.decompress(vec, original_dim)
.map_err(|e| RuVectorError::Database(e.to_string()))?;
Ok(PyArray1::from_vec_bound(py, decompressed))
}python/ruvector/init.py
"""
RuVector: A distributed vector database that learns.
Store embeddings, query with Cypher, scale horizontally with Raft consensus,
and let the index improve itself through Graph Neural Networks.
Example:
>>> from ruvector import VectorDB, GraphDB
>>>
>>> # Vector search
>>> db = VectorDB(dimensions=384, path="./vectors.db")
>>> db.insert("doc1", embedding, metadata={"source": "wiki"})
>>> results = db.search(query_embedding, k=10)
>>>
>>> # Graph queries
>>> graph = GraphDB(path="./graph.db")
>>> graph.execute("CREATE (a:Person {name: 'Alice'})")
>>> graph.execute("MATCH (p:Person) RETURN p.name")
"""
from ._ruvector import (
# Core classes
VectorDB,
GraphDB,
# GNN classes
GNNLayer,
RuvectorLayer,
# Result types
SearchResult,
Node,
Edge,
# Utility functions
compress,
decompress,
# Version
__version__,
)
__all__ = [
# Core
"VectorDB",
"GraphDB",
# GNN
"GNNLayer",
"RuvectorLayer",
# Types
"SearchResult",
"Node",
"Edge",
# Utils
"compress",
"decompress",
# Meta
"__version__",
]python/ruvector/_types.pyi (Type Stubs)
"""Type stubs for ruvector native module."""
from typing import Dict, List, Optional, Any, Sequence, Union
import numpy as np
import numpy.typing as npt
class SearchResult:
id: str
distance: float
metadata: Optional[Dict[str, str]]
vector: Optional[List[float]]
def to_dict(self) -> Dict[str, Any]: ...
class Node:
id: str
labels: List[str]
properties: Dict[str, str]
class Edge:
id: str
edge_type: str
source: str
target: str
properties: Dict[str, str]
class VectorDB:
def __init__(
self,
dimensions: int,
path: Optional[str] = None,
distance_metric: str = "cosine",
) -> None: ...
def insert(
self,
id: str,
vector: Union[List[float], npt.NDArray[np.float32]],
metadata: Optional[Dict[str, str]] = None,
) -> None: ...
def insert_batch(
self,
entries: List[tuple[str, List[float], Optional[Dict[str, str]]]],
) -> int: ...
def search(
self,
query: Union[List[float], npt.NDArray[np.float32]],
k: int = 10,
filter: Optional[Dict[str, Any]] = None,
include_vectors: bool = False,
) -> List[SearchResult]: ...
def get(self, id: str) -> Optional[SearchResult]: ...
def delete(self, id: str) -> bool: ...
def sync(self) -> None: ...
def stats(self) -> Dict[str, Any]: ...
def __len__(self) -> int: ...
class GraphDB:
def __init__(self, path: Optional[str] = None) -> None: ...
def execute(self, cypher: str) -> Any: ...
def create_node(
self,
id: str,
labels: Optional[List[str]] = None,
properties: Optional[Dict[str, str]] = None,
) -> Node: ...
def create_edge(
self,
source: str,
target: str,
edge_type: str,
properties: Optional[Dict[str, str]] = None,
) -> Edge: ...
def get_node(self, id: str) -> Optional[Node]: ...
def delete_node(self, id: str) -> bool: ...
def sync(self) -> None: ...
class GNNLayer:
def __init__(
self,
input_dim: int,
output_dim: int,
heads: int = 4,
dropout: float = 0.1,
) -> None: ...
def forward(
self,
query: npt.NDArray[np.float32],
neighbors: npt.NDArray[np.float32],
weights: npt.NDArray[np.float32],
) -> npt.NDArray[np.float32]: ...
class RuvectorLayer:
def __init__(
self,
input_dim: int,
output_dim: int,
heads: int = 4,
dropout: float = 0.1,
) -> None: ...
def enhance_search(
self,
query: npt.NDArray[np.float32],
candidates: npt.NDArray[np.float32],
) -> npt.NDArray[np.float32]: ...
def compress(
vector: npt.NDArray[np.float32],
ratio: float = 0.5,
) -> npt.NDArray[np.float32]: ...
def decompress(
vector: npt.NDArray[np.float32],
original_dim: Optional[int] = None,
) -> npt.NDArray[np.float32]: ...
__version__: strTests
tests/conftest.py
import pytest
import numpy as np
import tempfile
import os
@pytest.fixture
def temp_dir():
"""Create a temporary directory for database files."""
with tempfile.TemporaryDirectory() as tmpdir:
yield tmpdir
@pytest.fixture
def sample_vectors():
"""Generate sample vectors for testing."""
np.random.seed(42)
return {
"dimensions": 128,
"vectors": [
("doc1", np.random.rand(128).astype(np.float32)),
("doc2", np.random.rand(128).astype(np.float32)),
("doc3", np.random.rand(128).astype(np.float32)),
],
"query": np.random.rand(128).astype(np.float32),
}tests/test_vector_db.py
import pytest
import numpy as np
from ruvector import VectorDB
class TestVectorDB:
def test_create_in_memory(self):
db = VectorDB(dimensions=128)
assert len(db) == 0
def test_create_persistent(self, temp_dir):
path = f"{temp_dir}/vectors.db"
db = VectorDB(dimensions=128, path=path)
assert len(db) == 0
def test_insert_and_search(self, sample_vectors):
db = VectorDB(dimensions=sample_vectors["dimensions"])
# Insert vectors
for id, vector in sample_vectors["vectors"]:
db.insert(id, vector)
assert len(db) == 3
# Search
results = db.search(sample_vectors["query"], k=2)
assert len(results) == 2
assert all(hasattr(r, "id") for r in results)
assert all(hasattr(r, "distance") for r in results)
def test_insert_with_metadata(self, sample_vectors):
db = VectorDB(dimensions=sample_vectors["dimensions"])
id, vector = sample_vectors["vectors"][0]
db.insert(id, vector, metadata={"source": "test", "type": "document"})
result = db.get(id)
assert result is not None
assert result.metadata["source"] == "test"
def test_dimension_mismatch(self):
db = VectorDB(dimensions=128)
with pytest.raises(ValueError, match="Dimension mismatch"):
wrong_dim = np.random.rand(256).astype(np.float32)
db.insert("test", wrong_dim)
def test_delete(self, sample_vectors):
db = VectorDB(dimensions=sample_vectors["dimensions"])
id, vector = sample_vectors["vectors"][0]
db.insert(id, vector)
assert len(db) == 1
deleted = db.delete(id)
assert deleted
assert len(db) == 0
def test_batch_insert(self, sample_vectors):
db = VectorDB(dimensions=sample_vectors["dimensions"])
entries = [
(id, vector.tolist(), {"index": str(i)})
for i, (id, vector) in enumerate(sample_vectors["vectors"])
]
count = db.insert_batch(entries)
assert count == 3
assert len(db) == 3
def test_persistence(self, temp_dir, sample_vectors):
path = f"{temp_dir}/vectors.db"
# Create and populate
db1 = VectorDB(dimensions=sample_vectors["dimensions"], path=path)
for id, vector in sample_vectors["vectors"]:
db1.insert(id, vector)
db1.sync()
# Reopen
db2 = VectorDB(dimensions=sample_vectors["dimensions"], path=path)
assert len(db2) == 3tests/test_graph_db.py
import pytest
from ruvector import GraphDB
class TestGraphDB:
def test_create_in_memory(self):
graph = GraphDB()
assert graph is not None
def test_create_node(self):
graph = GraphDB()
node = graph.create_node(
"person1",
labels=["Person"],
properties={"name": "Alice", "age": "30"}
)
assert node.id == "person1"
assert "Person" in node.labels
assert node.properties["name"] == "Alice"
def test_create_edge(self):
graph = GraphDB()
graph.create_node("person1", ["Person"], {"name": "Alice"})
graph.create_node("person2", ["Person"], {"name": "Bob"})
edge = graph.create_edge(
"person1", "person2", "KNOWS",
properties={"since": "2020"}
)
assert edge.source == "person1"
assert edge.target == "person2"
assert edge.edge_type == "KNOWS"
def test_cypher_create(self):
graph = GraphDB()
graph.execute("CREATE (a:Person {name: 'Alice'})")
graph.execute("CREATE (b:Person {name: 'Bob'})")
graph.execute("MATCH (a:Person {name: 'Alice'}), (b:Person {name: 'Bob'}) CREATE (a)-[:KNOWS]->(b)")
result = graph.execute("MATCH (p:Person) RETURN p.name")
assert len(result) == 2
def test_cypher_match(self):
graph = GraphDB()
graph.create_node("alice", ["Person"], {"name": "Alice"})
graph.create_node("bob", ["Person"], {"name": "Bob"})
graph.create_edge("alice", "bob", "KNOWS")
result = graph.execute("""
MATCH (a:Person)-[:KNOWS]->(b:Person)
RETURN a.name, b.name
""")
assert len(result) == 1
def test_get_node(self):
graph = GraphDB()
graph.create_node("test", ["Label"], {"key": "value"})
node = graph.get_node("test")
assert node is not None
assert node.id == "test"
missing = graph.get_node("nonexistent")
assert missing is None
def test_delete_node(self):
graph = GraphDB()
graph.create_node("test", ["Label"])
assert graph.get_node("test") is not None
deleted = graph.delete_node("test")
assert deleted
assert graph.get_node("test") is Nonetests/test_gnn.py
import pytest
import numpy as np
from ruvector import GNNLayer, RuvectorLayer, compress, decompress
class TestGNNLayer:
def test_create_layer(self):
layer = GNNLayer(input_dim=128, output_dim=256, heads=4)
assert layer is not None
def test_forward(self):
layer = GNNLayer(input_dim=128, output_dim=256, heads=4)
query = np.random.rand(128).astype(np.float32)
neighbors = np.random.rand(5, 128).astype(np.float32)
weights = np.ones(5, dtype=np.float32)
output = layer.forward(query, neighbors, weights)
assert output.shape == (256,)
class TestRuvectorLayer:
def test_enhance_search(self):
layer = RuvectorLayer(input_dim=128, output_dim=128, heads=4)
query = np.random.rand(128).astype(np.float32)
candidates = np.random.rand(10, 128).astype(np.float32)
scores = layer.enhance_search(query, candidates)
assert scores.shape == (10,)
class TestCompression:
def test_compress_decompress(self):
original = np.random.rand(128).astype(np.float32)
compressed = compress(original, ratio=0.5)
decompressed = decompress(compressed, original_dim=128)
# Should be close but not exact due to lossy compression
assert decompressed.shape == original.shape
def test_compression_levels(self):
original = np.random.rand(128).astype(np.float32)
# Different compression levels
for ratio in [0.05, 0.2, 0.4, 0.6, 0.9]:
compressed = compress(original, ratio=ratio)
assert compressed is not NoneExamples
examples/basic_vector_search.py
"""Basic vector search example with RuVector."""
import numpy as np
from ruvector import VectorDB
def main():
# Create a vector database
db = VectorDB(dimensions=384, path="./demo_vectors.db")
# Generate some sample embeddings (in practice, use a real embedding model)
np.random.seed(42)
documents = [
("doc1", "Introduction to machine learning"),
("doc2", "Deep learning with neural networks"),
("doc3", "Natural language processing basics"),
("doc4", "Computer vision fundamentals"),
("doc5", "Reinforcement learning tutorial"),
]
# Insert documents with their embeddings
for doc_id, text in documents:
# Simulate embedding generation
embedding = np.random.rand(384).astype(np.float32)
db.insert(doc_id, embedding, metadata={"text": text})
print(f"Inserted {len(db)} documents")
# Search for similar documents
query = np.random.rand(384).astype(np.float32)
results = db.search(query, k=3)
print("\nSearch results:")
for result in results:
print(f" {result.id}: distance={result.distance:.4f}")
if result.metadata:
print(f" text: {result.metadata.get('text', 'N/A')}")
if __name__ == "__main__":
main()examples/knowledge_graph.py
"""Knowledge graph example combining vectors and graph queries."""
import numpy as np
from ruvector import VectorDB, GraphDB
def main():
# Initialize both databases
vectors = VectorDB(dimensions=384, path="./kg_vectors.db")
graph = GraphDB(path="./kg_graph.db")
# Create some entities
entities = [
("python", "Language", {"name": "Python", "type": "programming"}),
("rust", "Language", {"name": "Rust", "type": "programming"}),
("ml", "Topic", {"name": "Machine Learning"}),
("webdev", "Topic", {"name": "Web Development"}),
]
np.random.seed(42)
for entity_id, label, props in entities:
# Create graph node
graph.create_node(entity_id, [label], props)
# Create embedding and store in vector DB
embedding = np.random.rand(384).astype(np.float32)
vectors.insert(entity_id, embedding, metadata=props)
# Create relationships
graph.create_edge("python", "ml", "USED_FOR")
graph.create_edge("python", "webdev", "USED_FOR")
graph.create_edge("rust", "webdev", "USED_FOR")
# Query: Find what Python is used for
result = graph.execute("""
MATCH (lang:Language {name: 'Python'})-[:USED_FOR]->(topic:Topic)
RETURN topic.name
""")
print("Python is used for:", result)
# Semantic search: Find entities similar to a query
query = np.random.rand(384).astype(np.float32)
similar = vectors.search(query, k=2)
print("\nMost similar entities:")
for r in similar:
print(f" {r.id}: {r.distance:.4f}")
if __name__ == "__main__":
main()CI/CD Configuration
.github/workflows/python.yml (add to existing workflows)
name: Python Bindings
on:
push:
branches: [main]
paths:
- 'crates/ruvector-python/**'
pull_request:
paths:
- 'crates/ruvector-python/**'
jobs:
test:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install Rust
uses: dtolnay/rust-toolchain@stable
- name: Install maturin
run: pip install maturin pytest numpy
- name: Build and test
working-directory: crates/ruvector-python
run: |
maturin develop
pytest tests/ -v
build-wheels:
needs: test
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
steps:
- uses: actions/checkout@v4
- name: Build wheels
uses: PyO3/maturin-action@v1
with:
working-directory: crates/ruvector-python
args: --release --out dist
manylinux: auto
- name: Upload wheels
uses: actions/upload-artifact@v4
with:
name: wheels-${{ matrix.os }}
path: crates/ruvector-python/dist/*.whl
publish:
needs: build-wheels
runs-on: ubuntu-latest
if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/python-v')
steps:
- uses: actions/download-artifact@v4
with:
pattern: wheels-*
merge-multiple: true
path: dist
- name: Publish to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
password: ${{ secrets.PYPI_API_TOKEN }}README.md
# RuVector Python Bindings
Python bindings for [RuVector](https://github.com/ruvnet/ruvector), a distributed vector database that learns.
## Installation
```bash
pip install ruvectorQuick Start
from ruvector import VectorDB, GraphDB
import numpy as np
# Vector search
db = VectorDB(dimensions=384, path="./vectors.db")
# Insert vectors
embedding = np.random.rand(384).astype(np.float32)
db.insert("doc1", embedding, metadata={"source": "wiki"})
# Search
query = np.random.rand(384).astype(np.float32)
results = db.search(query, k=10)
for r in results:
print(f"{r.id}: {r.distance}")
# Graph queries with Cypher
graph = GraphDB(path="./graph.db")
graph.execute("CREATE (a:Person {name: 'Alice'})-[:KNOWS]->(b:Person {name: 'Bob'})")
result = graph.execute("MATCH (p:Person) RETURN p.name")Features
- Vector Search: HNSW index, <0.5ms latency, SIMD acceleration
- Graph Queries: Neo4j-style Cypher syntax
- GNN Enhancement: Neural network layers that improve search over time
- Compression: 2-32x memory reduction with adaptive quantization
- Persistence: Optional file-based storage with crash recovery
API Reference
See the [full documentation](https://github.com/ruvnet/ruvector/tree/main/crates/ruvector-python).
License
MIT
## Contribution Checklist
Before submitting the PR:
1. [ ] All tests pass: `pytest tests/ -v`
2. [ ] Code is formatted: `cargo fmt`
3. [ ] No clippy warnings: `cargo clippy`
4. [ ] Type stubs are complete and accurate
5. [ ] Examples work correctly
6. [ ] README is updated
7. [ ] CHANGELOG entry added
8. [ ] Version matches main ruvector version
## PR Description Template
```markdown
## Python Bindings for RuVector
This PR adds native Python bindings for RuVector using PyO3 and Maturin.
### Features
- `VectorDB`: Vector storage and similarity search
- `GraphDB`: Cypher query support
- `GNNLayer`/`RuvectorLayer`: GNN enhancement layers
- `compress`/`decompress`: Tensor compression utilities
### API Parity
Matches the Node.js bindings (`@ruvector/core`, `@ruvector/graph-node`, `@ruvector/gnn`) with Pythonic adaptations.
### Testing
- Unit tests for all public APIs
- Integration tests with numpy arrays
- Tested on Python 3.8-3.12, Linux/macOS/Windows
### Documentation
- Type stubs (`.pyi`) for IDE support
- Docstrings with examples
- README with quickstart guide
Closes #XX (if there's a related issue)
Notes for Implementation
- Study the Node.js bindings first: Look at
crates/ruvector-node/src/lib.rsfor patterns - Match the API signatures: Keep function names and parameters consistent with Node.js
- Use numpy for arrays: Python users expect numpy array support
- Thread safety: Use
Arc<RwLock<>>for the inner database handles - Error messages: Make them Pythonic and helpful
- Type hints: Complete
.pyifiles are essential for Python IDE support - Test on all platforms: macOS, Linux, Windows all behave differently
Metadata
Metadata
Assignees
Labels
No labels