Skip to content

Conversation

@Wirasm
Copy link
Contributor

@Wirasm Wirasm commented Dec 10, 2025

Summary

  • Fixed hardcoded dimensions = 128 in HNSW index build functions
  • Now extracts dimensions from the indexed column's type modifier (typmod)
  • When users declare ruvector(384), the index correctly uses 384 dimensions

Problem

The hnsw_build() function at crates/ruvector-postgres/src/index/hnsw_am.rs:272 hardcoded:

let dimensions = 128; // TODO: Extract from index definition

This caused all HNSW indexes to be treated as 128-dimensional, regardless of actual vector size.

Solution

Extract dimensions from PostgreSQL's type modifier system:

// RelationGetDescr and TupleDescAttr are C macros, so we access fields directly
let index_desc = (*index).rd_att;
let attr = (*index_desc).attrs.as_ptr().add(0);
let typmod = (*attr).atttypmod;

When users declare ruvector(384), PostgreSQL stores 384 in atttypmod.

Changes

  • hnsw_build(): Extract dimensions from typmod, error if not specified
  • hnsw_buildempty(): Also extract dimensions for empty index creation
  • Added doc comments explaining the dimension extraction behavior
  • Fixed several pre-existing test compilation issues (missing max_layers field, lifetime errors, JsonB type mismatches)

Error Handling

If the column is declared without explicit dimensions (e.g., ruvector instead of ruvector(384)), users get a helpful error:

HNSW: Vector column must have dimensions specified. 
Use ruvector(dimensions) instead of ruvector, e.g., ruvector(384)

Test Plan

  • Library code compiles successfully (cargo check --lib)
  • Manual verification: Create index on ruvector(384) column - should use 384 dimensions
  • Manual verification: Create index on ruvector column - should error with guidance

CI Failures (Pre-existing Issues)

The CI failures are not caused by this PR. They are pre-existing bugs in the CI workflow:

  1. Non-existent features: The workflow references hybrid-search and filtered-search features that don't exist in Cargo.toml. There's even a comment in the crate: # Note: hybrid-search and filtered-search are planned for future releases

  2. Missing Docker image: ruvector-postgres:test is referenced but never built

  3. Benchmark permissions: Fork PRs can't comment on upstream PRs due to GitHub Actions permissions

These issues exist on the main branch and should be addressed in a separate PR.


Architectural Recommendation: Test Structure

While fixing test compilation errors, I noticed a systemic pattern where #[pg_test] tests pass Vec<T> directly to functions that expect JsonB. This makes tests awkward and couples test code to PostgreSQL serialization.

Recommendation: Refactor PostgreSQL operator functions to have two layers:

// Inner: Pure Rust, easily testable with Vec<T>
fn gcn_forward_impl(
    embeddings: &[Vec<f32>],
    src: &[i32],
    dst: &[i32],
    weights: Option<&[f32]>,
    out_dim: usize,
) -> Vec<Vec<f32>> {
    // actual logic here
}

// Outer: PostgreSQL interface handling JsonB serialization
#[pg_extern]
pub fn ruvector_gcn_forward(
    embeddings_json: JsonB,
    src: Vec<i32>,
    dst: Vec<i32>,
    weights: Option<Vec<f32>>,
    out_dim: i32,
) -> JsonB {
    let embeddings = parse_json_to_vec(&embeddings_json);
    let result = gcn_forward_impl(&embeddings, &src, &dst, weights.as_deref(), out_dim as usize);
    JsonB(serde_json::to_value(result).unwrap())
}

Benefits:

  • Unit tests call _impl() functions directly with native Rust types - no JsonB wrapping
  • Integration tests (#[pg_test]) test the full PostgreSQL interface through SQL
  • Better separation of concerns - business logic doesn't depend on PostgreSQL serialization
  • More maintainable - changing the PostgreSQL interface doesn't require updating all unit tests

This pattern should be applied to: gnn/operators.rs, attention/operators.rs, learning/operators.rs, and similar files. This would be a good follow-up refactoring task.

…hardcoding 128

Previously, hnsw_build() hardcoded dimensions=128 regardless of the actual
vector column definition. This caused incorrect index behavior for vectors
with dimensions other than 128.

Now dimensions are extracted from the indexed column's type modifier (typmod).
When users declare ruvector(384), PostgreSQL stores 384 in atttypmod, which
we now read correctly.

Changes:
- hnsw_build(): Extract dimensions from typmod, error if not specified
- hnsw_buildempty(): Also extract dimensions for empty index creation
- Add doc comments explaining the dimension extraction behavior

If the column is declared without explicit dimensions (e.g., just 'ruvector'
instead of 'ruvector(384)'), an error message guides users to specify dimensions.
@Wirasm Wirasm force-pushed the fix/hnsw-hardcoded-dimensions branch from a697416 to 38226c1 Compare December 10, 2025 09:31
- Add missing max_layers field to HnswConfig in tests
- Fix temporary value lifetime issues in patterns.rs test
- Fix JsonB type mismatches in gnn/operators.rs tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant