Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 26, 2025

📄 99% (0.99x) speedup for validate_embeddings in chromadb/api/types.py

⏱️ Runtime : 2.32 milliseconds 1.16 milliseconds (best of 60 runs)

📝 Explanation and details

The optimized code achieves a 99% speedup through two key optimizations:

1. Pre-computed dtype set lookup: The original code checked embedding.dtype not in [np.float16, np.float32, np.float64, np.int32, np.int64] for each embedding, which creates a new list and performs linear searches. The optimized version uses a pre-computed set _ALLOWED_DTYPES with np.dtype() objects, enabling O(1) hash-based lookups instead of O(5) linear searches.

2. Early-exit loop for type checking: Instead of using all([isinstance(e, np.ndarray) for e in embeddings]) which builds a complete list comprehension before evaluation, the optimized version uses a simple for loop that exits immediately upon finding the first non-ndarray element.

Performance impact by test case type:

  • Valid embeddings: 80-110% faster across all dtype combinations due to the optimized dtype checking
  • Large datasets: Dramatic improvements - 135% faster for 1000 embeddings, 198% faster for mixed-type collections
  • Invalid dtype cases: 6-26% faster, as the dtype set lookup is more efficient even when raising errors
  • Early validation failures: Minimal impact (some slightly slower) since these exit before reaching the optimized paths

The line profiler shows the dtype checking went from 21.5% of total time (with multiple list constructions) to 16.3% with a single set lookup, while the isinstance checking became more efficient through early termination.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 44 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import List

# function to test
# (copied from above, as required)
import numpy as np
# imports
import pytest
from chromadb.api.types import validate_embeddings

Embedding = np.ndarray
Embeddings = List[Embedding]
from chromadb.api.types import validate_embeddings

# unit tests

# -------------------------
# 1. Basic Test Cases
# -------------------------

def test_valid_embeddings_float32():
    # Test with a valid list of float32 numpy arrays
    arrs = [np.array([1.0, 2.0, 3.0], dtype=np.float32), np.array([4.0, 5.0], dtype=np.float32)]
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 5.54μs -> 2.99μs (85.2% faster)

def test_valid_embeddings_int64():
    # Test with a valid list of int64 numpy arrays
    arrs = [np.array([1, 2, 3], dtype=np.int64), np.array([4, 5], dtype=np.int64)]
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 4.53μs -> 2.43μs (86.5% faster)

def test_valid_embeddings_float64_and_int32():
    # Test with mixed valid dtypes (float64 and int32)
    arrs = [np.array([1.2, 2.3], dtype=np.float64), np.array([7, 8, 9], dtype=np.int32)]
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 4.42μs -> 2.41μs (83.4% faster)

def test_valid_embeddings_float16():
    # Test with float16 dtype
    arrs = [np.array([0.1, 0.2], dtype=np.float16)]
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 2.73μs -> 2.09μs (31.0% faster)

# -------------------------
# 2. Edge Test Cases
# -------------------------

def test_empty_list_raises():
    # Should raise ValueError for empty list
    with pytest.raises(ValueError, match="at least one item"):
        validate_embeddings([]) # 1.53μs -> 1.54μs (0.715% slower)

def test_non_list_input_raises():
    # Should raise ValueError for non-list input (e.g., dict)
    with pytest.raises(ValueError, match="Expected embeddings to be a list"):
        validate_embeddings({"a": 1}) # 1.72μs -> 1.99μs (13.4% slower)

def test_tuple_input_raises():
    # Should raise ValueError for tuple input
    with pytest.raises(ValueError, match="Expected embeddings to be a list"):
        validate_embeddings((np.array([1, 2, 3]),)) # 1.82μs -> 1.88μs (2.77% slower)

def test_list_with_non_ndarray_raises():
    # Should raise ValueError if any element is not a numpy array
    arrs = [np.array([1, 2, 3]), [4, 5, 6]]  # Second element is a list, not ndarray
    with pytest.raises(ValueError, match="numpy array"):
        validate_embeddings(arrs) # 5.70μs -> 5.45μs (4.74% faster)

def test_list_with_all_non_ndarray_raises():
    # Should raise ValueError if all elements are not numpy arrays
    arrs = [[1, 2, 3], [4, 5, 6]]
    with pytest.raises(ValueError, match="numpy array"):
        validate_embeddings(arrs) # 4.30μs -> 4.30μs (0.047% faster)

def test_zero_dimensional_array_raises():
    # Should raise ValueError for 0-d array
    arrs = [np.array(42)]
    with pytest.raises(ValueError, match="0-dimensional"):
        validate_embeddings(arrs) # 5.46μs -> 4.52μs (20.7% faster)

def test_empty_ndarray_raises():
    # Should raise ValueError for empty 1-d array
    arrs = [np.array([], dtype=np.float32)]
    with pytest.raises(ValueError, match="no values at pos 0"):
        validate_embeddings(arrs) # 3.00μs -> 2.51μs (19.5% faster)

def test_multiple_embeddings_one_empty_raises():
    # Should raise ValueError if one embedding is empty
    arrs = [np.array([1, 2], dtype=np.int32), np.array([], dtype=np.float32)]
    with pytest.raises(ValueError, match="no values at pos 1"):
        validate_embeddings(arrs) # 5.19μs -> 3.19μs (62.4% faster)

def test_invalid_dtype_object_raises():
    # Should raise ValueError for object dtype
    arrs = [np.array([1, 2, 3], dtype=object)]
    with pytest.raises(ValueError, match="int or float"):
        validate_embeddings(arrs) # 48.6μs -> 44.7μs (8.82% faster)

def test_invalid_dtype_bool_raises():
    # Should raise ValueError for bool dtype
    arrs = [np.array([True, False], dtype=bool)]
    with pytest.raises(ValueError, match="int or float"):
        validate_embeddings(arrs) # 41.3μs -> 36.6μs (12.7% faster)

def test_invalid_dtype_uint8_raises():
    # Should raise ValueError for uint8 dtype
    arrs = [np.array([1, 2, 3], dtype=np.uint8)]
    with pytest.raises(ValueError, match="int or float"):
        validate_embeddings(arrs) # 81.4μs -> 64.5μs (26.3% faster)

def test_invalid_dtype_str_raises():
    # Should raise ValueError for string dtype
    arrs = [np.array(["a", "b"], dtype=str)]
    with pytest.raises(ValueError, match="int or float"):
        validate_embeddings(arrs) # 43.3μs -> 39.1μs (10.6% faster)

def test_embedding_with_extra_dimension_raises():
    # Should raise ValueError for 2D array
    arrs = [np.array([[1, 2], [3, 4]], dtype=np.float32)]
    # The function expects only 1D arrays, but does not explicitly check ndim > 1
    # However, the code only raises for ndim == 0, so this should pass.
    # Let's add a test for this to ensure the behavior.
    arrs = [np.array([[1, 2], [3, 4]], dtype=np.float32)]
    # Should not raise, as only 0-d is forbidden
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 3.31μs -> 2.23μs (48.3% faster)

def test_embedding_with_1_element():
    # Should work for 1D array with a single element
    arrs = [np.array([42], dtype=np.int32)]
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 3.50μs -> 2.06μs (70.0% faster)

def test_embedding_with_large_numbers():
    # Should work for large numbers in the embedding
    arrs = [np.array([1e10, -1e10], dtype=np.float64)]
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 3.36μs -> 2.11μs (59.4% faster)

def test_embedding_with_nan_inf():
    # Should work for nan/inf values (dtype is valid, values are not checked)
    arrs = [np.array([np.nan, np.inf], dtype=np.float32)]
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 3.17μs -> 1.98μs (60.1% faster)

def test_embedding_with_mixed_valid_dtypes():
    # Should work for a mix of valid dtypes
    arrs = [
        np.array([1, 2], dtype=np.int32),
        np.array([3.5, 4.5], dtype=np.float64),
        np.array([5], dtype=np.float16),
        np.array([6], dtype=np.int64),
    ]
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 5.28μs -> 2.55μs (107% faster)


def test_large_number_of_embeddings():
    # Test with 1000 embeddings, each of size 10
    arrs = [np.arange(10, dtype=np.float32) for _ in range(1000)]
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 247μs -> 105μs (135% faster)

def test_large_embedding_size():
    # Test with 10 embeddings, each of size 1000
    arrs = [np.arange(1000, dtype=np.float64) for _ in range(10)]
    codeflash_output = validate_embeddings(arrs); result = codeflash_output # 7.28μs -> 3.47μs (110% faster)

def test_large_number_of_embeddings_one_invalid():
    # 999 valid, 1 invalid (wrong dtype)
    arrs = [np.arange(10, dtype=np.float32) for _ in range(999)]
    arrs.append(np.array([1, 2, 3], dtype=object))
    with pytest.raises(ValueError, match="int or float"):
        validate_embeddings(arrs) # 292μs -> 149μs (95.9% faster)

def test_large_number_of_embeddings_one_empty():
    # 999 valid, 1 empty
    arrs = [np.arange(10, dtype=np.int64) for _ in range(999)]
    arrs.append(np.array([], dtype=np.int64))
    with pytest.raises(ValueError, match="no values at pos 999"):
        validate_embeddings(arrs) # 355μs -> 105μs (237% faster)


#------------------------------------------------
import numpy as np
# imports
import pytest  # used for our unit tests
from chromadb.api.types import validate_embeddings

# unit tests

# ---------------------- Basic Test Cases ----------------------

def test_valid_float32_embeddings():
    # Test with a valid list of float32 numpy arrays
    embeddings = [np.array([1.0, 2.0], dtype=np.float32), np.array([3.0, 4.0], dtype=np.float32)]
    codeflash_output = validate_embeddings(embeddings); result = codeflash_output # 5.41μs -> 2.90μs (86.9% faster)

def test_valid_int64_embeddings():
    # Test with a valid list of int64 numpy arrays
    embeddings = [np.array([1, 2], dtype=np.int64), np.array([3, 4], dtype=np.int64)]
    codeflash_output = validate_embeddings(embeddings); result = codeflash_output # 4.46μs -> 2.46μs (81.4% faster)

def test_valid_mixed_types_embeddings():
    # Test with a valid list of mixed int32 and float64 numpy arrays
    embeddings = [np.array([1, 2], dtype=np.int32), np.array([3.0, 4.0], dtype=np.float64)]
    codeflash_output = validate_embeddings(embeddings); result = codeflash_output # 4.35μs -> 2.26μs (92.6% faster)

def test_valid_float16_embeddings():
    # Test with a valid list of float16 numpy arrays
    embeddings = [np.array([1.0, 2.0], dtype=np.float16)]
    codeflash_output = validate_embeddings(embeddings); result = codeflash_output # 2.84μs -> 1.99μs (42.9% faster)

# ---------------------- Edge Test Cases ----------------------

def test_empty_embeddings_list():
    # Test with an empty list
    with pytest.raises(ValueError) as excinfo:
        validate_embeddings([]) # 1.62μs -> 1.48μs (9.25% faster)

def test_non_list_input():
    # Test with a non-list, non-numpy input (e.g., integer)
    with pytest.raises(ValueError) as excinfo:
        validate_embeddings(42) # 1.76μs -> 1.80μs (2.38% slower)

def test_non_array_in_list():
    # Test with a list containing a non-numpy array element
    embeddings = [np.array([1, 2], dtype=np.int32), [3, 4]]
    with pytest.raises(ValueError) as excinfo:
        validate_embeddings(embeddings) # 5.83μs -> 5.33μs (9.31% faster)

def test_0d_array_in_list():
    # Test with a 0-dimensional numpy array in the list
    embeddings = [np.array([1, 2], dtype=np.int32), np.array(5, dtype=np.float32)]
    with pytest.raises(ValueError) as excinfo:
        validate_embeddings(embeddings) # 9.31μs -> 6.26μs (48.6% faster)

def test_empty_array_in_list():
    # Test with a 1D numpy array of length zero
    embeddings = [np.array([], dtype=np.float32)]
    with pytest.raises(ValueError) as excinfo:
        validate_embeddings(embeddings) # 2.86μs -> 2.46μs (16.2% faster)

def test_invalid_dtype_in_array():
    # Test with an array of unsupported dtype (e.g., uint8)
    embeddings = [np.array([1, 2], dtype=np.uint8)]
    with pytest.raises(ValueError) as excinfo:
        validate_embeddings(embeddings) # 83.1μs -> 71.6μs (16.0% faster)

def test_object_dtype_in_array():
    # Test with an array of dtype=object
    embeddings = [np.array([1, "a"], dtype=object)]
    with pytest.raises(ValueError) as excinfo:
        validate_embeddings(embeddings) # 40.7μs -> 38.3μs (6.27% faster)


def test_2d_array_in_list():
    # Test with a 2D numpy array in the list (should pass as long as ndim != 0)
    embeddings = [np.array([[1, 2], [3, 4]], dtype=np.float32)]
    # Should pass, since ndim == 2 and dtype is valid
    codeflash_output = validate_embeddings(embeddings); result = codeflash_output # 4.87μs -> 2.60μs (87.5% faster)

def test_single_embedding_in_list():
    # Test with a list containing a single valid embedding
    embeddings = [np.array([1, 2, 3], dtype=np.float64)]
    codeflash_output = validate_embeddings(embeddings); result = codeflash_output # 3.89μs -> 2.19μs (77.4% faster)

def test_single_0d_embedding_in_list():
    # Test with a list containing a single 0d embedding
    embeddings = [np.array(42, dtype=np.int32)]
    with pytest.raises(ValueError) as excinfo:
        validate_embeddings(embeddings) # 5.47μs -> 5.09μs (7.45% faster)

# ---------------------- Large Scale Test Cases ----------------------

def test_large_number_of_embeddings():
    # Test with a large list of valid embeddings
    embeddings = [np.array([float(i), float(i+1)], dtype=np.float32) for i in range(1000)]
    codeflash_output = validate_embeddings(embeddings); result = codeflash_output # 247μs -> 105μs (135% faster)

def test_large_embeddings_dimension():
    # Test with embeddings that have large dimensions
    embeddings = [np.ones(1000, dtype=np.float64) for _ in range(10)]
    codeflash_output = validate_embeddings(embeddings); result = codeflash_output # 6.76μs -> 3.43μs (96.9% faster)

def test_large_mixed_types_embeddings():
    # Test with a large list mixing int32 and float64 embeddings
    embeddings = []
    for i in range(500):
        if i % 2 == 0:
            embeddings.append(np.ones(10, dtype=np.int32))
        else:
            embeddings.append(np.ones(10, dtype=np.float64))
    codeflash_output = validate_embeddings(embeddings); result = codeflash_output # 157μs -> 52.8μs (198% faster)

def test_large_invalid_dtype_embeddings():
    # Test with a large list where one embedding has invalid dtype
    embeddings = [np.ones(10, dtype=np.float32) for _ in range(999)]
    embeddings.append(np.ones(10, dtype=np.bool_))
    with pytest.raises(ValueError) as excinfo:
        validate_embeddings(embeddings) # 297μs -> 151μs (96.4% faster)

def test_large_some_empty_embeddings():
    # Test with a large list where one embedding is empty
    embeddings = [np.ones(10, dtype=np.float32) for _ in range(999)]
    embeddings.append(np.array([], dtype=np.float32))
    with pytest.raises(ValueError) as excinfo:
        validate_embeddings(embeddings) # 249μs -> 105μs (136% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from chromadb.api.types import validate_embeddings
import pytest

def test_validate_embeddings():
    with pytest.raises(ValueError, match='Expected\\ embeddings\\ to\\ be\\ a\\ list\\ with\\ at\\ least\\ one\\ item,\\ got\\ 0\\ embeddings'):
        validate_embeddings([])
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_p_g0hne0/tmpb1qtkwhf/test_concolic_coverage.py::test_validate_embeddings 2.12μs 2.00μs 5.79%✅

To edit these changes git checkout codeflash/optimize-validate_embeddings-mh7d8pem and push.

Codeflash

The optimized code achieves a **99% speedup** through two key optimizations:

**1. Pre-computed dtype set lookup**: The original code checked `embedding.dtype not in [np.float16, np.float32, np.float64, np.int32, np.int64]` for each embedding, which creates a new list and performs linear searches. The optimized version uses a pre-computed set `_ALLOWED_DTYPES` with `np.dtype()` objects, enabling O(1) hash-based lookups instead of O(5) linear searches.

**2. Early-exit loop for type checking**: Instead of using `all([isinstance(e, np.ndarray) for e in embeddings])` which builds a complete list comprehension before evaluation, the optimized version uses a simple `for` loop that exits immediately upon finding the first non-ndarray element.

**Performance impact by test case type**:
- **Valid embeddings**: 80-110% faster across all dtype combinations due to the optimized dtype checking
- **Large datasets**: Dramatic improvements - 135% faster for 1000 embeddings, 198% faster for mixed-type collections  
- **Invalid dtype cases**: 6-26% faster, as the dtype set lookup is more efficient even when raising errors
- **Early validation failures**: Minimal impact (some slightly slower) since these exit before reaching the optimized paths

The line profiler shows the dtype checking went from 21.5% of total time (with multiple list constructions) to 16.3% with a single set lookup, while the isinstance checking became more efficient through early termination.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 26, 2025 07:07
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant