⚡️ Speed up function `validate_metadata` by 45% #117

codeflash-ai · 2025-10-26T06:37:02Z

📄 45% (0.45x) speedup for `validate_metadata` in `chromadb/api/types.py`

⏱️ Runtime : 1.22 milliseconds → 844 microseconds (best of 40 runs)

📝 Explanation and details

The optimization achieves a 45% speedup through several key performance improvements:

1. Early Exit Optimization

Moved the None check to the very beginning as a fast-path exit, eliminating unnecessary type checking for the most common case
Reordered validation logic to fail-fast on the most likely error conditions first

2. Reduced Global Lookups

Pre-computed commonly used values (allowed_types, reserved_key, sparse_vector_type) outside the loop, avoiding repeated global variable lookups during iteration
This is especially beneficial for large metadata dictionaries where these lookups would occur thousands of times

3. Faster Type Checking

Replaced isinstance(value, SparseVector) with type(value) is sparse_vector_type for exact type matching, which is faster than inheritance-aware isinstance
Used type(value) is bool before the tuple check to handle boolean values more efficiently
Combined type checks into a single isinstance(value, allowed_types) call using a pre-computed tuple

4. Optimized Empty Dictionary Check

Changed len(metadata) == 0 to not metadata, which is a faster truthiness check in Python

The optimizations are particularly effective for large-scale test cases where the performance gains are most pronounced:

Large metadata validation (1000+ entries): 55-61% faster
Mixed type validation: 40-43% faster
Error detection in large datasets: 50-53% faster

For small metadata dictionaries, the improvements are modest (1-8%) but the code maintains the same correctness and error handling behavior while being significantly faster on larger inputs.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 54 Passed
🌀 Generated Regression Tests	✅ 48 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 3 Passed
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_api.py::test_sparse_vector_dict_format_normalization`	2.31μs	2.32μs	-0.689%⚠️
`test_api.py::test_sparse_vector_in_metadata_validation`	14.5μs	14.2μs	2.34%✅

🌀 Generated Regression Tests and Runtime

import pytest
from chromadb.api.types import validate_metadata


# --- Dummy SparseVector class for testing ---
class SparseVector:
    def __init__(self, indices, values):
        # Minimal validation to mimic real SparseVector
        if not isinstance(indices, list) or not isinstance(values, list):
            raise TypeError("indices and values must be lists")
        if len(indices) != len(values):
            raise ValueError("indices and values must be of same length")
        self.indices = indices
        self.values = values

    def __eq__(self, other):
        return (
            isinstance(other, SparseVector)
            and self.indices == other.indices
            and self.values == other.values
        )

# --- Function under test ---
META_KEY_CHROMA_DOCUMENT = "chroma:document"
from chromadb.api.types import validate_metadata

# --- Unit Tests ---

# ----------- BASIC TEST CASES -----------

def test_metadata_none():
    """Should accept None as valid metadata."""
    codeflash_output = validate_metadata(None) # 458ns -> 459ns (0.218% slower)

def test_metadata_single_str():
    """Should accept a dict with a single string key and string value."""
    meta = {"foo": "bar"}
    codeflash_output = validate_metadata(meta) # 1.53μs -> 1.51μs (1.25% faster)

def test_metadata_single_int():
    """Should accept a dict with a single string key and int value."""
    meta = {"foo": 123}
    codeflash_output = validate_metadata(meta) # 1.62μs -> 1.60μs (1.19% faster)

def test_metadata_single_float():
    """Should accept a dict with a single string key and float value."""
    meta = {"foo": 3.14}
    codeflash_output = validate_metadata(meta) # 1.57μs -> 1.53μs (3.15% faster)

def test_metadata_single_bool():
    """Should accept a dict with a single string key and bool value."""
    meta = {"foo": True}
    codeflash_output = validate_metadata(meta) # 1.26μs -> 1.37μs (8.05% slower)

def test_metadata_single_none_value():
    """Should accept a dict with a single string key and None value."""
    meta = {"foo": None}
    codeflash_output = validate_metadata(meta) # 1.52μs -> 1.64μs (7.02% slower)



def test_metadata_empty_dict():
    """Should raise ValueError for empty dict."""
    with pytest.raises(ValueError) as excinfo:
        validate_metadata({}) # 1.46μs -> 1.57μs (7.21% slower)

def test_metadata_non_dict_type():
    """Should raise ValueError if metadata is not a dict or None."""
    for bad in [[], (), 123, 3.14, "string", set()]:
        with pytest.raises(ValueError) as excinfo:
            validate_metadata(bad)

def test_metadata_reserved_key():
    """Should raise ValueError if reserved key is present."""
    meta = {META_KEY_CHROMA_DOCUMENT: "something"}
    with pytest.raises(ValueError) as excinfo:
        validate_metadata(meta) # 1.65μs -> 1.81μs (8.72% slower)

def test_metadata_non_string_key():
    """Should raise TypeError if a key is not a string."""
    meta = {42: "value"}
    with pytest.raises(TypeError) as excinfo:
        validate_metadata(meta) # 2.34μs -> 2.44μs (3.94% slower)

def test_metadata_tuple_key():
    """Should raise TypeError if a key is a tuple."""
    meta = {(1, 2): "value"}
    with pytest.raises(TypeError) as excinfo:
        validate_metadata(meta) # 3.23μs -> 3.51μs (8.00% slower)

def test_metadata_invalid_value_type_list():
    """Should raise ValueError if a value is a list."""
    meta = {"foo": [1, 2, 3]}
    with pytest.raises(ValueError) as excinfo:
        validate_metadata(meta) # 3.48μs -> 3.44μs (1.13% faster)

def test_metadata_invalid_value_type_dict():
    """Should raise ValueError if a value is a dict."""
    meta = {"foo": {"bar": "baz"}}
    with pytest.raises(ValueError) as excinfo:
        validate_metadata(meta) # 3.86μs -> 3.95μs (2.35% slower)

def test_metadata_invalid_value_type_object():
    """Should raise ValueError if a value is an arbitrary object."""
    class Dummy: pass
    meta = {"foo": Dummy()}
    with pytest.raises(ValueError) as excinfo:
        validate_metadata(meta) # 4.03μs -> 4.20μs (4.10% slower)

def test_metadata_bool_vs_int():
    """Should distinguish bool from int and allow both."""
    meta = {"is_active": True, "count": 1}
    codeflash_output = validate_metadata(meta) # 1.87μs -> 1.90μs (1.42% slower)

def test_metadata_sparse_vector_invalid():
    """Should raise from SparseVector if constructed with bad args."""
    with pytest.raises(TypeError):
        SparseVector("notalist", [1.0])
    with pytest.raises(TypeError):
        SparseVector([1], "notalist")
    with pytest.raises(ValueError):
        SparseVector([1,2], [3.0])

def test_metadata_value_is_set():
    """Should raise ValueError if value is a set."""
    meta = {"foo": set([1,2])}
    with pytest.raises(ValueError):
        validate_metadata(meta) # 4.36μs -> 4.50μs (3.18% slower)

def test_metadata_key_is_none():
    """Should raise TypeError if key is None."""
    meta = {None: "value"}
    with pytest.raises(TypeError):
        validate_metadata(meta) # 2.28μs -> 2.45μs (7.06% slower)

# ----------- LARGE SCALE TEST CASES -----------

def test_metadata_large_number_of_entries():
    """Should accept a large dict with valid types."""
    meta = {f"key{i}": i for i in range(1000)}
    codeflash_output = validate_metadata(meta) # 126μs -> 81.1μs (55.4% faster)


def test_metadata_large_mixed_types():
    """Should accept a dict with 1000 mixed valid types."""
    meta = {}
    for i in range(250):
        meta[f"str{i}"] = str(i)
        meta[f"int{i}"] = i
        meta[f"float{i}"] = float(i)
        meta[f"bool{i}"] = i % 2 == 0
    codeflash_output = validate_metadata(meta) # 114μs -> 81.1μs (40.6% faster)

def test_metadata_large_invalid_key():
    """Should fail fast if any key is not a string, even in large dict."""
    meta = {f"key{i}": i for i in range(999)}
    meta[42] = "badkey"
    with pytest.raises(TypeError):
        validate_metadata(meta) # 125μs -> 82.3μs (53.0% faster)

def test_metadata_large_invalid_value():
    """Should fail fast if any value is invalid, even in large dict."""
    meta = {f"key{i}": i for i in range(999)}
    meta["bad"] = set([1,2])
    with pytest.raises(ValueError):
        validate_metadata(meta) # 128μs -> 84.7μs (51.8% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from chromadb.api.types import validate_metadata


# --- Define SparseVector and Metadata for testing ---
# Minimal stub for SparseVector, as expected by validate_metadata
class SparseVector:
    def __init__(self, indices, values):
        # indices: list of int, values: list of float
        if not isinstance(indices, list) or not isinstance(values, list):
            raise TypeError("indices and values must be lists")
        if len(indices) != len(values):
            raise ValueError("indices and values must be of same length")
        if not all(isinstance(i, int) for i in indices):
            raise TypeError("All indices must be ints")
        if not all(isinstance(v, float) for v in values):
            raise TypeError("All values must be floats")
        self.indices = indices
        self.values = values

# Metadata is just a type alias for dict[str, Any]
Metadata = dict

# --- Function to test ---
META_KEY_CHROMA_DOCUMENT = "chroma:document"
from chromadb.api.types import validate_metadata

# --- Unit Tests ---

# 1. Basic Test Cases


def test_valid_metadata_single_entry():
    # Test with a single valid key/value
    metadata = {"name": "Alice"}
    codeflash_output = validate_metadata(metadata) # 1.80μs -> 1.66μs (8.24% faster)

def test_valid_metadata_bool_false():
    # Test with bool False
    metadata = {"active": False}
    codeflash_output = validate_metadata(metadata) # 1.29μs -> 1.45μs (11.3% slower)


def test_valid_metadata_none():
    # Test with None metadata
    codeflash_output = validate_metadata(None) # 608ns -> 516ns (17.8% faster)

# 2. Edge Test Cases

def test_metadata_not_dict_or_none():
    # Test with metadata as a list (invalid type)
    with pytest.raises(ValueError):
        validate_metadata(["not", "a", "dict"]) # 1.53μs -> 1.59μs (3.71% slower)

def test_metadata_empty_dict():
    # Test with empty dict
    with pytest.raises(ValueError):
        validate_metadata({}) # 1.41μs -> 1.46μs (3.44% slower)

def test_metadata_reserved_key():
    # Test with reserved key
    metadata = {META_KEY_CHROMA_DOCUMENT: "should fail"}
    with pytest.raises(ValueError):
        validate_metadata(metadata) # 1.70μs -> 1.93μs (11.9% slower)

def test_metadata_non_str_key_int():
    # Test with non-str key (int)
    metadata = {1: "value"}
    with pytest.raises(TypeError):
        validate_metadata(metadata) # 2.35μs -> 2.54μs (7.34% slower)

def test_metadata_non_str_key_tuple():
    # Test with non-str key (tuple)
    metadata = {(1,2): "value"}
    with pytest.raises(TypeError):
        validate_metadata(metadata) # 3.31μs -> 3.58μs (7.64% slower)

def test_metadata_invalid_value_type_list():
    # Test with invalid value type (list)
    metadata = {"key": [1,2,3]}
    with pytest.raises(ValueError):
        validate_metadata(metadata) # 3.49μs -> 3.52μs (0.908% slower)

def test_metadata_invalid_value_type_dict():
    # Test with invalid value type (dict)
    metadata = {"key": {"nested": "dict"}}
    with pytest.raises(ValueError):
        validate_metadata(metadata) # 3.72μs -> 3.82μs (2.59% slower)

def test_metadata_invalid_value_type_object():
    # Test with invalid value type (custom object)
    class Dummy: pass
    metadata = {"key": Dummy()}
    with pytest.raises(ValueError):
        validate_metadata(metadata) # 3.98μs -> 4.27μs (6.93% slower)

def test_metadata_sparsevector_invalid_indices_type():
    # Test SparseVector with non-int indices
    with pytest.raises(TypeError):
        SparseVector(["a", 2], [0.5, 1.5])

def test_metadata_sparsevector_invalid_values_type():
    # Test SparseVector with non-float values
    with pytest.raises(TypeError):
        SparseVector([1, 2], ["x", 1.5])

def test_metadata_sparsevector_mismatched_length():
    # Test SparseVector with mismatched indices/values
    with pytest.raises(ValueError):
        SparseVector([1], [0.5, 1.5])

def test_metadata_value_bool_vs_int():
    # Test bool and int distinction (bool is allowed)
    metadata = {"is_valid": True, "count": 1}
    codeflash_output = validate_metadata(metadata) # 2.10μs -> 2.06μs (2.18% faster)

def test_metadata_value_none():
    # Test value as None
    metadata = {"optional": None}
    codeflash_output = validate_metadata(metadata) # 1.57μs -> 1.62μs (3.09% slower)

# 3. Large Scale Test Cases

def test_large_metadata_all_str():
    # Test with 1000 string key/value pairs
    metadata = {f"key_{i}": f"value_{i}" for i in range(1000)}
    codeflash_output = validate_metadata(metadata) # 114μs -> 71.2μs (61.3% faster)


def test_large_metadata_with_none_values():
    # Test with 1000 keys, all values None
    metadata = {f"key_{i}": None for i in range(1000)}
    codeflash_output = validate_metadata(metadata) # 143μs -> 100μs (42.9% faster)

def test_large_metadata_invalid_key():
    # Test with 999 valid keys, 1 invalid key (int)
    metadata = {f"key_{i}": i for i in range(999)}
    metadata[12345] = "bad_key"
    with pytest.raises(TypeError):
        validate_metadata(metadata) # 127μs -> 82.7μs (53.5% faster)

def test_large_metadata_invalid_value():
    # Test with 999 valid values, 1 invalid value (list)
    metadata = {f"key_{i}": i for i in range(999)}
    metadata["bad_value"] = [1,2,3]
    with pytest.raises(ValueError):
        validate_metadata(metadata) # 126μs -> 83.3μs (52.3% faster)

def test_large_metadata_reserved_key():
    # Test with 999 valid keys, 1 reserved key
    metadata = {f"key_{i}": i for i in range(999)}
    metadata[META_KEY_CHROMA_DOCUMENT] = "reserved"
    with pytest.raises(ValueError):
        validate_metadata(metadata) # 125μs -> 82.0μs (52.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from chromadb.api.types import validate_metadata
from chromadb.base_types import SparseVector
import pytest

def test_validate_metadata():
    with pytest.raises(ValueError, match='Expected\\ metadata\\ to\\ not\\ contain\\ the\\ reserved\\ key\\ chroma:document'):
        validate_metadata({'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00': None, 'chroma:document': 0.0})

def test_validate_metadata_2():
    with pytest.raises(ValueError, match='Expected\\ metadata\\ to\\ be\\ a\\ non\\-empty\\ dict,\\ got\\ 0\\ metadata\\ attributes'):
        validate_metadata({})

def test_validate_metadata_3():
    validate_metadata({'': SparseVector([], [])})

🔎 Concolic Coverage Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_p_g0hne0/tmppseltnm3/test_concolic_coverage.py::test_validate_metadata`	2.65μs	2.56μs	3.51%✅
`codeflash_concolic_p_g0hne0/tmppseltnm3/test_concolic_coverage.py::test_validate_metadata_2`	1.39μs	1.45μs	-4.41%⚠️
`codeflash_concolic_p_g0hne0/tmppseltnm3/test_concolic_coverage.py::test_validate_metadata_3`	1.17μs	1.44μs	-18.9%⚠️

To edit these changes git checkout codeflash/optimize-validate_metadata-mh7c66kv and push.

The optimization achieves a **45% speedup** through several key performance improvements: **1. Early Exit Optimization** - Moved the `None` check to the very beginning as a fast-path exit, eliminating unnecessary type checking for the most common case - Reordered validation logic to fail-fast on the most likely error conditions first **2. Reduced Global Lookups** - Pre-computed commonly used values (`allowed_types`, `reserved_key`, `sparse_vector_type`) outside the loop, avoiding repeated global variable lookups during iteration - This is especially beneficial for large metadata dictionaries where these lookups would occur thousands of times **3. Faster Type Checking** - Replaced `isinstance(value, SparseVector)` with `type(value) is sparse_vector_type` for exact type matching, which is faster than inheritance-aware `isinstance` - Used `type(value) is bool` before the tuple check to handle boolean values more efficiently - Combined type checks into a single `isinstance(value, allowed_types)` call using a pre-computed tuple **4. Optimized Empty Dictionary Check** - Changed `len(metadata) == 0` to `not metadata`, which is a faster truthiness check in Python The optimizations are particularly effective for **large-scale test cases** where the performance gains are most pronounced: - Large metadata validation (1000+ entries): **55-61% faster** - Mixed type validation: **40-43% faster** - Error detection in large datasets: **50-53% faster** For small metadata dictionaries, the improvements are modest (1-8%) but the code maintains the same correctness and error handling behavior while being significantly faster on larger inputs.

codeflash-ai bot requested a review from mashraf-222 October 26, 2025 06:37

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Oct 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `validate_metadata` by 45% #117

⚡️ Speed up function `validate_metadata` by 45% #117

Uh oh!

codeflash-ai bot commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function validate_metadata by 45% #117

Are you sure you want to change the base?

⚡️ Speed up function validate_metadata by 45% #117

Uh oh!

Conversation

codeflash-ai bot commented Oct 26, 2025

📄 45% (0.45x) speedup for validate_metadata in chromadb/api/types.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `validate_metadata` by 45% #117

⚡️ Speed up function `validate_metadata` by 45% #117

📄 45% (0.45x) speedup for `validate_metadata` in `chromadb/api/types.py`