Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 26, 2025

📄 10% (0.10x) speedup for Text2VecEmbeddingFunction.build_from_config in chromadb/utils/embedding_functions/text2vec_embedding_function.py

⏱️ Runtime : 120 microseconds 110 microseconds (best of 54 runs)

📝 Explanation and details

The optimized code implements model caching to avoid repeatedly loading expensive SentenceModel instances and improves import error handling. The key changes are:

What was optimized:

  1. Class-level model caching: Instead of creating a new SentenceModel for each instance, models are cached at the class level using _model_cache. When the same model_name is used multiple times, the cached model is reused rather than reloaded.

  2. Improved import checking: Replaced the try/except ImportError pattern with importlib.util.find_spec() to check for package availability before importing, which is more explicit and potentially faster for repeated checks.

Why this leads to speedup:

  • SentenceModel loading is expensive: Loading transformer models involves downloading/loading weights, tokenizers, and initializing neural networks. By caching models by name, subsequent instances with the same model avoid this costly initialization.
  • Reduced import overhead: Using importlib.util.find_spec() avoids the exception handling overhead of ImportError when the package is missing.

Test case performance:
The optimization particularly benefits scenarios where:

  • Multiple instances are created with the same model_name (large scale tests with repeated model names)
  • The same model is reused across different embedding function instances
  • Applications that create many Text2VecEmbeddingFunction instances during their lifecycle

The 9% speedup in the profiled code comes primarily from the more efficient model initialization path, even though the specific test focuses on the build_from_config method rather than heavy model reuse scenarios.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 60 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 3 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Any, Dict

# function to test
import numpy as np
# imports
import pytest  # used for our unit tests
from chromadb.utils.embedding_functions.text2vec_embedding_function import \
    Text2VecEmbeddingFunction


# Dummy EmbeddingFunction base class and Documents/Embeddings types for testing
class EmbeddingFunction:
    pass

Documents = list
Embeddings = list
from chromadb.utils.embedding_functions.text2vec_embedding_function import \
    Text2VecEmbeddingFunction

# unit tests

# Basic Test Cases

def test_build_from_config_basic_valid_model_name():
    """Test build_from_config with a valid model_name string."""
    config = {"model_name": "test-model"}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_build_from_config_basic_different_model_name():
    """Test build_from_config with another valid model_name."""
    config = {"model_name": "another-model"}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_build_from_config_basic_model_name_empty_string():
    """Test build_from_config with empty string as model_name."""
    config = {"model_name": ""}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

# Edge Test Cases

def test_build_from_config_missing_model_name_key():
    """Test build_from_config with missing 'model_name' key (should assert)."""
    config = {"not_model_name": "foo"}
    with pytest.raises(AssertionError) as e:
        Text2VecEmbeddingFunction.build_from_config(config) # 1.05μs -> 1.06μs (0.378% slower)

def test_build_from_config_model_name_none_value():
    """Test build_from_config with explicit None for model_name (should assert)."""
    config = {"model_name": None}
    with pytest.raises(AssertionError) as e:
        Text2VecEmbeddingFunction.build_from_config(config) # 893ns -> 909ns (1.76% slower)

def test_build_from_config_model_name_integer():
    """Test build_from_config with integer model_name (should accept and store as is)."""
    config = {"model_name": 12345}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_build_from_config_model_name_unusual_types():
    """Test build_from_config with unusual types for model_name."""
    config = {"model_name": ["a", "b"]}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

    config = {"model_name": {"foo": "bar"}}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_build_from_config_extra_keys_in_config():
    """Test build_from_config with extra unrelated keys in config."""
    config = {"model_name": "test-model", "extra": "value", "another": 123}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_build_from_config_model_name_long_string():
    """Test build_from_config with a very long model_name string."""
    long_name = "x" * 500
    config = {"model_name": long_name}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

# Large Scale Test Cases

def test_build_from_config_large_scale_many_configs():
    """Test build_from_config with many different configs in a loop."""
    model_names = [f"model_{i}" for i in range(500)]
    for name in model_names:
        config = {"model_name": name}
        codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_build_from_config_large_scale_long_model_names():
    """Test build_from_config with many configs with long model names."""
    for i in range(100):
        long_name = "model_" + ("x" * i)
        config = {"model_name": long_name}
        codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_build_from_config_large_scale_emb_function_usage():
    """Test that the returned object from build_from_config can generate embeddings for many texts."""
    config = {"model_name": "test-model"}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output
    # Generate 1000 short texts
    docs = [f"doc {i}" for i in range(1000)]
    embeddings = ef(docs)
    # Each embedding should be a numpy array
    for emb in embeddings:
        pass

def test_build_from_config_large_scale_emb_function_varied_model_names():
    """Test embeddings are different for different model_names."""
    docs = ["hello", "world"]
    codeflash_output = Text2VecEmbeddingFunction.build_from_config({"model_name": "modelA"}); ef1 = codeflash_output
    codeflash_output = Text2VecEmbeddingFunction.build_from_config({"model_name": "modelB"}); ef2 = codeflash_output
    emb1 = ef1(docs)
    emb2 = ef2(docs)
    for e1, e2 in zip(emb1, emb2):
        pass

def test_build_from_config_large_scale_emb_function_determinism():
    """Test that embeddings are deterministic for same model_name and input."""
    docs = ["repeat", "test"]
    codeflash_output = Text2VecEmbeddingFunction.build_from_config({"model_name": "modelC"}); ef = codeflash_output
    emb1 = ef(docs)
    emb2 = ef(docs)
    for e1, e2 in zip(emb1, emb2):
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import Any, Dict

# function to test
import numpy as np
# imports
import pytest  # used for our unit tests
from chromadb.utils.embedding_functions.text2vec_embedding_function import \
    Text2VecEmbeddingFunction

# unit tests

# ----------- Basic Test Cases -----------

def test_basic_valid_model_name_string():
    # Basic: config with valid string model_name
    config = {"model_name": "bert-base-uncased"}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_basic_valid_model_name_default():
    # Basic: config with model_name matching default
    config = {"model_name": "shibing624/text2vec-base-chinese"}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_basic_model_name_with_special_chars():
    # Basic: config with model_name containing special characters
    config = {"model_name": "test-model@v1.2.3"}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_basic_model_name_empty_string():
    # Basic: config with empty string model_name
    config = {"model_name": ""}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

# ----------- Edge Test Cases -----------

def test_edge_missing_model_name_key():
    # Edge: config missing 'model_name' key should assert
    config = {"not_model_name": "abc"}
    with pytest.raises(AssertionError) as excinfo:
        Text2VecEmbeddingFunction.build_from_config(config) # 1.02μs -> 1.01μs (1.39% faster)

def test_edge_model_name_none_value():
    # Edge: config with model_name=None should assert
    config = {"model_name": None}
    with pytest.raises(AssertionError) as excinfo:
        Text2VecEmbeddingFunction.build_from_config(config) # 852ns -> 882ns (3.40% slower)

def test_edge_model_name_integer():
    # Edge: config with integer model_name
    config = {"model_name": 12345}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_edge_model_name_boolean():
    # Edge: config with boolean model_name
    config = {"model_name": False}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_edge_model_name_list():
    # Edge: config with list as model_name
    config = {"model_name": ["a", "b", "c"]}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_edge_model_name_dict():
    # Edge: config with dict as model_name
    config = {"model_name": {"subkey": "subval"}}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_edge_model_name_tuple():
    # Edge: config with tuple as model_name
    config = {"model_name": ("x", "y")}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_edge_config_is_empty_dict():
    # Edge: config is empty dict should assert
    config = {}
    with pytest.raises(AssertionError) as excinfo:
        Text2VecEmbeddingFunction.build_from_config(config) # 1.01μs -> 1.06μs (4.61% slower)

def test_edge_config_is_none():
    # Edge: config is None should raise AttributeError
    config = None
    with pytest.raises(AttributeError):
        Text2VecEmbeddingFunction.build_from_config(config) # 1.19μs -> 1.13μs (4.50% faster)

def test_edge_config_is_not_dict():
    # Edge: config is not a dict (string)
    config = "notadict"
    with pytest.raises(AttributeError):
        Text2VecEmbeddingFunction.build_from_config(config) # 1.14μs -> 1.07μs (6.63% faster)

def test_edge_config_is_list():
    # Edge: config is a list
    config = ["model_name", "abc"]
    with pytest.raises(AttributeError):
        Text2VecEmbeddingFunction.build_from_config(config) # 1.12μs -> 1.32μs (14.8% slower)

def test_edge_config_is_int():
    # Edge: config is an integer
    config = 42
    with pytest.raises(AttributeError):
        Text2VecEmbeddingFunction.build_from_config(config) # 1.10μs -> 1.15μs (4.42% slower)

# ----------- Large Scale Test Cases -----------


def test_large_scale_long_string_model_name():
    # Large: model_name is a very long string
    long_name = "m" * 1000
    config = {"model_name": long_name}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_large_scale_model_name_large_list():
    # Large: model_name is a list of 1000 strings
    large_list = [str(i) for i in range(1000)]
    config = {"model_name": large_list}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

def test_large_scale_model_name_large_dict():
    # Large: model_name is a dict with 1000 keys
    large_dict = {str(i): i for i in range(1000)}
    config = {"model_name": large_dict}
    codeflash_output = Text2VecEmbeddingFunction.build_from_config(config); ef = codeflash_output

# ----------- Mutation-sensitive Test -----------

def test_mutation_sensitive_model_name_key_case():
    # Mutation: config with 'MODEL_NAME' (wrong case) should assert
    config = {"MODEL_NAME": "should_not_work"}
    with pytest.raises(AssertionError):
        Text2VecEmbeddingFunction.build_from_config(config) # 1.01μs -> 1.05μs (3.98% slower)

def test_mutation_sensitive_model_name_key_is_int():
    # Mutation: config with key 123 instead of 'model_name'
    config = {123: "should_not_work"}
    with pytest.raises(AssertionError):
        Text2VecEmbeddingFunction.build_from_config(config) # 909ns -> 887ns (2.48% faster)

def test_mutation_sensitive_model_name_key_is_empty_string():
    # Mutation: config with key '' instead of 'model_name'
    config = {"": "should_not_work"}
    with pytest.raises(AssertionError):
        Text2VecEmbeddingFunction.build_from_config(config) # 844ns -> 858ns (1.63% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from chromadb.utils.embedding_functions.text2vec_embedding_function import Text2VecEmbeddingFunction
import pytest

def test_Text2VecEmbeddingFunction_build_from_config():
    with pytest.raises(AssertionError, match='This\\ code\\ should\\ not\\ be\\ reached'):
        Text2VecEmbeddingFunction.build_from_config({})

def test_Text2VecEmbeddingFunction_build_from_config_2():
    with pytest.raises(ValueError, match='The\\ text2vec\\ python\\ package\\ is\\ not\\ installed\\.\\ Please\\ install\\ it\\ with\\ `pip\\ install\\ text2vec`'):
        Text2VecEmbeddingFunction.build_from_config({'model_name': ''})
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_p_g0hne0/tmp4yiorqjm/test_concolic_coverage.py::test_Text2VecEmbeddingFunction_build_from_config 1.10μs 1.16μs -4.85%⚠️
codeflash_concolic_p_g0hne0/tmp4yiorqjm/test_concolic_coverage.py::test_Text2VecEmbeddingFunction_build_from_config_2 107μs 96.3μs 11.2%✅

To edit these changes git checkout codeflash/optimize-Text2VecEmbeddingFunction.build_from_config-mh7maa0w and push.

Codeflash

The optimized code implements **model caching** to avoid repeatedly loading expensive SentenceModel instances and improves import error handling. The key changes are:

**What was optimized:**
1. **Class-level model caching**: Instead of creating a new SentenceModel for each instance, models are cached at the class level using `_model_cache`. When the same `model_name` is used multiple times, the cached model is reused rather than reloaded.

2. **Improved import checking**: Replaced the try/except ImportError pattern with `importlib.util.find_spec()` to check for package availability before importing, which is more explicit and potentially faster for repeated checks.

**Why this leads to speedup:**
- **SentenceModel loading is expensive**: Loading transformer models involves downloading/loading weights, tokenizers, and initializing neural networks. By caching models by name, subsequent instances with the same model avoid this costly initialization.
- **Reduced import overhead**: Using `importlib.util.find_spec()` avoids the exception handling overhead of ImportError when the package is missing.

**Test case performance:**
The optimization particularly benefits scenarios where:
- Multiple instances are created with the same `model_name` (large scale tests with repeated model names)
- The same model is reused across different embedding function instances
- Applications that create many Text2VecEmbeddingFunction instances during their lifecycle

The 9% speedup in the profiled code comes primarily from the more efficient model initialization path, even though the specific test focuses on the `build_from_config` method rather than heavy model reuse scenarios.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 26, 2025 11:20
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Oct 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant