How to Add a New Embedding Model Implementation

This guide demonstrates how to add support for a new embedding model implementation, using OpenAI as an example. The implementation is defined in embedding_model_configuration.py.

Step 1: Add Dependencies

Add the required packages to pyproject.toml:

...
llama-index-embeddings-openai=0.2.4
...

Step 2: Define the Embedding Model Provider

In embedding_model_configuration.py, add the new provider to the EmbeddingModelProviderNames enumeration:

class EmbeddingModelProviderNames(str, Enum):
    ...
    OPENAI = "openai"

Step 3: Configure Embedding Model Secrets

Create a secrets class for the new provider:

class OpenAIEmbeddingModelSecrets(BaseSettings):
    model_config = ConfigDict(
        env_file_encoding="utf-8",
        env_prefix="RAG__EMBEDDING_MODELS__OPEN_AI__",
        env_nested_delimiter="__",
        extra="ignore",
    )

    api_key: Optional[SecretStr] = Field(
        None, description="API key for the embedding model"
    )

Add the corresponding environment variable to configurations/secrets.{environment}.env:

...
RAG__EMBEDDING_MODELS__OPEN_AI__API_KEY=<openai_api_key>
...

Step 4: Implement the Embedding Model Configuration

Define the configuration class for the new provider:

class OpenAIEmbeddingModelConfiguration(EmbeddingModelConfiguration):

    provider: Literal[EmbeddingModelProviderNames.OPENAI] = Field(
        ..., description="The provider of the embedding model."
    )
    max_request_size_in_tokens: int = Field(
        8191,
        description="Maximum size of the request in tokens.",
    )
    secrets: OpenAIEmbeddingModelSecrets = Field(
        None, description="The secrets for the language embedding model."
    )

    builder: Callable = Field(
        OpenAIEmbeddingModelBuilder.build,
        description="The builder for the embedding model.",
        exclude=True,
    )

    def model_post_init(self, __context):
        super().model_post_init(__context)
        self.batch_size = (
            self.max_request_size_in_tokens
            // self.splitting.chunk_size_in_tokens
        )

Step 5: Setup Tokenizer Initialization

Customize the get_tokenizer method in EmbeddingModelConfiguration:

import tiktoken
...
class EmbeddingModelConfiguration(BaseModel):
    ...
    def get_tokenizer(self) -> Callable:
        match self.provider:
            ...
            case EmbeddingModelProviderNames.OPENAI:
                return tiktoken.encoding_for_model(self.tokenizer_name).encode
            ...

Step 6: Example JSON Configuration

...
    "embedding_model": {
        "provider": "openai",
        "name": "text-embedding-3-small",
        "tokenizer_name": "text-embedding-3-small",
        "splitting": {
            "name": "basic",
            "chunk_overlap_in_tokens": 50,
            "chunk_size_in_tokens": 384
        }
    }
...

Step 7: Expose Embedding Model Configuration

Add the new configuration to AVAILABLE_EMBEDDING_MODELS:

AVAILABLE_EMBEDDING_MODELS = Union[..., OpenAIEmbeddingModelConfiguration]

Step 8: Create the Embedding Model Builder

Add the builder logic to embedding_builders.py:

from typing import TYPE_CHECKING

from injector import inject
from llama_index.embeddings.openai import OpenAIEmbedding

if TYPE_CHECKING:
    from common.bootstrap.configuration.pipeline.embedding.embedding_model.embedding_model_configuration import (
        OpenAIEmbeddingModelConfiguration,
    )


class OpenAIEmbeddingModelBuilder:
    """Builder for creating OpenAI embedding model instances.

    Provides factory method to create configured OpenAIEmbedding objects.
    """

    @staticmethod
    @inject
    def build(
        configuration: "OpenAIEmbeddingModelConfiguration",
    ) -> OpenAIEmbedding:
        """Creates a configured OpenAI embedding model.

        Args:
            configuration: Embedding model settings including API key, name and batch size.

        Returns:
            OpenAIEmbedding: Configured embedding model instance.
        """
        return OpenAIEmbedding(
            api_key=configuration.secrets.api_key.get_secret_value(),
            model_name=configuration.name,
            embed_batch_size=configuration.batch_size,
        )

After completing these steps, the OpenAI embedding models are ready to be configured and used in the RAG System.