How to Configure the RAG System
This guide explains how to customize the RAG system pipeline through configuration files.
Environments
Definition
The following environments are supported:
class EnvironmentName(str, Enum):
DEFAULT = "default"
LOCAL = "local"
DEV = "dev"
TEST = "test"
PROD = "prod"
Each environment requires corresponding configuration and secrets files in the configurations directory:
- Configuration files:
configuration.{environment}.json
- Secrets files:
secrets.{environment}.env
The configuration files define the pipeline setup, while secrets files store credentials and tokens. For security, all files in the configurations directory are git-ignored except for configuration.default.json
and configuration.local.json
.
Usage
Run the pipeline with a specific configuration using the --env
flag:
build/workstation/init.sh --env default
python src/embed.py --env default
Datasource Configuration
Currently, the following datasources are available:
class DatasourceName(str, Enum):
NOTION = "notion"
CONFLUENCE = "confluence"
PDF = "pdf"
Blueprint allows the usage of single or multiple datasources. Adjust the corresponding configuration accordingly:
{
"pipeline": {
"embedding": {
"datasources": [
{
"name": "notion",
"export_limit": 100
},
{
"name": "pdf",
"export_limit": 100,
"base_path": "data/"
}
]
}
}
}
Each entry in datasources
corresponds to a single source that will be sequentially used for the extraction of documents to be further processed. The name
of each entry must correspond to one of the implemented enums. Datasources' secrets must be added to the environment's secret file. To check configurable options for specific datasources, visit datasources_configuration.json.
LLM Configuration
Currently, LLMs from these providers are supported:
class LLMProviderNames(str, Enum):
OPENAI = "openai"
OPENAI_LIKE = "openai-like"
OpenAI
indicates the OpenAI provider, whereas OPENAI_LIKE
indicates any LLM exposed through an API compatible with OpenAI's API, e.g., a self-hosted LLM exposed via TabbyAPI.
Minimal setup requires the use of LLMs in augmentation and evaluation processes. To configure this, adjust the following JSON entries:
{
"pipeline": {
"augmentation": {
"query_engine": {
"synthesizer": {
"name": "tree",
"llm": {
"provider": "openai",
"name": "gpt-4o-mini",
"max_tokens": 1024,
"max_retries": 3,
"context_window": 16384
}
}
}
},
"evaluation": {
"judge_llm": {
"provider": "openai",
"name": "gpt-4o-mini",
"max_tokens": 1024,
"max_retries": 3,
"context_window": 16384
}
}
}
}
Providers' secrets must be added to the environment's secret file. The provider
field must be one of the values from LLMProviderNames
, and the name
field indicates the specific model exposed by the provider. To check configurable options for specific providers, visit llm_configuration.json.
In the above case, augmentation and evaluation processes use the same LLM, which might be suboptimal. To change it, simply adjust the entry of one of these:
{
"pipeline": {
"augmentation": {
"query_engine": {
"synthesizer": {
"name": "tree",
"llm": {
"provider": "openai",
"name": "gpt-4o-mini",
"max_tokens": 1024,
"max_retries": 3,
"context_window": 16384
}
}
}
},
"evaluation": {
"judge_llm": {
"provider": "openai-like", // another provider
"name": "my-llm", // another llm
"max_tokens": 512 // different parameters
}
}
}
}
Embedding Model Configuration
Currently, embedding models from these providers are supported:
class EmbeddingModelProviderNames(str, Enum):
HUGGING_FACE = "hugging_face"
OPENAI = "openai"
VOYAGE = "voyage"
Any model exposed by these providers can be used in the setup.
Minimal setup requires the use of embedding models in different processes. To configure this, adjust the following JSON entries:
{
"pipeline": {
"augmentation": {
"embedding_model": {
"provider": "hugging_face",
"name": "BAAI/bge-small-en-v1.5",
"tokenizer_name": "BAAI/bge-small-en-v1.5",
"splitting": {
"name": "basic",
"chunk_overlap_in_tokens": 50,
"chunk_size_in_tokens": 384
}
}
}
},
{
"evaluation": {
"judge_embedding_model": {
"provider": "hugging_face",
"name": "BAAI/bge-small-en-v1.5",
"tokenizer_name": "BAAI/bge-small-en-v1.5"
}
}
}
}
Providers' secrets must be added to the environment's secret file. The provider
field must be one of the values from EmbeddingModelProviderNames
, and the name
field indicates the specific model exposed by the provider. The tokenizer_name
field indicates the tokenizer used in pair with the embedding model, and it should be compatible with the specified embedding model. The splitting
field is optional and defines how the documents should be chunked in the embedding process. To check configurable options for specific providers, visit embedding_model_configuration.json.
Note: The same embedding model is used for embedding and retrieval processes, therefore it is defined in the embedding
configuration only.
In the above case, embedding/retrieval and evaluation processes use the same embedding model, which might be suboptimal. To change it, simply adjust the entry of one of these:
{
"pipeline": {
"augmentation": {
"embedding_model": {
"provider": "hugging_face",
"name": "BAAI/bge-small-en-v1.5",
"tokenizer_name": "BAAI/bge-small-en-v1.5",
"splitting": {
"name": "basic",
"chunk_overlap_in_tokens": 50,
"chunk_size_in_tokens": 384
}
}
}
},
{
"evaluation": {
"judge_embedding_model": {
"provider": "openai", // different provider
"name": "text-embedding-3-small", // different embedding model
"tokenizer_name": "text-embedding-3-small", // different tokenizer
"batch_size": 64 // different parameters
}
}
}
}
Vector Store Configuration
Currently, the following vector stores are supported:
class VectorStoreName(str, Enum):
QDRANT = "qdrant"
CHROMA = "chroma"
To configure the vector store, update the following entry:
{
"pipeline": {
"embedding": {
"vector_store": {
"name": "qdrant",
"collection_name": "collection-default",
"host": "qdrant",
"protocol": "http",
"ports": {
"rest": 6333
}
}
}
}
}
The name
field indicates one of the vector stores from VectorStoreName
, and the collection_name
defines the vector store collection for embedded documents. The next fields define the connection to the vector store. Corresponding secrets must be added to the environment's secrets file. To check configurable options for specific datasources, visit vector_store_configuration.json.
Note: If collection_name
already exists in the vector store, the embedding process will be skipped. To run it, delete the collection or use a different name.
Langfuse and Chainlit Configuration
Configuration contains the entries related to Langfuse and Chainlit:
{
"pipeline": {
"augmentation": {
"langfuse": {
"host": "langfuse",
"protocol": "http",
"port": 3000,
"database": {
"host": "langfuse-db",
"port": 5432,
"db": "langfuse"
}
},
"chainlit": {
"port": 8000
}
}
}
}
Field chailit.port
defines on which port chat UI should be run. Fields in langfuse
define connection details to Langfuse server and langfuse.database
details of its database. Corresponding secrets for Langfuse have to be added to environment's secrets file. For more details check langfuse_configuration.json
Upcoming Docs
Docs about configurable syntheziers, retrievers, postprocessors and others are in progress..