Confluence Datasource

This module contains functionality related to the Confluence datasource.

Cleaner

`ConfluenceCleaner`

Bases: BaseCleaner

The ConfluenceCleaner class is a concrete implementation of BaseCleaner for cleaning Confluence documents.

Source code in src/embedding/datasources/confluence/cleaner.py

class ConfluenceCleaner(BaseCleaner):
    """
    The `ConfluenceCleaner` class is a concrete implementation of `BaseCleaner` for cleaning Confluence documents.
    """

    def clean(
        self, documents: List[ConfluenceDocument]
    ) -> List[ConfluenceDocument]:
        """
        Clean the list of Confluence documents. If the content is empty it is not added to the cleaned documents.

        :param documents: List of ConfluenceDocument objects
        :return: List of cleaned ConfluenceDocument objects
        """
        cleaned_documents = []

        for document in ConfluenceCleaner._get_documents_with_tqdm(documents):
            if not ConfluenceCleaner._has_empty_content(document):
                cleaned_documents.append(document)

        return cleaned_documents

    @staticmethod
    def _get_documents_with_tqdm(documents: List[ConfluenceDocument]):
        """
        Return the documents with tqdm progress bar if GlobalSettings.SHOW_PROGRESS is True, else return the documents as is.

        :param documents: List of Notion document objects
        """
        return tqdm(documents, desc="[Confluence] Cleaning documents")

`_get_documents_with_tqdm(documents)` `staticmethod`

Return the documents with tqdm progress bar if GlobalSettings.SHOW_PROGRESS is True, else return the documents as is.

:param documents: List of Notion document objects

Source code in src/embedding/datasources/confluence/cleaner.py

@staticmethod
def _get_documents_with_tqdm(documents: List[ConfluenceDocument]):
    """
    Return the documents with tqdm progress bar if GlobalSettings.SHOW_PROGRESS is True, else return the documents as is.

    :param documents: List of Notion document objects
    """
    return tqdm(documents, desc="[Confluence] Cleaning documents")

`clean(documents)`

Clean the list of Confluence documents. If the content is empty it is not added to the cleaned documents.

:param documents: List of ConfluenceDocument objects :return: List of cleaned ConfluenceDocument objects

Source code in src/embedding/datasources/confluence/cleaner.py

def clean(
    self, documents: List[ConfluenceDocument]
) -> List[ConfluenceDocument]:
    """
    Clean the list of Confluence documents. If the content is empty it is not added to the cleaned documents.

    :param documents: List of ConfluenceDocument objects
    :return: List of cleaned ConfluenceDocument objects
    """
    cleaned_documents = []

    for document in ConfluenceCleaner._get_documents_with_tqdm(documents):
        if not ConfluenceCleaner._has_empty_content(document):
            cleaned_documents.append(document)

    return cleaned_documents

Document

`ConfluenceDocument`

Bases: BaseDocument

Document representation for Confluence page content.

Extends BaseDocument to handle Confluence-specific document processing including content extraction, metadata handling, and exclusion configuration.

Attributes:	`text` – Markdown-formatted page content `attachments` – Dictionary of page attachments (placeholder for future) `metadata` – Extracted page metadata including dates, IDs, and URLs `excluded_embed_metadata_keys` – Metadata keys to exclude from embeddings `excluded_llm_metadata_keys` – Metadata keys to exclude from LLM context

Note

Handles conversion of HTML content to markdown and manages metadata filtering for both embedding and LLM contexts.

Source code in src/embedding/datasources/confluence/document.py

class ConfluenceDocument(BaseDocument):
    """Document representation for Confluence page content.

    Extends BaseDocument to handle Confluence-specific document processing including
    content extraction, metadata handling, and exclusion configuration.

    Attributes:
        text: Markdown-formatted page content
        attachments: Dictionary of page attachments (placeholder for future)
        metadata: Extracted page metadata including dates, IDs, and URLs
        excluded_embed_metadata_keys: Metadata keys to exclude from embeddings
        excluded_llm_metadata_keys: Metadata keys to exclude from LLM context

    Note:
        Handles conversion of HTML content to markdown and manages metadata
        filtering for both embedding and LLM contexts.
    """

    @classmethod
    def from_page(cls, page: dict, base_url: str) -> "ConfluenceDocument":
        """Create ConfluenceDocument instance from page data.

        Args:
            page: Dictionary containing Confluence page details
            base_url: Base URL of the Confluence instance

        Returns:
            ConfluenceDocument: Configured document instance
        """
        document = cls(
            text=md(page["body"]["view"]["value"]),
            attachments={},  # TBD
            metadata=ConfluenceDocument._get_metadata(page, base_url),
        )
        document._set_excluded_embed_metadata_keys()
        document._set_excluded_llm_metadata_keys()
        return document

    def _set_excluded_embed_metadata_keys(self) -> None:
        """Configure metadata keys to exclude from embeddings.

        Identifies metadata keys not explicitly included in embedding
        processing and marks them for exclusion.
        """
        metadata_keys = self.metadata.keys()
        self.excluded_embed_metadata_keys = [
            key
            for key in metadata_keys
            if key not in self.included_embed_metadata_keys
        ]

    def _set_excluded_llm_metadata_keys(self) -> None:
        """Configure metadata keys to exclude from LLM context.

        Identifies metadata keys not explicitly included in LLM
        processing and marks them for exclusion.
        """
        metadata_keys = self.metadata.keys()
        self.excluded_llm_metadata_keys = [
            key
            for key in metadata_keys
            if key not in self.included_llm_metadata_keys
        ]

    @staticmethod
    def _get_metadata(page: dict, base_url: str) -> dict:
        """Extract and format page metadata.

        Args:
            page: Dictionary containing Confluence page details
            base_url: Base URL of the Confluence instance

        Returns:
            dict: Structured metadata including dates, IDs, and URLs
        """
        return {
            "created_time": page["history"]["createdDate"],
            "created_date": page["history"]["createdDate"].split("T")[0],
            "datasource": "confluence",
            "format": "md",
            "last_edited_date": page["history"]["lastUpdated"]["when"],
            "last_edited_time": page["history"]["lastUpdated"]["when"].split(
                "T"
            )[0],
            "page_id": page["id"],
            "space": page["_expandable"]["space"].split("/")[-1],
            "title": page["title"],
            "type": "page",
            "url": base_url + page["_links"]["webui"],
        }

`_get_metadata(page, base_url)` `staticmethod`

Extract and format page metadata.

Parameters:	`page` (`dict`) – Dictionary containing Confluence page details `base_url` (`str`) – Base URL of the Confluence instance

Returns:	`dict`( `dict` ) – Structured metadata including dates, IDs, and URLs

Source code in src/embedding/datasources/confluence/document.py

@staticmethod
def _get_metadata(page: dict, base_url: str) -> dict:
    """Extract and format page metadata.

    Args:
        page: Dictionary containing Confluence page details
        base_url: Base URL of the Confluence instance

    Returns:
        dict: Structured metadata including dates, IDs, and URLs
    """
    return {
        "created_time": page["history"]["createdDate"],
        "created_date": page["history"]["createdDate"].split("T")[0],
        "datasource": "confluence",
        "format": "md",
        "last_edited_date": page["history"]["lastUpdated"]["when"],
        "last_edited_time": page["history"]["lastUpdated"]["when"].split(
            "T"
        )[0],
        "page_id": page["id"],
        "space": page["_expandable"]["space"].split("/")[-1],
        "title": page["title"],
        "type": "page",
        "url": base_url + page["_links"]["webui"],
    }

`_set_excluded_embed_metadata_keys()`

Configure metadata keys to exclude from embeddings.

Identifies metadata keys not explicitly included in embedding processing and marks them for exclusion.

Source code in src/embedding/datasources/confluence/document.py

def _set_excluded_embed_metadata_keys(self) -> None:
    """Configure metadata keys to exclude from embeddings.

    Identifies metadata keys not explicitly included in embedding
    processing and marks them for exclusion.
    """
    metadata_keys = self.metadata.keys()
    self.excluded_embed_metadata_keys = [
        key
        for key in metadata_keys
        if key not in self.included_embed_metadata_keys
    ]

`_set_excluded_llm_metadata_keys()`

Configure metadata keys to exclude from LLM context.

Identifies metadata keys not explicitly included in LLM processing and marks them for exclusion.

Source code in src/embedding/datasources/confluence/document.py

def _set_excluded_llm_metadata_keys(self) -> None:
    """Configure metadata keys to exclude from LLM context.

    Identifies metadata keys not explicitly included in LLM
    processing and marks them for exclusion.
    """
    metadata_keys = self.metadata.keys()
    self.excluded_llm_metadata_keys = [
        key
        for key in metadata_keys
        if key not in self.included_llm_metadata_keys
    ]

`from_page(page, base_url)` `classmethod`

Create ConfluenceDocument instance from page data.

Parameters:	`page` (`dict`) – Dictionary containing Confluence page details `base_url` (`str`) – Base URL of the Confluence instance

Returns:	`ConfluenceDocument`( `ConfluenceDocument` ) – Configured document instance

Source code in src/embedding/datasources/confluence/document.py

@classmethod
def from_page(cls, page: dict, base_url: str) -> "ConfluenceDocument":
    """Create ConfluenceDocument instance from page data.

    Args:
        page: Dictionary containing Confluence page details
        base_url: Base URL of the Confluence instance

    Returns:
        ConfluenceDocument: Configured document instance
    """
    document = cls(
        text=md(page["body"]["view"]["value"]),
        attachments={},  # TBD
        metadata=ConfluenceDocument._get_metadata(page, base_url),
    )
    document._set_excluded_embed_metadata_keys()
    document._set_excluded_llm_metadata_keys()
    return document

Manager

`ConfluenceDatasourceManager`

Bases: DatasourceManager

Manager for Confluence content extraction and processing.

Handles document retrieval, cleaning, splitting and embedding updates for Confluence workspace content. Implements the base DatasourceManager interface for Confluence-specific processing.

Source code in src/embedding/datasources/confluence/manager.py

class ConfluenceDatasourceManager(DatasourceManager):
    """Manager for Confluence content extraction and processing.

    Handles document retrieval, cleaning, splitting and embedding updates
    for Confluence workspace content. Implements the base DatasourceManager
    interface for Confluence-specific processing.
    """

    pass

Reader

`ConfluenceReader`

Bases: BaseReader

Reader for extracting documents from Confluence spaces.

Implements document extraction from Confluence spaces, handling pagination and export limits. Supports both synchronous and asynchronous retrieval.

Attributes:	`export_limit` – Maximum number of documents to extract `confluence_client` – Client for Confluence API interactions

Source code in src/embedding/datasources/confluence/reader.py

class ConfluenceReader(BaseReader):
    """Reader for extracting documents from Confluence spaces.

    Implements document extraction from Confluence spaces, handling pagination
    and export limits. Supports both synchronous and asynchronous retrieval.

    Attributes:
        export_limit: Maximum number of documents to extract
        confluence_client: Client for Confluence API interactions
    """

    def __init__(
        self,
        configuration: ConfluenceDatasourceConfiguration,
        confluence_client: Confluence,
    ):
        """Initialize the Confluence reader.

        Args:
            configuration: Settings for Confluence access and limits
            confluence_client: Client for Confluence API interactions
        """
        super().__init__()
        self.export_limit = configuration.export_limit
        self.confluence_client = confluence_client

    def get_all_documents(self) -> List[ConfluenceDocument]:
        """Synchronously fetch all documents from Confluence.

        Returns:
            List[ConfluenceDocument]: List of extracted documents

        Note:
            Not implemented - use get_all_documents_async instead.
        """
        pass

    async def get_all_documents_async(self) -> List[ConfluenceDocument]:
        """Asynchronously fetch all documents from Confluence.

        Retrieves documents from all global spaces, respecting export limit.

        Returns:
            List[ConfluenceDocument]: List of extracted and processed documents
        """
        logging.info(
            f"Fetching documents from Confluence with limit {self.export_limit}"
        )
        response = self.confluence_client.get_all_spaces(space_type="global")
        pages = []

        for space in response["results"]:
            space_limit = (
                self.export_limit - len(pages)
                if self.export_limit is not None
                else None
            )
            pages.extend(self._get_all_pages(space["key"], space_limit))
            if (
                self.export_limit is not None
                and len(pages) >= self.export_limit
            ):
                break

        pages = (
            pages if self.export_limit is None else pages[: self.export_limit]
        )
        documents = [
            ConfluenceDocument.from_page(page, self.confluence_client.url)
            for page in pages
        ]
        return documents

    def _get_all_pages(self, space: str, limit: int) -> List[dict]:
        """Fetch all pages from a Confluence space.

        Args:
            space: Space key to fetch pages from
            limit: Maximum number of pages to fetch (None for unlimited)

        Returns:
            List[dict]: List of page details from the space
        """
        start = 0
        params = {
            "space": space,
            "start": start,
            "status": None,
            "expand": "body.view,history.lastUpdated",
        }
        all_pages = []

        try:
            with tqdm(
                desc=f"[Confluence] Reading {space}'s pages content",
                unit="pages",
            ) as pbar:
                while True:
                    pages = self.confluence_client.get_all_pages_from_space(
                        **params
                    )
                    all_pages.extend(pages)
                    pbar.update(len(pages))

                    if len(pages) == 0 or ConfluenceReader._limit_reached(
                        all_pages, limit
                    ):
                        break

                    start = len(all_pages)
                    params["start"] = start
        except HTTPError as e:
            logging.debug(f"Error while fetching pages from {space}: {e}")

        return all_pages if limit is None else all_pages[:limit]

    @staticmethod
    def _limit_reached(pages: List[dict], limit: int) -> bool:
        """Check if page limit has been reached.

        Args:
            pages: List of retrieved pages
            limit: Maximum number of pages (None for unlimited)

        Returns:
            bool: True if limit reached, False otherwise
        """
        return limit is not None and len(pages) >= limit

`init(configuration, confluence_client)`

Initialize the Confluence reader.

Parameters:	`configuration` (`ConfluenceDatasourceConfiguration`) – Settings for Confluence access and limits `confluence_client` (`Confluence`) – Client for Confluence API interactions

Source code in src/embedding/datasources/confluence/reader.py

def __init__(
    self,
    configuration: ConfluenceDatasourceConfiguration,
    confluence_client: Confluence,
):
    """Initialize the Confluence reader.

    Args:
        configuration: Settings for Confluence access and limits
        confluence_client: Client for Confluence API interactions
    """
    super().__init__()
    self.export_limit = configuration.export_limit
    self.confluence_client = confluence_client

`_get_all_pages(space, limit)`

Fetch all pages from a Confluence space.

Parameters:	`space` (`str`) – Space key to fetch pages from `limit` (`int`) – Maximum number of pages to fetch (None for unlimited)

Returns:	`List[dict]` – List[dict]: List of page details from the space

Source code in src/embedding/datasources/confluence/reader.py

def _get_all_pages(self, space: str, limit: int) -> List[dict]:
    """Fetch all pages from a Confluence space.

    Args:
        space: Space key to fetch pages from
        limit: Maximum number of pages to fetch (None for unlimited)

    Returns:
        List[dict]: List of page details from the space
    """
    start = 0
    params = {
        "space": space,
        "start": start,
        "status": None,
        "expand": "body.view,history.lastUpdated",
    }
    all_pages = []

    try:
        with tqdm(
            desc=f"[Confluence] Reading {space}'s pages content",
            unit="pages",
        ) as pbar:
            while True:
                pages = self.confluence_client.get_all_pages_from_space(
                    **params
                )
                all_pages.extend(pages)
                pbar.update(len(pages))

                if len(pages) == 0 or ConfluenceReader._limit_reached(
                    all_pages, limit
                ):
                    break

                start = len(all_pages)
                params["start"] = start
    except HTTPError as e:
        logging.debug(f"Error while fetching pages from {space}: {e}")

    return all_pages if limit is None else all_pages[:limit]

`_limit_reached(pages, limit)` `staticmethod`

Check if page limit has been reached.

Parameters:	`pages` (`List[dict]`) – List of retrieved pages `limit` (`int`) – Maximum number of pages (None for unlimited)

Returns:	`bool`( `bool` ) – True if limit reached, False otherwise

Source code in src/embedding/datasources/confluence/reader.py

@staticmethod
def _limit_reached(pages: List[dict], limit: int) -> bool:
    """Check if page limit has been reached.

    Args:
        pages: List of retrieved pages
        limit: Maximum number of pages (None for unlimited)

    Returns:
        bool: True if limit reached, False otherwise
    """
    return limit is not None and len(pages) >= limit

`get_all_documents()`

Synchronously fetch all documents from Confluence.

Returns:	`List[ConfluenceDocument]` – List[ConfluenceDocument]: List of extracted documents

Note

Not implemented - use get_all_documents_async instead.

Source code in src/embedding/datasources/confluence/reader.py

def get_all_documents(self) -> List[ConfluenceDocument]:
    """Synchronously fetch all documents from Confluence.

    Returns:
        List[ConfluenceDocument]: List of extracted documents

    Note:
        Not implemented - use get_all_documents_async instead.
    """
    pass

`get_all_documents_async()` `async`

Asynchronously fetch all documents from Confluence.

Retrieves documents from all global spaces, respecting export limit.

Returns:	`List[ConfluenceDocument]` – List[ConfluenceDocument]: List of extracted and processed documents

Source code in src/embedding/datasources/confluence/reader.py

async def get_all_documents_async(self) -> List[ConfluenceDocument]:
    """Asynchronously fetch all documents from Confluence.

    Retrieves documents from all global spaces, respecting export limit.

    Returns:
        List[ConfluenceDocument]: List of extracted and processed documents
    """
    logging.info(
        f"Fetching documents from Confluence with limit {self.export_limit}"
    )
    response = self.confluence_client.get_all_spaces(space_type="global")
    pages = []

    for space in response["results"]:
        space_limit = (
            self.export_limit - len(pages)
            if self.export_limit is not None
            else None
        )
        pages.extend(self._get_all_pages(space["key"], space_limit))
        if (
            self.export_limit is not None
            and len(pages) >= self.export_limit
        ):
            break

    pages = (
        pages if self.export_limit is None else pages[: self.export_limit]
    )
    documents = [
        ConfluenceDocument.from_page(page, self.confluence_client.url)
        for page in pages
    ]
    return documents

Splitter

`ConfluenceSplitter`

Bases: BaseSplitter

Source code in src/embedding/datasources/confluence/splitter.py

class ConfluenceSplitter(BaseSplitter):

    def __init__(
        self,
        markdown_splitter: BoundEmbeddingModelMarkdownSplitter,
    ):
        """
        The `ConfluenceSplitter` class is a concrete class that defines the interface for splitting documents into text nodes.

        :param markdown_splitter: MarkdownSplitter object for splitting documents
        """
        self.markdown_splitter = markdown_splitter

    def split(self, documents: List[ConfluenceDocument]) -> List[TextNode]:
        """
        Split the given list of documents into text nodes using `markdown_splitter`. Documents should be in markdown format.

        :param documents: List of Document objects
        :return: List of TextNode objects
        """
        return self.markdown_splitter.split(documents)

`init(markdown_splitter)`

The ConfluenceSplitter class is a concrete class that defines the interface for splitting documents into text nodes.

:param markdown_splitter: MarkdownSplitter object for splitting documents

Source code in src/embedding/datasources/confluence/splitter.py

def __init__(
    self,
    markdown_splitter: BoundEmbeddingModelMarkdownSplitter,
):
    """
    The `ConfluenceSplitter` class is a concrete class that defines the interface for splitting documents into text nodes.

    :param markdown_splitter: MarkdownSplitter object for splitting documents
    """
    self.markdown_splitter = markdown_splitter

`split(documents)`

Split the given list of documents into text nodes using markdown_splitter. Documents should be in markdown format.

:param documents: List of Document objects :return: List of TextNode objects

Source code in src/embedding/datasources/confluence/splitter.py

def split(self, documents: List[ConfluenceDocument]) -> List[TextNode]:
    """
    Split the given list of documents into text nodes using `markdown_splitter`. Documents should be in markdown format.

    :param documents: List of Document objects
    :return: List of TextNode objects
    """
    return self.markdown_splitter.split(documents)