Core Datasource

This module contains functionality related to the Core datasource.

Cleaner

`BaseCleaner`

Bases: ABC, Generic[DocType]

Abstract base class for document cleaning operations.

Defines the interface for document cleaners with generic type support to ensure type safety across different document implementations.

Source code in src/extraction/datasources/core/cleaner.py

class BaseCleaner(ABC, Generic[DocType]):
    """Abstract base class for document cleaning operations.

    Defines the interface for document cleaners with generic type support
    to ensure type safety across different document implementations.
    """

    @abstractmethod
    def clean(self, document: DocType) -> DocType:
        """Clean a single document.

        Args:
            document: The document to be cleaned

        Returns:
            The cleaned document or None if document should be filtered out
        """
        pass

`clean(document)` `abstractmethod`

Clean a single document.

Parameters:	`document` (`DocType`) – The document to be cleaned

Returns:	`DocType` – The cleaned document or None if document should be filtered out

Source code in src/extraction/datasources/core/cleaner.py

@abstractmethod
def clean(self, document: DocType) -> DocType:
    """Clean a single document.

    Args:
        document: The document to be cleaned

    Returns:
        The cleaned document or None if document should be filtered out
    """
    pass

`BasicMarkdownCleaner`

Bases: BaseCleaner, Generic[DocType]

Document cleaner for basic content validation.

Checks for empty content in documents and filters them out. Works with any document type that has a text attribute.

Source code in src/extraction/datasources/core/cleaner.py

class BasicMarkdownCleaner(BaseCleaner, Generic[DocType]):
    """Document cleaner for basic content validation.

    Checks for empty content in documents and filters them out.
    Works with any document type that has a text attribute.
    """

    def clean(self, document: DocType) -> DocType:
        """Remove document if it contains empty content.

        Args:
            document: The document to validate

        Returns:
            The original document if content is not empty, None otherwise
        """
        if not self._has_empty_content(document):
            return document

        return None

    @staticmethod
    def _has_empty_content(document: DocType) -> bool:
        """Check if document content is empty.

        Args:
            document: Document to check (must have a text attribute)

        Returns:
            True if document's text is empty or contains only whitespace
        """
        return not document.text.strip()

`clean(document)`

Remove document if it contains empty content.

Parameters:	`document` (`DocType`) – The document to validate

Returns:	`DocType` – The original document if content is not empty, None otherwise

Source code in src/extraction/datasources/core/cleaner.py

def clean(self, document: DocType) -> DocType:
    """Remove document if it contains empty content.

    Args:
        document: The document to validate

    Returns:
        The original document if content is not empty, None otherwise
    """
    if not self._has_empty_content(document):
        return document

    return None

Document

`BaseDocument`

Bases: Document

Base document class for structured content storage.

Extends LlamaIndex Document to add support for attachments and metadata filtering for embedding and LLM contexts.

Attributes:	`attachments` (`Optional[Dict[str, str]]`) – Dictionary mapping placeholder keys to attachment content `included_embed_metadata_keys` (`List[str]`) – Metadata fields to include in embeddings `included_llm_metadata_keys` (`List[str]`) – Metadata fields to include in LLM context

Note

DocType TypeVar ensures type safety when implementing document types. Default metadata includes title and timestamp information.

Source code in src/extraction/datasources/core/document.py

class BaseDocument(Document):
    """Base document class for structured content storage.

    Extends LlamaIndex Document to add support for attachments and
    metadata filtering for embedding and LLM contexts.

    Attributes:
        attachments: Dictionary mapping placeholder keys to attachment content
        included_embed_metadata_keys: Metadata fields to include in embeddings
        included_llm_metadata_keys: Metadata fields to include in LLM context

    Note:
        DocType TypeVar ensures type safety when implementing document types.
        Default metadata includes title and timestamp information.
    """

    attachments: Optional[Dict[str, str]] = Field(
        description="Document attachments with placeholders as keys and content as values",
        default=None,
    )

    included_embed_metadata_keys: List[str] = [
        "title",
        "created_time",
        "last_edited_time",
    ]

    included_llm_metadata_keys: List[str] = [
        "title",
        "created_time",
        "last_edited_time",
    ]

    def __init__(self, text: str, metadata: dict, attachments: dict = None):
        """Initialize a document with text, metadata, and optional attachments.

        Sets up excluded metadata keys for embedding and LLM contexts.
        """
        super().__init__(text=text, metadata=metadata)
        self.attachments = attachments or {}
        self.excluded_embed_metadata_keys = self._set_excluded_metadata_keys(
            self.metadata, self.included_embed_metadata_keys
        )
        self.excluded_llm_metadata_keys = self._set_excluded_metadata_keys(
            self.metadata, self.included_llm_metadata_keys
        )

    @staticmethod
    def _set_excluded_metadata_keys(
        metadata: dict, included_keys: List[str]
    ) -> List[str]:
        """Identify metadata keys to exclude from processing.

        Returns all keys from metadata that aren't in the included_keys list.
        """
        return [key for key in metadata.keys() if key not in included_keys]

`init(text, metadata, attachments=None)`

Initialize a document with text, metadata, and optional attachments.

Sets up excluded metadata keys for embedding and LLM contexts.

Source code in src/extraction/datasources/core/document.py

def __init__(self, text: str, metadata: dict, attachments: dict = None):
    """Initialize a document with text, metadata, and optional attachments.

    Sets up excluded metadata keys for embedding and LLM contexts.
    """
    super().__init__(text=text, metadata=metadata)
    self.attachments = attachments or {}
    self.excluded_embed_metadata_keys = self._set_excluded_metadata_keys(
        self.metadata, self.included_embed_metadata_keys
    )
    self.excluded_llm_metadata_keys = self._set_excluded_metadata_keys(
        self.metadata, self.included_llm_metadata_keys
    )

Manager

`BaseDatasourceManager`

Bases: ABC, Generic[DocType]

Abstract base class for datasource management.

Defines the interface for managing the extraction, parsing, cleaning, and splitting of documents from a data source. This class serves as a template for implementing specific datasource managers, ensuring a consistent interface and behavior across different implementations.