Base_orchestator

This module contains functionality related to the the base_orchestator module for extraction.orchestrators.

Base_orchestator

BaseDatasourceOrchestrator

Bases: ABC

Abstract base class for datasource orchestration.

Defines interface for managing content extraction, embedding generation, and vector storage operations across datasources.

This class serves as a coordinator for multiple datasource managers, providing unified methods for extracting and processing documents from various data sources in both full and incremental sync modes.

Note

All implementing classes must provide concrete implementations of extract, embed, save and update methods.

Source code in src/extraction/orchestrators/base_orchestator.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class BaseDatasourceOrchestrator(ABC):
    """Abstract base class for datasource orchestration.

    Defines interface for managing content extraction, embedding generation,
    and vector storage operations across datasources.

    This class serves as a coordinator for multiple datasource managers,
    providing unified methods for extracting and processing documents
    from various data sources in both full and incremental sync modes.

    Note:
        All implementing classes must provide concrete implementations
        of extract, embed, save and update methods.
    """

    def __init__(
        self,
        datasource_managers: List[BaseDatasourceManager],
    ):
        """Initialize the orchestrator with datasource managers.

        Args:
            datasource_managers: A list of datasource manager instances that
                                 implement the BaseDatasourceManager interface.
                                 These managers handle the actual data extraction
                                 from specific datasource types.
        """
        self.datasource_managers = datasource_managers

    @abstractmethod
    async def full_refresh_sync(self) -> AsyncIterator[BaseDocument]:
        """Extract content from configured datasources.

        Performs asynchronous content extraction from all configured
        datasource implementations. This method should perform a complete
        refresh of all available content from the datasources, regardless
        of previous sync state.

        Returns:
            An asynchronous iterator yielding BaseDocument objects representing
            the extracted content from all datasources.
        """
        pass

    @abstractmethod
    async def incremental_sync(self) -> AsyncIterator[BaseDocument]:
        """Perform an incremental sync from configured datasources.

        Extracts only new or modified content since the last sync operation.
        This method is designed for efficient regular updates without
        re-processing unchanged content.

        Returns:
            An asynchronous iterator yielding BaseDocument objects representing
            newly added or modified content from all datasources.
        """
        pass

__init__(datasource_managers)

Initialize the orchestrator with datasource managers.

Parameters:
  • datasource_managers (List[BaseDatasourceManager]) –

    A list of datasource manager instances that implement the BaseDatasourceManager interface. These managers handle the actual data extraction from specific datasource types.

Source code in src/extraction/orchestrators/base_orchestator.py
23
24
25
26
27
28
29
30
31
32
33
34
35
def __init__(
    self,
    datasource_managers: List[BaseDatasourceManager],
):
    """Initialize the orchestrator with datasource managers.

    Args:
        datasource_managers: A list of datasource manager instances that
                             implement the BaseDatasourceManager interface.
                             These managers handle the actual data extraction
                             from specific datasource types.
    """
    self.datasource_managers = datasource_managers

full_refresh_sync() abstractmethod async

Extract content from configured datasources.

Performs asynchronous content extraction from all configured datasource implementations. This method should perform a complete refresh of all available content from the datasources, regardless of previous sync state.

Returns:
  • AsyncIterator[BaseDocument]

    An asynchronous iterator yielding BaseDocument objects representing

  • AsyncIterator[BaseDocument]

    the extracted content from all datasources.

Source code in src/extraction/orchestrators/base_orchestator.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50
@abstractmethod
async def full_refresh_sync(self) -> AsyncIterator[BaseDocument]:
    """Extract content from configured datasources.

    Performs asynchronous content extraction from all configured
    datasource implementations. This method should perform a complete
    refresh of all available content from the datasources, regardless
    of previous sync state.

    Returns:
        An asynchronous iterator yielding BaseDocument objects representing
        the extracted content from all datasources.
    """
    pass

incremental_sync() abstractmethod async

Perform an incremental sync from configured datasources.

Extracts only new or modified content since the last sync operation. This method is designed for efficient regular updates without re-processing unchanged content.

Returns:
  • AsyncIterator[BaseDocument]

    An asynchronous iterator yielding BaseDocument objects representing

  • AsyncIterator[BaseDocument]

    newly added or modified content from all datasources.

Source code in src/extraction/orchestrators/base_orchestator.py
52
53
54
55
56
57
58
59
60
61
62
63
64
@abstractmethod
async def incremental_sync(self) -> AsyncIterator[BaseDocument]:
    """Perform an incremental sync from configured datasources.

    Extracts only new or modified content since the last sync operation.
    This method is designed for efficient regular updates without
    re-processing unchanged content.

    Returns:
        An asynchronous iterator yielding BaseDocument objects representing
        newly added or modified content from all datasources.
    """
    pass