Bases: ABC
Abstract base class for datasource orchestration.
Defines interface for managing content extraction, embedding generation,
and vector storage operations across datasources.
This class serves as a coordinator for multiple datasource managers,
providing unified methods for extracting and processing documents
from various data sources in both full and incremental sync modes.
Note
All implementing classes must provide concrete implementations
of extract, embed, save and update methods.
Source code in src/extraction/orchestrators/base_orchestator.py
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64 | class BaseDatasourceOrchestrator(ABC):
"""Abstract base class for datasource orchestration.
Defines interface for managing content extraction, embedding generation,
and vector storage operations across datasources.
This class serves as a coordinator for multiple datasource managers,
providing unified methods for extracting and processing documents
from various data sources in both full and incremental sync modes.
Note:
All implementing classes must provide concrete implementations
of extract, embed, save and update methods.
"""
def __init__(
self,
datasource_managers: List[BaseDatasourceManager],
):
"""Initialize the orchestrator with datasource managers.
Args:
datasource_managers: A list of datasource manager instances that
implement the BaseDatasourceManager interface.
These managers handle the actual data extraction
from specific datasource types.
"""
self.datasource_managers = datasource_managers
@abstractmethod
async def full_refresh_sync(self) -> AsyncIterator[BaseDocument]:
"""Extract content from configured datasources.
Performs asynchronous content extraction from all configured
datasource implementations. This method should perform a complete
refresh of all available content from the datasources, regardless
of previous sync state.
Returns:
An asynchronous iterator yielding BaseDocument objects representing
the extracted content from all datasources.
"""
pass
@abstractmethod
async def incremental_sync(self) -> AsyncIterator[BaseDocument]:
"""Perform an incremental sync from configured datasources.
Extracts only new or modified content since the last sync operation.
This method is designed for efficient regular updates without
re-processing unchanged content.
Returns:
An asynchronous iterator yielding BaseDocument objects representing
newly added or modified content from all datasources.
"""
pass
|
Initialize the orchestrator with datasource managers.
Parameters: |
-
datasource_managers
(List[BaseDatasourceManager] )
–
A list of datasource manager instances that
implement the BaseDatasourceManager interface.
These managers handle the actual data extraction
from specific datasource types.
|
Source code in src/extraction/orchestrators/base_orchestator.py
23
24
25
26
27
28
29
30
31
32
33
34
35 | def __init__(
self,
datasource_managers: List[BaseDatasourceManager],
):
"""Initialize the orchestrator with datasource managers.
Args:
datasource_managers: A list of datasource manager instances that
implement the BaseDatasourceManager interface.
These managers handle the actual data extraction
from specific datasource types.
"""
self.datasource_managers = datasource_managers
|
Extract content from configured datasources.
Performs asynchronous content extraction from all configured
datasource implementations. This method should perform a complete
refresh of all available content from the datasources, regardless
of previous sync state.
Returns: |
-
AsyncIterator[BaseDocument]
–
An asynchronous iterator yielding BaseDocument objects representing
-
AsyncIterator[BaseDocument]
–
the extracted content from all datasources.
|
Source code in src/extraction/orchestrators/base_orchestator.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50 | @abstractmethod
async def full_refresh_sync(self) -> AsyncIterator[BaseDocument]:
"""Extract content from configured datasources.
Performs asynchronous content extraction from all configured
datasource implementations. This method should perform a complete
refresh of all available content from the datasources, regardless
of previous sync state.
Returns:
An asynchronous iterator yielding BaseDocument objects representing
the extracted content from all datasources.
"""
pass
|
Perform an incremental sync from configured datasources.
Extracts only new or modified content since the last sync operation.
This method is designed for efficient regular updates without
re-processing unchanged content.
Returns: |
-
AsyncIterator[BaseDocument]
–
An asynchronous iterator yielding BaseDocument objects representing
-
AsyncIterator[BaseDocument]
–
newly added or modified content from all datasources.
|
Source code in src/extraction/orchestrators/base_orchestator.py
52
53
54
55
56
57
58
59
60
61
62
63
64 | @abstractmethod
async def incremental_sync(self) -> AsyncIterator[BaseDocument]:
"""Perform an incremental sync from configured datasources.
Extracts only new or modified content since the last sync operation.
This method is designed for efficient regular updates without
re-processing unchanged content.
Returns:
An asynchronous iterator yielding BaseDocument objects representing
newly added or modified content from all datasources.
"""
pass
|