Core Datasource
This module contains functionality related to the Core
datasource.
Cleaner
BaseCleaner
Bases: ABC
, Generic[DocType]
Abstract base class for document cleaning operations.
Defines the interface for document cleaners with generic type support to ensure type safety across different document implementations.
Source code in src/extraction/datasources/core/cleaner.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
clean(document)
abstractmethod
Clean a single document.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/extraction/datasources/core/cleaner.py
14 15 16 17 18 19 20 21 22 23 24 |
|
BasicMarkdownCleaner
Bases: BaseCleaner
, Generic[DocType]
Document cleaner for basic content validation.
Checks for empty content in documents and filters them out. Works with any document type that has a text attribute.
Source code in src/extraction/datasources/core/cleaner.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
clean(document)
Remove document if it contains empty content.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/extraction/datasources/core/cleaner.py
34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
Document
BaseDocument
Bases: Document
Base document class for structured content storage.
Extends LlamaIndex Document to add support for attachments and metadata filtering for embedding and LLM contexts.
Attributes: |
|
---|
Note
DocType TypeVar ensures type safety when implementing document types. Default metadata includes title and timestamp information.
Source code in src/extraction/datasources/core/document.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
__init__(text, metadata, attachments=None)
Initialize a document with text, metadata, and optional attachments.
Sets up excluded metadata keys for embedding and LLM contexts.
Source code in src/extraction/datasources/core/document.py
42 43 44 45 46 47 48 49 50 51 52 53 54 |
|
Manager
BaseDatasourceManager
Bases: ABC
, Generic[DocType]
Abstract base class for datasource management.
Defines the interface for managing the extraction, parsing, cleaning, and splitting of documents from a data source. This class serves as a template for implementing specific datasource managers, ensuring a consistent interface and behavior across different implementations.
Source code in src/extraction/datasources/core/manager.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
|
__init__(configuration, reader, parser=BasicMarkdownParser(), cleaner=BasicMarkdownCleaner(), splitter=BasicMarkdownSplitter())
Initialize datasource manager.
Parameters: |
|
---|
Source code in src/extraction/datasources/core/manager.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
full_refresh_sync()
abstractmethod
async
Extract and process all content from the datasource.
Returns: |
|
---|
Source code in src/extraction/datasources/core/manager.py
52 53 54 55 56 57 58 59 60 61 |
|
incremental_sync()
abstractmethod
Process only new or changed content from the datasource.
This method should handle differential updates to avoid reprocessing all content when only portions have changed. Implementations should update the vector storage accordingly.
Source code in src/extraction/datasources/core/manager.py
63 64 65 66 67 68 69 70 71 |
|
BasicDatasourceManager
Bases: BaseDatasourceManager
, Generic[DocType]
Standard implementation of datasource content processing pipeline.
Handles the extraction, parsing, cleaning, and splitting of documents from a data source. Processes documents using the provided components in a sequential pipeline to prepare them for embedding and storage.
Source code in src/extraction/datasources/core/manager.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
|
full_refresh_sync()
async
Process all content from the datasource from scratch.
Executes the complete pipeline: 1. Reads source objects asynchronously 2. Parses each object into a document 3. Cleans the content 4. Splits into appropriate chunks
Returns: |
|
---|
Source code in src/extraction/datasources/core/manager.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
|
incremental_sync()
Process only new or changed content since the last sync.
Should be implemented by subclasses to provide efficient updates when only a portion of the datasource has changed.
Raises: |
|
---|
Source code in src/extraction/datasources/core/manager.py
104 105 106 107 108 109 110 111 112 113 |
|
Parser
BaseParser
Bases: ABC
, Generic[DocType]
Abstract base class for document parsers.
Defines the interface for parsing content into documents of specified type (DocType).
Source code in src/extraction/datasources/core/parser.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
parse(content)
abstractmethod
Parse content into a document of type DocType.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/extraction/datasources/core/parser.py
17 18 19 20 21 22 23 24 25 26 27 28 |
|
BasicMarkdownParser
Bases: BaseParser[Document]
Markdown parser that converts markdown text into Document objects.
Implements the BaseParser interface for basic markdown content.
Source code in src/extraction/datasources/core/parser.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
parse(markdown)
Parse markdown content into a Document object.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/extraction/datasources/core/parser.py
38 39 40 41 42 43 44 45 46 47 48 |
|
Reader
BaseReader
Bases: ABC
Abstract base class for document source readers.
This class defines a standard interface for extracting documents from various data sources. Concrete implementations should inherit from this class and implement the required methods to handle specific data source types.
The generic typing allows for flexibility in the document types returned by different implementations.
Source code in src/extraction/datasources/core/reader.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
read_all_async()
abstractmethod
async
Asynchronously retrieve documents from the source.
Implementations should use async iteration to efficiently stream documents from the source without loading all content into memory at once.
Returns: |
|
---|
Raises: |
|
---|
Source code in src/extraction/datasources/core/reader.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Splitter
BaseSplitter
Bases: ABC
, Generic[DocType]
Abstract base class for document splitters.
This class defines the interface for splitting documents into smaller chunks. All splitter implementations should inherit from this class.
Source code in src/extraction/datasources/core/splitter.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
split(document)
abstractmethod
Split a document into multiple smaller documents.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/extraction/datasources/core/splitter.py
14 15 16 17 18 19 20 21 22 23 24 |
|
BasicMarkdownSplitter
Bases: BaseSplitter
, Generic[DocType]
A simple splitter implementation that returns the document as-is.
This splitter does not perform any actual splitting and is primarily used as a pass-through when splitting is not required.
Source code in src/extraction/datasources/core/splitter.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
split(document)
Return the document as a single-item list without splitting.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/extraction/datasources/core/splitter.py
34 35 36 37 38 39 40 41 42 43 |
|