Core Datasource
This module contains functionality related to the Core
datasource.
Cleaner
BaseCleaner
Bases: ABC
, Generic[DocType]
Abstract base class defining document cleaning interface.
Provides interface for cleaning document collections with type safety through generic typing.
Source code in src/embedding/datasources/core/cleaner.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
_has_empty_content(document)
staticmethod
Check if document content is empty.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/core/cleaner.py
32 33 34 35 36 37 38 39 40 41 42 |
|
clean(documents)
abstractmethod
Clean a list of documents.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/core/cleaner.py
20 21 22 23 24 25 26 27 28 29 30 |
|
Cleaner
Bases: BaseCleaner
Generic document cleaner implementation.
Removes empty documents from collections while tracking progress. Supports any document type with a text attribute.
Source code in src/embedding/datasources/core/cleaner.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
_get_documents_with_tqdm(documents, document_type_name)
Wrap document iteration with optional progress bar.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/core/cleaner.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
|
_has_empty_content(document)
staticmethod
Check if document has empty content.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/core/cleaner.py
95 96 97 98 99 100 101 102 103 104 105 |
|
clean(documents)
Remove empty documents from collection.
Parameters: |
|
---|
Returns: |
|
---|
Note
Document type is inferred from first document in collection.
Source code in src/embedding/datasources/core/cleaner.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
|
Document
BaseDocument
Bases: Document
Base document class for structured content storage.
Extends LlamaIndex Document to add support for attachments and metadata filtering for embedding and LLM contexts.
Attributes: |
|
---|
Note
DocType TypeVar ensures type safety when implementing document types. Default metadata includes title and timestamp information.
Source code in src/embedding/datasources/core/document.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
Manager
BaseDatasourceManager
Bases: ABC
, Generic[DocType]
Abstract base class for datasource management.
Provides interface for content extraction and vector storage updates.
Attributes: |
|
---|
Source code in src/embedding/datasources/core/manager.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
|
__init__(configuration, reader, cleaner, splitter)
Initialize datasource manager.
Parameters: |
|
---|
Source code in src/embedding/datasources/core/manager.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
extract()
abstractmethod
async
Extract and process content from datasource.
Returns: |
|
---|
Source code in src/embedding/datasources/core/manager.py
45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
update_vector_storage()
abstractmethod
Update vector storage with current embeddings.
Source code in src/embedding/datasources/core/manager.py
59 60 61 62 |
|
DatasourceManager
Bases: BaseDatasourceManager
Manager for datasource content processing and embedding.
Implements content extraction pipeline using configurable components for reading, cleaning, splitting and embedding content.
Source code in src/embedding/datasources/core/manager.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
|
extract()
async
Extract and process content from datasource.
Returns: |
|
---|
Source code in src/embedding/datasources/core/manager.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
|
update_vector_storage()
Update vector storage with current embeddings.
Raises: |
|
---|
Source code in src/embedding/datasources/core/manager.py
88 89 90 91 92 93 94 |
|
Reader
BaseReader
Bases: ABC
, Generic[DocType]
Abstract base class for document source readers.
Defines interface for document extraction from various sources. Supports both synchronous and asynchronous implementations through generic typing for document types.
Source code in src/embedding/datasources/core/reader.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
get_all_documents()
abstractmethod
Synchronously retrieve all documents from source.
Returns: |
|
---|
Raises: |
|
---|
Source code in src/embedding/datasources/core/reader.py
18 19 20 21 22 23 24 25 26 27 28 |
|
get_all_documents_async()
abstractmethod
async
Asynchronously retrieve all documents from source.
Returns: |
|
---|
Raises: |
|
---|
Source code in src/embedding/datasources/core/reader.py
30 31 32 33 34 35 36 37 38 39 40 |
|
Splitter
BaseSplitter
Bases: ABC
, Generic[DocType]
Abstract base class for document splitting.
Defines interface for splitting documents into text nodes with generic typing support for document types.
Source code in src/embedding/datasources/core/splitter.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
split(documents)
abstractmethod
Split documents into text nodes.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/core/splitter.py
19 20 21 22 23 24 25 26 27 28 29 |
|
MarkdownSplitter
Bases: BaseSplitter
Splitter for markdown documents with token-based chunking.
Splits markdown content into nodes based on document structure and token limits. Supports node merging and splitting to maintain consistent chunk sizes.
Attributes: |
|
---|
Source code in src/embedding/datasources/core/splitter.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
|
__init__(chunk_size_in_tokens, chunk_overlap_in_tokens, tokenize_func)
Initialize markdown splitter.
Parameters: |
|
---|
Source code in src/embedding/datasources/core/splitter.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|
_merge_small_nodes(document_nodes)
Merge adjacent small nodes into larger chunks.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/core/splitter.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
|
_split_big_node(document_node)
Split single oversized node into smaller nodes.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/core/splitter.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|
_split_big_nodes(document_nodes)
Split oversized nodes into smaller chunks.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/core/splitter.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
|
split(documents)
Split markdown documents into text nodes.
Processes documents through markdown parsing, then adjusts node sizes through splitting and merging to match chunk size requirements.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/core/splitter.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
Builders
MarkdownSplitterBuilder
Builder for creating markdown content splitter instances.
Provides factory method to create configured MarkdownSplitter objects using embedding model settings for chunking parameters.
Source code in src/embedding/datasources/core/builders.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
build(embedding_model_configuration)
staticmethod
Creates a configured markdown splitter instance.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/core/builders.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|