Pdf Datasource
This module contains functionality related to the Pdf
datasource.
Document
PdfDocument
Bases: BaseDocument
Document representation for PDF file content.
Extends BaseDocument to handle PDF-specific document processing including metadata filtering for embeddings and LLM contexts.
Attributes: |
|
---|
Source code in src/embedding/datasources/pdf/document.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
__init__(text, metadata, attachments=None)
Initialize PDF document.
Parameters: |
|
---|
Source code in src/embedding/datasources/pdf/document.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
_set_excluded_metadata_keys(metadata, included_keys)
staticmethod
Determine metadata keys to exclude from processing.
Parameters: |
|
---|
Returns: |
|
---|
Note
Returns any key from metadata that isn't in included_keys
Source code in src/embedding/datasources/pdf/document.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
Manager
PdfDatasourceManager
Bases: DatasourceManager
Manager for PDF content extraction and processing.
Handles document retrieval, cleaning, splitting and embedding updates for PDF documents. Implements the base DatasourceManager interface for PDF-specific processing.
Source code in src/embedding/datasources/pdf/manager.py
4 5 6 7 8 9 10 11 12 |
|
Reader
DefaultPDFParser
Source code in src/embedding/datasources/pdf/reader.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
_extract_metadata(reader, file_path)
Extract and process PDF metadata.
Parameters: |
|
---|
Returns: |
|
---|
Note
Converts date strings to ISO format where possible
Source code in src/embedding/datasources/pdf/reader.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
NLMPDFParser
Source code in src/embedding/datasources/pdf/reader.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
|
_extract_page_metadata(file_path)
Extract metadata from first pages of PDF.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/pdf/reader.py
146 147 148 149 150 151 152 153 154 155 156 157 158 |
|
parse(file_path)
Parses the given PDF file and enriches its metadata with additional fields.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/pdf/reader.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
|
PdfReader
Bases: BaseReader[PdfDocument]
Reader for extracting content from PDF files.
Implements document extraction from PDF files with support for text and metadata extraction.
Attributes: |
|
---|
Source code in src/embedding/datasources/pdf/reader.py
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 |
|
__init__(configuration)
Initialize PDF reader.
Parameters: |
|
---|
Source code in src/embedding/datasources/pdf/reader.py
188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
|
get_all_documents_async()
async
Load documents asynchronously from configured path.
Returns: |
|
---|
Note
Currently calls synchronous implementation
Source code in src/embedding/datasources/pdf/reader.py
225 226 227 228 229 230 231 232 233 234 |
|
preprocess_text(text)
Preprocess text to clean split labels and values while preserving structure.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/pdf/reader.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
Builders
PdfCleanerBuilder
Builder for creating PDF content cleaner instances.
Provides factory method to create Cleaner objects for PDF content.
Source code in src/embedding/datasources/pdf/builders.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
|
build()
staticmethod
Creates a content cleaner for PDFs.
Returns: |
|
---|
Source code in src/embedding/datasources/pdf/builders.py
78 79 80 81 82 83 84 85 86 |
|
PdfDatasourceManagerBuilder
Builder for creating PDF datasource manager instances.
Provides factory method to create configured PdfDatasourceManager with required components for content processing.
Source code in src/embedding/datasources/pdf/builders.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
build(configuration, reader, cleaner, splitter)
staticmethod
Creates a configured PDF datasource manager.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/pdf/builders.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
PdfReaderBuilder
Builder for creating PDF reader instances.
Provides factory method to create configured PdfReader objects.
Source code in src/embedding/datasources/pdf/builders.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
build(configuration)
staticmethod
Creates a configured PDF reader.
Parameters: |
|
---|
Returns: |
|
---|
Source code in src/embedding/datasources/pdf/builders.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|