Pdf Datasource

This module contains functionality related to the Pdf datasource.

Configuration

PDFDatasourceConfiguration

Bases: DatasourceConfiguration

Configuration for PDF data source.

This class defines the configuration parameters required for extracting data from PDF files. It inherits from the base DatasourceConfiguration class.

Source code in src/extraction/datasources/pdf/configuration.py
11
12
13
14
15
16
17
18
19
20
21
22
23
class PDFDatasourceConfiguration(DatasourceConfiguration):
    """Configuration for PDF data source.

    This class defines the configuration parameters required for extracting data from PDF files.
    It inherits from the base DatasourceConfiguration class.
    """

    name: Literal[DatasourceName.PDF] = Field(
        ..., description="The name of the data source."
    )
    base_path: str = Field(
        ..., description="Base path to the directory containing PDF files"
    )

Document

PDFDocument

Bases: BaseDocument

Document representation for PDF file content.

Extends BaseDocument to handle PDF-specific document processing including metadata filtering for embeddings and LLM contexts.

Source code in src/extraction/datasources/pdf/document.py
 4
 5
 6
 7
 8
 9
10
11
class PDFDocument(BaseDocument):
    """Document representation for PDF file content.

    Extends BaseDocument to handle PDF-specific document processing including
    metadata filtering for embeddings and LLM contexts.
    """

    pass

Manager

PDFDatasourceManagerFactory

Bases: Factory

Factory for creating datasource managers.

Provides type-safe creation of datasource managers based on configuration.

Attributes:
  • _configuration_class (Type) –

    Type of configuration object

Source code in src/extraction/datasources/pdf/manager.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
class PDFDatasourceManagerFactory(Factory):
    """Factory for creating datasource managers.

    Provides type-safe creation of datasource managers based on configuration.

    Attributes:
        _configuration_class: Type of configuration object
    """

    _configuration_class: Type = PDFDatasourceConfiguration

    @classmethod
    def _create_instance(
        cls, configuration: PDFDatasourceConfiguration
    ) -> BasicDatasourceManager:
        """Create an instance of the PDF datasource manager.

        This method constructs a BasicDatasourceManager by creating the appropriate
        reader and parser based on the provided configuration.

        Args:
            configuration: Configuration specifying how to set up the PDF datasource
                          manager, reader, and parser.

        Returns:
            A configured BasicDatasourceManager instance for handling PDF documents.
        """
        reader = PDFDatasourceReaderFactory.create(configuration)
        parser = PDFDatasourceParserFactory.create(configuration)
        return BasicDatasourceManager(
            configuration=configuration,
            reader=reader,
            parser=parser,
        )

Parser

PDFDatasourceParser

Bases: BaseParser[PDFDocument]

Parser for PDF documents that converts them to structured PDFDocument objects.

Uses MarkItDown to convert PDF files to markdown format for easier processing.

Source code in src/extraction/datasources/pdf/parser.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
class PDFDatasourceParser(BaseParser[PDFDocument]):
    """
    Parser for PDF documents that converts them to structured PDFDocument objects.

    Uses MarkItDown to convert PDF files to markdown format for easier processing.
    """

    def __init__(self, parser: MarkItDown = MarkItDown()):
        """
        Initialize the PDF parser.

        Attributes:
            parser: MarkItDown parser instance for PDF to markdown conversion
        """
        self.parser = parser

    def parse(self, file_path: str) -> PDFDocument:
        """
        Parses the given PDF file into a structured document.

        Args:
            file_path: Path to the PDF file

        Returns:
            PDFDocument object containing the parsed content and metadata
        """
        markdown = self.parser.convert(
            file_path, file_extension=".pdf"
        ).text_content
        metadata = self._extract_metadata(file_path)
        return PDFDocument(text=markdown, metadata=metadata)

    def _extract_metadata(self, file_path: str) -> dict:
        """
        Extract and process PDF metadata from the file.

        Args:
            file_path: Path to the PDF file

        Returns:
            Processed metadata dictionary with standardized fields
        """
        metadata = default_file_metadata_func(file_path)
        metadata.update(
            {
                "datasource": "pdf",
                "format": "pdf",
                "url": None,
                "title": os.path.basename(file_path),
                "last_edited_date": metadata["last_modified_date"],
                "created_date": metadata["creation_date"],
            }
        )
        del metadata["last_modified_date"]
        del metadata["creation_date"]
        return metadata

__init__(parser=MarkItDown())

Initialize the PDF parser.

Attributes:
  • parser

    MarkItDown parser instance for PDF to markdown conversion

Source code in src/extraction/datasources/pdf/parser.py
20
21
22
23
24
25
26
27
def __init__(self, parser: MarkItDown = MarkItDown()):
    """
    Initialize the PDF parser.

    Attributes:
        parser: MarkItDown parser instance for PDF to markdown conversion
    """
    self.parser = parser

parse(file_path)

Parses the given PDF file into a structured document.

Parameters:
  • file_path (str) –

    Path to the PDF file

Returns:
  • PDFDocument

    PDFDocument object containing the parsed content and metadata

Source code in src/extraction/datasources/pdf/parser.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def parse(self, file_path: str) -> PDFDocument:
    """
    Parses the given PDF file into a structured document.

    Args:
        file_path: Path to the PDF file

    Returns:
        PDFDocument object containing the parsed content and metadata
    """
    markdown = self.parser.convert(
        file_path, file_extension=".pdf"
    ).text_content
    metadata = self._extract_metadata(file_path)
    return PDFDocument(text=markdown, metadata=metadata)

PDFDatasourceParserFactory

Bases: Factory

Factory for creating PDF parser instances.

Creates and configures PDFDatasourceParser objects according to the provided configuration.

Source code in src/extraction/datasources/pdf/parser.py
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
class PDFDatasourceParserFactory(Factory):
    """
    Factory for creating PDF parser instances.

    Creates and configures PDFDatasourceParser objects according to
    the provided configuration.
    """

    _configuration_class: Type = PDFDatasourceConfiguration

    @classmethod
    def _create_instance(
        cls, _: PDFDatasourceConfiguration
    ) -> PDFDatasourceParser:
        """
        Creates a new instance of the PDF parser.

        Args:
            _: Configuration object for the parser (not used in this implementation)

        Returns:
            PDFDatasourceParser: Configured parser instance
        """
        return PDFDatasourceParser()

Reader

PDFDatasourceReader

Bases: BaseReader

Source code in src/extraction/datasources/pdf/reader.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
class PDFDatasourceReader(BaseReader):

    def __init__(
        self,
        configuration: PDFDatasourceConfiguration,
        logger: logging.Logger = LoggerConfiguration.get_logger(__name__),
    ):
        """Initialize PDF reader.

        Args:
            configuration: Settings for PDF processing
            logger: Logger instance for logging messages
        """
        super().__init__()
        self.export_limit = configuration.export_limit
        self.base_path = configuration.base_path
        self.logger = logger

    async def read_all_async(self) -> AsyncIterator[str]:
        """Asynchronously yield PDF file paths from the configured directory.

        Retrieves a list of PDF files from the base path, applies any configured
        export limit, and yields each file path individually.

        Returns:
            AsyncIterator[str]: An asynchronous iterator of PDF file paths
        """
        self.logger.info(
            f"Fetching PDF files from '{self.base_path}' with limit {self.export_limit}"
        )

        pdf_files = [
            f for f in os.listdir(self.base_path) if f.endswith(".pdf")
        ]
        files_to_load = (
            pdf_files
            if self.export_limit is None
            else pdf_files[: self.export_limit]
        )

        for file_name in tqdm(
            files_to_load, desc="[PDF] Loading files", unit="files"
        ):
            file_path = os.path.join(self.base_path, file_name)
            if os.path.isfile(file_path):
                yield file_path

__init__(configuration, logger=LoggerConfiguration.get_logger(__name__))

Initialize PDF reader.

Parameters:
  • configuration (PDFDatasourceConfiguration) –

    Settings for PDF processing

  • logger (Logger, default: get_logger(__name__) ) –

    Logger instance for logging messages

Source code in src/extraction/datasources/pdf/reader.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def __init__(
    self,
    configuration: PDFDatasourceConfiguration,
    logger: logging.Logger = LoggerConfiguration.get_logger(__name__),
):
    """Initialize PDF reader.

    Args:
        configuration: Settings for PDF processing
        logger: Logger instance for logging messages
    """
    super().__init__()
    self.export_limit = configuration.export_limit
    self.base_path = configuration.base_path
    self.logger = logger

read_all_async() async

Asynchronously yield PDF file paths from the configured directory.

Retrieves a list of PDF files from the base path, applies any configured export limit, and yields each file path individually.

Returns:
  • AsyncIterator[str]

    AsyncIterator[str]: An asynchronous iterator of PDF file paths

Source code in src/extraction/datasources/pdf/reader.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
async def read_all_async(self) -> AsyncIterator[str]:
    """Asynchronously yield PDF file paths from the configured directory.

    Retrieves a list of PDF files from the base path, applies any configured
    export limit, and yields each file path individually.

    Returns:
        AsyncIterator[str]: An asynchronous iterator of PDF file paths
    """
    self.logger.info(
        f"Fetching PDF files from '{self.base_path}' with limit {self.export_limit}"
    )

    pdf_files = [
        f for f in os.listdir(self.base_path) if f.endswith(".pdf")
    ]
    files_to_load = (
        pdf_files
        if self.export_limit is None
        else pdf_files[: self.export_limit]
    )

    for file_name in tqdm(
        files_to_load, desc="[PDF] Loading files", unit="files"
    ):
        file_path = os.path.join(self.base_path, file_name)
        if os.path.isfile(file_path):
            yield file_path

PDFDatasourceReaderFactory

Bases: Factory

Factory for creating PDF reader instances.

Implements the factory pattern to produce configured PDFDatasourceReader objects based on the provided configuration settings.

Source code in src/extraction/datasources/pdf/reader.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
class PDFDatasourceReaderFactory(Factory):
    """Factory for creating PDF reader instances.

    Implements the factory pattern to produce configured PDFDatasourceReader
    objects based on the provided configuration settings.
    """

    _configuration_class: Type = PDFDatasourceConfiguration

    @classmethod
    def _create_instance(
        cls, configuration: PDFDatasourceConfiguration
    ) -> PDFDatasourceReader:
        """Create a new PDFDatasourceReader with the specified configuration.

        Args:
            configuration: Settings that control PDF processing behavior including
                           base path and export limits

        Returns:
            PDFDatasourceReader: A fully configured reader instance ready for use
        """
        return PDFDatasourceReader(configuration=configuration)